Date: 30th of May 2021.
This article was also published on Medium
Machine Learning for Asset Managers is a book by acclaimed mathematician Marcos M. López de Prado. Marcos is recipient of the Quant of the Year award and his research is at the forefront of innovation in Machine Learning for Finance. The book is very rich in facts and unorthodox ideas, and would make a great read for those who want to learn about how modern Machine Learning is tackling the limitations of classical statistical methods when applied to Finance.
You can learn more about Marcos M. López de Prado from his website here.
This article provides a general summary of the ideas I noted down while going through the book (note that the book is already very compact).
A historical simulation of an investment strategy's performance (backtest) is not a theory; it is a (likely unrealistic) simulation of a past that never happened. Successful investment strategies are specific implementations of general theories. Only a theory can pin down the clear cause-effect mechanism that allows you to extract profits against the collective wisdom of the crowds.
Backtests can never prove that a strategy is a true positive, they may only provide evidence that a strategy is a false positive. Strategy must be supported by theory, not historical simulations. Never develop a strategy solely through backtests. Furthermore, backtests can't simulate black swans - only theories have the breadth and depth needed, and backtests may insinuate that a strategy is profitable, but they do not tell us why.
Asset managers should focus their efforts on researching theories, not backtesting trading rules, and Machine Learning is a powerful tool for building financial theories.
The most insightful use of Machine Learning in finance is for discovering theories. The general approach is:
Machine Learning refers to a set of algorithms that learn complex patterns (structure not definable as a set of equations) in a high-dimensional space (large number of variables) without being specifically directed (researchers impose very little structure on the data).
Classical statistical methods rely on simplistic assumptions (linearity, independence), in-sample analysis, analytical solutions, and asymptotic properties because of the historically limited computing power. Today, ML and computational statistical methods such as cross-validation, ensemble estimators, regularization, bootstrapping, and Monte Carlo, deliver demonstrably better solutions.
The most successful quantitative investment firms in history rely primarily on ML, not econometrics. ML and econometrics complement each other because they have different strengths. ML is helpful at suggesting the ingredients of a theory while econometrics can be useful for testing a theory that is well grounded on empirical observation.
Recent scientific breakthroughs of ML in science are used for:
Asset managers typically spend their time computing covariance matrices, using correlations, searching for low-dimensional representations of high-dimensional spaces, building predictive models, computing p-values, and solving mean-variance optimizations. Covariance matrices are used to run regressions, optimize portfolios, manage risks, and search for linkages. However they are notoriously noisy so we need to denoise them.
Applications of ML in asset management include price prediction, hedging, portfolio construction, detection of outliers and structural breaks, credit ratings, sentiment analysis, market making, bet sizing, securities taxonomy, etc.
The book discusses the following uses of ML in Finance:
ML in Finance tends to be criticized for being over-hyped, a black box, not applicable due to lack of large datasets, and ineffective for low signal-to-noise environments. However:
Estimates of covariance matrices computed from finite and non-deterministic random variables include some amount of noise and are usually derived from estimated factors that are numerically ill-conditioned. Working with those matrices directly without pre-processing can lead to misallocation of assets or increased transaction costs due to unnecessary re-balancing. ML can be used to reduce the noise and enhance the signal in these empirical covariance matrices.
The theorem is used to calculate the distribution of the eigenvalues associated with a random matrix. Applied to financial data sets, it can be used to measure the signal-to-noise ratio by separating the eigenvalues associated with noise from the eigenvalues associated with the signal.
A Marcenko-Pastur PDF is fit to the covariance matrix with the aim of finding the variance that minimizes the sum of the squared differences between the analytical PDF and the kernel density estimate of the observed eigenvalues.
It is common in Finance to shrink the covariance matrix by making it closer to a diagonal (Ledoit and Wolf 2004). This shrinkage however doesn't discriminate between noise and signal, further weakening an already weak signal. A better approach is to use the separated signal and noise eigenvalues to denoise a correlation matrix.
There are two approaches to achieve the above, a numerical method approach and a targeted approach.
The numerical method approach involves setting a constant eigenvalue for all random eigenvectors while preserving the trace of the correlation matrix. The targeted approach on the other hand applies the shrinkage strictly to the random eigenvectors.
Detoning usually refers to the removal of the market component (or other prominent components) from a correlation matrix in order to amplify the signal from more subtle components such as sector, industry, or size. Detoning makes it easier to cluster a correlation matrix and discover dissimilarities across clusters.
A denoised correlation matrix \(C_{1}\) can be detoned using the following formula:
$$ \widetilde{C}_{2} = C_{1} - W_{M}A_{M}{W^\prime}_{M} = W_{D}A_{D}W^\prime_{D}$$
Where \(W_{M}\) and \(A_{M}\) are the eigenvectors and eigenvalues associated with market components, and \(W_{D}\) and \(A_{D}\) are for the non-market components.
The detoned correlation matrix is singular (non-invertible). It can't be used directly for mean-variance portfolio optimization. Instead, we should optimize on the principal components and map the optimal allocations back to the original basis.
An experiment is performed to compute the errors associated with estimating a minimum variance portfolio with and without denoising. The experiments show that denoising is much more effective than shrinkage with a 59.85% RMSE reduction. Shrinkage adds little benefit beyond what denoising contributes.
Correlation is an important numerical property in Finance however it doesn't come without limitations. It neglects nonlinear relationships, it is influenced by outliers, and its use for variables that aren't normally distributed is questionable. Furthermore, it isn't a metric because it doesn't satisfy non-negativity and triangle inequality conditions, making it incoherent for comparison operations.
To overcome those limitations, we use Information theory concepts for quantifying the amount of uncertainty associated with a random variable, particularly the concept of Entropy. Information Theory can be used for defining the objective function in decision tree learning, defining loss function for classification problems, evaluating distance between two random variables, comparing clusters, and feature selection.
Correlation isn't a metric. However, if we introduce the requirement that the covariance of two random variables is equal to the correlation multiplied by the standard deviation of each variables, then the following measurement is a metric:
$$d_{\rho}[X,Y] = \sqrt{{1\over2}(1 - \rho[X,Y])}$$
The above metric however exhibits asymmetry when it comes to the sign of the correlation, with negative correlation deemed more distant than positive correlation. This can be resolved by using the following normalized correlation-based distance metric:
$$d_{|\rho|}[X,Y] = \sqrt{1 - |\rho[X,Y]|}$$
For a discrete random variable \(X\) that takes a value \(x\) from the set \(S_{x}\) with probability \(p[x]\), the entropy of \(X\) is defined as:
$$ H[X] = -\sum_{x \in S_{X}} p[x]log[p[x]] $$
Entropy can be interpreted as the amount of uncertainty associated with \(X\). It is zero when all probability is concentrated in a single element of \(S_{x}\) and is at its maximum value when \(X\) is distributed uniformly.
For two random variables \(X\) and \(Y\), we define their joint entropy as:
$$ H[X, Y] = -\sum_{x,y \in S_{X} \times S_{Y}} p[x,y]log[p[x,y]] $$
Conditional entropy for variable \(X\) condition on variable \(Y\) tells us the uncertainty we expect in \(X\) if we are told the value of \(Y\). It is defined as:
$$ H[X|Y] = H[X, Y] - H[Y] $$
Conditional Entropy has two main properties: \(H[X|X] = 0\), and \(H[X] >= H[X|Y]\).
For two discrete probability distributions \(p\) and \(q\), KL divergence measures how much \(p\) diverges from a reference distribution \(q\). It isn't a metric as despite being non-negative, it violates the symmetry and triangle inequality conditions. It is defined as:
$$ D_{KL}[p||q] = - \sum_{x \in S_{X}} p[x]log\left[\frac{q[x]}{p[x]}\right]$$
KL divergence is widely used in variational inference.
For two discrete probability distributions \(p\) and \(q\), cross-entropy is interpreted as the uncertainty associated with \(X\), when we evaluate its information content not on its true distribution \(p\) but the wrong distribution \(q\). It is defined as:
$$ H_{C}[p||q] = H[X] + D_{KL}[p||q] $$
Mutual information is used to measure the decrease in entropy for random variable \(X\) from knowing the value of random variable \(Y\). It is defined as follows:
$$ I[X, Y] = H[X] - H[X|Y] = H[X] - H[Y] - H[X,Y] $$
Mutual information is equal to zero when the two random variables are independent. It isn't a metric because it doesn't satisfy the triangle inequality. However, its strength is its grouping property which can be used to decompose mutual information into simple constituents.
VI is a metric and it can be interpreted as the uncertainty we expect in one variable if we are told the value of the other. It is defined as:
$$ VI[X, Y] = H[X|Y] + H[Y|X] = H[X] + H[Y] - 2I[X, Y] $$
All the above measures assume that the random variables used are discrete. For the continuous cases, we need to quantize the values by dividing the range spanning the observed values into bins of equal size.
The result of the estimator may be biased by our choice of bin size, in which case we need to find the optimal binning for each estimator.
The above methods can be extended to compare two partitions from the same data set instead of two random variables, where a partition P of a data set D is an unordered set of mutually disjoint non-empty subsets.
This extension allows us to compare partitioning algorithms across different data sets. For example, variation of information is useful for comparing outcomes from a partitional clustering algorithm in unsupervised learning.
The corresponding measure for linear algebra's correlation coefficient in information theory is the normalized mutual information.
For two random variables with an associated linear relationship, the correlation is approximately 1 while the normalized mutual information is approximately 0.9, implying the existence of some degree of uncertainty.
For two random variables \(x\) and \(y\) with an associated non-linear relationship such as \(y = 100|x| + e\), the correlation is approximately 0 while the normalized mutual information is 0.64, showing that normalized mutual information was still able to identify the existence of a relationship between the two variables.
Clustering is an unsupervised learning technique whose aim is to separate data points in a dataset into groups such that similarities within the group are maximized while similarities outside the group are minimized.
Clustering problems can be represented as splitting an undirected graph into its connected components, where each connected component is a cluster. The undirected graph can be generated from a proximity matrix representation of the data.
Clustering algorithms are either partitional (un-nested) or hierarchical (nested). Hierarchical clusters can be divisive (top-down) or agglomerative (bottom-up).
Clustering algorithms expect either a measure of similarity or a measure of dissimilarity. There are five types of clustering algorithms: Connectivity (distance), Centroids (k-means), Distribution (statistical distribution), Density (DBSCAN and OPTICS), Subspace (bi-clustering).
When faced with a clustering problem where the number of features is greater than the number of observations (curse of dimensionality), one solution is to project the feature space into a lower dimensional space.
For partitioning algorithms, the researcher needs to pre-define the number of clusters to use in a problem. One technique that aids the researcher in doing so is the elbow method, where clusters are introduced additively until the marginal percentage of variance (intra-group variance / total variance) reaches a pre-defined threshold.
Another technique is called the Optimal Number of Clusters (ONC).
For correlation clustering, a recommended approach is to derive the observation matrix as:
$$ X_{i,j} = \sqrt{\frac{1}{2} (1 - \rho_{i,j})}$$
The advantage of observations matrices is that the distance between the variables is a function of more than one correlations, and it acknowledges that a correlation change of 0.1 between two variables with correlation 0.9 and 1.0 is greater than between variables with correlation 0.1 and 0.2.
Using the observation matrix defined above, we can then perform base clustering using an algorithm such as k-means. To circumvent the the limitations of the k-means algorithms with pre-set K clusters parameters and random initialization, two modifications must be introduced.
For the first limitation, using the t-stat of silhouette scores as an objective function will help us pick an optimal K. And for the second limitation, running the algorithm with multiple initializations and choosing the most optimal cluster helps with reducing the bias from random initialization.
We can also further improve on the algorithm by identifying clusters deemed of lower quality. For all clusters quality scores we compute the average quality. Then we identify all clusters whose quality is below this average and rerun the clustering algorithm on that set of clusters. We return a new set of clusters and accept that as a better configuration if the average quality of the clusters has improved.
To verify the efficacy of the ONC algorithm, a Monte Carlo experiment was devised where a randomly generated correlation matrix is drawn with high intra-block correlation and low inter-block correlation. The correlation matrix is shuffled, and the ONC algorithm is performed. Results show that the ONC algorithm was able to recover the correct number of clusters, except for a small number of errors.
Supervised learning problems are of two types: regression and classification. Classification problems involve sampling from a finite set of categorical or ordinal labels. For financial datasets, it is important to define the correct labels for learning. There are four main labelling strategies.
The Fixed-Horizon method looks at the price series as a sequence of fixed interval bars and assigns a label of 1/0/-1 based on the difference of the open/close price for that specific bar. A threshold can also be introduced where the label remains as 0 unless the open/close difference is larger than that threshold.
The main limitation of this method is its proneness to heteroscedasticity due to intra-day seasonal activity patterns being transferred to the labels. This limitation can be rectified by either applying the method on tick, volume, or dollar bars, or by using the standardized returns adjusted for volatility.
Another limitation is that it doesn't account for price action within the fixed interval bars, which can pose a model risk as intra bar stop-losses will not be learned.
The triple-barrier method leverages the three main trading rules for closing a position: 1. profit target reached, 2. stop-loss reached, and 3. time horizon reached without triggering profit taking or stop-loss.
We can label the price data based off the above rules such that for a given reference point we define profit-taking level, stop-loss level, and time horizon limit. When the price breaches the profit-taking level we label it as 1, when it reaches stop-loss we label it as -1, and when time limit lapses we label it as 0.
When the position side is unknown, we can use the volatility to define symmetric profit-taking and stop-loss levels.
The only limitation of this is the fact that the levels are considered discrete events which may or may not occur by a thin margin.
The Trend-Scanning method does away with profit-taking and stop-loss levels instead opting for labels based on trend where 1 is an up-trend, 0 is no trend, and -1 is a down-trend.
One way is to compute the t-value associated with the estimated regressor coefficient in a linear time-trend model with a preset look-forward period. Due to the fact that different values of L lead to different t-values, we can permute different values of L and pick the one that maximizes the t-value giving us the most statistically significant trend observed out of multiple look-forward periods.
Meta-labelling is technique used for bet sizing whose goal is to reduce an investor's exposure to false positives (increase precision of the model).
It works by training a secondary model on the prediction outcomes of the primary model with losses labelled as 0 and gains as 1. The secondary model's purpose is then to predict whether the primary model will succeed or fail at prediction with the probability of the output being used to size positions.
For a trial with probability \(p\) of yielding a profit \(π\), and probability \(1 - p\) of yielding a loss \(-π\), the expected profit of the trial is \(u = π(2p - 1)\) and expected variance is \(4(π^2)p(1 - p)\). The sharpe ratio can then be estimated as:
$$ z = \frac{\mu}{\theta} = \frac{p - \frac{1}{2}}{\sqrt{p(1-p)}} $$
Assuming the sharpe follows a normal distribution, we can derive the bet size as \(m = 2*Z[z] - 1\) where Z[.] is the cumulative distribution function of the standard Gaussian.
Another approach is to treat the each output of the meta-labeling classifiers as a bernoulli trial and use the predictions of each trial to approximate a binomial distribution. Assuming the predictions are independent and identically distributed, the de Moivre-Laplace theorem then states that the normal distribution can be used as an approximation to the binomial distribution for large enough n.
The average prediction across the classifiers can then be used to derive a t-student distribution from which a bet size can be calculated.
The aim of feature important analysis is to identify the variables in a problem that are relevant, and the variables that are redundant or irrelevant.
\(p\)-Values quantify the probability that if a true coefficient associated with a variable is zero, we would have obtained a result equal to or more extreme than the one we have estimated.
However limitations of \(p\)-Values include not measuring the probability of neither the null or alternative hypothesis being true, the probability that the data is random, the size of the effect, or the significance of the result.
Furthermore, \(p\)-Values can't discriminate among explanatory variables with high multi-collinearity, and they can produce probabilities that aren't generalizable out-of-sample due to \(p\)-hacking.
The fact that \(p\)-Values measure the probability of obtaining a result equal to or more extreme than the result we got subject to the null hypothesis being true makes it less relevant for researchers. Researchers are mainly interested in the probability of the null hypothesis being true subject to the obtained result, which can be computed using Bayes theorem.
ML feature importance techniques that address the limitations of \(p\)-values include Mean-Decrease Impurity, Mean-Decrease Accuracy, accumulated local effects, and Shapley values.
A sample that is the result of a classification split (such as in decision trees) is purest when it contains labels from one category, and is most impure when the labels are uniformly distributed across categories. Impurity can be evaluated as the entropy of the distribution of labels, from which we can calculate the information gain using the impurity before and after classification.
In the example of a decision tree, the importance of a feature is then calculated as the weighted information gain across all nodes where that feature was selected.
Similar to \(p\)-values, MDI suffers from in-sample computation though it resolves the three other limitations of \(p\)-values. It doesn't rely on distributional assumptions, it is an estimate from a number of tress such reducing estimate variance, and it isn't tied to a particular parametric specification.
Mean-decrease accuracy (MDA) is a method used to determine the relative informational content of explanatory variables. It follows these steps:
The main goal of MDA is to mend the in-sample limitation of MDI and \(p\)-values. MDA is derived by comparing the cross-validated performance of the model before and after shuffling one of its features.
The only limitation of MDA is its reliance on accuracy to score classifiers, which doesn't apply well in Finance as the confidence of the prediction needs to be taken into account.
Cross-entropy loss (or log-loss) is a better alternative than accuracy in Finance. However, it isn't easily interpretable. One way to preserve interoperability is to average the likelihoods instead of the log-likelihoods, this metric is known as the negative average likelihood of the true labels (or NegAL). It is defined as follows:
$$ NegAL = -N^{-1}\sum_{n=0}^{N-1}\sum_{k=0}^{K-1}y_{n,k}p_{n,k} $$
We can also define another metric called probability-weighted accuracy (PWA) as follows:
$$ PWA = \left. \sum_{n=0}^{N-1}y_{n}(p_{n} - K^{-1}) \right/ \sum_{n=0}^{N-1}(p_{n} - K^{-1}) $$
PWA punishes incorrect high-confidence predictions more harshly than standard accuracy.
Substitution effects happen when two features share predictive power. For any two such features, MDI will halve their importance as they are randomly chosen with equal probability, and MDA will consider both unimportant as one compensates for the other during shuffling.
One way to reduce substitution effects is to perform PCA on the features and then run MDI/MDA on the principal components. Still, substitution effect for redundant features from nonlinear combinations of informative ones will persist, and the principal components are less interpretable.
A better approach is to perform the feature importance step after clustering the feature space, this is known as the Clustered Feature Importance (CFI) algorithm. We first project the features into a metric space using either a correlation-based approach or an information-theoretic approach. Then we apply the ONC algorithm to cluster the features followed by the MDI/MDA feature importance algorithms on the resultant clusters.
Portfolio construction in Finance is traditionally represented as a convex optimization program called Markowitz's Critical Line Algorithm (CLA). CLA estimates an efficient frontier of portfolios that maximize expected return based off the level of risk of the portfolio represented by the standard deviation.
However, mean-variance solutions are susceptible to instability due to certain covariance structure and certain types of signals used.
The instability of the covariance structure can be measured in terms of the magnitude between extreme eigenvalues in the matrix. The Condition Number is one such metric and is defined as the absolute value of the rato between the maximum and the minimal eigenvalues of a covariance matrix.
When securities within a portfolio are highly correlated, the standardized covariance matrix will have a high condition number, and the values of the inverse of the covariance matrix explode. So unless the correlation between the different assets in a portfolio is close to zero, we would expect an unstable optimization solution.
Markowitz is mostly needed when assets in a portfolio are highly correlated, giving rise to the Markowitz Curse which entails that the more correlated the assets are, the more unstable the optimization solution.
HRP is an ML-based asset allocation method that leverages graph theory and unsupervised learning to build diversified portfolios. Even though it is suboptimal in-sample, HRP outperforms Markowitz out-of-sample.
Signal-induced instability is structural, and cannot be reduced by sampling more observations. It occurs due to clustering behavior of highly-correlated assets within the covariance matrix which end up affecting the full covariance matrix.
The higher the intra-cluster correlation, the higher the condition number. Thus, we can reduce the instability by optimizing the dominant clusters separately.
NCO is an ML-based algorithm which tackles the limitations of the Markowitz curse. NCO's core idea is to split the optimization into sub-problems and is hence agnostic to the underlying algorithms used for clustering, and optimization on the individual clusters.
The algorithm is as follows:
A backtest isn't a controlled experiment but is rather a simulation of how a strategy would have performed in the past. It is easy to over-fit backtests due to the limited amount of applicable historical time series financial data. Running several trials on the data and selecting the best performing backtest result leads to selection bias under multiple testing.
ML's dark side is that it can easily over-fit data. The primary symptom of over-fitting is a divergence between a model's in-sample and out-of-sample performance (generalization error).
Two types of over-fitting: training set over-fitting and testing set over-fitting.
Training set over-fitting is when the model explains not only the signal but also the noise. There are three ways to reduce over-fitting:
Test set over-fitting occurs when a researcher applies the same test to the same dataset multiple times which eventually might lead to false discovery, also known as selection bias. In Finance, this usually manifests when a researcher tweaks a backtested strategy until the target performance is achieved. Instead the researcher should have gone back to the research process and devised a newer strategy.
Backtests are able to detect training set over-fitting but are powerless against test set over-fitting.
The current crisis in financial research wasn't caused by ML but rather by widespread misuse of classical statistical methods and p-hacking.
Machine Learning can help reduce the effect of test set over-fitting in three ways:
There are two approaches to correcting for multiple trials: False Strategy Theorem and the Šidàk correction method.
Traditionally in Finance where the signal-to-noise ratio is low, the ratio of strategies with positive outcome relative to strategies with negative outcome is very low. Thus we would expect that the number of true positives relative to the total number of positives (precision) to be very low.
For multiple independent trials, we can devise precision and recall formulas that adjust for multiple testing.
Out of multiple trials, the researcher will pick the one with the highest sharpe ratio. Thus causing the expected value of the maximum Sharpe ratio to be higher than the expected value of the Sharpe ratio.
The False Strategy Theorem derives the expected value of the maximum Sharpe ratio as a function of the number of trials, and the variance of the Sharpe ratios across the trials. It proves that as more trials are conducted by the researcher, the expected maximum Sharpe ratio hurdle the researcher has to beat increases. If the maximum Sharpe ratio isn't higher than the expected Sharpe ratio, the discovered strategy is likely to be a false positive.
DSR is the probability of observing a Sharpe ratio greater or equal to the estimated Sharpe Ratio, subject to the null hypothesis that the true Sharpe ratio is zero while adjusting for skewness, kurtosis, sample length, and multiple testing.
The effective number of independent trials can be estimated by clustering all the trials that have been run and counting the differentiated clusters as one each.
Is the probability that at least one type I error in a multi-trial experiment involving hypothesis tests.
The Šidàk correction for multiple testing is used to control the familywise error rate from blowing up as more tests are conducted on the same sample. One approach to apply the correction is to divide the per analysis alpha rate by the number of trials conducted, with the number of trials estimated using the ONC algorithm.