Quantitative Methods & Machine Learning Essentials
Likelihood Function
The likelihood function describes how observed data depends on the model parameters, θ. It is often denoted as p(x|θ).
Maximum Likelihood Estimation (MLE)
The Maximum Likelihood Estimator (MLE), δ(x), is defined as:
or equivalently:
The Gaussian log-likelihood is shown above.
Asymptotic Distribution of MLE
When n is large, the asymptotic distribution of the MLE is given by:
where:
Examples include:
- Gaussian:
- Exponential:
Bayesian Estimation
The Bayesian forecast is given by:
Assume the prior distribution on μ is Gaussian:
where:
When T→∞, mT = μ, and the Maximum A Posteriori (MAP) estimator is almost the same as the MLE (relying almost entirely on data, not prior). The MAP estimator is the value θMAP at which the posterior density is at its maximum.
Assume the prior distribution on θ is Gamma:
Generalized Method of Moments (GMM)
The GMM estimator is defined as:
where g represents sample moments (average residuals), and W is an N×N weighting matrix (inverse of the covariance matrix).
where:
Newey-West Estimator
The Newey-West estimator is used when heteroscedasticity or autocorrelation exists.
Delta Method
Given an estimator θ, the Delta Method is used to derive the asymptotic distribution of a vector of smooth functions h(θ):
where:
Hypothesis Testing Fundamentals
- Type I Error: False rejection of a true null hypothesis.
- Type II Error: Failure to reject a false null hypothesis.
Economic Design & Decision Theory
Economic design leads to decision theory:
Event Studies Methodology
- Event Definition & Window, Selection Criteria for Firms
- Estimating the Reference Model to Get “Normal Returns”
- Compute Aggregate Abnormal Returns
where:
- Aggregate Abnormal Returns: γ is an (L2 × 1) vector with 1s in positions τ1 − T1 through τ2 − T1 and zeros elsewhere.
Hypothesis Testing in Event Studies
Under H₀:
- Hypothesis Testing:
with df = L1-2
- Considerations:
- Heteroskedasticity: Use cross-sectional variance.
- Clustering for Correlations: Form a portfolio of the securities with the same event window.
- Heteroskedasticity: Use cross-sectional variance.
Time Series Analysis
Forecasting with AR(1) Models
Unit Root Test
The unit root test is used to test whether ρ = 0:
- Estimation for Time Series:
- MLE
- GMM
- AIC/BIC
- Seasonality:
θ₁ controls for short-term serial dependence. θs controls for seasonal dependence.
- Pairs Trading:
Example:
Volatility Models
- Measuring Volatility (VIX, GARCH)
- GARCH(1,1) Model
GARCH generates fat tails even if shocks are Gaussian.
- Likelihood for GARCH(1,1)
- Non-Gaussian Extensions:
- QMLE: Uses Gaussian likelihood, relaxes distribution assumptions.
- Student’s t-GARCH: Adds heavier conditional tails.
- Other Extensions/Alternatives:
- IGARCH (Integrated GARCH): Persistent variance, no mean reversion. Used in RiskMetrics.
- EGARCH (Exponential GARCH): Captures leverage effect (asymmetric volatility response).
- MIDAS (Mixed Data Sampling): Forecasts low-frequency volatility using high-frequency data.
- Multivariate GARCH: Models covariance matrices. Risk: Parameter explosion.
- Value at Risk (VaR) & Expected Shortfall (ES)
- VaR(T,α): Worst expected loss with 1-α confidence.
- Expected Shortfall (ES): Expected loss beyond VaR threshold.
- Normal Distribution
- Student’s t-Distribution
RiskMetrics Example: IGARCH(1,1); Square-Root-of-Time Rule for scaling VaR.
Classification Models
Use Cases: Credit risk, fraud detection, default prediction, customer churn, etc.
Logistic Regression (Logit)
Models the log-odds (logit) as linear in features:
- MLE Estimation
Log-likelihood Function:
- Multi-class Extension
Models relative log-odds for each class versus a reference class:
- MLE Estimation
K-Nearest Neighbors (KNN)
Non-Parametric: No functional form assumed.
Predicts based on the average of neighbors’ labels:
- Large k: More bias, less variance.
- Small k: Less bias, more variance.
Limitations: Sensitive to feature scaling. Computationally expensive for large data.
Bayesian Classification
Offers the lowest misclassification errors among all classifiers.
Linear Discriminant Analysis (LDA)
Assumes a normal distribution for features within each class and equal covariance matrices across classes. It calculates a linear equation that tries to maximize the distance between group centers (means) while minimizing the spread within groups.
Pick the class with the largest score.
Model Strengths Weaknesses Logit (for linear relationships) Interpretable, robust, linear boundaries, direct probabilities Fails on nonlinear patterns, may underfit complex data KNN Non-parametric, captures nonlinearity, simple idea Slow on large data, sensitive to scaling, needs k tuning LDA (derived from probabilistic assumptions) Efficient if normality & equal variance hold, probabilistic Fails if assumptions violated, only linear boundaries QDA (allows different covariance matrices for each class) Captures nonlinear boundaries, flexible variances Needs lots of data, overfits on small samples, assumes normality Performance Measurement
ROC Curve & AUC:
- Trade-off: Between True Positive Rate & False Positive Rate.
- Area Under the Curve (AUC): Measures overall classification quality.
Model Selection Strategies
Bias-Variance Trade-off
Mean Squared Error (MSE) Components
Cross-Validation
K-Fold Cross-Validation:
- Split data into K folds (commonly K = 5 or 10).
- Rotate through folds, train on K-1 folds, test on the left-out fold.
- Average the errors to select the best model.
Linear Model Selection Techniques
- Best Subset Selection: Tries all possible combinations of predictors. Computationally expensive and prone to overfitting when p is large.
- Forward/Backward Stepwise Selection: More efficient than best subset, but not guaranteed optimal.
Regularization
- Ridge Regression: Shrinks all coefficients toward zero, but keeps all variables.
- Lasso Regression: Reduces the size of β, but does not drive its elements to 0.
Choosing Lambda (λ)
- Ridge Regression: Shrinks all coefficients toward zero, but keeps all variables.
Decision Trees & Ensembles
CART Algorithm
- Split the feature space into rectangular regions based on feature thresholds to minimize prediction error. At each step, choose the best split that improves prediction the most (greedy strategy).
- Minimize classification error (alternative: Gini or cross-entropy) for classification trees:
- Minimize RSS for regression trees:
Tree Pruning
Look for the tree that minimizes the classification error but with a penalty on the size of the tree (number of leaves):
Ensemble Methods
- Bagging: Builds many trees on bootstrapped samples. Averages predictions (regression) or uses majority vote (classification) to reduce variance, evaluate distributional properties, adjust bias, and improve precision of asymptotic approximation.
- Random Forests: Only consider h predictors (√p) to make trees less correlated.
- Boosting: Parameters include number of trees (B), tree depth (d), and learning rate (λ). Fit tree fb(x) with d splits using all data, then update f(x).
Neural Networks
Neural networks capture complex patterns beyond linear models.
Architecture: Input → Hidden → Output Layers
- Depth: Number of layers.
- Width: Number of units per layer.
- Hyperparameters: Learning rate, batch size (number of observations to evaluate gradient), number of epochs.
Activation functions introduce non-linearity: ReLU, Leaky ReLU, Sigmoid.
Overfitting Solutions: Early stopping, batch normalization, architectural tuning.
Deep Surrogates
Pre-trained neural networks designed to mimic the outputs of complex models.
Benefits:
- Knowledge of true Data Generating Process (DGP).
- Expressivity: Universal approximation theorem for shallow and deep networks.
- Efficiency: E.g., Option pricing: Deep surrogates can be 100-1000x faster than FFT methods.
- Economic Advantages: Offer unlimited data, no errors, are accurate, cheaper, and portable.
Transfer Learning (Combining Theory & Data)
Key Idea: Starts by training on synthetic data generated from theory or simulations (the “source domain”). Then, it fine-tunes the model on limited real data (the “target domain”).
Exam-Relevant Benefits:
- Reduces variance.
- Requires less real data.
- Improves generalization.
- Better when the market is volatile, inputs are unusual, etc.
Empirical Evidence: Transfer Learning outperforms both Deep NNs and theoretical models (e.g., Heston) in option pricing.