Quantitative Methods & Machine Learning Essentials

Posted on Aug 25, 2025 in Mathematics and Computer Science

Likelihood Function

The likelihood function describes how observed data depends on the model parameters, θ. It is often denoted as p(x|θ).

Maximum Likelihood Estimation (MLE)

The Maximum Likelihood Estimator (MLE), δ(x), is defined as:

AD_4nXens3ikERhLz2M8HRO4dfpKx9_w9BgUbx73ALHG41CGRjsBxhIsjOw4BzE1BphWCTk53xzdyxIzHr1I422n2rEii-8h4nn-e_ANVu9Lfvd-9gTgwuMjXiTVHu3h94h_Pp-zzdZm?key=LHRE6LkDAk3aauFWnXAY3w

or equivalently:

AD_4nXcBOqTKdnd_44nT9SvCEN_DC3fTsTRA6L92Ma8KuAwJFiqJw1ob3yLLhFBxcIFmvg5br9WC97OQbtxEoadi3MJZmlNWDP--HzQmzGYYOHlE92JvAyaLrwZrPv1hOvQE0-1bhjRPHw?key=LHRE6LkDAk3aauFWnXAY3w

The Gaussian log-likelihood is shown above.

Asymptotic Distribution of MLE

When n is large, the asymptotic distribution of the MLE is given by:

AD_4nXcfhoYKSvG5j_M-BufAmuQPVt-n61jGX9AFGHH72am8FJwK1jJwpVhCbd18IozHsGDa5vglE6F4FYN9cMiEDAVGKnnwWcmgNHuaykwujgCqRVwcUUjRoQhM2WOLFu1X5v91vhpY8g?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXfMjcWPVzPNV0KN67nEqo0BJIubRYV7FNxAznDQ2qrpYrVa_i36aynPjaomQ_CdQ1IZCCPIVU7aX2iaVZXGbVsUBaDXxn1wfyL61ohsq4xQrN0ndXHz4z62whJP9c9NWJ44fj8o?key=LHRE6LkDAk3aauFWnXAY3w

Examples include:

Gaussian:
Exponential:

Bayesian Estimation

The Bayesian forecast is given by:

AD_4nXcQACF6aLi-R_oXFCxI00TGSzp-SYj3jT4-te79_CYDkkQzjHqVCY84UHTLI03s8qaHtVH0-f6QJk3J2FkjzwpsQlsphLkRnAtWxyBhAwwowcACKqZ2xphX_gASIN2w4wWDn5_MSw?key=LHRE6LkDAk3aauFWnXAY3w

Assume the prior distribution on μ is Gaussian:

AD_4nXcKskOFnZ3ZXHT3eqbk3qKpCxXL4DIMBegHfNaNCoQvJeujCfxhkFsOcskoJHMMXXqgjBGRShnJ1t61VksALOuSYV8yiSXw5_KpY2stVW72fC_X5U2TtCg63MqAZHWQR0YmHdF8Cw?key=LHRE6LkDAk3aauFWnXAY3w

AD_4nXeCd6igHYFjybY4OO4G986Q0BHkjJ1-TXQa29cWgb_tTRFJBkisLbG2CWUR3M_0KDJevrTUUr9cJuxUWkGI4B7kIFtIzbA8ESxuV4PcNiy-VnB69kDIUGnyMlXOYnxuqRxNlP5diQ?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXd6hN_vlDGmQNUX3zfexq1ZG5_ePZy2_PzQ3v7m6yENFBF_AsUz88VwhD4THFoEk2KzZ8aNHugtumHIbUcBHh08levaYbW_Mua_ESmQxjBovQGl6ij2FkGqQqi9exlh07jYUOLluA?key=LHRE6LkDAk3aauFWnXAY3w

When T→∞, m_T = μ, and the Maximum A Posteriori (MAP) estimator is almost the same as the MLE (relying almost entirely on data, not prior). The MAP estimator is the value θ_MAP at which the posterior density is at its maximum.

Assume the prior distribution on θ is Gamma:

AD_4nXfsjWi_BUfmTkorAufuiMyx5Ujbz23P4zNvKILtxoGl5E73Yr7aIojbzQxuYVv_swM7VMbRpgPmOiKrETRIGF8SI9xw9VNlVxGQ9qY0sGepTvHA9LACTVVYpZeyms3OBvDpLj6e2A?key=LHRE6LkDAk3aauFWnXAY3w

Generalized Method of Moments (GMM)

The GMM estimator is defined as:

AD_4nXeAN8K-nBCpSI_Mi-olugy_Naa5ynsmh8oTfV8K28jLd2zoFMn2Fl2tfipGA5RUB_GsJC6Unpyjg7NyL3aVjrRlnDknU3Jt9ImQay0WsDKkzaXV9Mt9d_pSEi2RgdJe8BZD8RZ2?key=LHRE6LkDAk3aauFWnXAY3w

where g represents sample moments (average residuals), and W is an N×N weighting matrix (inverse of the covariance matrix).

AD_4nXfaGvF-ZgluGhkNe3EPJm8txL8RRdQDrVYE3zMnX1oY-AQqybPkA3ptkZajMHHbbDtADyhEts9wMWJ3RAlUUfTyjA2fvpoI3QnsAths-sgwwyMCg5T3EWebyd-l6YY5OtNa-MAuYQ?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXeDMpY8rWHMqVfQ8N7ZvDVjksv_xLeC2GbeQDXarXGm1RGNQeGFueoPo0qLVBdXt6mCNJoAipFYNSjnSZUl6RPjzjQV9IWk88_HO4gYvUda34ueWQFWbsCHfAT7tMY5USs-Ru0t?key=LHRE6LkDAk3aauFWnXAY3w

AD_4nXcDF_5nWVVm5NlxFPzwkcyopastRzBQ1i87Fl1GVBsX0yghpufb6y71VCHE4OHfH3iD23uoAU6_MhnBI-KpteqlsPl60nzzs6DgPncj9au_3oj7kQ9oyd_vBP8lR7UsWLaPAOPbuw?key=LHRE6LkDAk3aauFWnXAY3w

Newey-West Estimator

The Newey-West estimator is used when heteroscedasticity or autocorrelation exists.

Delta Method

Given an estimator θ, the Delta Method is used to derive the asymptotic distribution of a vector of smooth functions h(θ):

AD_4nXeaPVfCv7YxxAHz5aWFlbt6QM045OUe9LCdIClQrx5SZzmrHPQPpppFNkcO1x1sGXz3mrmUyOctzDuvZJ8-BN0l-CRklR5cE040sG6cFCI4ngwDdU_T3x15oeTjDK1f3qYY0x7U?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXfOZyhyoV70SdjWeouXEWVVbQcqh1q_HPhhiXGZb8xasljOFmFcLCRSnP12mhApX0145xu9OhoawD_jqoCkBbPBF2fUkLeSniJw0Out5qh5w1gEkpH8I2EWmplWt--gCN2kzd0eHQ?key=LHRE6LkDAk3aauFWnXAY3w

Hypothesis Testing Fundamentals

Type I Error: False rejection of a true null hypothesis.
Type II Error: Failure to reject a false null hypothesis.

Economic Design & Decision Theory

Economic design leads to decision theory:

AD_4nXcgJHQe9X5AlmpVXKfLV16REyCaIAVRCzFAmhDCVrB235bACt_eovkPwtgbh4MmUCXPxzi68QG89FB5wEtG-nOnSxgORIf5_R00uT1aqdJtT6IEP_cHqDestYY2LARhaYwfbNCr?key=LHRE6LkDAk3aauFWnXAY3w

Event Studies Methodology

Event Definition & Window, Selection Criteria for Firms
Estimating the Reference Model to Get “Normal Returns”
Compute Aggregate Abnormal Returns
where:
Aggregate Abnormal Returns: γ is an (L2 × 1) vector with 1s in positions τ1 − T1 through τ2 − T1 and zeros elsewhere.

Hypothesis Testing in Event Studies

Under H₀:

AD_4nXeVpJPdN-0tYCkKDqeI8A3cVkEcsrk4LbAo8pZuKXMVVwTXKO5deM_G0dR5z_V1kNMAdqccCCtDFWAiKya_B6_U95_h6IHeUGx_TxZ0BmyhzaQ8Hqxh0eUvHKTRSTUkAQcGpdaRrg?key=LHRE6LkDAk3aauFWnXAY3w

Hypothesis Testing:
with df = L1-2
Considerations:
1. Heteroskedasticity: Use cross-sectional variance.
2. Clustering for Correlations: Form a portfolio of the securities with the same event window.

Time Series Analysis

Forecasting with AR(1) Models

AD_4nXcG-11a_ddRcp3uLhpBz3zHFE9CvEE5CEz3gr6dZnrinIWQthEi-ZKV_fJUI7eFaqp_ykg2WJ9nIlglmW1kR0MUtJARKh6MgMurQqsPB-2HAk836e27K7xffWGFNfrMaeza6P-AZQ?key=LHRE6LkDAk3aauFWnXAY3w

Unit Root Test

The unit root test is used to test whether ρ = 0:

AD_4nXfojXPryhSXBfNpQcHIohMSMTFuiZn44q4qqxlq5intv9VGHKKn8Pio8FkpSCpFOY5yBWcCC97SGzZs9DFdRyPDvWQVUKnDEmUudGwG0pXbiUg2rv0M_tplnIPK6sjZXonFnpBAbQ?key=LHRE6LkDAk3aauFWnXAY3w

Estimation for Time Series:
1. MLE
2. GMM
3. AIC/BIC
Seasonality:
θ₁ controls for short-term serial dependence. θ_s controls for seasonal dependence.
Pairs Trading:
Example:

Volatility Models

Measuring Volatility (VIX, GARCH)
GARCH(1,1) Model
1. GARCH generates fat tails even if shocks are Gaussian.
2. Likelihood for GARCH(1,1)
3. Non-Gaussian Extensions:
  - QMLE: Uses Gaussian likelihood, relaxes distribution assumptions.
  - Student’s t-GARCH: Adds heavier conditional tails.
4. Other Extensions/Alternatives:
  - IGARCH (Integrated GARCH): Persistent variance, no mean reversion. Used in RiskMetrics.
  - EGARCH (Exponential GARCH): Captures leverage effect (asymmetric volatility response).
  - MIDAS (Mixed Data Sampling): Forecasts low-frequency volatility using high-frequency data.
  - Multivariate GARCH: Models covariance matrices. Risk: Parameter explosion.
Value at Risk (VaR) & Expected Shortfall (ES)
- VaR(T,α): Worst expected loss with 1-α confidence.
- Expected Shortfall (ES): Expected loss beyond VaR threshold.
1. Normal Distribution
2. Student’s t-Distribution
RiskMetrics Example: IGARCH(1,1); Square-Root-of-Time Rule for scaling VaR.

Classification Models

Use Cases: Credit risk, fraud detection, default prediction, customer churn, etc.

Logistic Regression (Logit)
Models the log-odds (logit) as linear in features:
1. MLE Estimation
  Log-likelihood Function:
2. Multi-class Extension
  Models relative log-odds for each class versus a reference class:
K-Nearest Neighbors (KNN)
Non-Parametric: No functional form assumed.
Predicts based on the average of neighbors’ labels:
- Large k: More bias, less variance.
- Small k: Less bias, more variance.
Limitations: Sensitive to feature scaling. Computationally expensive for large data.
Bayesian Classification
Offers the lowest misclassification errors among all classifiers.

Linear Discriminant Analysis (LDA)

Assumes a normal distribution for features within each class and equal covariance matrices across classes. It calculates a linear equation that tries to maximize the distance between group centers (means) while minimizing the spread within groups.

AD_4nXfHXHHShYrRsHZXQCd0xt8UzwH_AZ9cCiCNx4PL4GWgl_WNQc9F4VsAE5NEQ7uKM3aS8PdgckHg7WWrqvXxg0hCBVlRuVG6V5Bj8ht4Ls1u5nEeX1ndCjeHQWmy-FsjdFEhQIZU-w?key=LHRE6LkDAk3aauFWnXAY3w

Pick the class with the largest score.

Model	Strengths	Weaknesses
Logit (for linear relationships)	Interpretable, robust, linear boundaries, direct probabilities	Fails on nonlinear patterns, may underfit complex data
KNN	Non-parametric, captures nonlinearity, simple idea	Slow on large data, sensitive to scaling, needs k tuning
LDA (derived from probabilistic assumptions)	Efficient if normality & equal variance hold, probabilistic	Fails if assumptions violated, only linear boundaries
QDA (allows different covariance matrices for each class)	Captures nonlinear boundaries, flexible variances	Needs lots of data, overfits on small samples, assumes normality

Performance Measurement
ROC Curve & AUC:
- Trade-off: Between True Positive Rate & False Positive Rate.
- Area Under the Curve (AUC): Measures overall classification quality.

Model Selection Strategies

Bias-Variance Trade-off
Mean Squared Error (MSE) Components
Cross-Validation
K-Fold Cross-Validation:
- Split data into K folds (commonly K = 5 or 10).
- Rotate through folds, train on K-1 folds, test on the left-out fold.
- Average the errors to select the best model.
Linear Model Selection Techniques
- Best Subset Selection: Tries all possible combinations of predictors. Computationally expensive and prone to overfitting when p is large.
- Forward/Backward Stepwise Selection: More efficient than best subset, but not guaranteed optimal.
Regularization
- Ridge Regression: Shrinks all coefficients toward zero, but keeps all variables.
- Lasso Regression: Reduces the size of β, but does not drive its elements to 0.
Choosing Lambda (λ)

Decision Trees & Ensembles

CART Algorithm
1. Split the feature space into rectangular regions based on feature thresholds to minimize prediction error. At each step, choose the best split that improves prediction the most (greedy strategy).
2. Minimize classification error (alternative: Gini or cross-entropy) for classification trees:
3. Minimize RSS for regression trees:
Tree Pruning
Look for the tree that minimizes the classification error but with a penalty on the size of the tree (number of leaves):
Ensemble Methods
1. Bagging: Builds many trees on bootstrapped samples. Averages predictions (regression) or uses majority vote (classification) to reduce variance, evaluate distributional properties, adjust bias, and improve precision of asymptotic approximation.
2. Random Forests: Only consider h predictors (√p) to make trees less correlated.
3. Boosting: Parameters include number of trees (B), tree depth (d), and learning rate (λ). Fit tree f_b(x) with d splits using all data, then update f(x).

Neural Networks

Neural networks capture complex patterns beyond linear models.

Architecture: Input → Hidden → Output Layers
- Depth: Number of layers.
- Width: Number of units per layer.
- Hyperparameters: Learning rate, batch size (number of observations to evaluate gradient), number of epochs.
Activation functions introduce non-linearity: ReLU, Leaky ReLU, Sigmoid.
Overfitting Solutions: Early stopping, batch normalization, architectural tuning.
Deep Surrogates
Pre-trained neural networks designed to mimic the outputs of complex models.
Benefits:
- Knowledge of true Data Generating Process (DGP).
- Expressivity: Universal approximation theorem for shallow and deep networks.
- Efficiency: E.g., Option pricing: Deep surrogates can be 100-1000x faster than FFT methods.
- Economic Advantages: Offer unlimited data, no errors, are accurate, cheaper, and portable.
Transfer Learning (Combining Theory & Data)
Key Idea: Starts by training on synthetic data generated from theory or simulations (the “source domain”). Then, it fine-tunes the model on limited real data (the “target domain”).
Exam-Relevant Benefits:
- Reduces variance.
- Requires less real data.
- Improves generalization.
- Better when the market is volatile, inputs are unusual, etc.
Empirical Evidence: Transfer Learning outperforms both Deep NNs and theoretical models (e.g., Heston) in option pricing.

Quantitative Methods & Machine Learning Essentials

Likelihood Function

Maximum Likelihood Estimation (MLE)

Asymptotic Distribution of MLE

Bayesian Estimation

Generalized Method of Moments (GMM)

Newey-West Estimator

Delta Method

Hypothesis Testing Fundamentals

Economic Design & Decision Theory

Event Studies Methodology

Hypothesis Testing in Event Studies

Time Series Analysis

Forecasting with AR(1) Models

Unit Root Test

Volatility Models

Classification Models

Logistic Regression (Logit)

K-Nearest Neighbors (KNN)

Bayesian Classification

Linear Discriminant Analysis (LDA)

Performance Measurement

ROC Curve & AUC:

Model Selection Strategies

Bias-Variance Trade-off

Mean Squared Error (MSE) Components

Cross-Validation

K-Fold Cross-Validation:

Linear Model Selection Techniques

Regularization

Choosing Lambda (λ)

Decision Trees & Ensembles

CART Algorithm

Tree Pruning

Ensemble Methods

Neural Networks

Architecture: Input → Hidden → Output Layers

Deep Surrogates

Benefits:

Transfer Learning (Combining Theory & Data)

Exam-Relevant Benefits:

Recent Notes

Subjects

Publicidad