Quantitative Methods & Machine Learning Essentials

Likelihood Function

The likelihood function describes how observed data depends on the model parameters, θ. It is often denoted as p(x|θ).

Maximum Likelihood Estimation (MLE)

The Maximum Likelihood Estimator (MLE), δ(x), is defined as:

AD_4nXens3ikERhLz2M8HRO4dfpKx9_w9BgUbx73ALHG41CGRjsBxhIsjOw4BzE1BphWCTk53xzdyxIzHr1I422n2rEii-8h4nn-e_ANVu9Lfvd-9gTgwuMjXiTVHu3h94h_Pp-zzdZm?key=LHRE6LkDAk3aauFWnXAY3w

or equivalently:

AD_4nXcBOqTKdnd_44nT9SvCEN_DC3fTsTRA6L92Ma8KuAwJFiqJw1ob3yLLhFBxcIFmvg5br9WC97OQbtxEoadi3MJZmlNWDP--HzQmzGYYOHlE92JvAyaLrwZrPv1hOvQE0-1bhjRPHw?key=LHRE6LkDAk3aauFWnXAY3w

The Gaussian log-likelihood is shown above.

Asymptotic Distribution of MLE

When n is large, the asymptotic distribution of the MLE is given by:

AD_4nXcfhoYKSvG5j_M-BufAmuQPVt-n61jGX9AFGHH72am8FJwK1jJwpVhCbd18IozHsGDa5vglE6F4FYN9cMiEDAVGKnnwWcmgNHuaykwujgCqRVwcUUjRoQhM2WOLFu1X5v91vhpY8g?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXfMjcWPVzPNV0KN67nEqo0BJIubRYV7FNxAznDQ2qrpYrVa_i36aynPjaomQ_CdQ1IZCCPIVU7aX2iaVZXGbVsUBaDXxn1wfyL61ohsq4xQrN0ndXHz4z62whJP9c9NWJ44fj8o?key=LHRE6LkDAk3aauFWnXAY3w

Examples include:

  • Gaussian: AD_4nXcrVSk61EBxOo-Yfcy7NovkHFiXR9vRQ1oDMDZJ3PDU1456xscUf4UlFihY50DKAC01kl1PrBCg8QvDmEYTZP7OGzMwky8IW5kWMc5-UBMyy14DWAuL8TqaPuoFOkeG2gWp4UrM?key=LHRE6LkDAk3aauFWnXAY3w
  • Exponential: AD_4nXdH3ZJbiU6DQ5fU66ztOanYvjMdHipMqQE66STpcJ_EYuBzRVH8grFgu9sedLQyeyMJmQZnD7hJ4i4SMSNcovoOpwfxK5RMHGPMLsJU-r_CtmogiS2jFbspG7UWBymWywXcp607?key=LHRE6LkDAk3aauFWnXAY3w

Bayesian Estimation

The Bayesian forecast is given by:

AD_4nXcQACF6aLi-R_oXFCxI00TGSzp-SYj3jT4-te79_CYDkkQzjHqVCY84UHTLI03s8qaHtVH0-f6QJk3J2FkjzwpsQlsphLkRnAtWxyBhAwwowcACKqZ2xphX_gASIN2w4wWDn5_MSw?key=LHRE6LkDAk3aauFWnXAY3w

Assume the prior distribution on μ is Gaussian:

AD_4nXcKskOFnZ3ZXHT3eqbk3qKpCxXL4DIMBegHfNaNCoQvJeujCfxhkFsOcskoJHMMXXqgjBGRShnJ1t61VksALOuSYV8yiSXw5_KpY2stVW72fC_X5U2TtCg63MqAZHWQR0YmHdF8Cw?key=LHRE6LkDAk3aauFWnXAY3w

AD_4nXeCd6igHYFjybY4OO4G986Q0BHkjJ1-TXQa29cWgb_tTRFJBkisLbG2CWUR3M_0KDJevrTUUr9cJuxUWkGI4B7kIFtIzbA8ESxuV4PcNiy-VnB69kDIUGnyMlXOYnxuqRxNlP5diQ?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXd6hN_vlDGmQNUX3zfexq1ZG5_ePZy2_PzQ3v7m6yENFBF_AsUz88VwhD4THFoEk2KzZ8aNHugtumHIbUcBHh08levaYbW_Mua_ESmQxjBovQGl6ij2FkGqQqi9exlh07jYUOLluA?key=LHRE6LkDAk3aauFWnXAY3w

When T→∞, mT = μ, and the Maximum A Posteriori (MAP) estimator is almost the same as the MLE (relying almost entirely on data, not prior). The MAP estimator is the value θMAP at which the posterior density is at its maximum.

Assume the prior distribution on θ is Gamma:

AD_4nXfsjWi_BUfmTkorAufuiMyx5Ujbz23P4zNvKILtxoGl5E73Yr7aIojbzQxuYVv_swM7VMbRpgPmOiKrETRIGF8SI9xw9VNlVxGQ9qY0sGepTvHA9LACTVVYpZeyms3OBvDpLj6e2A?key=LHRE6LkDAk3aauFWnXAY3w

Generalized Method of Moments (GMM)

The GMM estimator is defined as:

AD_4nXeAN8K-nBCpSI_Mi-olugy_Naa5ynsmh8oTfV8K28jLd2zoFMn2Fl2tfipGA5RUB_GsJC6Unpyjg7NyL3aVjrRlnDknU3Jt9ImQay0WsDKkzaXV9Mt9d_pSEi2RgdJe8BZD8RZ2?key=LHRE6LkDAk3aauFWnXAY3w

where g represents sample moments (average residuals), and W is an N×N weighting matrix (inverse of the covariance matrix).

AD_4nXfaGvF-ZgluGhkNe3EPJm8txL8RRdQDrVYE3zMnX1oY-AQqybPkA3ptkZajMHHbbDtADyhEts9wMWJ3RAlUUfTyjA2fvpoI3QnsAths-sgwwyMCg5T3EWebyd-l6YY5OtNa-MAuYQ?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXeDMpY8rWHMqVfQ8N7ZvDVjksv_xLeC2GbeQDXarXGm1RGNQeGFueoPo0qLVBdXt6mCNJoAipFYNSjnSZUl6RPjzjQV9IWk88_HO4gYvUda34ueWQFWbsCHfAT7tMY5USs-Ru0t?key=LHRE6LkDAk3aauFWnXAY3w

AD_4nXcDF_5nWVVm5NlxFPzwkcyopastRzBQ1i87Fl1GVBsX0yghpufb6y71VCHE4OHfH3iD23uoAU6_MhnBI-KpteqlsPl60nzzs6DgPncj9au_3oj7kQ9oyd_vBP8lR7UsWLaPAOPbuw?key=LHRE6LkDAk3aauFWnXAY3w

Newey-West Estimator

The Newey-West estimator is used when heteroscedasticity or autocorrelation exists.

Delta Method

Given an estimator θ, the Delta Method is used to derive the asymptotic distribution of a vector of smooth functions h(θ):

AD_4nXeaPVfCv7YxxAHz5aWFlbt6QM045OUe9LCdIClQrx5SZzmrHPQPpppFNkcO1x1sGXz3mrmUyOctzDuvZJ8-BN0l-CRklR5cE040sG6cFCI4ngwDdU_T3x15oeTjDK1f3qYY0x7U?key=LHRE6LkDAk3aauFWnXAY3w

where:

AD_4nXfOZyhyoV70SdjWeouXEWVVbQcqh1q_HPhhiXGZb8xasljOFmFcLCRSnP12mhApX0145xu9OhoawD_jqoCkBbPBF2fUkLeSniJw0Out5qh5w1gEkpH8I2EWmplWt--gCN2kzd0eHQ?key=LHRE6LkDAk3aauFWnXAY3w

Hypothesis Testing Fundamentals

  • Type I Error: False rejection of a true null hypothesis.
  • Type II Error: Failure to reject a false null hypothesis.

Economic Design & Decision Theory

Economic design leads to decision theory:

AD_4nXcgJHQe9X5AlmpVXKfLV16REyCaIAVRCzFAmhDCVrB235bACt_eovkPwtgbh4MmUCXPxzi68QG89FB5wEtG-nOnSxgORIf5_R00uT1aqdJtT6IEP_cHqDestYY2LARhaYwfbNCr?key=LHRE6LkDAk3aauFWnXAY3w

Event Studies Methodology

  1. Event Definition & Window, Selection Criteria for Firms
  2. Estimating the Reference Model to Get “Normal Returns”

    AD_4nXcxz40eeIsnvwPtj2Y1b7XIbSzeHco4qUisPFiwONB_GW6o3zfi0ZoVraXd1qYpjHETzMuslUSXP6bBG9XLBP-PAFkAUS-yPtxHWvCfr-I4Uo_Zgd8AbL0qU1OhedmxmfdpqUSo?key=LHRE6LkDAk3aauFWnXAY3w

    AD_4nXfJPj-K7hAiuGJdLQG-Vo6D2WEFVKOlK2AOOkZ5UfvEvHaTDH6PL7YAHa_vKNGLjM8yVWBPM2Tsqlf03qOXZJ-PdvryMhDLhEkrS3-sgSnuKeRNzhgHBIT7CnLVEuxjubwMSMtYYg?key=LHRE6LkDAk3aauFWnXAY3w

    AD_4nXdYmLqgu8bD_FHareAbtMroQeJmSorrEBPSs-Qw7CXn3_yWVBAXiyGt8s-egzRExHuGSBxDx6h2ioV8NrgTsxnX-TUS18a1AZb2ZCGlhOaL9IgA6kreuFAJZr9-POFLjP3QA4sqlw?key=LHRE6LkDAk3aauFWnXAY3w

  3. Compute Aggregate Abnormal Returns

    AD_4nXcjffldNIP8kznZjrSzxi67JD39imFCoSYIn1VbuHu8PGTtt7dcxVSJ3l00qK_DcGJjjAjNwvq5OV2NjkVKsXBR0T5eB7XwbUJrKmbjbUB5E4CK7Lz9ETPaAVBLF_xrCsOwyCuQrw?key=LHRE6LkDAk3aauFWnXAY3w

    where:

    AD_4nXckMHw0AisPM2vzZb1PSZXZyl77STu2nsl5GLzbhFklVIdEq5NQktNUt_NWrqF2Y4KfaS4hPa_-S2QoKEVblc2yLPEA7YB7XbJvgz1XJP3Q439Kodp5i61RVQCAqRaRcCFKe9RFhw?key=LHRE6LkDAk3aauFWnXAY3w

  4. Aggregate Abnormal Returns: γ is an (L2 × 1) vector with 1s in positions τ1 − T1 through τ2 − T1 and zeros elsewhere.

Hypothesis Testing in Event Studies

Under H₀:

AD_4nXeVpJPdN-0tYCkKDqeI8A3cVkEcsrk4LbAo8pZuKXMVVwTXKO5deM_G0dR5z_V1kNMAdqccCCtDFWAiKya_B6_U95_h6IHeUGx_TxZ0BmyhzaQ8Hqxh0eUvHKTRSTUkAQcGpdaRrg?key=LHRE6LkDAk3aauFWnXAY3w

  1. Hypothesis Testing:

    AD_4nXe5JceYeit79_Gd0MZCjZx_7JdK6t3J1DdCNiuZmkYkOadL-NXTZc3ltt9BzFlFS_m6Qn8dtkjHMqU3AbHQAHHHtSZtT6elJEjMma6Zaj-IUwgak5sqT0zH50MBV5-8LAi2F41akw?key=LHRE6LkDAk3aauFWnXAY3w

    with df = L1-2

    AD_4nXcDsoetczRy_3zDttIhypapkmSSE4TLn7wB71xPGKyu0LNTksWCHg0hZVnZivCi3ToU9TwEMVb97bx7uL2e-teC5EHQThVcd37Qqi2Upx0_T2uzWuP678lOaXCw5vBCB1sOJwsL?key=LHRE6LkDAk3aauFWnXAY3w

  2. Considerations:
    1. Heteroskedasticity: Use cross-sectional variance.

      AD_4nXcLPGZeZ_0O0FvSQu4Zv1scGHTi-jGJyAtrYV62AdrnM_92X7up_aoMhebwctsg5VU9WnpwrKYP-qZlt0GbCAoECGDw5APrwBdVN6enUrGAeF1wOf-xvW5yJgstElD0_31POd_asA?key=LHRE6LkDAk3aauFWnXAY3w

    2. Clustering for Correlations: Form a portfolio of the securities with the same event window.

Time Series Analysis

  1. AD_4nXdaakkFdTUV4ldpS32y4R9CCkOxK3_C6jQckpni0cxmWzSFdcubBxx85oUn7T2yfImw4HuGGnkIMIgA6PdBKjfcO8aMAinAySeNiREfJhZiUxn5eOOzY7SVCFAcBntO1RrPGokpiA?key=LHRE6LkDAk3aauFWnXAY3w
  2. AD_4nXfKOXoiVlkME6qgwAZfua6DvICOLyBC_G7FBVck0DPBY35lh1pSKempmdtQnuStJHAZiNVc3fNTvco2SHAig48TLhStjJo9DD1bsdEhdimMLJCEg9Lxd_n8hal_RBZau_VlB9bHuw?key=LHRE6LkDAk3aauFWnXAY3w

    AD_4nXfFTiaQSUMUzqeyU9RZPXR_Nezr6hgFhaDOZZ0CxtdCtqQ0bxFJkp7uSp0xTQJYfpBHJvPeJFRIV5PYp2_1G9ddEW7ZPBeMDgs5pjx9OEXExzgiWNAGStdpzJ3LLqiHhwL3UJcS0A?key=LHRE6LkDAk3aauFWnXAY3w

Forecasting with AR(1) Models

AD_4nXcG-11a_ddRcp3uLhpBz3zHFE9CvEE5CEz3gr6dZnrinIWQthEi-ZKV_fJUI7eFaqp_ykg2WJ9nIlglmW1kR0MUtJARKh6MgMurQqsPB-2HAk836e27K7xffWGFNfrMaeza6P-AZQ?key=LHRE6LkDAk3aauFWnXAY3w

  1. AD_4nXdIkYhMKojTytRQ_9aG49NucPoZnIiv9Jyg5Px3TRj19IITst7kKzHk5iTUHlQ7xIKY3acr7sUAyte5cLVbVebMjwVJv1AElLpQGzZlzzBDlP662ifKLMX2VQrE6u8E5QnXwpH-VA?key=LHRE6LkDAk3aauFWnXAY3w
  2. AD_4nXfroelOUi8TJPZ3M2PuiJ4Zt6-FJD2vx_5WLU_0XHx-p87PIMGrRnJRFcgYbowqD4qjWVwCowS5B6BpWFhw9VCdBmuDy9bkGrXPxAsVu4x5FPqXYKFHHC5zz344hRSifSL1Xn0T?key=LHRE6LkDAk3aauFWnXAY3w
  3. AD_4nXcXTXeW8uIrS944I6haafiNRl5tB_UxzWeVqr21rGZs8BBWXgrzb1f7Ipo0EgQuJ-XQanh_npWlQrdmrb9zZyvnqcs2nrdw_7-z2WZ3FgvERS2cXotjgL8WEq7ZnpiUxeW08LBLIw?key=LHRE6LkDAk3aauFWnXAY3w

Unit Root Test

The unit root test is used to test whether ρ = 0:

AD_4nXfojXPryhSXBfNpQcHIohMSMTFuiZn44q4qqxlq5intv9VGHKKn8Pio8FkpSCpFOY5yBWcCC97SGzZs9DFdRyPDvWQVUKnDEmUudGwG0pXbiUg2rv0M_tplnIPK6sjZXonFnpBAbQ?key=LHRE6LkDAk3aauFWnXAY3w

  1. Estimation for Time Series:
    1. MLE
    2. GMM
    3. AIC/BIC

    AD_4nXfbQve0jvQDUm9Gi7tZ0mBYGb3oQwnqIp4eJKuhq5dlBywcUd5fEEKPJxwC_CZahlZGwer7APB-5r1JsLncRAPYJxn05flHHDusfdHHv9tGyqzLeuUeXRspByq1tqzX9fj8-J1I?key=LHRE6LkDAk3aauFWnXAY3w

  2. Seasonality:

    AD_4nXcWOyoXaFPstH9ytLLI56adXUrrj3Kr4lUXEQS6IzGLktF0z2KwXLxWyOsb2MZ_DpTSkYFY70WbvgF2Mx6xly3SNqa670ciFWN0qstTkPi3GQbu47iYUiW8iWHwyUeyqTi-X0w1MA?key=LHRE6LkDAk3aauFWnXAY3w

    θ₁ controls for short-term serial dependence. θs controls for seasonal dependence.

  3. Pairs Trading:

    Example:

    AD_4nXfxz57LHl1mETv8vYLDqNA3jvkneSZbjwagUl4VLbRkfEkorzzF1c3ve98pbzOnYPGaBirjjxpQXxMpCWQHTafz5DLXfZNZ23IUlC7CPRe6ZDVkHjzGnYsq-Ky8WPubeWs4G7gYEg?key=LHRE6LkDAk3aauFWnXAY3w

Volatility Models

  1. Measuring Volatility (VIX, GARCH)

    AD_4nXezUppYiS1OOv4iPNVkc3fqP0tTK1QPoVK6USWBZsPhcIazajgpLhqLGGTN3zjsh0YHi9W0_vmN9Aq2OFwnfo1aQSRX3gygQBx1x0xwaL9dBjgAzkHIdFCcvA7MyLeBJbPlaj0BXA?key=LHRE6LkDAk3aauFWnXAY3w

  2. GARCH(1,1) Model
    1. AD_4nXdPu2wMU29tZSXALWX17iHLJCP8IPkauBbHoFZona2eqwPqY8X5guapSzLiO3z8XeXcNg4CMpUc7gmNckq-ocX-8uHDUuIxdXSVupsISZdJVJBivb1dmLgzqaGJJBCjVz-71Qbdvg?key=LHRE6LkDAk3aauFWnXAY3w

      AD_4nXduKx1XevAO6y93WHxcWJxg259qT72AcKr83pYRWI_TjqQtQa7qhACadf0GlqdOhSmQU7yTng-mgAkKHVI1rl_WZmi69_i7PX2iqzjAMh_KYJ7Knp10TV6OxAuqI5bV5zc5OvM-bQ?key=LHRE6LkDAk3aauFWnXAY3w

      GARCH generates fat tails even if shocks are Gaussian.

    2. Likelihood for GARCH(1,1)

      AD_4nXclHTeRf77_GSAhAbEj96WwXCKo7UIq3kTSmG9trQwN0zLlDAYvJcNQ8cUW2I5FtQku5ROaQZX5zLE-ZCqc_BiCcxVuoKfwjTg_lVrOxH4X2fBISgz0rzzLp7KXPpblS_D6KeWNGw?key=LHRE6LkDAk3aauFWnXAY3w

      AD_4nXfhtqsFj-rkX7BJTlVUxg4au871bCC3Kts1s0lAqHv2Ss3Ix6dkRhf6C37FtIsrLJ2pez0DWyK8SCS0gkej7xpneLYciAXGpIrsV76-4G2QGt9iYYguOwJRT-FkluKDhl8NioJPCQ?key=LHRE6LkDAk3aauFWnXAY3w

    3. Non-Gaussian Extensions:
      • QMLE: Uses Gaussian likelihood, relaxes distribution assumptions.
      • Student’s t-GARCH: Adds heavier conditional tails.
    4. Other Extensions/Alternatives:
      • IGARCH (Integrated GARCH): Persistent variance, no mean reversion. Used in RiskMetrics.
      • EGARCH (Exponential GARCH): Captures leverage effect (asymmetric volatility response).
      • MIDAS (Mixed Data Sampling): Forecasts low-frequency volatility using high-frequency data.
      • Multivariate GARCH: Models covariance matrices. Risk: Parameter explosion.
  3. Value at Risk (VaR) & Expected Shortfall (ES)
    • VaR(T,α): Worst expected loss with 1-α confidence.
    • Expected Shortfall (ES): Expected loss beyond VaR threshold.
    1. Normal Distribution

      AD_4nXfKtmFTM-IUD3st_98CkbFIeCywC-SePoEilrcIGZ91LsfMVe3Ubxh4SvX-w4pDxZtixetuwg51Wa-Kyw22SqHBskKCn63RNtjWvkQ1xnyC_1LPTkp-_lnmnKj5vdxdrIoyj2MH?key=LHRE6LkDAk3aauFWnXAY3w

    2. Student’s t-Distribution

      AD_4nXciIOmkuMiyvFYlbaMImaC6QVEYa36TKGduh42r1XCYm9IdBBFrPTfTODZ5UxU6RLUEUGF62Vxzk1XfrEBBXlpGn1eD3qgHIoyJeCW0GZ5JWU_ZZXW4Rcz69g3uBdneu4Jd-NRs3Q?key=LHRE6LkDAk3aauFWnXAY3w

    RiskMetrics Example: IGARCH(1,1); Square-Root-of-Time Rule for scaling VaR.

Classification Models

Use Cases: Credit risk, fraud detection, default prediction, customer churn, etc.

  1. Logistic Regression (Logit)

    Models the log-odds (logit) as linear in features:

    AD_4nXexaib1EZII3e2JmM8zbfKGRtDIM_igtSVIfLpmtC-crekEk9_T4y6B5kwW0bXFzwfp6ohlEefubUplpbWAzQV1EGP3uAVkI1jR-d1JhZjhzllKrnIjW6RhDgmLZSS6D6cG3V9z9g?key=LHRE6LkDAk3aauFWnXAY3w

    1. MLE Estimation

      AD_4nXfXZIaVyFvuDgQLcAUjmy-b0D99wjMAKv_cik-ahGLOpr9c8t3MfzmMDsOpmmL5VALop23BldKrmiGTq-lhvg3yiRORZMaqeMHeiXk0kyLRWa_w7Os7ydROcUAuyw6U4xK8-qysxA?key=LHRE6LkDAk3aauFWnXAY3w

      Log-likelihood Function:

      AD_4nXehwHb7qdYho5T2QyetD7BvmEdA5xb3Eg7QOMb5HWPrB9zTgwoTEQftXVXVZJ1mR0t6b5O03RoQ2IBV1O16WEkTIB54oo8_XUXQ75W-ZF6SATArsaFyR8Kv00qC4f0YweHkvhHR?key=LHRE6LkDAk3aauFWnXAY3w

    2. Multi-class Extension

      Models relative log-odds for each class versus a reference class:

      AD_4nXesPvZ89ZHtEgmiKyasjfBR5SjnVSBQd2yS0XRTsfRRURQvTFZnAdxCBIQTRq7BlgodAMAN04NCh0lbapEG9HGBCTK262RJqIrjL_fSuOsIPILR0E3b7nNOuRUIuNz2fQ_pcVtc1w?key=LHRE6LkDAk3aauFWnXAY3w

  2. K-Nearest Neighbors (KNN)

    Non-Parametric: No functional form assumed.

    Predicts based on the average of neighbors’ labels:

    AD_4nXc7A6fl3f9w92sFT7UpV84HXpizAuTCHuQlcYTtd3Rc3OtBPuUlyOvdElKBrd-8veHTqqef63FqJB5nZ371L78HQkGJ9LidIX_D2RiwiGDk53sSu5pFIURWI37vmxWifkO_xNiALA?key=LHRE6LkDAk3aauFWnXAY3w

    • Large k: More bias, less variance.
    • Small k: Less bias, more variance.

    Limitations: Sensitive to feature scaling. Computationally expensive for large data.

  3. Bayesian Classification

    Offers the lowest misclassification errors among all classifiers.

    AD_4nXceQ8JPiwFDXCa07-_JkV5XyFY3UlfR1zvEL5pWafyNRdZ8d9hyUSMgZ61_jaTFvcFfk86q1vXAxO1RJ2E7tGFA13gSuAIT6wdiQA4aLIZYmJs9IRVfPUpCxCqm1wFF7BLIdNQF?key=LHRE6LkDAk3aauFWnXAY3w

  4. Linear Discriminant Analysis (LDA)

    Assumes a normal distribution for features within each class and equal covariance matrices across classes. It calculates a linear equation that tries to maximize the distance between group centers (means) while minimizing the spread within groups.

    AD_4nXfHXHHShYrRsHZXQCd0xt8UzwH_AZ9cCiCNx4PL4GWgl_WNQc9F4VsAE5NEQ7uKM3aS8PdgckHg7WWrqvXxg0hCBVlRuVG6V5Bj8ht4Ls1u5nEeX1ndCjeHQWmy-FsjdFEhQIZU-w?key=LHRE6LkDAk3aauFWnXAY3w

    Pick the class with the largest score.

    ModelStrengthsWeaknesses
    Logit (for linear relationships)Interpretable, robust, linear boundaries, direct probabilitiesFails on nonlinear patterns, may underfit complex data
    KNNNon-parametric, captures nonlinearity, simple ideaSlow on large data, sensitive to scaling, needs k tuning
    LDA (derived from probabilistic assumptions)Efficient if normality & equal variance hold, probabilisticFails if assumptions violated, only linear boundaries
    QDA (allows different covariance matrices for each class)Captures nonlinear boundaries, flexible variancesNeeds lots of data, overfits on small samples, assumes normality
  5. Performance Measurement

    AD_4nXeBLWjVkFlj9V4DCa1LvcHE83mDkszJNtxP82bkig2tOlqn8VYaO5n55tW4CHgquPgnxZ0tOs1PNQmorf-ejPBljuKjMJyMe0JUf4fN0I__QENzhSMLLv3yvocG-TGb0oAk1B3N9w?key=LHRE6LkDAk3aauFWnXAY3w

    ROC Curve & AUC:

    • Trade-off: Between True Positive Rate & False Positive Rate.
    • Area Under the Curve (AUC): Measures overall classification quality.

Model Selection Strategies

  1. Bias-Variance Trade-off

    Mean Squared Error (MSE) Components

    AD_4nXeYVMHmHT7ncWKg96gnlqYA0EP_5J40Qn1rWiEfQTPp429kolyeysIpszb92wTiBxOilaYuK2V7agtd5ySY1DisYLCaIl0GeC8z2hAK1jbkKxvZgdKhQBhkxI_6fYCMmCGwNd2-sQ?key=LHRE6LkDAk3aauFWnXAY3w

  2. Cross-Validation

    K-Fold Cross-Validation:

    • Split data into K folds (commonly K = 5 or 10).
    • Rotate through folds, train on K-1 folds, test on the left-out fold.
    • Average the errors to select the best model.
  3. Linear Model Selection Techniques

    • Best Subset Selection: Tries all possible combinations of predictors. Computationally expensive and prone to overfitting when p is large.
    • Forward/Backward Stepwise Selection: More efficient than best subset, but not guaranteed optimal.
  4. Regularization

    • Ridge Regression: Shrinks all coefficients toward zero, but keeps all variables.

      AD_4nXc52RzmYCYjvRZy_FSgZMj4GU3iw7QB2zqv35Q9Ib7vOlB2835SpghIWV_jhuYhu3Hdv7FvlnScq7aX2rm3UG3b7Pw3BKdKvI2dcZ7bnHY8HKTyoDVNRnx2Y3C8radzE5KobKt1hA?key=LHRE6LkDAk3aauFWnXAY3w

      AD_4nXejtHgH5k9GcAiFyxXy2mfjj_2lKzuSoHjLAzioFNdp0f5KAnSY7vIxkgORovsDKfvcBR9upeTI2Tvx0lHLmgilmVmQzwwGb5j2tCbWpmQNNavBySyAr6KWcgQbtkS-eyoNEg8Fyg?key=LHRE6LkDAk3aauFWnXAY3w

    • Lasso Regression: Reduces the size of β, but does not drive its elements to 0.

      AD_4nXfSl7uMG-OHKpku9wfvcCC1PHcBAswPPIf5qJESe3Rneoy3mPorF2kto3gQa_SUPKgekpaKVoaVSJ0ux2EOo-6W0XDrojpgObv9rT77DqwKuWz7TXHSJrTJxK-1LEEQa37nULtJ?key=LHRE6LkDAk3aauFWnXAY3w

    Choosing Lambda (λ)

    AD_4nXdpI5OJFvYmdGujnfUZR08QEeosjkKruUh1tUvmDZZwIHtQBglLoX1sesyRPxdSMaTnIQ4wcYSmWbOEDAZmY1HS8ylQaFbO2fk5dxJjVyHL7DgOLQIKPaDJgWM2jOaoq08Ja-kWow?key=LHRE6LkDAk3aauFWnXAY3w

Decision Trees & Ensembles

  1. CART Algorithm

    1. Split the feature space into rectangular regions based on feature thresholds to minimize prediction error. At each step, choose the best split that improves prediction the most (greedy strategy).
    2. Minimize classification error (alternative: Gini or cross-entropy) for classification trees:

      AD_4nXdsS_qBYKnIyYvKZW8yFkwxyBPrB47jIBco7BhMF4GkSZW3IWitVLTs6zlqJtI8rzI_YDC69oMcJ27r5ssAf15rUx0L8PMUvRCKR4F9-6cmErAE4Ih5mObBPCuAfYwUyw6ZipuI2Q?key=LHRE6LkDAk3aauFWnXAY3w

    3. Minimize RSS for regression trees:

      AD_4nXfmrH6KQFQueOP35nGo5zYfPF2jdyZe0tsHEj32AzsiwG3_DRxFxYjCVS0xwzhCfUC3rb-6WX1VzhgstiKGvf8EI5XOjoYgoXEx0fN4-EeA9XjBWcrfRD9p22Aolj6T6D9ZN48p?key=LHRE6LkDAk3aauFWnXAY3w

  2. Tree Pruning

    Look for the tree that minimizes the classification error but with a penalty on the size of the tree (number of leaves):

    AD_4nXdFcCN30hr55cSc8gNzBs08nNNymndg7auZlVExt_fM3Qp0Ib4313EikSiKCHqPcQcj9OW2l_nwx3ExKv8ImkhSK_SjaTeA6niYkO7Lnn_WmBrA5jcHoDiOffWtYQHVJWUvyWTN1A?key=LHRE6LkDAk3aauFWnXAY3w

  3. Ensemble Methods

    1. Bagging: Builds many trees on bootstrapped samples. Averages predictions (regression) or uses majority vote (classification) to reduce variance, evaluate distributional properties, adjust bias, and improve precision of asymptotic approximation.
    2. Random Forests: Only consider h predictors (√p) to make trees less correlated.
    3. Boosting: Parameters include number of trees (B), tree depth (d), and learning rate (λ). Fit tree fb(x) with d splits using all data, then update f(x).

Neural Networks

Neural networks capture complex patterns beyond linear models.

  1. Architecture: Input → Hidden → Output Layers

    • Depth: Number of layers.
    • Width: Number of units per layer.
    • Hyperparameters: Learning rate, batch size (number of observations to evaluate gradient), number of epochs.

    Activation functions introduce non-linearity: ReLU, Leaky ReLU, Sigmoid.

    AD_4nXdbQvnf56x6AvpbC9gswhBqd6QQNbsd-5cBSBDdE0yhY8ZtLNNu2t9yRVRjIKLoWSkcf7cB31VFl7xNxSydVvKfmszmfYw6t5M4l9Oy6XckDM2ES_mtQhl3a1nfYArziMRQr83GJg?key=LHRE6LkDAk3aauFWnXAY3w

    Overfitting Solutions: Early stopping, batch normalization, architectural tuning.

  2. Deep Surrogates

    Pre-trained neural networks designed to mimic the outputs of complex models.

    Benefits:

    • Knowledge of true Data Generating Process (DGP).
    • Expressivity: Universal approximation theorem for shallow and deep networks.
    • Efficiency: E.g., Option pricing: Deep surrogates can be 100-1000x faster than FFT methods.
    • Economic Advantages: Offer unlimited data, no errors, are accurate, cheaper, and portable.
  3. Transfer Learning (Combining Theory & Data)

    Key Idea: Starts by training on synthetic data generated from theory or simulations (the “source domain”). Then, it fine-tunes the model on limited real data (the “target domain”).

    Exam-Relevant Benefits:

    • Reduces variance.
    • Requires less real data.
    • Improves generalization.
    • Better when the market is volatile, inputs are unusual, etc.

    Empirical Evidence: Transfer Learning outperforms both Deep NNs and theoretical models (e.g., Heston) in option pricing.