Machine Learning Fundamentals: Models, Bias-Variance, Metrics

1. The Three Goals of Modeling

GoalKey QuestionExampleModels
Prediction“What will happen?”Accurately flagging spam emails.Random Forest, GBM, SVM
Attribution“Why does this happen?”Identifying which ad campaign had a significant impact on sales.Logistic/Linear Regression
Estimation“What is the true relationship?”Modeling the true dose-response curve of a new drug.Logistic/Linear Regression

2. The Bias-Variance Tradeoff

  • Bias (Underfitting): Error from a model being too simple and making incorrect assumptions. It fails to capture the true underlying pattern in the data.

  • Variance (Overfitting): Error from a model being too complex and sensitive to the specific training data. It learns the random noise, not just the signal, and fails to generalize to new data.

Model TypeBias / Variance ProfileResult
Simple Models (e.g., Linear/Logistic Reg)High Bias, Low VarianceUnderfits (e.g., trying to fit a straight line to a curve)
Complex Models (e.g., Deep Decision Tree)Low Bias, High VarianceOverfits (e.g., drawing a wiggly line that hits every single point)

3. Key Model Comparisons

ModelMain IdeaTricky Part / Key Parameter
Logistic/Regularized Reg.Predicts probability for binary classification. Regularization (“Complexity Budget”): Adds a penalty to shrink coefficients and prevent overfitting. Lasso (L1) can shrink coefficients to zero (feature selection). Ridge (L2) shrinks all coefficients towards zero.
Naïve BayesFast probabilistic classifier that assumes all features are independent. The “Naïve” Assumption: It ignores all feature interactions. This makes it bad for Attribution but very fast for Prediction (e.g., spam filtering). Use Gaussian for continuous data and Multinomial for count data.
SVMFinds a max-margin hyperplane (boundary). C Parameter (“Budget for Mistakes”): Controls the tradeoff between a wide margin and correctly classifying training points. Low C = Wider Margin (High Bias/Low Var); High C = Narrow Margin (Low Bias/High Var). Kernels: For non-linear data. RBF is for complex, “blob-like” shapes.
KNN“Majority vote” of ‘K’ nearest neighbors. K Parameter: K controls flexibility. Small K = Overfits (High Var); Large K = Underfits (High Bias). To fix overfitting, INCREASE K to make the model smoother.

4. Classification Evaluation Metrics

Confusion Matrix Structure

Actual: NoActual: Yes
Predicted: NoTrue Negative (TN)False Negative (FN) (A “Miss”)
Predicted: YesFalse Positive (FP) (A “False Alarm”)True Positive (TP)

Key Metrics

Metric“The Question it Answers”FormulaFocuses On…
Precision“When I predict YES, am I right?”TP / (TP + FP)Minimizing False Positives (important when the cost of a false alarm is high).
Recall“Did I find all the actual YES cases?”TP / (TP + FN)Minimizing False Negatives (important when the cost of a miss is high).
F1 Score“What’s the balance between them?”2*(P*R)/(P+R)A single score that balances Precision and Recall.
  • TRICKY PART: Accuracy ((TP+TN)/Total) is misleading for imbalanced datasets. A model can get 99% accuracy by always predicting the majority class, but it will have 0% recall for the minority class.


5. Decision Trees & Ensemble Methods

Decision Tree (The Building Block)

  • Core Idea: “20 Questions” game that asks a series of if-then questions to split data.

  • Goal: To create “pure” nodes (groups containing only one class).

  • Splitting Metric: Uses Gini Impurity to measure how “mixed up” a group is and find the best split.

  • Problem: A single, deep decision tree will almost always Overfit the data.

  • Solution: Pruning (simplifying the tree) or using Ensemble Methods.

Ensemble Methods: The Key Distinction

MethodTechniqueHow it WorksMain GoalAnalogy
Random ForestBaggingParallel: Averages hundreds of independent, deep trees. Each tree is trained on a bootstrap sample of the data AND uses a random subset of features for each split. This de-correlates the trees.Reduce Variance“Wisdom of the Crowd”
Gradient Boosting (GBM)BoostingSequential: Builds “weak,” simple trees one-by-one. Each new tree is trained to predict the errors (residuals) of the one before it. The final model is a sum of all trees.Reduce Bias“Team of Specialists”

6. Recommender Systems

The Three Main Approaches

MethodAsks the Question…AnalogyKey Detail
User-Based CF“Who is like you?”“Movie Twin”Finds similar users. Suffers from the “user cold-start” problem (bad for new users).
Item-Based CF“What is like this?”“Perfect Pair”Finds similar items. More efficient when the number of Users >> number of Items.
Content-Based“What has similar features?”“Item Expert”Uses item attributes (genre, actor). Solves the “item cold-start” problem (good for new items).

Recommender System Metrics

  • Similarity Metrics (How to find pairs/twins)

    • Pearson: For explicit ratings (1-5). Accounts for user rating bias (“taste profile”).

    • Jaccard: For binary data (yes/no). Measures simple overlap (“Venn diagram”).

    • Cosine: For sparse ratings. Measures taste “direction,” not magnitude.

  • “Beyond Accuracy” Evaluation Metrics

    • Coverage: How much of your catalog can actually be recommended?

    • Novelty: Does it recommend items the user hasn’t seen before?

    • Serendipity: Is it a successful and surprising (“lucky discovery”) recommendation?

    • Diversity: Is there variety within a single list of recommendations?