Machine Learning Fundamentals: Models, Bias-Variance, Metrics
1. The Three Goals of Modeling
| Goal | Key Question | Example | Models |
| Prediction | “What will happen?” | Accurately flagging spam emails. | Random Forest, GBM, SVM |
| Attribution | “Why does this happen?” | Identifying which ad campaign had a significant impact on sales. | Logistic/Linear Regression |
| Estimation | “What is the true relationship?” | Modeling the true dose-response curve of a new drug. | Logistic/Linear Regression |
2. The Bias-Variance Tradeoff
Bias (Underfitting): Error from a model being too simple and making incorrect assumptions. It fails to capture the true underlying pattern in the data.
Variance (Overfitting): Error from a model being too complex and sensitive to the specific training data. It learns the random noise, not just the signal, and fails to generalize to new data.
| Model Type | Bias / Variance Profile | Result |
| Simple Models (e.g., Linear/Logistic Reg) | High Bias, Low Variance | Underfits (e.g., trying to fit a straight line to a curve) |
| Complex Models (e.g., Deep Decision Tree) | Low Bias, High Variance | Overfits (e.g., drawing a wiggly line that hits every single point) |
3. Key Model Comparisons
| Model | Main Idea | Tricky Part / Key Parameter |
| Logistic/Regularized Reg. | Predicts probability for binary classification. | Regularization (“Complexity Budget”): Adds a penalty to shrink coefficients and prevent overfitting. Lasso (L1) can shrink coefficients to zero (feature selection). Ridge (L2) shrinks all coefficients towards zero. |
| Naïve Bayes | Fast probabilistic classifier that assumes all features are independent. | The “Naïve” Assumption: It ignores all feature interactions. This makes it bad for Attribution but very fast for Prediction (e.g., spam filtering). Use Gaussian for continuous data and Multinomial for count data. |
| SVM | Finds a max-margin hyperplane (boundary). |
C Parameter (“Budget for Mistakes”): Controls the tradeoff between a wide margin and correctly classifying training points. Low C = Wider Margin (High Bias/Low Var); High C = Narrow Margin (Low Bias/High Var). Kernels: For non-linear data. RBF is for complex, “blob-like” shapes. |
| KNN | “Majority vote” of ‘K’ nearest neighbors. |
K Parameter: K controls flexibility. Small K = Overfits (High Var); Large K = Underfits (High Bias). To fix overfitting, INCREASE K to make the model smoother. |
4. Classification Evaluation Metrics
Confusion Matrix Structure
| Actual: No | Actual: Yes | |
| Predicted: No | True Negative (TN) | False Negative (FN) (A “Miss”) |
| Predicted: Yes | False Positive (FP) (A “False Alarm”) | True Positive (TP) |
Key Metrics
| Metric | “The Question it Answers” | Formula | Focuses On… |
| Precision | “When I predict YES, am I right?” | TP / (TP + FP) | Minimizing False Positives (important when the cost of a false alarm is high). |
| Recall | “Did I find all the actual YES cases?” | TP / (TP + FN) | Minimizing False Negatives (important when the cost of a miss is high). |
| F1 Score | “What’s the balance between them?” | 2*(P*R)/(P+R) | A single score that balances Precision and Recall. |
TRICKY PART: Accuracy (
(TP+TN)/Total) is misleading for imbalanced datasets. A model can get 99% accuracy by always predicting the majority class, but it will have 0% recall for the minority class.
5. Decision Trees & Ensemble Methods
Decision Tree (The Building Block)
Core Idea: “20 Questions” game that asks a series of if-then questions to split data.
Goal: To create “pure” nodes (groups containing only one class).
Splitting Metric: Uses Gini Impurity to measure how “mixed up” a group is and find the best split.
Problem: A single, deep decision tree will almost always Overfit the data.
Solution: Pruning (simplifying the tree) or using Ensemble Methods.
Ensemble Methods: The Key Distinction
| Method | Technique | How it Works | Main Goal | Analogy |
| Random Forest | Bagging | Parallel: Averages hundreds of independent, deep trees. Each tree is trained on a bootstrap sample of the data AND uses a random subset of features for each split. This de-correlates the trees. | Reduce Variance | “Wisdom of the Crowd” |
| Gradient Boosting (GBM) | Boosting | Sequential: Builds “weak,” simple trees one-by-one. Each new tree is trained to predict the errors (residuals) of the one before it. The final model is a sum of all trees. | Reduce Bias | “Team of Specialists” |
6. Recommender Systems
The Three Main Approaches
| Method | Asks the Question… | Analogy | Key Detail |
| User-Based CF | “Who is like you?” | “Movie Twin” | Finds similar users. Suffers from the “user cold-start” problem (bad for new users). |
| Item-Based CF | “What is like this?” | “Perfect Pair” | Finds similar items. More efficient when the number of Users >> number of Items. |
| Content-Based | “What has similar features?” | “Item Expert” | Uses item attributes (genre, actor). Solves the “item cold-start” problem (good for new items). |
Recommender System Metrics
Similarity Metrics (How to find pairs/twins)
Pearson: For explicit ratings (1-5). Accounts for user rating bias (“taste profile”).
Jaccard: For binary data (yes/no). Measures simple overlap (“Venn diagram”).
Cosine: For sparse ratings. Measures taste “direction,” not magnitude.
“Beyond Accuracy” Evaluation Metrics
Coverage: How much of your catalog can actually be recommended?
Novelty: Does it recommend items the user hasn’t seen before?
Serendipity: Is it a successful and surprising (“lucky discovery”) recommendation?
Diversity: Is there variety within a single list of recommendations?
