Machine Learning Fundamentals: Models, Bias-Variance, Metrics

Posted on Nov 27, 2025 in Commerce

1. The Three Goals of Modeling

Goal	Key Question	Example	Models
Prediction	“What will happen?”	Accurately flagging spam emails.	Random Forest, GBM, SVM
Attribution	“Why does this happen?”	Identifying which ad campaign had a significant impact on sales.	Logistic/Linear Regression
Estimation	“What is the true relationship?”	Modeling the true dose-response curve of a new drug.	Logistic/Linear Regression

Bias (Underfitting): Error from a model being too simple and making incorrect assumptions. It fails to capture the true underlying pattern in the data.
Variance (Overfitting): Error from a model being too complex and sensitive to the specific training data. It learns the random noise, not just the signal, and fails to generalize to new data.

Model Type	Bias / Variance Profile	Result
Simple Models (e.g., Linear/Logistic Reg)	High Bias, Low Variance	Underfits (e.g., trying to fit a straight line to a curve)
Complex Models (e.g., Deep Decision Tree)	Low Bias, High Variance	Overfits (e.g., drawing a wiggly line that hits every single point)

Model	Main Idea	Tricky Part / Key Parameter
Logistic/Regularized Reg.	Predicts probability for binary classification.	Regularization (“Complexity Budget”): Adds a penalty to shrink coefficients and prevent overfitting. Lasso (L1) can shrink coefficients to zero (feature selection). Ridge (L2) shrinks all coefficients towards zero.
Naïve Bayes	Fast probabilistic classifier that assumes all features are independent.	The “Naïve” Assumption: It ignores all feature interactions. This makes it bad for Attribution but very fast for Prediction (e.g., spam filtering). Use Gaussian for continuous data and Multinomial for count data.
SVM	Finds a max-margin hyperplane (boundary).	`C` Parameter (“Budget for Mistakes”): Controls the tradeoff between a wide margin and correctly classifying training points. Low `C` = Wider Margin (High Bias/Low Var); High `C` = Narrow Margin (Low Bias/High Var). Kernels: For non-linear data. RBF is for complex, “blob-like” shapes.
KNN	“Majority vote” of ‘K’ nearest neighbors.	`K` Parameter: `K` controls flexibility. Small `K` = Overfits (High Var); Large `K` = Underfits (High Bias). To fix overfitting, INCREASE `K` to make the model smoother.

	Actual: No	Actual: Yes
Predicted: No	True Negative (TN)	False Negative (FN) (A “Miss”)
Predicted: Yes	False Positive (FP) (A “False Alarm”)	True Positive (TP)

Metric	“The Question it Answers”	Formula	Focuses On…
Precision	“When I predict YES, am I right?”	`TP / (TP + FP)`	Minimizing False Positives (important when the cost of a false alarm is high).
Recall	“Did I find all the actual YES cases?”	`TP / (TP + FN)`	Minimizing False Negatives (important when the cost of a miss is high).
F1 Score	“What’s the balance between them?”	`2(PR)/(P+R)`	A single score that balances Precision and Recall.

TRICKY PART: Accuracy ((TP+TN)/Total) is misleading for imbalanced datasets. A model can get 99% accuracy by always predicting the majority class, but it will have 0% recall for the minority class.

Core Idea: “20 Questions” game that asks a series of if-then questions to split data.
Goal: To create “pure” nodes (groups containing only one class).
Splitting Metric: Uses Gini Impurity to measure how “mixed up” a group is and find the best split.
Problem: A single, deep decision tree will almost always Overfit the data.
Solution: Pruning (simplifying the tree) or using Ensemble Methods.

Method	Technique	How it Works	Main Goal	Analogy
Random Forest	Bagging	Parallel: Averages hundreds of independent, deep trees. Each tree is trained on a bootstrap sample of the data AND uses a random subset of features for each split. This de-correlates the trees.	Reduce Variance	“Wisdom of the Crowd”
Gradient Boosting (GBM)	Boosting	Sequential: Builds “weak,” simple trees one-by-one. Each new tree is trained to predict the errors (residuals) of the one before it. The final model is a sum of all trees.	Reduce Bias	“Team of Specialists”

Method	Asks the Question…	Analogy	Key Detail
User-Based CF	“Who is like you?”	“Movie Twin”	Finds similar users. Suffers from the “user cold-start” problem (bad for new users).
Item-Based CF	“What is like this?”	“Perfect Pair”	Finds similar items. More efficient when the number of Users >> number of Items.
Content-Based	“What has similar features?”	“Item Expert”	Uses item attributes (genre, actor). Solves the “item cold-start” problem (good for new items).

Similarity Metrics (How to find pairs/twins)
- Pearson: For explicit ratings (1-5). Accounts for user rating bias (“taste profile”).
- Jaccard: For binary data (yes/no). Measures simple overlap (“Venn diagram”).
- Cosine: For sparse ratings. Measures taste “direction,” not magnitude.
“Beyond Accuracy” Evaluation Metrics
- Coverage: How much of your catalog can actually be recommended?
- Novelty: Does it recommend items the user hasn’t seen before?
- Serendipity: Is it a successful and surprising (“lucky discovery”) recommendation?
- Diversity: Is there variety within a single list of recommendations?