Supervised and Unsupervised Learning Model Reference
Posted on Jun 13, 2026 in Computer Engineering
Supervised Classification
Logistic Regression (LR)
- Type: Binary Classification
- Scaling: Yes (StandardScaler)
- Outliers: Not robust
- Categorical Variables: No (encode first)
- Core Idea: Sigmoid function maps output to 0–1 probability; threshold ≥ 0.5 predicts class 1.
- Advantages: Fast, simple, interpretable, outputs probabilities.
- Disadvantages: Binary only, requires linear boundary, fails on non-linear data.
- Metrics: Accuracy, Precision, Recall, F1-Score, Confusion Matrix.
Decision Trees (DT)
- Type: Classification and Regression
- Scaling: No (not required)
- Outliers: Robust
- Categorical Variables: Yes
- Core Idea: IF-ELSE splits by feature; leaf nodes represent final predictions.
- Advantages: Interpretable, no scaling needed, handles any data type, fast.
- Disadvantages: Prone to overfitting, sensitive to small data changes.
- Metrics: Gini Impurity, Accuracy, Confusion Matrix.
Random Forest (RF)
- Type: Classification and Regression
- Scaling: No
- Outliers: Robust
- Categorical Variables: Yes
- Core Idea: Ensemble of trees (Bagging) using majority vote for final prediction.
- Advantages: Reduces overfitting, stable, provides feature importance, handles missing values.
- Disadvantages: Slow, less interpretable, requires hyperparameter tuning.
- Metrics: Accuracy, Feature Importance, Out-of-Bag (OOB) error.
Supervised Regression
Linear Regression
- Type: Continuous target only
- Scaling: Yes (recommended)
- Outliers: Not robust (outliers distort the regression line)
- Categorical Variables: No (encode first)
- Core Idea: y = b0 + b1x1 + b2x2 + … predicts a continuous value.
- Advantages: Simple, fast, interpretable, coefficients indicate feature impact.
- Disadvantages: Assumes linearity, fails on complex patterns, sensitive to outliers.
- Metrics: MAE, MSE, RMSE, R².
K-Nearest Neighbors (KNN)
- Type: Classification and Regression
- Scaling: Yes (Distance-based; mandatory)
- Outliers: Not robust
- Categorical Variables: No
- Core Idea: Classification via majority vote of K neighbors; Regression via average of K neighbors.
- Advantages: Simple, no training phase, effective with small datasets.
- Disadvantages: Slow on large data, sensitive to outliers. Note: K too low causes overfitting/noise; K too high causes underfitting. Default K ≈ 5.
- Metrics: Accuracy, F1 (Class); MAE, MSE, R² (Regression).
Unsupervised Learning
K-Means (KM)
- Type: Clustering (No target variable)
- Scaling: Yes (Distance-based)
- Outliers: Not robust (shifts centroids)
- Categorical Variables: No
- Core Idea: Iteratively assign points to the nearest of K centroids and update centroids until stable.
- Inertia: Sum of squared distances to centroid; lower is more compact. Used in the Elbow method.
- Elbow Method: Plot inertia vs. K; pick K where the curve bends.
- Advantages: Fast, simple, scalable.
- Disadvantages: Must pre-set K, assumes spherical clusters, random initialization leads to inconsistent results.
- Metrics: Inertia, Silhouette score.
Hierarchical Clustering (HC)
- Type: Clustering (No target variable)
- Scaling: Yes (Distance-based)
- Outliers: Depends on linkage
- Categorical Variables: No
- Core Idea: Agglomerative approach; start with N clusters and merge closest pairs.
- Dendrogram: Tree diagram showing merge history; cut horizontally to determine cluster count.
- Linkage Types:
- Single: Nearest point (good for outlier detection).
- Complete: Farthest point (spherical clusters).
- Average: Centroid distance (robust to outliers).
- Ward: Minimizes within-cluster variance (best general choice).
- Advantages: No K needed in advance, deterministic, shows full merge history.
- Disadvantages: Slow on large data, highly sensitive to linkage choice.
- Metrics: Dendrogram, Silhouette score, Elbow (distortion).
Association Rules (APr)
- Note: High confidence with Lift ≈ 1 indicates the consequent is equally common with or without the antecedent, rendering the rule useless.