Supervised and Unsupervised Learning Model Reference

Supervised Classification

Logistic Regression (LR)

  • Type: Binary Classification
  • Scaling: Yes (StandardScaler)
  • Outliers: Not robust
  • Categorical Variables: No (encode first)
  • Core Idea: Sigmoid function maps output to 0–1 probability; threshold ≥ 0.5 predicts class 1.
  • Advantages: Fast, simple, interpretable, outputs probabilities.
  • Disadvantages: Binary only, requires linear boundary, fails on non-linear data.
  • Metrics: Accuracy, Precision, Recall, F1-Score, Confusion Matrix.

Decision Trees (DT)

  • Type: Classification and Regression
  • Scaling: No (not required)
  • Outliers: Robust
  • Categorical Variables: Yes
  • Core Idea: IF-ELSE splits by feature; leaf nodes represent final predictions.
  • Advantages: Interpretable, no scaling needed, handles any data type, fast.
  • Disadvantages: Prone to overfitting, sensitive to small data changes.
  • Metrics: Gini Impurity, Accuracy, Confusion Matrix.

Random Forest (RF)

  • Type: Classification and Regression
  • Scaling: No
  • Outliers: Robust
  • Categorical Variables: Yes
  • Core Idea: Ensemble of trees (Bagging) using majority vote for final prediction.
  • Advantages: Reduces overfitting, stable, provides feature importance, handles missing values.
  • Disadvantages: Slow, less interpretable, requires hyperparameter tuning.
  • Metrics: Accuracy, Feature Importance, Out-of-Bag (OOB) error.

Supervised Regression

Linear Regression

  • Type: Continuous target only
  • Scaling: Yes (recommended)
  • Outliers: Not robust (outliers distort the regression line)
  • Categorical Variables: No (encode first)
  • Core Idea: y = b0 + b1x1 + b2x2 + … predicts a continuous value.
  • Advantages: Simple, fast, interpretable, coefficients indicate feature impact.
  • Disadvantages: Assumes linearity, fails on complex patterns, sensitive to outliers.
  • Metrics: MAE, MSE, RMSE, R².

K-Nearest Neighbors (KNN)

  • Type: Classification and Regression
  • Scaling: Yes (Distance-based; mandatory)
  • Outliers: Not robust
  • Categorical Variables: No
  • Core Idea: Classification via majority vote of K neighbors; Regression via average of K neighbors.
  • Advantages: Simple, no training phase, effective with small datasets.
  • Disadvantages: Slow on large data, sensitive to outliers. Note: K too low causes overfitting/noise; K too high causes underfitting. Default K ≈ 5.
  • Metrics: Accuracy, F1 (Class); MAE, MSE, R² (Regression).

Unsupervised Learning

K-Means (KM)

  • Type: Clustering (No target variable)
  • Scaling: Yes (Distance-based)
  • Outliers: Not robust (shifts centroids)
  • Categorical Variables: No
  • Core Idea: Iteratively assign points to the nearest of K centroids and update centroids until stable.
  • Inertia: Sum of squared distances to centroid; lower is more compact. Used in the Elbow method.
  • Elbow Method: Plot inertia vs. K; pick K where the curve bends.
  • Advantages: Fast, simple, scalable.
  • Disadvantages: Must pre-set K, assumes spherical clusters, random initialization leads to inconsistent results.
  • Metrics: Inertia, Silhouette score.

Hierarchical Clustering (HC)

  • Type: Clustering (No target variable)
  • Scaling: Yes (Distance-based)
  • Outliers: Depends on linkage
  • Categorical Variables: No
  • Core Idea: Agglomerative approach; start with N clusters and merge closest pairs.
  • Dendrogram: Tree diagram showing merge history; cut horizontally to determine cluster count.
  • Linkage Types:
    • Single: Nearest point (good for outlier detection).
    • Complete: Farthest point (spherical clusters).
    • Average: Centroid distance (robust to outliers).
    • Ward: Minimizes within-cluster variance (best general choice).
  • Advantages: No K needed in advance, deterministic, shows full merge history.
  • Disadvantages: Slow on large data, highly sensitive to linkage choice.
  • Metrics: Dendrogram, Silhouette score, Elbow (distortion).

Association Rules (APr)

  • Note: High confidence with Lift ≈ 1 indicates the consequent is equally common with or without the antecedent, rendering the rule useless.