Machine Learning Model Evaluation: Classification & Clustering Metrics

Classification Model Evaluation Metrics

Understanding how to evaluate classification models is crucial for assessing their effectiveness. This section details key metrics derived from the confusion matrix.

A. Confusion Matrix for Binary Classification

The Confusion Matrix is a fundamental tool for evaluating the performance of a classification model, especially for binary classification (Positive (+) and Negative (-)).

Predicted +Predicted –
Actual +TP (f++)FN (f+-)
Actual –FP (f-+)TN (f–)

Key Terms in the Confusion Matrix:

  • True Positive (TP): Instances that are Actual + and Predicted +. These are correctly identified positives.
  • False Negative (FN): Instances that are Actual + but Predicted –. These are missed positives (also known as a Type II error).
  • False Positive (FP): Instances that are Actual – but Predicted +. These are falsely identified positives (also known as a Type I error).
  • True Negative (TN): Instances that are Actual – and Predicted –. These are correctly identified negatives.

Derived Totals:

  • Total Actual Positives (Np or P): Np = TP + FN
  • Total Actual Negatives (Nn or N): Nn = FP + TN
  • Total Instances (N_total): N_total = TP + FN + FP + TN
  • Correct Classifications: TP + TN
  • Errors: FP + FN
  • Perfect Classifier: FP = 0, FN = 0

B. Key Classification Metrics & Formulas

Understanding these metrics is crucial for a comprehensive evaluation of classification models.

  • Accuracy

    Fraction of correctly classified instances.

    Accuracy = (TP + TN) / (TP + FP + FN + TN)

  • True Positive Rate (TPR) / Recall / Sensitivity / Hit Rate

    Fraction of actual positives correctly predicted.

    TPR = TP / (TP + FN) = TP / Np

  • False Positive Rate (FPR)

    Fraction of actual negatives incorrectly predicted as positive.

    FPR = FP / (FP + TN) = FP / Nn

  • False Negative Rate (FNR)

    Fraction of actual positives incorrectly predicted as negative.

    FNR = FN / (TP + FN) = FN / Np

  • True Negative Rate (TNR) / Specificity

    Fraction of actual negatives correctly predicted.

    TNR = TN / (FP + TN) = TN / Nn

  • Precision

    Fraction of predicted positives that are actually positive.

    Precision = TP / (TP + FP)

  • F1 Score (Harmonic Mean of Precision & Recall)

    Overall measure of predictive performance, balancing Precision and Recall. A high F1 score indicates that both Precision and Recall are reasonably high.

    F1 = 2 * (Precision * Recall) / (Precision + Recall) = 2TP / (2TP + FP + FN)

C. Classifier Archetypes

These archetypes illustrate how different model behaviors impact evaluation metrics. (P = Total Actual Positives, N = Total Actual Negatives)

  • Case 1: Perfect Classifier

    A model that correctly classifies all instances.

    • Confusion Matrix (CM): TP=P, FN=0, FP=0, TN=N
    • Metrics: TPR=1, FPR=0, Precision=1, Accuracy=1, F1=1
  • Case 2: Worst Classifier

    A model where all instances are wrongly classified.

    • Confusion Matrix (CM): TP=0, FN=P, FP=N, TN=0
    • Metrics: TPR=0, FPR=1, Precision=0, Accuracy=0, F1=0 (Note: F1, Precision, Recall can be N/A if P=0 or N=0, depending on the specific formula and context.)
  • Case 3: Ultra-Liberal Classifier (Always Predicts Positive)

    A model that always predicts the positive class.

    • Confusion Matrix (CM): TP=P, FN=0, FP=N, TN=0
    • Metrics: TPR=1, FPR=1, Precision=P/(P+N), Accuracy=P/(P+N), F1=2P/(2P+N)
    • Note: Accuracy is P/(P+N), which is only 0 if P=0.
  • Case 4: Ultra-Conservative Classifier (Always Predicts Negative)

    A model that always predicts the negative class.

    • Confusion Matrix (CM): TP=0, FN=P, FP=0, TN=N
    • Metrics: TPR=0, FPR=0, Precision=N/A (since TP+FP=0), Accuracy=N/(P+N)
    • Note: Accuracy is N/(P+N), which is only 0 if N=0.

D. Classification Example Data

Consider the following data for a classification task:

  • True Positives (TP): 52
  • False Negatives (FN): 18
  • False Positives (FP): 21
  • True Negatives (TN): 123

Clustering Model Evaluation Metrics

Evaluating clustering models is distinct from classification, as it often involves assessing intrinsic data structures without ground truth labels.

A. Cluster Validity: Why Evaluate?

Evaluating cluster validity is essential for several reasons:

  • Avoid finding patterns in noise: Ensures that identified clusters represent meaningful structures, not random fluctuations.
  • Compare clustering algorithms: Allows for objective comparison of different algorithms on the same dataset.
  • Compare sets of clusters: Helps in comparing two different sets of clusters or individual clusters.

B. Types of Cluster Validity Measures

  • External Index

    Measures how well cluster labels match externally supplied class labels (ground truth).

    Example: Entropy.

  • Internal Index

    Measures the “goodness” of a clustering structure without relying on external information.

    Example: Sum of Squared Error (SSE).

  • Relative Index

    Used to compare two different clusterings or clusters, often by applying an external or internal index.

C. Internal Measures for Clustering

  • Sum of Squared Errors (SSE) / Inertia

    Measures the compactness of clusters. A lower SSE generally indicates better clustering, as data points are closer to their respective cluster centroids.

    SSE = Σi Σx ∈ Ci ||x - mi||2

    Where Ci is cluster i, and mi is its centroid (mean).

    SSE can also be used with the “elbow method” to estimate the optimal number of clusters (K).

  • Cluster Cohesion & Separation

    These measures assess how well-defined and distinct clusters are.

    • Cohesion (Within-cluster Sum of Squares – WSS or SSE)

      Measures how closely related objects are within a cluster. It is the same as the SSE defined above.

      WSS = Σi Σx ∈ Ci (x - mi)2

    • Separation (Between-cluster Sum of Squares – BSS)

      Measures how distinct or well-separated a cluster is from other clusters.

      BSS = Σi |Ci| (m - mi)2

      Where |Ci| is the size of cluster i, mi is the centroid of cluster i, and m is the overall mean of the dataset.

    Generally, a good clustering exhibits high cohesion (low WSS) and high separation (high BSS).

    Total Sum of Squares (TSS): TSS = WSS + BSS (TSS is constant for a given dataset).

D. Clustering Example Data

Consider the following data points and centroids for 3 clusters (C1, C2, C3) with features F1 and F2:

  • C1: (1,0), (1,1) → Centroid: (1, 0.5)
  • C2: (1,2), (2,3), (2,2), (1,2) → Centroid: (1.5, 2.25)
  • C3: (3,1), (3,3), (2,1) → Centroid: (2.67, 1.67)