Machine Learning Clustering: K-Means and Hierarchical Methods

Machine Learning Clustering Fundamentals

Clustering vs. Data Storage

Clustering is a data analysis technique used to discover patterns, contrasting with data storage systems like Firestore, which are designed to store data necessary to run an application or service.

  • Data Storage (e.g., Firestore): Stores data to run the app/service.
  • Data Analysis (Clustering): Analyzes data to discover patterns (machine learning).

Defining Clustering and Unsupervised Learning

Clustering is the task of grouping data points by similarity. We decide similarity based on specific rules or requirements.

It is an unsupervised method because no class labels are provided, and the algorithm discovers structure in the data on its own.

The Ill-Defined Nature of Clustering

Clustering is considered ill-defined because there is no single correct clustering result; the outcome depends on the distance metric, context, and the observer’s interpretation.

The Role of Distance Metrics

A distance metric defines how similarity between data points is measured. Different distance metrics can lead to different clustering results.

K-Means Clustering Algorithm

K-Means Algorithm Description

K-means clustering is an unsupervised algorithm that partitions data into K clusters by assigning each data point to the nearest cluster center. It is suitable for numeric data with compact, roughly round clusters.

Determining the Optimal K Value

Choosing K involves deciding how many groups we want (flat grouping).

When K is not specified, we determine the optimal value by trying different values of K and observing the decrease in clustering error (e.g., Sum of Squared Errors, SQE).

The Elbow Method

As K increases, the error decreases, but after a certain point, the improvement slows down significantly. This slowdown is known as the elbow, and that point is chosen as the optimal K.

Disadvantages of K-Means

The K-means clustering algorithm has several disadvantages:

  • It requires choosing K beforehand.
  • It is sensitive to initialization.
  • It has difficulty handling nested or long-thin clusters.
  • It may get stuck in local optima.

Local Optima and Initialization

K-means depends on the random initialization of cluster centers, and poor initial choices can cause the algorithm to converge to a sub-optimal solution instead of the global optimum. Repeated runs with different initial centers help reduce the chance of getting stuck in a poor local optimum.

Struggles with Nested Clusters

K-means struggles with nested clusters because it assigns points based on distance to a single center, which cannot properly separate clusters where one cluster is contained inside another.

Hierarchical Clustering Methods

Agglomerative Clustering (Bottom-Up Approach)

Agglomerative clustering is a bottom-up hierarchical method that starts with each data point as its own cluster and repeatedly merges the closest clusters. This results in a tree-like grouping with depth.

Advantages of Hierarchical Clustering

  1. It does not require choosing K beforehand.
  2. It provides a clear hierarchical structure of clusters and subclusters.
  3. It produces a dendrogram for visualization.
  4. It provides a natural notion of distance between clusters.

The Dendrogram

The whole tree produced by hierarchical clustering is called a dendrogram. A dendrogram shows the order in which clusters are merged and the distance (dissimilarity) at which merges occur, allowing the number of clusters to be chosen by cutting the tree.

Linkage Criteria

Linkage criteria define the rule for merging groups in hierarchical clustering:

  • Single Linkage: Distance between the closest pair of points in two clusters.
  • Complete Linkage: Distance between the farthest pair of points.
  • Average Linkage: Average distance between all pairs of points (overall similarity).
  • Centroid Linkage: Distance between cluster centroids (comparing group centers).

Centroids vs. Center Points

The distinction between these terms is important, especially when dealing with different data types:

  • Centroid: The average position of all points in a cluster, well-defined for numeric data.
  • Center Point (Medoid): For non-numeric data, this is the most representative actual data point chosen using a distance function.