Unsupervised Learning: Transformations and Clustering Methods

Unsupervised Learning Transformations and Clustering

1. Purpose of Unsupervised Transformations

Unsupervised transformations create a new representation of the data that is easier to interpret or use than the original raw data.

2. Application of Dimensionality Reduction

Dimensionality reduction simplifies high-dimensional data while preserving the most important information.

3. Role of Clustering Algorithms

Clustering algorithms group data into clusters based on similarity, revealing hidden structure in the dataset.

4. Challenge in Evaluating Unsupervised Learning

The main challenge is the lack of labeled data, which makes it difficult to measure correctness or accuracy.

5. Importance of Feature Scaling

Feature scaling ensures that all features contribute equally to distance-based algorithms and improves results.

Principal Component Analysis (PCA)

6. What is PCA?

PCA is an unsupervised dimensionality reduction technique that transforms data into uncorrelated components ordered by explained variance.

7. How PCA Aids Dimensionality Reduction

PCA reduces dimensionality by keeping only the components with the highest variance.

8. Limitations of PCA

PCA is a linear method and only considers variance, ignoring non-linear structure and output relationships.

9. Feature Extraction with Images

Feature extraction converts raw pixels into a compact and meaningful representation of the image.

Clustering Challenges and Methods

10. Major Challenge in Clustering

Determining the optimal number of clusters.

11. The Elbow Method

The Elbow method selects the number of clusters by identifying where the improvement in clustering starts to level off.

12. Silhouette Analysis

Silhouette Analysis measures clustering quality using a score between −1 and 1.

k-Means Clustering
13. How k-Means Works

k-Means iteratively assigns points to the nearest centroid and updates centroids until convergence.

14. Shapes Captured by k-Means

k-means can only capture convex or spherical cluster shapes.

15. Boundary Determination in k-Means

Boundaries are placed halfway between cluster centroids based on distance.

16. k-Means Relation to Decomposition (PCA)

k-means represents each data point using a single cluster center, known as vector quantization.

17. Strengths and Weaknesses of k-Means
  • Strengths: Simple, fast, scalable.
  • Weaknesses: Requires k, sensitive to initialization, limited to convex clusters.
Hierarchical Clustering
18. Principle of Agglomerative Clustering

Each point starts as its own cluster, and the most similar clusters are iteratively merged.

19. Predictions with Agglomerative Clustering

No, agglomerative clustering cannot predict new data points.

20. What is Hierarchical Clustering?

Hierarchical clustering is an iterative process where clusters are merged to form a hierarchy.

21. Visualization Tool for Hierarchical Clustering

A dendrogram can be used for visualizing hierarchical clustering for multidimensional datasets.

22. Performance on Globular Datasets

Both KMeans and Agglomerative Clustering perform well on globular or convex datasets.

23. Core Limitation of KMeans and Agglomerative Clustering

They assume clusters are convex and struggle with complex shapes.

24. Definition of ‘Convex’ Shape

A shape is convex if the line between any two points stays inside the shape.

Density-Based Spatial Clustering (DBSCAN)
25. Advantage of DBSCAN over k-Means/Agglomerative

DBSCAN does not require the number of clusters and can detect noise and complex shapes.

26. Central Tenet of DBSCAN

Clusters are dense regions separated by sparse areas.

27. Determining Core Points in DBSCAN

A point is a core point if it has at least min_samples neighbors within eps.

28. DBSCAN Operation

DBSCAN expands clusters from dense core points and labels sparse points as noise.

29. Point Categories in DBSCAN Conclusion

The three categories of points determined are: core points, boundary points, and noise points.

30. Effect of Repeated DBSCAN Application

Core points and noise remain stable, while boundary points may vary slightly.

31. Predictions with DBSCAN

No, DBSCAN cannot make predictions on new, unseen data.

Evaluating Clustering Algorithms

32. Difference Between External and Internal Indices

External indices use true labels, while internal indices rely only on data structure.

33. Adjusted Rand Index (ARI)

ARI measures agreement between clustering results and true labels, correcting for chance.

34. Normalized Mutual Information (NMI)

NMI measures shared information between predicted clusters and true labels.

35. Limitation of ARI in Real-World Applications

True labels are usually unavailable in real-world data.

36. Importance of Manual Analysis

High scores do not guarantee meaningful clusters, so human interpretation is necessary.