Unsupervised Learning: Transformations and Clustering Methods
Unsupervised Learning Transformations and Clustering
1. Purpose of Unsupervised Transformations
Unsupervised transformations create a new representation of the data that is easier to interpret or use than the original raw data.
2. Application of Dimensionality Reduction
Dimensionality reduction simplifies high-dimensional data while preserving the most important information.
3. Role of Clustering Algorithms
Clustering algorithms group data into clusters based on similarity, revealing hidden structure in the dataset.
4. Challenge in Evaluating Unsupervised Learning
The main challenge is the lack of labeled data, which makes it difficult to measure correctness or accuracy.
5. Importance of Feature Scaling
Feature scaling ensures that all features contribute equally to distance-based algorithms and improves results.
Principal Component Analysis (PCA)
6. What is PCA?
PCA is an unsupervised dimensionality reduction technique that transforms data into uncorrelated components ordered by explained variance.
7. How PCA Aids Dimensionality Reduction
PCA reduces dimensionality by keeping only the components with the highest variance.
8. Limitations of PCA
PCA is a linear method and only considers variance, ignoring non-linear structure and output relationships.
9. Feature Extraction with Images
Feature extraction converts raw pixels into a compact and meaningful representation of the image.
Clustering Challenges and Methods
10. Major Challenge in Clustering
Determining the optimal number of clusters.
11. The Elbow Method
The Elbow method selects the number of clusters by identifying where the improvement in clustering starts to level off.
12. Silhouette Analysis
Silhouette Analysis measures clustering quality using a score between −1 and 1.
k-Means Clustering
13. How k-Means Works
k-Means iteratively assigns points to the nearest centroid and updates centroids until convergence.
14. Shapes Captured by k-Means
k-means can only capture convex or spherical cluster shapes.
15. Boundary Determination in k-Means
Boundaries are placed halfway between cluster centroids based on distance.
16. k-Means Relation to Decomposition (PCA)
k-means represents each data point using a single cluster center, known as vector quantization.
17. Strengths and Weaknesses of k-Means
- Strengths: Simple, fast, scalable.
- Weaknesses: Requires k, sensitive to initialization, limited to convex clusters.
Hierarchical Clustering
18. Principle of Agglomerative Clustering
Each point starts as its own cluster, and the most similar clusters are iteratively merged.
19. Predictions with Agglomerative Clustering
No, agglomerative clustering cannot predict new data points.
20. What is Hierarchical Clustering?
Hierarchical clustering is an iterative process where clusters are merged to form a hierarchy.
21. Visualization Tool for Hierarchical Clustering
A dendrogram can be used for visualizing hierarchical clustering for multidimensional datasets.
22. Performance on Globular Datasets
Both KMeans and Agglomerative Clustering perform well on globular or convex datasets.
23. Core Limitation of KMeans and Agglomerative Clustering
They assume clusters are convex and struggle with complex shapes.
24. Definition of ‘Convex’ Shape
A shape is convex if the line between any two points stays inside the shape.
Density-Based Spatial Clustering (DBSCAN)
25. Advantage of DBSCAN over k-Means/Agglomerative
DBSCAN does not require the number of clusters and can detect noise and complex shapes.
26. Central Tenet of DBSCAN
Clusters are dense regions separated by sparse areas.
27. Determining Core Points in DBSCAN
A point is a core point if it has at least min_samples neighbors within eps.
28. DBSCAN Operation
DBSCAN expands clusters from dense core points and labels sparse points as noise.
29. Point Categories in DBSCAN Conclusion
The three categories of points determined are: core points, boundary points, and noise points.
30. Effect of Repeated DBSCAN Application
Core points and noise remain stable, while boundary points may vary slightly.
31. Predictions with DBSCAN
No, DBSCAN cannot make predictions on new, unseen data.
Evaluating Clustering Algorithms
32. Difference Between External and Internal Indices
External indices use true labels, while internal indices rely only on data structure.
33. Adjusted Rand Index (ARI)
ARI measures agreement between clustering results and true labels, correcting for chance.
34. Normalized Mutual Information (NMI)
NMI measures shared information between predicted clusters and true labels.
35. Limitation of ARI in Real-World Applications
True labels are usually unavailable in real-world data.
36. Importance of Manual Analysis
High scores do not guarantee meaningful clusters, so human interpretation is necessary.
