Machine Learning Fundamentals: Data Analysis & Clustering Techniques
Core Machine Learning Concepts
Linear Regression
Linear Regression: Finds the best line that summarizes the relationship between two variables. Imagine a scatter plot of data points and a line representing their trend.
Dimensionality Reduction
Dimensionality Reduction: Addresses datasets with a large number of variables, which can lead to a complex dispersion matrix. It reduces the number of variables to a more manageable few.
Why Dimensionality Reduction?
- Simpler Analysis: Fewer features make it easier to find patterns.
- Faster Processing: Essential for applications like high-frequency trading.
- Better Visualization: Easier to visualize data with 2-3 features compared to 100.
Correlation Analysis
Correlation Analysis: Measures the strength and direction of the relationship between two variables, ranging from -1 to 1.
- If one variable (X) increases and the other (Y) also increases, the correlation (r) is positive.
- If r is closer to 1, X and Y are positively correlated.
- If r is closer to -1, X and Y are negatively correlated.
- If r is closer to 0, there is a weak or no linear relationship between X and Y.
Correlation for Feature Reduction
Correlation can help reduce dimensionality by identifying redundant features. Imagine you have many features (X1, X2, …, Xn) and want to reduce them to a smaller set while retaining important information.
Steps to Reduce Features Using Correlation:
- Find Correlation Between Features: Compare each feature pair (e.g., X1 & X2, X1 & X3) to check if they are highly correlated.
- Compare Correlation With Target Variable (Y): If two features are highly correlated with each other, check which one has a stronger relationship with the target variable (Y). Keep the feature with the stronger correlation to Y and remove the weaker one.
Probable Error (PEr)
A limitation of correlation is that it doesn’t inherently consider the size of the dataset.
- A correlation of 0.5 in 20 samples is statistically less significant than 0.5 in 10,000 samples, even though the coefficient is the same.
- Probable Error (PEr) helps adjust the interpretation of correlation strength based on sample size.
PEr Formula and Interpretation
The formula for Probable Error (PEr) is:
PEr = 0.674 × (1 − r2) / √n
Where:
- r = correlation coefficient
- n = number of data points
Interpretation of correlation strength using PEr:
- If r > 6 × PEr → Strong correlation
- If r < PEr → Not significant correlation
Example: If you have a large dataset, even a small correlation (e.g., 0.1) could be statistically meaningful due to the large sample size.
Principal Component Analysis (PCA)
PCA is a powerful technique used to reduce the number of variables (dimensions) in a dataset while preserving the most important information.
Why PCA is Essential
- Too many variables (features) can make data analysis complex and slow.
- Some features may be redundant, containing similar information.
- PCA helps identify a smaller set of”principal component” that capture most of the data’s variance.
How PCA Works
- Collect Data: Start with a dataset containing multiple variables (e.g., height, weight, age, income).
- Create a Covariance Matrix: This matrix illustrates how variables in your dataset are related to each other.
- Find Eigenvalues & Eigenvectors:
- Eigenvalues: Indicate how much information (variance) each new feature (Principal Component) captures.
- Eigenvectors: Represent the directions of these new features in the data space.
- Choose Principal Components:
- Select only the most important components (those with the largest eigenvalues).
- These new features are linear combinations of the original variables but effectively capture the most significant patterns in the data.
The Covariance Matrix in PCA
A covariance matrix is a table that quantifies the relationships between different variables in your dataset.
- Example: Consider a dataset with Height and Weight of people.
- If taller people tend to be heavier, Height and Weight have a positive covariance (they increase together).
- If one variable increases while the other decreases, the covariance is negative.
- If variables are completely unrelated, their covariance is close to zero.
- A covariance matrix stores these relationships for all variable pairs in a dataset.
- Example Covariance Matrix:
[ Cov(Height, Height) Cov(Height, Weight) ]
[ Cov(Weight, Height) Cov(Weight, Weight) ]
Understanding the covariance matrix helps PCA determine which variables contain similar information, allowing it to merge them into fewer, more meaningful variables.
Key PCA Concepts for Study
- Understanding the difference between Eigenvectors and Eigenvalues in the context of the covariance matrix.
- Visualizing these concepts, especially in a three-dimensional graph.
- Grasping the concept of Principal Components and their role in data transformation.
- How to select the optimal number of dimensions for the new dataset.
Week 3: Classification Algorithms
Naive Bayes Classifier
Introduction to Naive Bayes
- The Naive Bayes classifier is a probabilistic machine learning model used for classification tasks.
- It is based on Bayes’ Theorem and assumes that the features (predictors) are conditionally independent given the class label.
- This “naive” assumption simplifies computations, making the algorithm fast and efficient, even for large datasets.
Multinomial Naive Bayes
- Used for discrete data, such as word counts in text classification.
- Assumes features represent the frequency of events (e.g., word occurrences).
- Example Applications: Spam detection, document classification.
Multinomial Naive Bayes Example: Spam Detection
Let’s classify emails as spam or not spam using Multinomial Naive Bayes.
- Training Data – Normal Emails:
- Collect all words from normal (non-spam) emails and count their occurrences (e.g., using a histogram).
- Calculate the probability of seeing each word, given that it’s a normal email. For example, if “free” occurred 5 times in 100 total words in normal emails:
- P(free|normal) = 5/100 = 0.05
- Similarly, if “offer” occurred 8 times in 100 total words in normal emails:
- P(offer|normal) = 8/100 = 0.08
- Training Data – Spam Emails:
- Collect all words from spam emails and count their occurrences.
- Calculate the probability of seeing each word, given that it’s a spam email. For example, if “free” occurred 20 times in 50 total words in spam emails:
- P(free|spam) = 20/50 = 0.4
- Similarly, if “offer” occurred 40 times in 50 total words in spam emails:
- P(offer|spam) = 40/50 = 0.8
- Classifying a New Email:
Imagine you receive a new email containing only the word “offer!”. We want to determine if it’s normal or spam.
- Prior Probability for Normal Email P(Normal):
Estimate the initial probability that any email is normal, based on training data. If 100 out of 150 emails are normal:
- P(Normal) = 100 / (100 + 50) = 0.66
This initial guess is called the prior probability.
- Score for “offer!” in Normal Class:
Multiply the prior probability by the likelihood of “offer” in a normal email:
- P(Normal) × P(offer|normal) = 0.66 × 0.08 = 0.052
This 0.052 is the score for “offer!” belonging to the normal email class.
- Prior Probability for Spam Email P(Spam):
Estimate the initial probability that any email is spam. If 50 out of 150 emails are spam:
- P(Spam) = 50 / (100 + 50) = 0.33
- Score for “offer!” in Spam Class:
Multiply the prior probability by the likelihood of “offer” in a spam email:
- P(Spam) × P(offer|spam) = 0.33 × 0.8 = 0.26
This 0.26 is the score for “offer!” belonging to the spam class.
- Decision:
Since the score for “offer!” belonging to spam (0.26) is greater than the score for it belonging to normal email (0.052), the email “offer!” is classified as spam.
- Prior Probability for Normal Email P(Normal):
The “Naive” Assumption of Naive Bayes
- Naive Bayes treats all word orders as the same. For example, the score for “free offer” is identical to “offer free”.
- P(Normal) × P(free|normal) × P(offer|normal) = P(Normal) × P(offer|normal) × P(free|normal)
- This means Naive Bayes ignores grammar and language rules, assuming conditional independence between features (words).
- By ignoring relationships between words, Naive Bayes exhibits high bias. However, it often performs well in practice due to its low variance.
Week 4: Clustering Techniques
K-Means Clustering
K-Means Definition
- K-Means is a partitioning method that divides data into k distinct clusters.
- Clustering is based on the distance of data points to the centroid (mean) of a cluster.
- It requires the number of clusters (k) to be specified in advance.
K-Means Clustering Process
- Initialization: Choose k initial centroids (randomly or using heuristics).
- Assignment Step: Assign each data point to the nearest centroid.
- Update Step: Calculate the new centroids as the mean of all data points assigned to each cluster.
- Repeat: Continue the assignment and update steps until convergence (when centroids no longer change significantly).
Imagine a graph with three clusters of points, each with a centroid at its center.
Hierarchical Clustering
Hierarchical Clustering Definition
- A method of grouping data points into clusters in a step-by-step manner.
- It creates a hierarchy or “family tree” of clusters, where similar data points are progressively grouped together.
Agglomerative (Bottom-Up) Hierarchical Clustering
This approach starts with individual data points and merges them into larger clusters.
- Start: Each data point begins as its own individual cluster.
- Find the Closest Clusters: Measure the similarity (or distance) between all existing clusters.
- Merge: Combine the two most similar (closest) clusters into one.
- Repeat: Continue merging clusters until only one large cluster remains, or until a predefined number of clusters is reached.
Comparing K-Means and Hierarchical Clustering
A key difference between K-Means and Hierarchical Clustering lies in their approach: K-Means requires a pre-defined number of clusters, while Hierarchical Clustering builds a tree-like structure that can be cut at different levels to yield varying numbers of clusters.
DBSCAN Clustering
DBSCAN Explanation
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points in dense regions together while identifying outliers (noise).
- It does not require specifying the number of clusters in advance (unlike K-Means).
- It can discover clusters of arbitrary shapes (not limited to circular clusters like K-Means).
- It effectively detects outliers (points that do not fit into any cluster).
DBSCAN Point Types:
- Core Points: Have at least
min_samples
points within a specified distance (eps
). - Boundary Points: Are not core points themselves but are within
eps
distance of at least one core point. - Noise Points: Are not close enough to any cluster (often assigned a label of -1).
DBSCAN Example Walkthrough
Let’s illustrate DBSCAN with a given dataset of points: A, B, C, D, E, F, G, H, I. Assume specific neighborhood relationships exist within distance eps
(ε).
- Step 1: Identify Core Points (Implicit in the example, but this is the first conceptual step).
- Step 2: Start a Cluster from Core Points
If B is a core point, it forms a new cluster (Cluster 1). Points directly reachable from B (A, C, G) are added to this cluster. Current Cluster 1: {B, A, C, G}.
- Step 3: Expand the Cluster
Check if any new members (A, C, G) are also core points to further expand the cluster.
- C and A are not core points (they don’t have enough neighbors within
eps
). - G has only 2 neighbors (H, I), which is less than the assumed
min_samples
(e.g., 3), so G is NOT a core point.
Since no more core points are found within the current cluster’s reach, the expansion stops.
- C and A are not core points (they don’t have enough neighbors within
- Step 4: Identify Remaining Points
Points D, E, and F are not connected to B’s cluster and form another group where all are neighbors of each other. However, if none of them are core points (e.g., each has only 2 neighbors, less than
min_samples
), they do not form a new cluster and remain unclustered or are identified as noise, depending on the dataset structure and parameters.