Machine Learning Fundamentals: Data Analysis & Clustering Techniques

Posted on Jul 5, 2025 in Mathematics and Computer Science

Core Machine Learning Concepts

Linear Regression

Linear Regression: Finds the best line that summarizes the relationship between two variables. Imagine a scatter plot of data points and a line representing their trend.

Dimensionality Reduction

Dimensionality Reduction: Addresses datasets with a large number of variables, which can lead to a complex dispersion matrix. It reduces the number of variables to a more manageable few.

Why Dimensionality Reduction?

Simpler Analysis: Fewer features make it easier to find patterns.
Faster Processing: Essential for applications like high-frequency trading.
Better Visualization: Easier to visualize data with 2-3 features compared to 100.

Correlation Analysis

Correlation Analysis: Measures the strength and direction of the relationship between two variables, ranging from -1 to 1.

If one variable (X) increases and the other (Y) also increases, the correlation (r) is positive.
If r is closer to 1, X and Y are positively correlated.
If r is closer to -1, X and Y are negatively correlated.
If r is closer to 0, there is a weak or no linear relationship between X and Y.

Correlation for Feature Reduction

Correlation can help reduce dimensionality by identifying redundant features. Imagine you have many features (X1, X2, …, Xn) and want to reduce them to a smaller set while retaining important information.

Steps to Reduce Features Using Correlation:

Find Correlation Between Features: Compare each feature pair (e.g., X1 & X2, X1 & X3) to check if they are highly correlated.
Compare Correlation With Target Variable (Y): If two features are highly correlated with each other, check which one has a stronger relationship with the target variable (Y). Keep the feature with the stronger correlation to Y and remove the weaker one.

Probable Error (PEr)

A limitation of correlation is that it doesn’t inherently consider the size of the dataset.

A correlation of 0.5 in 20 samples is statistically less significant than 0.5 in 10,000 samples, even though the coefficient is the same.
Probable Error (PEr) helps adjust the interpretation of correlation strength based on sample size.

PEr Formula and Interpretation

The formula for Probable Error (PEr) is:

PEr = 0.674 × (1 − r²) / √n

Where:

r = correlation coefficient
n = number of data points

Interpretation of correlation strength using PEr:

If r > 6 × PEr → Strong correlation
If r < PEr → Not significant correlation

Example: If you have a large dataset, even a small correlation (e.g., 0.1) could be statistically meaningful due to the large sample size.

Principal Component Analysis (PCA)

PCA is a powerful technique used to reduce the number of variables (dimensions) in a dataset while preserving the most important information.

Why PCA is Essential

Too many variables (features) can make data analysis complex and slow.
Some features may be redundant, containing similar information.
PCA helps identify a smaller set of”principal component” that capture most of the data’s variance.

How PCA Works

Collect Data: Start with a dataset containing multiple variables (e.g., height, weight, age, income).
Create a Covariance Matrix: This matrix illustrates how variables in your dataset are related to each other.
Find Eigenvalues & Eigenvectors:
1. Eigenvalues: Indicate how much information (variance) each new feature (Principal Component) captures.
2. Eigenvectors: Represent the directions of these new features in the data space.
Choose Principal Components:
1. Select only the most important components (those with the largest eigenvalues).
2. These new features are linear combinations of the original variables but effectively capture the most significant patterns in the data.

The Covariance Matrix in PCA

A covariance matrix is a table that quantifies the relationships between different variables in your dataset.

Example: Consider a dataset with Height and Weight of people.
1. If taller people tend to be heavier, Height and Weight have a positive covariance (they increase together).
2. If one variable increases while the other decreases, the covariance is negative.
3. If variables are completely unrelated, their covariance is close to zero.
A covariance matrix stores these relationships for all variable pairs in a dataset.
Example Covariance Matrix:
[ Cov(Height, Height) Cov(Height, Weight) ]
[ Cov(Weight, Height) Cov(Weight, Weight) ]

Understanding the covariance matrix helps PCA determine which variables contain similar information, allowing it to merge them into fewer, more meaningful variables.

Key PCA Concepts for Study

Understanding the difference between Eigenvectors and Eigenvalues in the context of the covariance matrix.
Visualizing these concepts, especially in a three-dimensional graph.
Grasping the concept of Principal Components and their role in data transformation.
How to select the optimal number of dimensions for the new dataset.

AD_4nXcrI4ZXm-Apg9ivgJUz3rBZAnPE5ZtstX0KKRB9xkAWDOFiGUwAWY9TtoVWz-s5GDK7gA7Se6Vn8ze7zdy9xhVl0IEo-mSgW1MpnhyD1sFY5D9voTltW_osfxO3eTZvJbbcIg5DOw?key=u8-pKhiHRUf81eZN0ydVE2BR

Week 3: Classification Algorithms

Naive Bayes Classifier

Introduction to Naive Bayes

The Naive Bayes classifier is a probabilistic machine learning model used for classification tasks.
It is based on Bayes’ Theorem and assumes that the features (predictors) are conditionally independent given the class label.
This “naive” assumption simplifies computations, making the algorithm fast and efficient, even for large datasets.

Multinomial Naive Bayes

Used for discrete data, such as word counts in text classification.
Assumes features represent the frequency of events (e.g., word occurrences).
Example Applications: Spam detection, document classification.

Multinomial Naive Bayes Example: Spam Detection

Let’s classify emails as spam or not spam using Multinomial Naive Bayes.

Training Data – Normal Emails:
1. Collect all words from normal (non-spam) emails and count their occurrences (e.g., using a histogram).
2. Calculate the probability of seeing each word, given that it’s a normal email. For example, if “free” occurred 5 times in 100 total words in normal emails:
  1. P(free|normal) = 5/100 = 0.05
3. Similarly, if “offer” occurred 8 times in 100 total words in normal emails:
  1. P(offer|normal) = 8/100 = 0.08
Training Data – Spam Emails:
1. Collect all words from spam emails and count their occurrences.
2. Calculate the probability of seeing each word, given that it’s a spam email. For example, if “free” occurred 20 times in 50 total words in spam emails:
  1. P(free|spam) = 20/50 = 0.4
3. Similarly, if “offer” occurred 40 times in 50 total words in spam emails:
  1. P(offer|spam) = 40/50 = 0.8
Classifying a New Email:
Imagine you receive a new email containing only the word “offer!”. We want to determine if it’s normal or spam.
1. Prior Probability for Normal Email P(Normal):
  Estimate the initial probability that any email is normal, based on training data. If 100 out of 150 emails are normal:
  1. P(Normal) = 100 / (100 + 50) = 0.66
  This initial guess is called the prior probability.
2. Score for “offer!” in Normal Class:
  Multiply the prior probability by the likelihood of “offer” in a normal email:
  1. P(Normal) × P(offer|normal) = 0.66 × 0.08 = 0.052
  This 0.052 is the score for “offer!” belonging to the normal email class.
3. Prior Probability for Spam Email P(Spam):
  Estimate the initial probability that any email is spam. If 50 out of 150 emails are spam:
  1. P(Spam) = 50 / (100 + 50) = 0.33
4. Score for “offer!” in Spam Class:
  Multiply the prior probability by the likelihood of “offer” in a spam email:
  1. P(Spam) × P(offer|spam) = 0.33 × 0.8 = 0.26
  This 0.26 is the score for “offer!” belonging to the spam class.
5. Decision:
  Since the score for “offer!” belonging to spam (0.26) is greater than the score for it belonging to normal email (0.052), the email “offer!” is classified as spam.

The “Naive” Assumption of Naive Bayes

Naive Bayes treats all word orders as the same. For example, the score for “free offer” is identical to “offer free”.
- P(Normal) × P(free|normal) × P(offer|normal) = P(Normal) × P(offer|normal) × P(free|normal)
This means Naive Bayes ignores grammar and language rules, assuming conditional independence between features (words).
By ignoring relationships between words, Naive Bayes exhibits high bias. However, it often performs well in practice due to its low variance.

AD_4nXddPemcW25Wm_Rc_-4MZ19m429H4g19bUKbTt4H6DrPjjrz_qE3xblXQUVJljBQKBvDpRY-9GFhc_xrTWxb1ypEDuQKZqqibTFcT9rhf_ZWOTH0ssNFnoV3-OAiOTu8_GD215OyJg?key=u8-pKhiHRUf81eZN0ydVE2BR

Week 4: Clustering Techniques

K-Means Clustering

K-Means Definition

K-Means is a partitioning method that divides data into k distinct clusters.
Clustering is based on the distance of data points to the centroid (mean) of a cluster.
It requires the number of clusters (k) to be specified in advance.

K-Means Clustering Process

Initialization: Choose k initial centroids (randomly or using heuristics).
Assignment Step: Assign each data point to the nearest centroid.
Update Step: Calculate the new centroids as the mean of all data points assigned to each cluster.
Repeat: Continue the assignment and update steps until convergence (when centroids no longer change significantly).

Imagine a graph with three clusters of points, each with a centroid at its center.

Hierarchical Clustering

Hierarchical Clustering Definition

A method of grouping data points into clusters in a step-by-step manner.
It creates a hierarchy or “family tree” of clusters, where similar data points are progressively grouped together.

Agglomerative (Bottom-Up) Hierarchical Clustering

This approach starts with individual data points and merges them into larger clusters.

Start: Each data point begins as its own individual cluster.
Find the Closest Clusters: Measure the similarity (or distance) between all existing clusters.
Merge: Combine the two most similar (closest) clusters into one.
Repeat: Continue merging clusters until only one large cluster remains, or until a predefined number of clusters is reached.

Comparing K-Means and Hierarchical Clustering

A key difference between K-Means and Hierarchical Clustering lies in their approach: K-Means requires a pre-defined number of clusters, while Hierarchical Clustering builds a tree-like structure that can be cut at different levels to yield varying numbers of clusters.

AD_4nXdQS5NsXBD1tK9lPdh3HjW4dJiOiGk5AdzwktRPYNjr90Q4g8HE_z3ptQdvVyLe-S-5aXuuqima2lEBEbwNywikR1cIK4WhR3EXSFl68cAmRQGZNd5q8wPkRAd7FSfxD2sXzaSZ?key=u8-pKhiHRUf81eZN0ydVE2BR

DBSCAN Clustering

DBSCAN Explanation

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points in dense regions together while identifying outliers (noise).

It does not require specifying the number of clusters in advance (unlike K-Means).
It can discover clusters of arbitrary shapes (not limited to circular clusters like K-Means).
It effectively detects outliers (points that do not fit into any cluster).

DBSCAN Point Types:

Core Points: Have at least min_samples points within a specified distance (eps).
Boundary Points: Are not core points themselves but are within eps distance of at least one core point.
Noise Points: Are not close enough to any cluster (often assigned a label of -1).

DBSCAN Example Walkthrough

Let’s illustrate DBSCAN with a given dataset of points: A, B, C, D, E, F, G, H, I. Assume specific neighborhood relationships exist within distance eps (ε).

Step 1: Identify Core Points (Implicit in the example, but this is the first conceptual step).
Step 2: Start a Cluster from Core Points
If B is a core point, it forms a new cluster (Cluster 1). Points directly reachable from B (A, C, G) are added to this cluster. Current Cluster 1: {B, A, C, G}.
Step 3: Expand the Cluster
Check if any new members (A, C, G) are also core points to further expand the cluster.
- C and A are not core points (they don’t have enough neighbors within eps).
- G has only 2 neighbors (H, I), which is less than the assumed min_samples (e.g., 3), so G is NOT a core point.
Since no more core points are found within the current cluster’s reach, the expansion stops.
Step 4: Identify Remaining Points
Points D, E, and F are not connected to B’s cluster and form another group where all are neighbors of each other. However, if none of them are core points (e.g., each has only 2 neighbors, less than min_samples), they do not form a new cluster and remain unclustered or are identified as noise, depending on the dataset structure and parameters.

AD_4nXc0BHsE6ixy0VWlQXk7UQMFMbowlXvEkoG4oWWACK6NoSvXKywx7AgAizUaKRJGg0WEKljvTu6n3m7Sk2RDEQHXO0E9Zt2rmND7ifrOF8FLV0cfJ7yaECiI4nNVDYzYGRHkvinyDw?key=u8-pKhiHRUf81eZN0ydVE2BR