Machine Learning Fundamentals: Algorithms and PyTorch Implementation
Machine Learning Core Concepts
Basic Mathematical Information
Matrix Multiplication
- If Matrix A has size (m x n) and Matrix B has size (n x p), the resulting product AB has size (m x p).
- The number of columns in A must equal the number of rows in B.
- Calculation involves multiplying each row of A by each column of B.
Finding Log Base 2 (n)
Key Machine Learning Definitions
- Supervised Learning: Models learn from labeled data to approximate a target function (hypothesis function).
- Classification: Goal is to categorize input into discrete classes (e.g., predicting disease presence, identifying fruit type).
- Regression: Goal is to predict continuous numerical values based on input data (e.g., predicting house price based on size).
- Unsupervised Learning: Models find patterns in unlabeled data (e.g., clustering).
- Overfitting (High Variance): Model performs well on training data but poorly on test data, often due to excessive complexity.
- Underfitting (High Bias): Model is too simple to capture the underlying patterns in the data.
Machine Learning Pipeline
- Data Preprocessing: Handling missing data, encoding categorical variables (one-hot, ordinal).
- Feature Engineering: Selecting relevant features, removing irrelevant ones, and designing/creating new features.
- Model Selection.
- Model Training: The model learns to map input to output, adjusting parameters to minimize prediction errors.
- Model Evaluation.
Data Preprocessing Techniques
Representing Data
Tabular Data Encoding
- One-Hot Encoding: Converts categorical labels into binary vectors. E.g., “science”, “arts”, “hybrid” → [1,0,0], [0,1,0], [0,0,1]. The number of elements in the list equals the number of unique labels.
- Ordinal Encoding: Assigns numerical values based on order or rank. E.g., “science”, “arts”, “hybrid” → 0, 2, 1 (if order is arbitrary or defined).
Image and Video Representation (CNN)
- Pixel Grid Representation: Images are represented as grids of pixels.
- Flat Black & White: 1 channel, typically using 0s and 1s.
- Grayscale Image: 1 channel, using a scale of 256 possible levels (0 to 255 inclusive) representing varying shades of gray.
- RGB: Combines Red, Green, and Blue channels, each with 256 possible levels.
- Video: A sequence of images, represented by repeated pixel grid representations for each frame.
Word and Sentence Representation
- Words (One-Hot Encoding): The number of elements equals the size of the vocabulary. E.g., “ume” → [1, 0, 0], “is” → [0, 1, 0], “cute” → [0, 0, 1].
- Sentence (Sequence of Words): A list of one-hot encoded words. E.g., “ume is cute” → [[1,0,0], [0,1,0], [0,0,1]].
- Bag of Words (BoW): Represents the frequency of words in a document, ignoring grammar and word order. E.g., “ume is so so cute cute cute”
ume is so cute | |||
1 | 1 | 2 | 3 |
Handling Data Issues
Class Imbalance
Issue: Model may learn that predicting the dominant class results in fewer mistakes.
Solution: Data Resampling
- Oversampling: Duplicates of the minority class.
- Undersampling: Subsample of the majority class.
Missing Values
- Solution 1: Remove all data points with missing values (not recommended if the training sample is small).
- Solution 2 (Imputation): Replace missing values with the mean or mode of the column.
- Implementation:
sklearn.impute.SimpleImputer(strategy = "mean" or "most_frequent")
- Implementation:
Outliers
- Solution 1: Z-Score Method
- Calculate Z-score:
sklearn.preprocessing.StandardScaler().fit_transform(data)
- Find data points where |Z| > threshold.
- Calculate Z-score:
- Solution 2: IQR Method
- Arrange data in ascending order.
- Calculate IQR (Interquartile Range): $Q3 – Q1$. (
np.percentile(data, 75) - np.percentile(data, 25)
) - Lower Bound = $Q1 – 1.5 \times IQR$
- Upper Bound = $Q3 + 1.5 \times IQR$
- Remove points outside this range (point > upper bound OR point < lower bound).
Feature Scaling
Issue: Features with different scales may cause the model to give more importance to features with a larger range.
Solution 1: Z-Score Standardization
Solution 2: Min-Max Scaling
Supervised Learning Models
K-Nearest Neighbors (K-NN)
- Type: Supervised, primarily for classification.
- Best for: Datasets with a small number of attributes.
- Input/Output: Input can be continuous/discrete/binary; output is an array of labels.
K-NN Formula and Steps
- Calculate distances of data points from the given point using Manhattan or Euclidean distance. (n = number of features; $x_a$ & $x_b$ = two data points).
- Identify $K$ nearest neighbors based on distance.
- Classify the data point based on the majority vote among the $K$ neighbors.
Choosing K
- K too large: Decision boundary is overly smooth (linear), leading to underfitting.
- K too small: Decision boundary is overly complex, leading to overfitting.
Limitations
- Requires feature scaling.
- Curse of Dimensionality: In higher dimensions, points are more spread out, making it difficult to distinguish based on distance.
Mean Nearest-Farthest Distance Ratio:
Implementation
Use sklearn.neighbors.KNeighborsClassifier
.
Euclidean distance: np.linalg.norm(x)
(assuming $x$ is an array containing data points of $[x_1, x_2, \dots, x_b]$ features).
Decision Trees
- Type: Supervised learning, used for classification.
- Input: Binary, discrete, or discretized continuous data (e.g., range 0-10, 10-20).
- Output: Single decision value.
Building the Tree
- Get Root Node:
- Compute the entropy of the data and all attributes.
- Compute the remainder (or information gain) for each attribute. Choose the attribute with the smallest remainder as the root.
- Split Dataset: Split the dataset by the root attribute’s values.
- If an attribute value has more than one outcome, repeat the process using the next best attribute (smallest remainder).
- To find the next best attribute: Compute the remainder within each subset filtered by the previous best attribute.
- If two attributes have the same remainder, choose the one that comes earlier in the breaking order.
- Repeat until: No attributes remain OR all samples in a subset belong to one class.
Entropy (I)
Measures the uncertainty or unpredictability of the outcomes of a decision.
- Maximum uncertainty: $I(y) = 1$ (random).
- No uncertainty: $I(y) = 0$ (deterministic).
Formula:
Calculate Probability (p): $p = n(\text{samples with label}) / n(\text{total samples})$. If using a subset, filter by all samples with the attribute value (e.g., only see outcomes of points where attribute “taste = salty”).
Remainder
The uncertainty remaining after partitioning the dataset based on an attribute, or the weighted average of the entropy of the subsets after splitting the dataset based on that attribute.
Smaller remainder → better attribute.
Formula:
Where $v$ = number of outcomes, $I(A)$ = entropy of subset A, $W(A)$ = weight of subset A.
Weight Calculation:
Information Gain (IG)
Reduction in uncertainty after splitting on an attribute. Higher IG → better attribute → should be chosen earlier.
Formula: $IG(A) = \text{Entropy}(\text{data}) – \text{Remainder}(A)$
Pruning (To Prevent Overfitting)
- Max-Depth: Limit the tree depth; use majority vote at maximum depth.
- Min-Sample Leaf: Set minimum leaf sample size; use majority vote if fewer samples exist.
Gradient Descent
Iteratively updates weights to minimize losses by following the gradient.
Global Minimum Theorem: The graph of the loss function against weights must have only one minimum, which is the global minimum.
Stops when $\partial J / \partial w = 0$ (converges to minimum).
- Loss function for Linear Regression: Mean Squared Error (MSE).
- Loss function for Logistic Regression: Binary Cross-Entropy (BCE) Loss.
Gradient Calculation ($\partial / \partial w_j$ of loss):
Determining Optimal Weight and Learning Rate
- Optimal Weight ($w^*$): Find the derivative of the cost function, set $d(\text{cost})/d(w) = 0$, and solve for $w$.
- Optimal Learning Rate ($\gamma$):
- Let error be $e(j) = w(j) – w^*$. Substitute this into the gradient update equation.
- Rearrange the equation to see $e(t)$ on both sides (LHS $e(j+1)$, RHS $e(j)$).
- Find the value of $\gamma$ whereby the error gets smaller, i.e., $|e(t+1)| < |e(t)|$.
Linear Regression
- Type: Supervised, used for predicting continuous values by fitting a straight line to data.
Model Hypothesis
- 1 Feature, 1 Label: $h_w(x) = w_0 + w_1x_1$
- Multiple Features: $h_w(x) = w_0 + w^Tx$
- $w^Tx = w_1x_1 + w_2x_2 + \dots + w_nx_n$
Key Terms
- $\hat{y}$: Predicted value.
- $x = [x_1, x_2, \dots, x_n]$: Input features.
- $w = [w_1, w_2, \dots, w_n]$: Weights (parameters).
- $w_0$: Y-intercept / bias.
- $i$: Index of a data point (runs from 1 to $m$ = number of samples; $data.shape[0]$).
- $j$: Index of a feature (runs from 1 to $n$ = number of features; $data.shape[1] – 1$).
Cost Function (MSE)
Measures the ‘goodness’ of the model.
Where $m$ = number of training data samples.
- Note: The loss function measures the loss of a single data point, i.e., $(\hat{y}-y)^2$ (squared error).
- The cost function measures the average loss across the entire training set.
Variants of Gradient Descent
- Mini-Batch: Uses a subset of data. Faster and cheaper, good for large datasets.
- Stochastic: Uses 1 data point per update. Fast but may overfit, unlikely to hit the exact global minimum.
Handling Non-Linear Relationships
Transform the relationship into a linear one by adding polynomial features (Polynomial Regression).
Polynomial Regression: Approximates any continuous function on a closed and bounded interval (min-max known) to any degree of accuracy (Weierstrass Approximation Theorem).
Example: $h_w(x) = w_0 + w_1x + w_2x^2$
Logistic Regression
- Type: Supervised, used for classification with continuous inputs.
- Decision Boundary: Usually linear (or can be transformed to linear).
- Use Case: Mainly for binary classification (yes/no, true/false, 0/1).
- Prediction: Predicts the probability of a given input belonging to a class. A threshold (default Scipy: 0.5) determines the final classification.
Model (Sigmoid Function)
The sigmoid function maps any real-valued number into a probability between 0 and 1.
Cost Function (Binary Cross-Entropy)
Common Issues and Fixes
Non-Linear Decision Boundary
Add polynomial features equal to the number of features in the dataset.
Example: Transform $h_w(x) = w_0 + w_1x_1 + w_2x_2$ into a circular boundary:
$h_w(x) = w_0 + w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_2^2$
Note: Coefficients may not necessarily be 0 and 1; this depends on the problem.
Multi-Class Classification
- One vs All (OvA): Fit one classifier per class. Train each model to distinguish one class (“1”) from all others (“0”). Choose the class with the highest probability when it is “1”.
- One vs One (OvO): Fit one classifier per class pair. The class with the higher probability “wins.” Choose the class with the highest number of “wins.”
Regularization
Methods to prevent overfitting:
- Reduce the number of features (e.g., transform from high to low degree polynomial).
- Regularization: Reduce the magnitude of weights ($w_j$).
L2 (Ridge) Regularized Cost Function
Where $\lambda$ is the regularization parameter.
- For Logistic Regression: $J(\text{BCE})$
- For Linear Regression: $J(\text{MSE})$
Regularized Loss Gradient
Unsupervised Learning: Clustering
Unsupervised learning method used to find groupings based on data point feature similarities (no labels).
K-Means Clustering
Centroid Definition
Coordinates are the mean of all data points for each feature ($x_1, x_2, x_3, \dots, x_n$) within the cluster. $m$ = number of data points in the cluster.
Algorithm Steps
- Initialize $K$ centroids.
- For each data point $i$ (in range 0 to $m$), group it to the nearest centroid (calculated using Euclidean distance).
- Update centroids (calculate the mean of all data points in the new cluster).
The loop stops when the centroids no longer change (convergence).
Initialization Methods
- K-Means: Randomly initialize $K$ centroids.
- K-Means++: Random initialization of the first centroid. Select subsequent centroids with the largest probability $P$, proportional to the squared distance ($x^2$) of each data point from the existing centroids.
- $P(x^2)$ = squared distance of the current point from the nearest centroid.
- Largest distance → highest probability → selected as the next centroid.
Evaluate Performance (J)
Calculate the Mean Euclidean Distance J of each data point from its centroid. A smaller value indicates better performance.
Determining K
- Elbow Method (Heuristic): Set $K$ to the ‘elbow’ point—the point where increasing the number of clusters ($K$) does not lead to a drastic improvement in clustering performance ($J$). (Note: Some graphs of J against K may not have a clear elbow).
- Application Dependent Method: $K$ is chosen based on specific case requirements.
Hierarchical Clustering
Steps
- Initially, every point is a data cluster.
- In a loop, calculate the distances of each data point from all other data points using a linkage method.
- Find the minimum distance among all data points and merge them together.
- Repeat until all points form a single cluster.
Comparing Merged Clusters: After merging points [A, B] with point C, the distance is determined by the linkage type. For example, if using single linkage, the distance is $\min(\text{dist}(A, C), \text{dist}(B, C))$.
Visualization: Dendrogram
Draw from the bottom up until there is one link for all points. Drawing a horizontal line across the dendrogram shows the number of clusters (equal to the number of intersections between the line and the dendrogram).
Types of Linkages (Distance Metrics)
- Single Linkage: Distance between the points closest to one another in the clusters.
- Average Linkage: Mean distance between all points in each cluster.
- Complete Linkage: Distance between the points farthest from one another between clusters.
- Centroid Linkage: Distance between the centroids of the clusters.
Implementation Differences
- K-Means uses
sklearn.cluster.KMeans(n_clusters=k)
and minimizes a loss function. - Hierarchical clustering uses
sklearn.cluster.AgglomerativeClustering
and combines pairs of clusters that are close together according to the linkage method.
Model Evaluation Metrics
Regression Metrics
- MSE: Mean Squared Error.
- MAE: Mean Absolute Error (mean of sum of $| \hat{y} – y |$).
Classification Metrics (Confusion Matrix)
Predicted Positive | Predicted Negative | |
Actual Positive | TP (True Positive) | FN (False Negative) |
Actual Negative | FP (False Positive) | TN (True Negative) |
Note: For all metrics below, values are between 0 and 1. Higher is better.
Metric | Formula | Definition / Use Case |
Accuracy (A) | $(TP + TN) / \text{Total Samples}$ | True predictions / total number of samples |
Precision (P) | $TP / (TP + FP)$ | True positives / all predicted positives |
Recall (R) | $TP / (TP + FN)$ | True positives / all actual positives |
F1 Score | $2 / ( (1 / P) + (1/R) )$ | Harmonic mean of Precision and Recall |
Hyperparameter Tuning and Validation
Hyperparameter: External configuration parameters of an ML model not learned from training data (e.g., learning rate, regularization parameter $\lambda$).
Methods to Find Best Hyperparameter
- Grid/Exhaustive Search: Test all specified hyperparameter combinations.
- Random Search: Select hyperparameters randomly to test.
Model Performance Measurement Steps
- Training:
- If finding the best hyperparameter: Train a new model with each hyperparameter value on the same dataset.
- If finding the best model: Select different models (K-NN, Logistic Regression, etc.) to train on the same dataset.
- Validating: Use a subset of the training sample to find the model/hyperparameter that produces the minimum error.
- Testing: Final evaluation on unseen test data.
Neural Networks and Deep Learning
Perceptrons
Perceptron Model
Input (weighted sum: $w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n$) → Activation Function ($g$) → Output ($\hat{y}$)
- The original perceptron, inspired by brain neurons, uses a step function: output is -1 if $x < 0$, and 1 if $x \ge 0$.
- The perceptron output without activation is equivalent to linear regression, forming a straight-line decision boundary.
- The activation function applies to the weighted sum of all inputs, not to individual input features.
Perceptron Update Algorithm
- Initialize weights randomly, predict $\hat{y}$.
- If $\hat{y} \ne y$, update weights: $w_j = w_j – \gamma(\hat{y}_j – y_j)x_j$ (where $\gamma$ is the learning rate).
- Repeat until all points are classified correctly or maximum steps are reached.
Logic Gates
True = 1, False = 0
- NOT: Opposite (True if 0, False if 1).
- AND (&): Only true if both inputs are 1.
- OR: True if at least 1 input is 1.
- NOR: Only true if both inputs are 0.
- XNOR: True if inputs are the same (1,1) or (0,0) → 1.
- XOR: True if inputs differ (0,1) or (1,0) → 1.
Neural Networks Structure
- Single Layer: No hidden layers; can implement gates like AND, OR, NAND.
- Multi-Layer: Input → Hidden layers → Output (required for non-linear problems like XOR, XNOR).
- Forward Propagation: Data flows from input → hidden layers → output. An activation function is used at each layer (except for the output layer when performing regression).
- Weight Matrix: Rows = number of inputs, Columns = number of outputs for the next layer. Example: A 3×5 matrix means the next layer has 5 neurons/outputs.
Evaluating Mathematical Representations
Start with the inner parentheses first. E.g., $\text{NOT}(\text{AND}(x_1, x_2))$. If $x_1 = 0, x_2 = 1$, then $\text{NOT}(0) \rightarrow 1$.
Activation Functions
- Softmax: Used for multi-class classification (for $C$ number of classes).
Softmax(dim=1)
returns a list of probabilities $p$ (input row belongs to class $i$). $\sum(\text{row}) = 1$.- The index of the element corresponds to the class label.
- Predicted class =
output.argmax(dim=1)
.
- Sigmoid: Used for binary classification (Logistic Regression). Handles non-linear boundaries if input features are transformed.
- No Activation: Used for regression tasks ($g(x) = x$).
PyTorch Implementation
Import: import torch.nn as nn
Containers
nn.Module
: Used as input when defining a neural network class.nn.Sequential
: Directly defines the class by calling layers sequentially.
Linear and Activation Functions
nn.Linear(input_size, output_size)
: Defines a single layer.- Activation functions (e.g.,
nn.Sigmoid
,nn.ReLU
,nn.Softmax
) must be called separately for every layer before the activation class in PyTorch.
Building a Neural Network using nn.Module
Example: Single layer NN for REGRESSION (last layer has no activation function).
class NNRegressor(torch.nn.Module):
def __init__(self, input_size, output_size):
super().__init__()
self.linear1 = torch.nn.Linear(input_size, output_size)
self.linear2 = torch.nn.Linear(output_size, 1)
# output size of 1st layer = input size of 2nd layer
self.activate = torch.nn.ReLU() # or any other activation func
def forward(self, x):
x = self.linear1(x)
x = self.activate(x)
x = self.linear2(x)
return x
model1 = NNRegressor(4, 8) # 4 features (x_n), 8 neurons to map to initially
Building a Neural Network using nn.Sequential
Used to package a series of modules.
model2 = torch.nn.Sequential(
torch.nn.Linear(5, 8),
torch.nn.ReLU(),
torch.nn.Linear(8, 1)
)
Training Loop (Gradient Descent)
- Loss Functions:
torch.nn.CrossEntropyLoss
(multi-class),torch.nn.BCELoss
(binary),torch.nn.MSELoss
(regression). - Optimizer:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# Training loop example
loss_fn = torch.nn.BCELoss()
for epoch in range(epochs):
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward() # compute gradient
optimizer.step() # update weights
Convolutional Neural Networks (CNN)
Usually used for object detection in images (e.g., cancer cells, classifying image types).
- Fully Connected Layer: Kernel “views” all pixels (‘square’) of the input image to determine the output.
- Convolution: Allows the model to view local regions of an image.
Building a CNN
Similar to a standard NN, but each convolutional layer needs pooling afterward.
self.c1 = torch.nn.Conv2d(input_channels, output_channels, kernel_size, stride=1, padding=0)
# padding="valid" = padding=0
self.pool = torch.nn.MaxPool2d(kernel_size)
To transition to a FULLY CONNECTED LAYER (Conv2d → MaxPool → Linear), flatten inputs first.
Input size = channels $\times$ height $\times$ width.
torch.flatten(batch_size, start_dim=1, end_dim=-1)
Components
- W/H = N: Dimensions (Width/Height).
- K: Kernel dimension (for W/H respectively).
- The feature map value is the sum of element-wise multiplication with the local region.
- P: Padding.
- S: Stride. A stride of ‘n’ is inclusive of the current first element. E.g., if the top left starts from A, stride = 2 means the second region should start from C.
Pooling
Downsamples feature maps, reducing dimensions.
- Types: Average, Max, Sum.
- PyTorch Methods:
torch.nn.MaxPool2d
,torch.nn.AvgPool1d
,torch.nn.AdaptiveMaxPool2d
. - Output Shape: (Initial height or width) / pooling factor.
fr