Essential Machine Learning Concepts Explained

Posted on Jul 28, 2025 in Mathematics and Computer Science

Regularization Techniques in Machine Learning

Regularization is a set of techniques used to reduce overfitting by encouraging simpler models. It works by adding a penalty term to the loss function that discourages overly large or complex weights. The two most common forms are L2 (Ridge) and L1 (Lasso) regularization.

L2 Regularization (Ridge / Weight Decay)

L2 Regularization adds the squared L2 norm of the weights to the loss: L(w) = original loss + λ‖w‖². This shrinks weights proportionally to their size, encouraging small, smooth parameters. The gradient becomes ∇L(w) = ∇L₀(w) + 2λw, and the gradient descent update subtracts η·2λw from each weight. Large weights decay faster than small ones. L2 avoids overly large coefficients without setting them exactly to zero.

L1 Regularization (Lasso)

L1 Regularization adds the L1 norm of weights to the loss: L(w) = original loss + λ‖w‖₁ = λΣ|wⱼ|. This promotes sparsity by encouraging some weights to become exactly zero. The gradient is λ·sign(w), applied element-wise. L1 takes constant-sized steps toward 0, regardless of magnitude, unlike L2. This makes L1 good for feature selection and compressing models.

L1 vs L2 Regularization

L1 leads to sparse models by zeroing unimportant weights; L2 leads to smooth models by shrinking all weights proportionally. L1 is preferred when only a few features are likely to matter; L2 when all features contribute somewhat. The choice is treated as a hyperparameter.

Effect on Learning Algorithms

Regularization changes both the gradient and the update rule. L2 reduces large weights gradually; L1 eliminates some weights completely. Both techniques reduce variance, improve generalization, and prevent overfitting.

Bayesian Interpretation (MAP Estimation)

Regularization can be viewed as imposing a prior on weights. Assume w is drawn from P(w), and data D is drawn from P(D|w). Learning becomes maximizing P(w|D) ∝ P(D|w)P(w). Taking the log yields the usual loss + regularization term. A Gaussian prior on w implies L2 regularization; a Laplace prior implies L1. Regularization thus has a probabilistic justification rooted in Bayesian inference.

Summary of Regularization

Reduces overfitting by penalizing large weights
Shrinks or zeroes weights to simplify models
Improves generalization
Changes gradient and update behavior
λ must be tuned, typically via cross-validation

Bias and Variance in Machine Learning Models

Bias/Variance describes two main sources of error in machine learning models. Bias is error from incorrect assumptions that prevent the model from capturing true data patterns. Variance is error from being too sensitive to small fluctuations in the training data. Total error equals bias plus variance.

Understanding Bias

Bias results from simplifying assumptions, such as assuming linearity when the true relationship is nonlinear. High bias leads to underfitting—the model is too simple and misses important structure. To reduce bias, expand the model family by reducing constraints or adding features. This allows the model to better capture real-world complexity.

Understanding Variance

Variance comes from fitting the training data too closely, including noise. High variance leads to overfitting—the model performs well on the training set but poorly on new data. Causes include using overly complex models and having limited training data. Solutions include regularization (which shrinks the model family), increasing training data, or reducing model complexity.

The Bias-Variance Tradeoff

Increasing complexity lowers bias but raises variance. A well-balanced model minimizes both. The goal is to find a model that generalizes well—flexible enough to capture patterns, but not so complex that it memorizes noise.

Data Splitting for Model Evaluation

Data Splitting is essential for evaluating generalization. A typical split is 70% training, 10% validation, and 20% test. The training set is used to fit model parameters. The validation (development) set is used to tune hyperparameters and detect overfitting. The test set is used only for final evaluation and must remain untouched during model development.

Train vs. Test Loss Behavior

As model complexity increases, training loss always decreases, but test loss decreases then increases—the rise marks overfitting. Monitoring both losses helps identify the optimal model complexity.

Summary of Bias and Variance

Bias = assumptions that oversimplify
Variance = overreaction to noise
Underfitting = high bias
Overfitting = high variance
Reduce bias by using more complex models
Reduce variance by regularization or more data
Balance is key to generalization

Normal Equations for Linear Regression

Normal Equations provide a closed-form solution to linear regression. Given feature matrix X (n × d), output vector y (n × 1), and weight vector w (d × 1), the solution is: w = (XᵀX)⁻¹Xᵀy. This computes the optimal w that minimizes the squared error loss. It works when XᵀX is invertible.

Non-Invertibility Conditions

Non-Invertibility occurs when XᵀX is not full rank, meaning the matrix cannot be inverted. This happens in two main cases: (1) Redundant features—when some columns of X are linearly dependent, and (2) Too many features—when d > n, i.e., more features than data points. In both cases, XᵀX becomes singular.

Consequences of Non-Invertibility

The system XᵀXw = Xᵀy has many solutions. This leads to overfitting and high variance since multiple weight vectors perfectly fit the training data.

Solutions for Non-Invertibility

Fixes include removing redundant or near-duplicate features (feature selection) or using regularization to ensure stability. With L2 regularization (ridge regression), the modified equation becomes w = (XᵀX + λI)⁻¹Xᵀy. Adding λI ensures invertibility by nudging diagonal entries away from zero.

The Pseudoinverse

If XᵀX is not invertible, the Moore-Penrose pseudoinverse of X can be used to compute a solution: w = X⁺y. The pseudoinverse always exists and provides a least-squares solution, even when multiple solutions exist.

Practical Rules of Thumb

Always aim for more data than features (n > d) to ensure XᵀX is invertible. Avoid duplicate or near-duplicate features to maintain linear independence.

Normal Equations Summary

Normal equation gives exact solution for linear regression
Fails when XᵀX is not invertible
Causes include too few examples or redundant features
Fixes include regularization or using the pseudoinverse
Always check data shape and feature quality before applying the normal equations

Non-Parametric Machine Learning Methods

Non-Parametric Methods do not assume a fixed set of parameters. Instead, they store training data and use it to make predictions, meaning model size grows with the dataset. This contrasts with parametric methods, which learn a fixed number of parameters and discard the training data after training.

Parametric Methods Overview

Parametric Methods include linear regression, logistic regression, and softmax regression. These directly model P(y|x) and learn a parameter vector w ∈ ℝᵈ. Logistic regression predicts using sigmoid(wᵀx), with w trained via maximum likelihood. Naive Bayes is a generative parametric method, modeling P(y) and P(x|y) to infer P(y|x) by estimating π and τ from counts.

k-Nearest Neighbors (kNN)

k-Nearest Neighbors (kNN) is a simple non-parametric classifier. To classify input x: (1) compute distances to all training points, (2) find the k closest, (3) return the most common label. Distance metrics include Euclidean and Manhattan. 1-NN perfectly fits training data but is sensitive to outliers; k-NN smooths predictions using a majority vote. kNN has no training time but is expensive at test time and suffers from the curse of dimensionality—in high dimensions, all points are far apart.

Kernel Methods

Kernel Methods generalize both kNN and logistic regression by computing similarity in a high-dimensional space using a kernel function K(x, x′) = φ(x)ᵀφ(x′), where φ is implicit. Examples: polynomial kernel K(x, x′) = (xᵀx′ + c)ᵈ and Gaussian RBF kernel K(x, x′) = exp(−γ‖x − x′‖²). Kernel methods allow linear models to behave nonlinearly in the original space. The kernel trick lets you operate in high-dimensional feature space without explicitly computing φ(x).

Kernelized Logistic Regression

Kernelized Logistic Regression replaces w with αᵢ values per training point. Predict using a sum of kernel evaluations: P(y|x) = sigmoid(Σ αᵢ K(x, xᵢ)). Setting K(x, z) = xᵀz recovers standard logistic regression. More powerful kernels enable nonlinear classification.

Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are binary classifiers that learn a hyperplane wᵀx = D with maximum margin between classes. They minimize ½‖w‖² subject to yᵢ(wᵀxᵢ + b) ≥ 1. Points with margin ≤ 1 are support vectors; only these influence the decision boundary. Unlike logistic regression, SVMs have no probabilistic interpretation and use a different loss function (hinge loss). Hard-margin SVMs require separable data; soft-margin SVMs introduce slack to allow misclassifications.

Kernelized SVMs

Kernelized SVMs apply the kernel trick to learn nonlinear boundaries. SVMs work well in high dimensions and are robust to overfitting, but they are computationally expensive and sensitive to kernel and hyperparameter choices.

Non-Parametric Methods Summary

Parametric = fixed-size models, learn parameters then discard data
Non-parametric = store training data, grow with dataset
kNN is simple, no training, sensitive to noise and dimensionality
Kernel methods implicitly lift data to high-dimensions, enable flexible decision boundaries
SVMs maximize margin, depend only on support vectors, and work well with kernels for classification

Backpropagation Algorithm

Backpropagation is a method for efficiently computing gradients in a computation graph, especially in neural networks. It applies the chain rule recursively, starting from the output and working backward to inputs. Used during training to update weights via gradient descent.

Computation Graphs

Computation Graphs represent computations as nodes (operations) and edges (data flow). Each node performs a function and stores intermediate values. The forward pass computes the output. The backward pass computes gradients with respect to each node’s inputs. The goal is to compute ∂L/∂w for all model parameters w.

The Chain Rule in Backpropagation

The Chain Rule is central: if z = f(w) and L depends on z, then ∂L/∂w = ∂L/∂z × ∂z/∂w. In the graph, each node passes its gradient to its inputs by multiplying its incoming gradient by its local partial derivative.

Backpropagation on Tree Structures

Assume the computation graph is a tree (no shared nodes). Start at the root (output), and initialize ∂output/∂output = 1. Propagate the gradient to children based on the type of node.

Multiplication Node: If e = x·y, then ∂e/∂x = y and ∂e/∂y = x. The child node receives the parent’s gradient multiplied by the other child’s value.
Addition Node: If e = x + y, then ∂e/∂x = 1 and ∂e/∂y = 1. The child node receives the parent’s gradient multiplied by 1.
Leaf Nodes (inputs) receive the final gradients; no further propagation is needed.

Backpropagation Recipe

The overall Backpropagation Recipe:

Perform forward pass to compute all node values.
Initialize root gradient to 1.
Recursively apply chain rule from root to leaves. Each node computes ∂(parent)/∂(child) and multiplies by parent’s gradient.
Continue until all leaf gradients are computed.

Multivariate Chain Rule

When a variable has multiple downstream paths (e.g., shared inputs), its total gradient is the sum of all contributions: ∂L/∂x = Σ (∂L/∂zᵢ × ∂zᵢ/∂x). This handles weight sharing.

Numerical Gradients

Numerical Gradients are an alternative method: ∂y/∂x ≈ [f(x + ε) − f(x)] / ε.

Pros: Simple to implement, good for checking correctness.
Cons: Inefficient (requires one forward pass per input).

Backpropagation Summary

Backprop computes gradients efficiently using the chain rule
Works on computation graphs
Forward pass computes values
Backward pass propagates gradients
Each node knows how to compute its own derivative
Used to train neural networks via gradient descent

Neural Network Optimization Techniques

Neural Network Optimization addresses the challenges of training non-convex models. Neural networks are flexible but prone to overfitting and difficult to optimize. Key components include initialization, learning rate, SGD variants, and regularization.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) approximates the gradient by computing it over a small batch of examples. One update is much faster than full gradient descent. Smaller batches are faster but noisier, while larger batches are slower but more accurate. Each full pass through the dataset is called an epoch.

Weight Initialization Strategies

Initialization matters because neural networks are non-convex. Bad initialization (e.g., all zeros) leads to symmetry—all neurons behave identically. Fix this by initializing weights to small random values. Use He initialization (variance = 2/d_in) or Xavier initialization (variance = 1/d_in). This avoids exploding or vanishing gradients.

Learning Rate and Schedules

Learning Rate is critical. If too small, convergence is slow or stuck in flat regions. If too large, training becomes unstable. Use learning rate schedules—start high and decay over time. This allows the model to take large steps early and small steps later.

Momentum Optimization

Momentum helps escape sharp curves by building velocity. The update uses: v = β₁v + (1−β₁)grad, then w = w − ηv. It accelerates progress along consistent directions and damps oscillations.

RMSProp Optimizer

RMSProp adapts the learning rate per parameter: s = β₂s + (1−β₂)(grad²), then w = w − η·grad / (√s + ε). This reduces the learning rate in directions with consistently large gradients.

Adam Optimizer

Adam combines momentum and RMSProp: v = β₁v + (1−β₁)grad, s = β₂s + (1−β₂)(grad²), and the update is w = w − η·v / (√s + ε). Adam is widely used and robust to hyperparameter tuning.

Regularization for Neural Networks

Regularization Techniques include weight decay, early stopping, and dropout.

Weight Decay (L2 Regularization)

Weight Decay (L2 Regularization) adds λ‖w‖² to the loss function. This shrinks weights by a factor of (1 − ηλ) during updates. It helps prevent overfitting by penalizing large parameters.

Early Stopping

Early Stopping monitors validation loss during training. If it starts rising, training is stopped, and the best-performing checkpoint is returned. This limits how far parameters can drift from initialization, effectively constraining the model.

Dropout Regularization

Dropout randomly zeroes out hidden neurons during training with probability p. To maintain expected output, scale the remaining neurons by 1/p. Prevents co-adaptation—neurons must be useful independently. At test time, the full network is used without dropout.

Dropout as an Ensemble Method

Each dropout pattern represents a sub-network. Training with dropout averages over many such models, similar to an ensemble. Improves generalization and reduces overfitting.

Dropout and Naive Bayes Generalization

At the input layer, dropout reduces to learning P(y|xⱼ) for individual features, similar to Naive Bayes. At test time, all features are used, combining predictions multiplicatively.

Neural Network Optimization Summary

SGD enables efficient training
Initialization breaks symmetry and stabilizes training
Learning rate and schedule are crucial
Momentum and RMSProp improve convergence
Adam combines both
Regularization via weight decay, early stopping, and dropout controls overfitting and enhances generalization

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are deep learning models designed for images and other spatially structured inputs. They extract features hierarchically—from edges to shapes to objects—by stacking layers of convolution, nonlinearity, and pooling.

Convolution Operation

Convolution is a linear operation where a small kernel (e.g., 3×3) slides over an input matrix and computes dot products. Output size is (W−K+1)×(H−K+1). Convolutional layers learn kernels that detect specific patterns. They are parameter-efficient and exploit spatial locality.

Local Receptive Fields

Local Receptive Fields ensure each neuron focuses on a small region of the input. This mimics how vision systems detect small parts before combining them into larger structures.

Weight Sharing in CNNs

Weight Sharing allows the same kernel to detect the same feature in different image locations. This gives CNNs translation equivariance—if the input shifts, the output shifts similarly.

Conv Layers vs. Fully Connected Layers

Convolutional layers have far fewer parameters (e.g., 10 vs. 372) and require less data to train. Fully Connected layers must learn the same feature at every location; CNNs reuse filters globally.

Input and Output Channels

Color images have 3 channels (RGB), so kernels have the shape Cin×K×K. To learn multiple features, use Cout output channels—the kernel tensor is Cout×Cin×K×K.

Stride and Padding

Stride determines how far the kernel moves each step; a stride greater than 1 reduces spatial resolution. Padding (e.g., zero-padding) keeps the output size consistent. Without it, width and height shrink with each layer.

Convolutional Layers and Activation

Convolutional Layers are typically followed by ReLU nonlinearity. Convolutional + ReLU acts like a linear layer with constraints. CNNs stack multiple Convolutional + ReLU + Pool blocks to learn complex, high-level representations.

Pooling Layers (Max and Average)

Pooling (Max or Average) reduces resolution and increases receptive field. Max Pooling picks the maximum value in a small patch (e.g., 2×2). Pooling helps retain dominant features while discarding spatial noise.

Final Layers in CNNs

After several convolutional layers, CNNs use fully connected layers for global reasoning. First, flatten the output to a vector, then apply MLP layers to produce predictions.

Learning Features with CNNs

CNN weights (in convolutional and fully connected layers) are updated via backpropagation. Gradients indicate whether increasing a value improves the loss. Convolutional filters learn to activate strongly for useful features (e.g., edges, textures, objects).

CNN Feature Visualization

Visualization: Early filters learn color and edge detectors. Deeper layers capture high-level concepts (e.g., faces, animals, text).

Image Classification with CNNs

Image Classification: CNNs achieve strong performance on datasets like ImageNet. AlexNet (2012) marked a breakthrough in deep learning by winning ImageNet with a CNN.

Object Detection

Object Detection: Classify and locate objects in an image. One approach is to generate region proposals and then classify each region.

Semantic Segmentation

Semantic Segmentation: Assign a label to every pixel. Encoder-decoder models like U-Net perform downsampling (convolution + pooling), then upsampling (convolution + upsampling) to match the original image size.

Diffusion Models

Diffusion Models: Train CNNs to reverse noise. Start with clean images, then add noise to the training data. Train the model to predict the denoised version. At test time, start from pure noise and iteratively denoise using a CNN. It can also be conditioned on captions. The architecture often uses U-Net and per-pixel regression loss.

CNNs Summary

CNNs extract features from local to global
Use convolution, weight sharing, and pooling
Learn complex representations layer by layer
Reduce parameters vs. FC layers
Used for classification, detection, segmentation, and generation
Backprop tunes kernels to detect meaningful visual patterns

Word Vectors and Embeddings

Word Vectors represent each word w with a vector vₐ ∈ ℝᵈ that encodes its meaning. Similar words have similar vectors. Components may capture semantics such as category, sentiment, or usage. This allows neural networks to process language numerically.

The Distributional Hypothesis

Distributional Hypothesis: Words that appear in similar contexts have similar meanings. For example, if “ongchoi” occurs in similar sentences as “spinach,” their vectors should be similar. This motivates learning word vectors by predicting context words.

Word2Vec Model

Word2Vec builds word vectors using a prediction task: For each word w, predict which words co-occur with it in a fixed-size sliding window. Generate positive examples (real co-occurrences) and negative ones (random words). Each prediction is a binary classification task: Did this word pair co-occur?

Word2Vec Training Objective

For each word w, learn a word vector vₐ and a context vector cₐ. Use a logistic regression loss for each (w, w′) pair: positive pairs should score high, and negative pairs should score low. The objective is non-convex because both v and c are learned. Gradient descent is used for optimization. Gradients are symmetrical: update v when w is an input, and update c when w is a context word.

Negative Sampling

Negative Sampling: Rather than computing a full softmax, randomly sample negative context words. Frequency-adjusted sampling (e.g., using P(w)⁰·⁷⁵) downweights common words like “the.” A typical ratio is 2 negatives per positive.

Word vs. Context Vectors

Both are learned during training, but only word vectors v are kept afterward. Context vectors c serve as classifier weights and are discarded.

Analogy Reasoning with Word Vectors

Analogy Reasoning: Vector arithmetic can capture relationships: vvine ≈ vtree − vapple + vgrape. Given three words, analogies can be solved by computing a target vector and searching for the closest word (by cosine similarity). Analogies capture directions like gender (man → woman), grammar (fast → fastest), and hierarchy (animal → dog).

Bias in Word Vectors

Since Word2Vec learns from raw data, it picks up societal biases. For example, vprogrammer − vman + vwoman ≈ vhomemaker reflects historical stereotypes in data, not semantic relationships. Embedding bias correlates with real-world demographic distributions. Machine Learning models absorb whatever associations exist in the data.

Word Vector Embedding Layer

A layer that maps each word w to its vector vₐ. It stores a lookup table of parameters (|V| × d). In PyTorch, it is implemented as nn.Embedding(). These embeddings are typically trained further in downstream tasks.

Word Vectors Summary

Learn word vectors by predicting co-occurrence
Optimize with gradient descent on binary classification tasks
Negative sampling improves efficiency
Vector arithmetic enables analogy solving
Word2Vec vectors reflect both semantics and bias
Only word vectors (not context vectors) are retained for use

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) handle sequential inputs like text, where input length varies. Instead of using a fixed-size feature vector, RNNs process inputs one token at a time using a shared set of parameters. They are capable of encoding arbitrary-length sequences into fixed-size hidden states.

Word Vectors as RNN Input

Word Vectors convert each word w to a vector vₐ ∈ ℝᵈ. These vectors are fed to the RNN one at a time. Inputs x₁, …, x_T are vectors for words w₁, …, w_T.

Elman RNN Architecture

An Elman RNN maintains a hidden state hₜ updated at each step using the previous state hₜ₋₁ and current input xₜ: hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b). Parameters are shared across timesteps: Wₓ ∈ ℝᵈˣʰ, Wₕ ∈ ℝʰˣʰ, and b ∈ ℝʰ. The hidden state is initialized as h₀. The RNN layer transforms a T × d input into a T × h sequence of hidden states.

RNN Training with BPTT

Training uses Backpropagation Through Time (BPTT). The loss from later timesteps is propagated through each previous hidden state, all the way to earlier word vectors. Gradients are computed using the chain rule across all timesteps, which makes RNN training sensitive to sequence length.

Language Modeling with RNNs

Language Modeling: Given a prefix, the model predicts the next word at each step. RNNs act as autoregressive models: each word’s prediction depends on the current hidden state, which encodes all previous words. At test time, text is generated by feeding in the model’s own predictions. Use greedy decoding (selecting the most likely word) or sampling from the output distribution.

Long-Range Dependencies Challenge

Long-Range Dependencies: Because hidden states are repeatedly updated with new inputs, information from earlier words is easily lost. This makes it hard to capture relationships across many timesteps, for example, subject-verb agreement or coreference across sentences.

Vanishing and Exploding Gradients

Vanishing & Exploding Gradients: As the gradient is propagated backward, it is repeatedly multiplied by Wₕ. If ‖Wₕ‖ < 1, the gradient shrinks and vanishes. If ‖Wₕ‖ > 1, the gradient grows and explodes. Both conditions make training unstable.

Gradient Solutions for RNNs

Solutions: Gradient clipping limits the gradient magnitude to prevent explosion. Vanishing gradients can be avoided using additive recurrence. Instead of purely multiplicative updates, additive updates preserve information better.

Gated Recurrent Networks (GRUs, LSTMs)

Gated RNNs address vanishing gradients by introducing gates.

Gated Recurrent Unit (GRU)

GRUs use update gates z ∈ [0,1] to control how much information to keep versus forget. When zᵢ = 0, the neuron is updated; when zᵢ = 1, the value is preserved. GRUs apply element-wise interpolation between the previous and candidate hidden state. Parameters: W_z, W_r, W.

Long Short-Term Memory (LSTM)

LSTMs separate the hidden state hₜ and the cell state cₜ. The cell state is updated additively and passed through forget, input, and output gates. The architecture retains long-term memory better and is widely used in NLP.

RNNs Summary

RNNs model sequences with shared parameters across time
Use word vectors as input
Train with BPTT
Can predict next word or encode full sequence
Struggle with long-range dependencies due to vanishing gradients
Gated RNNs (GRUs, LSTMs) mitigate this by using additive updates and gates to preserve information

Decision Trees and Ensemble Methods

Decision Tree Prediction Process

Decision Tree Prediction: At each node, split on one feature. At each leaf, store predicted output (classification = majority label, regression = mean value). Given new input, walk the tree to determine the output. Very interpretable.

Learning Regression Trees

Learning Trees (Regression): At each node, choose the best feature and split threshold. Try all features and split points. Choose the split minimizing prediction error. Predict the mean in each region and compute total squared error.

Decision Tree Stopping Criteria

Stopping Criteria: Splitting to leaf size 1 overfits. Use min samples, max depth, or pruning.

Learning Classification Trees

Learning Trees (Classification): Predict the majority label at each node. Use split criteria like majority class accuracy or Gini Impurity. Gini(c) = ∑ p(c)(1 − p(c)) = ∑ p(c)·p(c′). Lower Gini = purer split. Split score = weighted average of child impurities.

Handling Missing Features in Trees

Handling Missing Features: Use surrogate variables—features that correlate with the chosen feature at each node. During test, if a feature is missing, use a surrogate to decide the path.

Ensemble Methods Overview

Ensembling: Combine predictions of multiple models to improve accuracy. Final prediction = average (regression) or majority vote (classification). Reduces overfitting, improves generalization.

Tree Ensemble Techniques

Tree Ensembles: Shallow trees capture weak patterns. Ensemble aggregates more info across trees.

Bagging (Bootstrap Aggregating)

Bagging: Train each tree on a bootstrap sample (sample with replacement). Each tree sees ~63% of the training set. Adds variance between trees.

Random Forests

Random Forests: Add randomness at the split level—only consider a random subset of features (e.g., √d) when splitting. Further reduces tree correlation, boosts diversity and performance.

Dropout as an Ensemble

Dropout as Ensemble: In neural nets, train different subnetworks each iteration by randomly dropping units. Test time: use the full net → output ≈ average of subnetworks. Shared weights, but ensemble-like effect.

Neural Network Ensembles

Neural Net Ensembles: Combine multiple trained networks. Stochastic training (SGD, init, dropout) already gives variation. Still, explicit ensembles (e.g., GPT-4) can improve performance at extra compute cost.

Vanilla RNNs and Gradient Issues

Vanilla RNN (Elman): At each timestep, an RNN receives the previous hidden state hₜ₋₁ and input vector xₜ. It computes hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b) with shared parameters across time. It struggles with long-range dependencies due to repeated multiplication of Wₕ, causing vanishing gradients over time.

The Vanishing Gradient Problem

Vanishing Gradient Problem: Gradients shrink as they are backpropagated through time. If ‖Wₕ‖ < 1, then (Wₕ)ᵗ → 0 quickly. This limits the ability to learn dependencies across distant timesteps.

Avoiding Vanishing Gradients

Avoiding Vanishing Gradients: Replace multiplicative recurrence with additive updates to allow gradients to accumulate rather than vanish. Purely additive recurrence preserves information but lacks flexibility.

Gated Recurrent Networks

Gated RNNs: Introduce gates to control the flow of information and selectively forget or preserve parts of the hidden state. Gates output values in [0, 1] using sigmoid functions and apply element-wise multiplication.

Gated Recurrent Unit (GRU)

GRU (Gated Recurrent Unit): A simplified gated RNN. It uses an update gate zₜ to control whether to keep or replace each component of hₜ₋₁ with a candidate update ~hₜ. If zₜᵢ = 1: replace neuron i with a new value. If zₜᵢ = 0: retain neuron i from the previous timestep. Parameters: W_z, W_r, W.

Long Short-Term Memory (LSTM)

LSTM (Long Short-Term Memory): A more expressive gated RNN. It maintains two states: cell state cₜ (long-term memory) and hidden state hₜ (output). Gates: forget gate fₜ, input gate iₜ, output gate oₜ control updates to the cell and hidden state. Cell state updated via cₜ = fₜ * cₜ₋₁ + iₜ * ~cₜ (additive). Output: hₜ = oₜ * tanh(cₜ).

Sequence-to-Sequence Tasks

Seq-to-seq Tasks: Tasks where input and output are both sequences (e.g., translation, summarization, or command-to-action). Two RNNs are used: an encoder processes the input into a feature vector, and a decoder generates the output one token at a time.

Encoder-Decoder Model Architecture

Encoder-Decoder Model: The encoder summarizes the input into a hidden/cell state vector, which is then used to initialize the decoder. The decoder predicts the next token conditioned on the previous hidden state and the previous token output. This approach is autoregressive and predicts an [END] token to stop generation.

Bidirectional RNNs

Bidirectional RNNs: Encodings are improved by running one RNN left-to-right and one right-to-left. The final encoding is a concatenation of the forward and backward final states. This doubles the feature dimension.

Attention Mechanism

Attention Mechanism: It solves the bottleneck of relying on a single encoder vector to store the entire input. It allows the decoder to attend to relevant encoder states at each generation step. Compute attention scores between the decoder state and each encoder state (e.g., using a dot product). Apply softmax to the scores to get weights αₜ. Compute the context vector cₜ = ∑ αₜ * encoder_stateₜ as a weighted average. There are no trainable parameters in its basic form.

Luong Attention (2015)

Luong Attention (2015): A dot product is computed between the decoder hidden state hₜ and each encoder state, which is then passed through a softmax to get attention weights, resulting in the context vector cₜ. Combine cₜ and hₜ, then pass them through an MLP to produce a new hidden state used for output prediction.

General Attention Formulation

General Attention Formulation: Inputs include a query vector q, key vectors k₁,…,kₜ, and value vectors v₁,…,vₜ. The output is a weighted sum of values, where weights are derived from softmax(dot(q, kₜ)). In sequence-to-sequence RNNs: q is the decoder hidden state, kᵢ are the encoder hidden states, and vᵢ are the encoder hidden states.

Attention as Information Retrieval

Attention as Retrieval: Think of the decoder hidden state as the query, encoder states as the keys, and the representations passed back as the values—just like a search engine matching queries to documents.

Conclusion on RNNs and Attention

Conclusion: GRUs/LSTMs mitigate vanishing gradients using gates and additive updates. RNNs are used as encoders to produce feature vectors or as decoders to generate sequences. Sequence-to-sequence models use both; attention improves them by dynamically aligning outputs to relevant inputs.

Transformer Neural Network Architecture

Transformer Overview: Input is a sequence of T word embeddings of size d, forming a T × d matrix. Apply alternating multi-headed attention and feedforward layers, both transforming T × d input to T × d output.

Transformer Embedding Layer

Embedding Layer: Learns word vectors plus positional embeddings (added to encode word order).

Transformer Feedforward Layer

Feedforward Layer: An MLP is applied separately to each token, typically structured as Linear(d → d_hidden) → ReLU → Linear(d_hidden → d).

Motivation for Attention

Attention Motivation: Replaces recurrence with attention over the full sequence, allowing parallelism and long-range dependencies without sequential bottlenecks.

Self-Attention Mechanism

Self-Attention: For an input matrix X ∈ ℝ^{T×d}, apply three learned linear projections: Q = XW_Q, K = XW_K, V = XW_V (where W matrices ∈ ℝ^{d×d_attn}). Compute dot products QKᵀ ∈ ℝ^{T×T}, then apply softmax row-wise to obtain attention weights. Multiply by V: The output is softmax(QKᵀ)V. Each output token attends to all tokens (including itself) in the input sequence.

Multi-Headed Attention

Multi-Headed Attention: Use n heads, each with its own W_Q, W_K, W_V mapping d → d/n. Perform attention independently for each head, resulting in an output of T × d/n. Concatenate all head outputs to get the final T × d matrix. This allows each head to learn different types of relationships (e.g., syntax, coreference, position).

Overall Transformer Architecture

Transformer Architecture: Stack L alternating multi-head attention and feedforward layers with residual connections and Layer Normalization after each sublayer. The final output is a T × d matrix, with one vector per input token.

Residual Connections

Residual Connections: Let the layer input be X and the output be O; the final output is X + O. This helps gradient flow and avoids vanishing gradients. It is also common in CNNs.

Layer Normalization

Layer Normalization: For each token vector x ∈ ℝ^d, compute the mean μ and standard deviation σ. Then, normalize to (x – μ) / (σ + ε), and finally rescale with learnable parameters a and b: y = a * normalized + b. It is applied before feedforward and attention layers for stability.

Positional Embeddings

Positional Embeddings: Since attention is order-invariant, positional information is added by learning a vector per position. The final input to the Transformer is the word embedding plus the positional embedding.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE): Tokenizes words into subword units to handle rare or unknown words. Common words become one token, while rare words become multiple subwords. This reduces vocabulary size and increases generalization.

Scaled Dot Product Attention

Scaled Dot Product Attention: Instead of raw dot products QKᵀ, scale by √d_attn to avoid large values causing near-zero or one softmax outputs. Formula: softmax(QKᵀ / √d_attn) * V.

Autoregressive Decoding

Autoregressive Decoding: During generation, the decoder predicts one token at a time using previous outputs. Apply causal masking: mask QKᵀ scores where the query index is less than the key index with -∞, so softmax ignores future tokens. This allows parallel training but enforces an autoregressive constraint.

Transformer Encoder-Decoder Model

Transformer as Encoder-Decoder: The encoder maps the input sequence to contextual representations. The decoder generates outputs one-by-one, using encoder outputs and previously generated tokens.

Masked Language Modeling (MLM)

Masked Language Modeling (MLM): A pretraining task for BERT, where random tokens are masked, and the model is trained to predict them. It trains on large unlabeled corpora (e.g., internet data) with massive compute resources.

Fine-Tuning Transformers

Fine-Tuning: Pretrain on MLM, then add a classifier head, and finally fine-tune on a downstream task (e.g., sentiment analysis, QA, entailment) using all model parameters. This is very effective even on small datasets.

Vision Transformers (ViT)

Vision Transformers: Split an image into patches (e.g., 16×16), flatten them, and embed them as a sequence of vectors. Add a CLS token and positional embeddings, then feed them to a standard Transformer. The final CLS vector is used for classification.

CNNs vs. Vision Transformers

CNN vs Vision Transformer: CNNs use fixed-size receptive fields and strong spatial biases. Vision Transformers allow each patch to attend globally, exhibiting weaker inductive bias but better scaling with data.

Training Transformers

Training Transformers: Use large datasets, GPUs/TPUs, a parallelism-friendly architecture, and pretraining plus fine-tuning to achieve state-of-the-art results in NLP and beyond.

Unsupervised Learning Fundamentals

Unsupervised Learning: There are no labels y; the dataset consists only of inputs {x(1), …, x(n)}. The goal is to learn the structure of data without supervision. Types include: Clustering (grouping data points), Dimensionality Reduction (finding a low-dimensional subspace), and Embeddings (learning vectors for structured objects).

Clustering Algorithms

Clustering: Assigns each data point x(i) to a cluster index zᵢ ∈ {1, …, k}. The output is an assignment vector z₁:ₙ where each cluster j is a set of points with zᵢ = j. The number of clusters k is a hyperparameter and must be chosen beforehand (or heuristically).

k-Means Clustering Objective

k-Means Clustering (Objective): Represents each cluster with a centroid µⱼ. The loss is L = ∑₍ᵢ₌₁₎ⁿ ∥x(i) − µ_{zᵢ}∥², aiming to minimize the distance from each point to its cluster center. Also called reconstruction error—how well can we represent each x(i) using just the centroid µ_{zᵢ}?

k-Means Algorithm (Alternating Minimization)

k-Means Algorithm (Alternating Minimization): Gradient descent cannot be used because zᵢ are discrete. Instead, alternate two steps until convergence:

Step 1 (Update zᵢ): Assign each point to the nearest centroid, so zᵢ = argminⱼ ∥x(i) − µⱼ∥².
Step 2 (Update µⱼ): Recompute each centroid as the mean of assigned points, so µⱼ = average of {x(i) | zᵢ = j}.

k-Means Initialization

Initialization: Start with random µⱼ by choosing k random data points. It converges when z₁:ₙ and µ₁:k stop changing. It is not guaranteed to find the global minimum; it may converge to a local minimum. Therefore, use multiple random restarts and pick the clustering with the lowest loss.

k-Means vs. Classification

Comparison to Classification: Superficially similar—the output is a label from {1,…,k} and depends on learned parameters µ₁:k. But no true labels are known in clustering, and we only care about dividing up the training data—no generalization to new test data is typically considered.

Choosing the Number of Clusters (k)

Choosing k (Number of Clusters): More clusters always reduce the loss, so the loss alone cannot be used to pick k. Use the elbow criterion: plot loss versus k and look for the point where the curve flattens (the “elbow”). Pick k where adding more clusters yields diminishing returns on loss reduction.

K-Means Limitations and Gaussian Mixture Models

k-Means Limitations: It assumes all clusters are spherical, equal in size, and defined by a single centroid. It cannot model clusters with differing shapes, densities, or orientations.

Gaussian Mixture Models (GMM)

Gaussian Mixture Model (GMM): A probabilistic clustering method where each cluster is defined by a multivariate Gaussian with its own mean µⱼ, covariance Σⱼ, and mixing weight πⱼ. It allows soft assignments of points to clusters based on likelihood.

Multivariate Gaussian Distribution

Multivariate Gaussian: A generalization of the univariate Gaussian to a vector x ∈ ℝᵈ. It is parameterized by a mean vector µ ∈ ℝᵈ and a covariance matrix Σ ∈ ℝᵈˣᵈ. Covariance captures variance (diagonal entries) and correlation (off-diagonal entries) between variables. The probability density function is: P(x) = (1 / (2π)^{d/2} √det(Σ)) · exp(−½(x−µ)ᵀΣ⁻¹(x−µ)).

GMM Generative Process

GMM Generative Process:

For each point, sample Zᵢ ∈ {1,…,k} from a categorical distribution with probabilities π₁:ₖ.
Given Zᵢ = j, sample xᵢ ~ 𝒩(µⱼ, Σⱼ).

Zᵢ are latent variables (i.e., not observed).

GMM Parameters

GMM Parameters:

π₁:ₖ = mixture weights, πⱼ = P(Zᵢ = j), ∑ πⱼ = 1
µ₁:ₖ = cluster means
Σ₁:ₖ = cluster covariances

Soft Inference (E-step)

Soft Inference (E-step): Compute the posterior P(Zᵢ = j | xᵢ) using Bayes’ Rule: P(Zᵢ = j | xᵢ) = (πⱼ · 𝒩(xᵢ; µⱼ, Σⱼ)) / ∑_{b=1}^k (π_b · 𝒩(xᵢ; µ_b, Σ_b)). This gives a soft assignment matrix R ∈ ℝⁿˣᵏ where Rᵢⱼ = P(Zᵢ = j | xᵢ).

Hard Assignment in GMMs

Hard Assignment: zᵢ = argmaxⱼ P(Zᵢ = j | xᵢ), which is the cluster with the highest posterior probability.

EM Algorithm for GMMs

EM Algorithm for GMMs: It iteratively improves parameters and soft assignments.

E-step: Compute Rᵢⱼ = P(Zᵢ = j | xᵢ; current π, µ, Σ)
M-step: Update π, µ, Σ using weighted averages with Rᵢⱼ as soft counts

M-step Updates

M-step Updates:

µⱼ = ∑ Rᵢⱼ xᵢ / ∑ Rᵢⱼ
πⱼ = ∑ Rᵢⱼ / n
Σⱼ = ∑ Rᵢⱼ (xᵢ − µⱼ)(xᵢ − µⱼ)ᵀ / ∑ Rᵢⱼ

Expected Complete Log-Likelihood (ECLL)

ECLL (Expected Complete Log-Likelihood): The objective function for EM is ECLL = ∑₍ᵢ₌₁₎ⁿ ∑₍ⱼ₌₁₎ᵏ Rᵢⱼ log P(xᵢ, Zᵢ = j). Use current Rᵢⱼ values to compute the expectation over latent variables and maximize the expected log-likelihood.

k-Means vs. GMMs Comparison

k-Means = hard assignment, minimizes ∑ ∥xᵢ − µ_{zᵢ}∥²
GMMs = soft assignment, probabilistic model with Gaussian components
EM for GMM = soft version of k-Means (inference + parameter update)
k-Means = special case of GMM with fixed spherical covariance and hard assignments

Dimensionality Reduction Techniques

Dimensionality Reduction aims to find a low-dimensional subspace that best represents a high-dimensional dataset. Unlike clustering, which identifies subgroups, dimensionality reduction identifies structure across continuous directions. It is often used for visualization, compression, and preprocessing of high-dimensional data like images or genomes. For example, projecting genetic data (600,000-dimensional vectors) to 2D space via PCA reveals patterns corresponding to European geography.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is the most widely used method for dimensionality reduction and finds orthogonal directions (principal components) that capture the most variance in the data. We assume the data has been mean-centered (i.e., ∑ x(i) = 0) so that the origin reflects the center of the data cloud.

Reconstruction Loss in PCA

Reconstruction Loss is defined as the squared Euclidean distance between the original point x(i) and its projection (wᵀx(i))w onto a unit direction vector w. The total loss L(w) = ∑ ∥x(i) − (wᵀx(i))w∥², where w is constrained to have ∥w∥ = 1. The projection of x onto w is given by (wᵀx)w, which is a scalar multiple of w. Our goal is to find the w that minimizes this loss over the dataset.

Equivalent Objective: Maximizing Variance

Equivalent Objective: Minimizing reconstruction loss is equivalent to maximizing the squared projection length (wᵀx(i))², since ∥x(i)∥² = (wᵀx(i))² + ∥x(i) − (wᵀx(i))w∥² by the Pythagorean Theorem. Therefore, minimizing loss is equivalent to maximizing ∑ (wᵀx(i))², which means maximizing the variance of projections onto w.

Variance Formulation and Covariance

Variance Formulation: The variance objective becomes wᵀΣw, where Σ = (1/n) ∑ x(i)x(i)ᵀ is the empirical covariance matrix of the dataset. Our goal is to find the unit vector w that maximizes wᵀΣw.

Eigendecomposition for PCA

Eigendecomposition: Since Σ is symmetric, it can be decomposed as Σ = UDUᵀ where U is an orthonormal matrix of eigenvectors and D is a diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λ_d. Letting a = Uᵀw and knowing ∥a∥ = ∥w∥ = 1, we rewrite the variance objective as aᵀDa = ∑ λ_j a_j², which is maximized by setting a₁ = 1 and all other a_j = 0. Thus, w = u₁, which is the eigenvector corresponding to the largest eigenvalue of Σ.

Principal Components

Principal Components: The top eigenvector defines the first principal component, which is the direction capturing the most variance in the dataset. To reduce dimensionality to k dimensions, project onto the top k eigenvectors (those corresponding to the top k eigenvalues). The resulting k-dimensional projection preserves as much variance as possible in k dimensions.

PCA Summary

PCA identifies directions (principal components) that preserve maximal variance when projecting the data
Each principal component is an eigenvector of the covariance matrix Σ
The amount of variance explained by each component is proportional to its associated eigenvalue
PCA is solved in closed form using eigendecomposition, requiring only linear algebra and no iterative optimization

Multi-Armed Bandit Problem

The Bandit Setting is a simplified form of reinforcement learning where there is no state, only actions and rewards. At each timestep t = 1,…,T, the agent chooses an action Aₜ ∈ {1,…,k} and receives a reward Rₜ drawn from an unknown distribution Pₐ(R) associated with action a. The agent’s goal is to maximize the total reward ∑ₜ₌₁ᵀ Rₜ without knowing the reward distributions in advance.

Exploration vs. Exploitation Tradeoff

Exploration vs Exploitation is the fundamental tradeoff: exploration involves trying unknown actions to learn their reward potential, while exploitation involves picking the action currently estimated to be best. Any good bandit algorithm must balance these two forces: explore enough to discover the best arm, and exploit enough to accumulate high rewards.

Regret in Bandit Algorithms

Regret is the difference between the total reward the algorithm achieves and the total reward the best fixed action (arm) would have given: Regret = µ(a*)·T − E[∑ₜ Rₜ], where µ(a*) = E[R | a*] is the expected reward of the optimal action a*. Regret is a theoretical measure used to evaluate algorithms, but it is not directly computable during execution.

Upper Confidence Bound (UCB) Algorithm

The Upper Confidence Bound (UCB) Algorithm solves bandits using the principle of optimism under uncertainty: at each round, pick the action with the highest upper confidence bound UCBₜ(a). Maintain a running estimate of the mean reward µ̂ₜ(a) for each action and a count nₜ(a) of how many times each action has been played.

UCB Score Formula

The UCB Score is defined as UCBₜ(a) = µ̂ₜ(a) + √(2 log t / nₜ(a)), where the first term encourages exploitation and the second term encourages exploration. As more samples are collected for action a, the confidence interval narrows (i.e., uncertainty shrinks), making that action less likely to be selected purely for exploration. The log(t) numerator grows slowly, ensuring that actions not chosen in a while will eventually become attractive again and be retried.

UCB Algorithm Steps

UCB Algorithm Steps:

For t = 1 to k (number of actions), play each action once.
For t = k+1 to T, choose Aₜ = argmaxₐ UCBₜ(a).

UCB Algorithm Interpretation

Interpretation: UCB balances learning and earning by combining an exploitation-promoting term (the sample mean) and an exploration bonus (uncertainty) into a single decision rule. Even without knowing the best action, UCB guarantees near-optimal performance over time by keeping an “open mind” and revisiting poorly explored actions.

UCB Regret Bound

The Regret Bound of UCB is sublinear: total regret R(T) = O(√(kT log T)) under bounded rewards R ∈ [0,1]. This implies that the average regret R(T)/T → 0 as T → ∞, meaning the algorithm asymptotically matches the optimal strategy.

Reinforcement Learning Fundamentals

Reinforcement Learning generalizes bandits by modeling sequential decision-making where actions influence future states. The environment is defined as a Markov Decision Process (MDP) with: S (set of states), s_start (start state), A(s) (actions available in state s), T(s, a, s′) (transition probability), R(s, a, s′) (immediate reward), and IsEnd(s) (whether state s ends an episode). The Markov property means that state s fully determines future dynamics regardless of past history. An agent’s policy π(s) maps each state to an action a ∈ A(s). The agent aims to maximize total rewards over an episode, not just immediate reward, by accounting for future reward potential. We use discounting with γ ∈ [0,1) to emphasize earlier rewards. The total discounted return is r₁ + γr₂ + γ²r₃ + …, which remains finite even in infinite-horizon settings. The state value function V^π(s) is the expected total discounted reward starting at state s and following policy π. The optimal value V*(s) satisfies: V*(s) = 0 if IsEnd(s); otherwise, it is maxₐ Q*(s, a). The Q-function Q*(s, a) gives the expected reward for taking action a in state s and acting optimally afterward: Q*(s, a) = ∑ₛ′ T(s, a, s′)·(R(s, a, s′) + γV*(s′)). The optimal policy π*(s) selects actions as π*(s) = argmaxₐ Q*(s, a). In the RL setting, the agent does not know T or R, so it learns by interacting with the environment in episodes of experience tuples (s, a, r, s′).

Tabular Q-Learning

In tabular Q-learning, we initialize Q̂(s, a) to 0 and iteratively update it using experience via Q̂(s, a) ← (1 − η)Q̂(s, a) + η(r + γV̂(s′)), where V̂(s′) = maxₐ′ Q̂(s′, a′) and η is a learning rate. This nudges Q̂ toward the new observed reward plus the estimated future value. To encourage exploration, we use ϵ-greedy policies: with probability ϵ, a random action is taken; otherwise, the action with the highest Q̂(s, a) is picked.

State Discretization

In large or continuous state spaces, we use state discretization: each dimension is split into B buckets, leading to a total of B^d states, which is feasible when d is small.

Function Approximation in RL

To scale further, we apply function approximation, such as linear models: represent the state-action pair (s, a) with features ϕ(s, a) and define Q̂(s, a) = wᵀϕ(s, a). Update via gradient descent using squared loss: w ← w − η(wᵀϕ(s, a) − r − γV̂(s′))·ϕ(s, a). This generalizes Q-values across states using shared weights and extracted features.

Deep Q-Learning (DQN)

In Deep Q-Learning, we replace the linear model with a neural network Q̂_θ(s, a), and update θ via backpropagation on the loss ½(Q̂_θ(s, a) − r − γV̂(s′))², computing gradients as ∇θ(Q̂_θ(s, a)), and estimating V̂(s′) = maxₐ′ Q̂_θ(s′, a′).

Policy Gradient Methods (REINFORCE)

Instead of learning Q-values, we can also learn policies directly with policy gradient methods like REINFORCE: model π_θ(a | s), which is a probability distribution over actions, and maximize the expected trajectory reward V(θ) = E_z∼πθ[R(z)], where z = [s₁, a₁, r₁, …, s_T, a_T, r_T, s_T₊₁]. The discounted reward of trajectory z is R(z) = ∑ₜ γ^{t−1} rₜ, and the gradient is ∇θV(θ) = E[R(z) ∑ₜ ∇θ log π_θ(aₜ | sₜ)]. In practice, we sample trajectories and update θ ← θ + ηR(z) ∑ₜ log π_θ(aₜ | sₜ) using stochastic gradient ascent. This avoids estimating Q-values and treats action selection as classification with high variance but direct credit assignment. Modern variants like Actor-Critic, PPO, and others reduce variance or improve stability but are built on this same core gradient estimator.

Adversarial Examples in Machine Learning

Adversarial Examples are inputs crafted by an attacker to intentionally cause a machine learning model to behave incorrectly. They can exist even for models with high average accuracy and are a major concern for security, interpretability, and robustness. In vision, adversarial perturbations are often small changes in pixel values that are imperceptible to humans but cause confident misclassifications. For example, a panda classified as a gibbon with 99% confidence after imperceptible noise is added.

Why Adversarial Examples Matter

Why We Care: Fooling facial recognition, self-driving cars, or spam detectors poses serious risks. Adversarial attacks also expose that models may rely on spurious correlations, not real understanding.

Adversarial Threat Models

The Threat Model includes three parts: attack vector (what the attacker can modify), knowledge (white-box = full access to the model, black-box = only query access), and goal (undirected = cause any error, directed = cause a specific target error). Attack Vectors include constrained perturbations (small norm difference, e.g., L₂ or L_∞), unconstrained inputs, and data poisoning (malicious training data). For images, attackers limit pixel changes to ε in each direction using norms like L_∞: max |xᵢ − x′ᵢ| ≤ ε or L₂: ‖x − x′‖₂ ≤ ε. White-box Attacks use model gradients directly; black-box attacks use queries and heuristics, often with a limited budget.

Fast Gradient Sign Method (FGSM)

The Fast Gradient Sign Method (FGSM) is a white-box attack that computes the perturbation as x′ = x + ε · sign(∇ₓ L(x, y)) to increase model loss. This approximates the model locally as linear and nudges the input in the direction that increases prediction error.

Defending Against Adversarial Attacks

Defending Against Attacks requires anticipating the worst-case input, not just random noise. Naive strategies like adding noise (data augmentation) are insufficient—they fail against worst-case perturbations.

Adversarial Training Defense

Adversarial Training defends by training the model on adversarial examples x_adv = A(x), where A is an attack algorithm like FGSM. To modify the loss, train on adversarial inputs with θ ← θ − η ∇θ L(f_θ(A(x)), y) at each gradient step. This improves robustness to the specific attack used during training but may not generalize to all attacks.

Adversarial Attacks in NLP

NLP Attacks include Unicode attacks (swapping ASCII for similar-looking Unicode), typo-based attacks (causing large embedding changes from small textual edits), and meaning-preserving paraphrases (e.g., “what has” → “what’s”) that change model predictions despite semantic equivalence.

Jailbreaking Large Language Models (LLMs)

Jailbreaking LLMs uses prompts or images to force large language models like ChatGPT to output restricted content. Attacks include prompt engineering and adversarial image encoding, often by subtly altering inputs to bypass alignment safeguards.

Adversarial Examples Summary

Adversarial examples affect images, text, and multimodal models
White-box attacks like FGSM are fast and effective
Adversarial training improves robustness against specific threats
Models may still fail under novel perturbations even with high average accuracy

Spurious Correlations and ML Fairness

Spurious Correlations are input-label associations that are predictive in training but not causally related to the task. Models learn any pattern that helps minimize loss—even irrelevant or misleading ones. For example, a model predicts “wolf” when snow is present because wolves often appear in snowy pictures—it misclassifies a husky as a wolf when snow is present. In medicine, models learned to detect pneumonia based on radiology technician tokens that vary by hospital, not actual medical features—this leads to poor generalization across hospitals. In NLP, models associate identity terms (e.g., “gay”, “Muslim”) with toxicity or sentiment, simply because such mentions co-occur with hate speech or reviews—not because the terms themselves imply anything. Spurious correlations undermine generalization and robustness: models fail when tested in different conditions (e.g., ducks on land instead of water). Solutions include data diversification and out-of-distribution testing—never assume a model will generalize without measuring it.

Fairness in Machine Learning

Fairness in ML concerns how models make decisions affecting different groups and individuals, especially across sensitive attributes A (e.g., race, gender). Core setup: input features X, prediction target Y, model output R, and sensitive attribute A—we ask whether the model treats different A values fairly. Fairness cannot be achieved by just omitting A (no fairness through unawareness)—models can reconstruct A from proxies like zip code.

Types of Fairness Definitions

Types of Fairness:

Independence (Statistical Parity)

Independence (Statistical Parity): prediction R is independent of A—groups get an equal percentage of positive predictions regardless of actual labels.

Separation (Equalized Odds)

Separation (Equalized Odds): prediction R is conditionally independent of A given Y—equal true and false positive rates across groups.

Sufficiency (Calibration)

Sufficiency (Calibration): Y is conditionally independent of A given R—predictions are equally reliable for all groups (e.g., if a model outputs 0.8 probability, the actual outcome occurs 80% of the time across all groups). These fairness criteria are often mutually incompatible—you cannot satisfy all simultaneously if groups differ in base rates.

COMPAS Example

For example, COMPAS (a recidivism prediction tool used in court decisions)—was more likely to falsely classify Black defendants as high risk (violating separation), yet was equally calibrated across groups (satisfying sufficiency). Takeaway: Fairness depends on what definition you use and what harms you prioritize.

Types of Fairness Harms

Types of Fairness Harms:

Allocative Harms

Allocative Harms: unequal access to opportunities/resources (e.g., bail decisions, medical care, resume filtering)—for example, Amazon’s recruiting AI penalized resumes with “women’s” and female colleges.

Unequal Accuracy

Unequal Accuracy: some groups consistently receive worse predictions—for example, ASR and facial recognition systems have higher error rates for Black speakers and dark-skinned women.

Representational Harms

Representational Harms: stereotypes are reinforced through model outputs—for example, search engines showing biased results, or machine translation always translating “doctor” as male.

Summary of Spurious Correlations and Fairness

Machine Learning models learn from data, including undesirable patterns
Spurious correlations hurt generalization
Fairness definitions conflict
Harms arise from allocation, unequal accuracy, or reinforcing societal bias
Mitigation requires understanding model behavior, measuring performance across groups, and being explicit about which fairness definition matters for a given application