Essential Machine Learning Concepts Explained
Regularization Techniques in Machine Learning
Regularization is a set of techniques used to reduce overfitting by encouraging simpler models. It works by adding a penalty term to the loss function that discourages overly large or complex weights. The two most common forms are L2 (Ridge) and L1 (Lasso) regularization.
L2 Regularization (Ridge / Weight Decay)
L2 Regularization adds the squared L2 norm of the weights to the loss: L(w) = original loss + λ‖w‖². This shrinks weights proportionally to their size, encouraging small, smooth parameters. The gradient becomes ∇L(w) = ∇L₀(w) + 2λw, and the gradient descent update subtracts η·2λw from each weight. Large weights decay faster than small ones. L2 avoids overly large coefficients without setting them exactly to zero.
L1 Regularization (Lasso)
L1 Regularization adds the L1 norm of weights to the loss: L(w) = original loss + λ‖w‖₁ = λΣ|wⱼ|. This promotes sparsity by encouraging some weights to become exactly zero. The gradient is λ·sign(w), applied element-wise. L1 takes constant-sized steps toward 0, regardless of magnitude, unlike L2. This makes L1 good for feature selection and compressing models.
L1 vs L2 Regularization
L1 leads to sparse models by zeroing unimportant weights; L2 leads to smooth models by shrinking all weights proportionally. L1 is preferred when only a few features are likely to matter; L2 when all features contribute somewhat. The choice is treated as a hyperparameter.
Effect on Learning Algorithms
Regularization changes both the gradient and the update rule. L2 reduces large weights gradually; L1 eliminates some weights completely. Both techniques reduce variance, improve generalization, and prevent overfitting.
Bayesian Interpretation (MAP Estimation)
Regularization can be viewed as imposing a prior on weights. Assume w is drawn from P(w), and data D is drawn from P(D|w). Learning becomes maximizing P(w|D) ∝ P(D|w)P(w). Taking the log yields the usual loss + regularization term. A Gaussian prior on w implies L2 regularization; a Laplace prior implies L1. Regularization thus has a probabilistic justification rooted in Bayesian inference.
Summary of Regularization
- Reduces overfitting by penalizing large weights
- Shrinks or zeroes weights to simplify models
- Improves generalization
- Changes gradient and update behavior
- λ must be tuned, typically via cross-validation
Bias and Variance in Machine Learning Models
Bias/Variance describes two main sources of error in machine learning models. Bias is error from incorrect assumptions that prevent the model from capturing true data patterns. Variance is error from being too sensitive to small fluctuations in the training data. Total error equals bias plus variance.
Understanding Bias
Bias results from simplifying assumptions, such as assuming linearity when the true relationship is nonlinear. High bias leads to underfitting—the model is too simple and misses important structure. To reduce bias, expand the model family by reducing constraints or adding features. This allows the model to better capture real-world complexity.
Understanding Variance
Variance comes from fitting the training data too closely, including noise. High variance leads to overfitting—the model performs well on the training set but poorly on new data. Causes include using overly complex models and having limited training data. Solutions include regularization (which shrinks the model family), increasing training data, or reducing model complexity.
The Bias-Variance Tradeoff
Increasing complexity lowers bias but raises variance. A well-balanced model minimizes both. The goal is to find a model that generalizes well—flexible enough to capture patterns, but not so complex that it memorizes noise.
Data Splitting for Model Evaluation
Data Splitting is essential for evaluating generalization. A typical split is 70% training, 10% validation, and 20% test. The training set is used to fit model parameters. The validation (development) set is used to tune hyperparameters and detect overfitting. The test set is used only for final evaluation and must remain untouched during model development.
Train vs. Test Loss Behavior
As model complexity increases, training loss always decreases, but test loss decreases then increases—the rise marks overfitting. Monitoring both losses helps identify the optimal model complexity.
Summary of Bias and Variance
- Bias = assumptions that oversimplify
- Variance = overreaction to noise
- Underfitting = high bias
- Overfitting = high variance
- Reduce bias by using more complex models
- Reduce variance by regularization or more data
- Balance is key to generalization
Normal Equations for Linear Regression
Normal Equations provide a closed-form solution to linear regression. Given feature matrix X (n × d), output vector y (n × 1), and weight vector w (d × 1), the solution is: w = (XᵀX)⁻¹Xᵀy. This computes the optimal w that minimizes the squared error loss. It works when XᵀX is invertible.
Non-Invertibility Conditions
Non-Invertibility occurs when XᵀX is not full rank, meaning the matrix cannot be inverted. This happens in two main cases: (1) Redundant features—when some columns of X are linearly dependent, and (2) Too many features—when d > n, i.e., more features than data points. In both cases, XᵀX becomes singular.
Consequences of Non-Invertibility
The system XᵀXw = Xᵀy has many solutions. This leads to overfitting and high variance since multiple weight vectors perfectly fit the training data.
Solutions for Non-Invertibility
Fixes include removing redundant or near-duplicate features (feature selection) or using regularization to ensure stability. With L2 regularization (ridge regression), the modified equation becomes w = (XᵀX + λI)⁻¹Xᵀy. Adding λI ensures invertibility by nudging diagonal entries away from zero.
The Pseudoinverse
If XᵀX is not invertible, the Moore-Penrose pseudoinverse of X can be used to compute a solution: w = X⁺y. The pseudoinverse always exists and provides a least-squares solution, even when multiple solutions exist.
Practical Rules of Thumb
Always aim for more data than features (n > d) to ensure XᵀX is invertible. Avoid duplicate or near-duplicate features to maintain linear independence.
Normal Equations Summary
- Normal equation gives exact solution for linear regression
- Fails when XᵀX is not invertible
- Causes include too few examples or redundant features
- Fixes include regularization or using the pseudoinverse
- Always check data shape and feature quality before applying the normal equations
Non-Parametric Machine Learning Methods
Non-Parametric Methods do not assume a fixed set of parameters. Instead, they store training data and use it to make predictions, meaning model size grows with the dataset. This contrasts with parametric methods, which learn a fixed number of parameters and discard the training data after training.
Parametric Methods Overview
Parametric Methods include linear regression, logistic regression, and softmax regression. These directly model P(y|x) and learn a parameter vector w ∈ ℝᵈ. Logistic regression predicts using sigmoid(wᵀx), with w trained via maximum likelihood. Naive Bayes is a generative parametric method, modeling P(y) and P(x|y) to infer P(y|x) by estimating π and τ from counts.
k-Nearest Neighbors (kNN)
k-Nearest Neighbors (kNN) is a simple non-parametric classifier. To classify input x: (1) compute distances to all training points, (2) find the k closest, (3) return the most common label. Distance metrics include Euclidean and Manhattan. 1-NN perfectly fits training data but is sensitive to outliers; k-NN smooths predictions using a majority vote. kNN has no training time but is expensive at test time and suffers from the curse of dimensionality—in high dimensions, all points are far apart.
Kernel Methods
Kernel Methods generalize both kNN and logistic regression by computing similarity in a high-dimensional space using a kernel function K(x, x′) = φ(x)ᵀφ(x′), where φ is implicit. Examples: polynomial kernel K(x, x′) = (xᵀx′ + c)ᵈ and Gaussian RBF kernel K(x, x′) = exp(−γ‖x − x′‖²). Kernel methods allow linear models to behave nonlinearly in the original space. The kernel trick lets you operate in high-dimensional feature space without explicitly computing φ(x).
Kernelized Logistic Regression
Kernelized Logistic Regression replaces w with αᵢ values per training point. Predict using a sum of kernel evaluations: P(y|x) = sigmoid(Σ αᵢ K(x, xᵢ)). Setting K(x, z) = xᵀz recovers standard logistic regression. More powerful kernels enable nonlinear classification.
Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are binary classifiers that learn a hyperplane wᵀx = D with maximum margin between classes. They minimize ½‖w‖² subject to yᵢ(wᵀxᵢ + b) ≥ 1. Points with margin ≤ 1 are support vectors; only these influence the decision boundary. Unlike logistic regression, SVMs have no probabilistic interpretation and use a different loss function (hinge loss). Hard-margin SVMs require separable data; soft-margin SVMs introduce slack to allow misclassifications.
Kernelized SVMs
Kernelized SVMs apply the kernel trick to learn nonlinear boundaries. SVMs work well in high dimensions and are robust to overfitting, but they are computationally expensive and sensitive to kernel and hyperparameter choices.
Non-Parametric Methods Summary
- Parametric = fixed-size models, learn parameters then discard data
- Non-parametric = store training data, grow with dataset
- kNN is simple, no training, sensitive to noise and dimensionality
- Kernel methods implicitly lift data to high-dimensions, enable flexible decision boundaries
- SVMs maximize margin, depend only on support vectors, and work well with kernels for classification
Backpropagation Algorithm
Backpropagation is a method for efficiently computing gradients in a computation graph, especially in neural networks. It applies the chain rule recursively, starting from the output and working backward to inputs. Used during training to update weights via gradient descent.
Computation Graphs
Computation Graphs represent computations as nodes (operations) and edges (data flow). Each node performs a function and stores intermediate values. The forward pass computes the output. The backward pass computes gradients with respect to each node’s inputs. The goal is to compute ∂L/∂w for all model parameters w.
The Chain Rule in Backpropagation
The Chain Rule is central: if z = f(w) and L depends on z, then ∂L/∂w = ∂L/∂z × ∂z/∂w. In the graph, each node passes its gradient to its inputs by multiplying its incoming gradient by its local partial derivative.
Backpropagation on Tree Structures
Assume the computation graph is a tree (no shared nodes). Start at the root (output), and initialize ∂output/∂output = 1. Propagate the gradient to children based on the type of node.
- Multiplication Node: If e = x·y, then ∂e/∂x = y and ∂e/∂y = x. The child node receives the parent’s gradient multiplied by the other child’s value.
- Addition Node: If e = x + y, then ∂e/∂x = 1 and ∂e/∂y = 1. The child node receives the parent’s gradient multiplied by 1.
- Leaf Nodes (inputs) receive the final gradients; no further propagation is needed.
Backpropagation Recipe
The overall Backpropagation Recipe:
- Perform forward pass to compute all node values.
- Initialize root gradient to 1.
- Recursively apply chain rule from root to leaves. Each node computes ∂(parent)/∂(child) and multiplies by parent’s gradient.
- Continue until all leaf gradients are computed.
Multivariate Chain Rule
When a variable has multiple downstream paths (e.g., shared inputs), its total gradient is the sum of all contributions: ∂L/∂x = Σ (∂L/∂zᵢ × ∂zᵢ/∂x). This handles weight sharing.
Numerical Gradients
Numerical Gradients are an alternative method: ∂y/∂x ≈ [f(x + ε) − f(x)] / ε.
- Pros: Simple to implement, good for checking correctness.
- Cons: Inefficient (requires one forward pass per input).
Backpropagation Summary
- Backprop computes gradients efficiently using the chain rule
- Works on computation graphs
- Forward pass computes values
- Backward pass propagates gradients
- Each node knows how to compute its own derivative
- Used to train neural networks via gradient descent
Neural Network Optimization Techniques
Neural Network Optimization addresses the challenges of training non-convex models. Neural networks are flexible but prone to overfitting and difficult to optimize. Key components include initialization, learning rate, SGD variants, and regularization.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) approximates the gradient by computing it over a small batch of examples. One update is much faster than full gradient descent. Smaller batches are faster but noisier, while larger batches are slower but more accurate. Each full pass through the dataset is called an epoch.
Weight Initialization Strategies
Initialization matters because neural networks are non-convex. Bad initialization (e.g., all zeros) leads to symmetry—all neurons behave identically. Fix this by initializing weights to small random values. Use He initialization (variance = 2/d_in) or Xavier initialization (variance = 1/d_in). This avoids exploding or vanishing gradients.
Learning Rate and Schedules
Learning Rate is critical. If too small, convergence is slow or stuck in flat regions. If too large, training becomes unstable. Use learning rate schedules—start high and decay over time. This allows the model to take large steps early and small steps later.
Momentum Optimization
Momentum helps escape sharp curves by building velocity. The update uses: v = β₁v + (1−β₁)grad, then w = w − ηv. It accelerates progress along consistent directions and damps oscillations.
RMSProp Optimizer
RMSProp adapts the learning rate per parameter: s = β₂s + (1−β₂)(grad²), then w = w − η·grad / (√s + ε). This reduces the learning rate in directions with consistently large gradients.
Adam Optimizer
Adam combines momentum and RMSProp: v = β₁v + (1−β₁)grad, s = β₂s + (1−β₂)(grad²), and the update is w = w − η·v / (√s + ε). Adam is widely used and robust to hyperparameter tuning.
Regularization for Neural Networks
Regularization Techniques include weight decay, early stopping, and dropout.
Weight Decay (L2 Regularization)
Weight Decay (L2 Regularization) adds λ‖w‖² to the loss function. This shrinks weights by a factor of (1 − ηλ) during updates. It helps prevent overfitting by penalizing large parameters.
Early Stopping
Early Stopping monitors validation loss during training. If it starts rising, training is stopped, and the best-performing checkpoint is returned. This limits how far parameters can drift from initialization, effectively constraining the model.
Dropout Regularization
Dropout randomly zeroes out hidden neurons during training with probability p. To maintain expected output, scale the remaining neurons by 1/p. Prevents co-adaptation—neurons must be useful independently. At test time, the full network is used without dropout.
Dropout as an Ensemble Method
Each dropout pattern represents a sub-network. Training with dropout averages over many such models, similar to an ensemble. Improves generalization and reduces overfitting.
Dropout and Naive Bayes Generalization
At the input layer, dropout reduces to learning P(y|xⱼ) for individual features, similar to Naive Bayes. At test time, all features are used, combining predictions multiplicatively.
Neural Network Optimization Summary
- SGD enables efficient training
- Initialization breaks symmetry and stabilizes training
- Learning rate and schedule are crucial
- Momentum and RMSProp improve convergence
- Adam combines both
- Regularization via weight decay, early stopping, and dropout controls overfitting and enhances generalization
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks (CNNs) are deep learning models designed for images and other spatially structured inputs. They extract features hierarchically—from edges to shapes to objects—by stacking layers of convolution, nonlinearity, and pooling.
Convolution Operation
Convolution is a linear operation where a small kernel (e.g., 3×3) slides over an input matrix and computes dot products. Output size is (W−K+1)×(H−K+1). Convolutional layers learn kernels that detect specific patterns. They are parameter-efficient and exploit spatial locality.
Local Receptive Fields
Local Receptive Fields ensure each neuron focuses on a small region of the input. This mimics how vision systems detect small parts before combining them into larger structures.
Weight Sharing in CNNs
Weight Sharing allows the same kernel to detect the same feature in different image locations. This gives CNNs translation equivariance—if the input shifts, the output shifts similarly.
Conv Layers vs. Fully Connected Layers
Convolutional layers have far fewer parameters (e.g., 10 vs. 372) and require less data to train. Fully Connected layers must learn the same feature at every location; CNNs reuse filters globally.
Input and Output Channels
Color images have 3 channels (RGB), so kernels have the shape Cin×K×K. To learn multiple features, use Cout output channels—the kernel tensor is Cout×Cin×K×K.
Stride and Padding
Stride determines how far the kernel moves each step; a stride greater than 1 reduces spatial resolution. Padding (e.g., zero-padding) keeps the output size consistent. Without it, width and height shrink with each layer.
Convolutional Layers and Activation
Convolutional Layers are typically followed by ReLU nonlinearity. Convolutional + ReLU acts like a linear layer with constraints. CNNs stack multiple Convolutional + ReLU + Pool blocks to learn complex, high-level representations.
Pooling Layers (Max and Average)
Pooling (Max or Average) reduces resolution and increases receptive field. Max Pooling picks the maximum value in a small patch (e.g., 2×2). Pooling helps retain dominant features while discarding spatial noise.
Final Layers in CNNs
After several convolutional layers, CNNs use fully connected layers for global reasoning. First, flatten the output to a vector, then apply MLP layers to produce predictions.
Learning Features with CNNs
CNN weights (in convolutional and fully connected layers) are updated via backpropagation. Gradients indicate whether increasing a value improves the loss. Convolutional filters learn to activate strongly for useful features (e.g., edges, textures, objects).
CNN Feature Visualization
Visualization: Early filters learn color and edge detectors. Deeper layers capture high-level concepts (e.g., faces, animals, text).
Image Classification with CNNs
Image Classification: CNNs achieve strong performance on datasets like ImageNet. AlexNet (2012) marked a breakthrough in deep learning by winning ImageNet with a CNN.
Object Detection
Object Detection: Classify and locate objects in an image. One approach is to generate region proposals and then classify each region.
Semantic Segmentation
Semantic Segmentation: Assign a label to every pixel. Encoder-decoder models like U-Net perform downsampling (convolution + pooling), then upsampling (convolution + upsampling) to match the original image size.
Diffusion Models
Diffusion Models: Train CNNs to reverse noise. Start with clean images, then add noise to the training data. Train the model to predict the denoised version. At test time, start from pure noise and iteratively denoise using a CNN. It can also be conditioned on captions. The architecture often uses U-Net and per-pixel regression loss.
CNNs Summary
- CNNs extract features from local to global
- Use convolution, weight sharing, and pooling
- Learn complex representations layer by layer
- Reduce parameters vs. FC layers
- Used for classification, detection, segmentation, and generation
- Backprop tunes kernels to detect meaningful visual patterns
Word Vectors and Embeddings
Word Vectors represent each word w with a vector vₐ ∈ ℝᵈ that encodes its meaning. Similar words have similar vectors. Components may capture semantics such as category, sentiment, or usage. This allows neural networks to process language numerically.
The Distributional Hypothesis
Distributional Hypothesis: Words that appear in similar contexts have similar meanings. For example, if “ongchoi” occurs in similar sentences as “spinach,” their vectors should be similar. This motivates learning word vectors by predicting context words.
Word2Vec Model
Word2Vec builds word vectors using a prediction task: For each word w, predict which words co-occur with it in a fixed-size sliding window. Generate positive examples (real co-occurrences) and negative ones (random words). Each prediction is a binary classification task: Did this word pair co-occur?
Word2Vec Training Objective
For each word w, learn a word vector vₐ and a context vector cₐ. Use a logistic regression loss for each (w, w′) pair: positive pairs should score high, and negative pairs should score low. The objective is non-convex because both v and c are learned. Gradient descent is used for optimization. Gradients are symmetrical: update v when w is an input, and update c when w is a context word.
Negative Sampling
Negative Sampling: Rather than computing a full softmax, randomly sample negative context words. Frequency-adjusted sampling (e.g., using P(w)⁰·⁷⁵) downweights common words like “the.” A typical ratio is 2 negatives per positive.
Word vs. Context Vectors
Both are learned during training, but only word vectors v are kept afterward. Context vectors c serve as classifier weights and are discarded.
Analogy Reasoning with Word Vectors
Analogy Reasoning: Vector arithmetic can capture relationships: vvine ≈ vtree − vapple + vgrape. Given three words, analogies can be solved by computing a target vector and searching for the closest word (by cosine similarity). Analogies capture directions like gender (man → woman), grammar (fast → fastest), and hierarchy (animal → dog).
Bias in Word Vectors
Since Word2Vec learns from raw data, it picks up societal biases. For example, vprogrammer − vman + vwoman ≈ vhomemaker reflects historical stereotypes in data, not semantic relationships. Embedding bias correlates with real-world demographic distributions. Machine Learning models absorb whatever associations exist in the data.
Word Vector Embedding Layer
A layer that maps each word w to its vector vₐ. It stores a lookup table of parameters (|V| × d). In PyTorch, it is implemented as nn.Embedding()
. These embeddings are typically trained further in downstream tasks.
Word Vectors Summary
- Learn word vectors by predicting co-occurrence
- Optimize with gradient descent on binary classification tasks
- Negative sampling improves efficiency
- Vector arithmetic enables analogy solving
- Word2Vec vectors reflect both semantics and bias
- Only word vectors (not context vectors) are retained for use
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) handle sequential inputs like text, where input length varies. Instead of using a fixed-size feature vector, RNNs process inputs one token at a time using a shared set of parameters. They are capable of encoding arbitrary-length sequences into fixed-size hidden states.
Word Vectors as RNN Input
Word Vectors convert each word w to a vector vₐ ∈ ℝᵈ. These vectors are fed to the RNN one at a time. Inputs x₁, …, x_T are vectors for words w₁, …, w_T.
Elman RNN Architecture
An Elman RNN maintains a hidden state hₜ updated at each step using the previous state hₜ₋₁ and current input xₜ: hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b). Parameters are shared across timesteps: Wₓ ∈ ℝᵈˣʰ, Wₕ ∈ ℝʰˣʰ, and b ∈ ℝʰ. The hidden state is initialized as h₀. The RNN layer transforms a T × d input into a T × h sequence of hidden states.
RNN Training with BPTT
Training uses Backpropagation Through Time (BPTT). The loss from later timesteps is propagated through each previous hidden state, all the way to earlier word vectors. Gradients are computed using the chain rule across all timesteps, which makes RNN training sensitive to sequence length.
Language Modeling with RNNs
Language Modeling: Given a prefix, the model predicts the next word at each step. RNNs act as autoregressive models: each word’s prediction depends on the current hidden state, which encodes all previous words. At test time, text is generated by feeding in the model’s own predictions. Use greedy decoding (selecting the most likely word) or sampling from the output distribution.
Long-Range Dependencies Challenge
Long-Range Dependencies: Because hidden states are repeatedly updated with new inputs, information from earlier words is easily lost. This makes it hard to capture relationships across many timesteps, for example, subject-verb agreement or coreference across sentences.
Vanishing and Exploding Gradients
Vanishing & Exploding Gradients: As the gradient is propagated backward, it is repeatedly multiplied by Wₕ. If ‖Wₕ‖ < 1, the gradient shrinks and vanishes. If ‖Wₕ‖ > 1, the gradient grows and explodes. Both conditions make training unstable.
Gradient Solutions for RNNs
Solutions: Gradient clipping limits the gradient magnitude to prevent explosion. Vanishing gradients can be avoided using additive recurrence. Instead of purely multiplicative updates, additive updates preserve information better.
Gated Recurrent Networks (GRUs, LSTMs)
Gated RNNs address vanishing gradients by introducing gates.
Gated Recurrent Unit (GRU)
GRUs use update gates z ∈ [0,1] to control how much information to keep versus forget. When zᵢ = 0, the neuron is updated; when zᵢ = 1, the value is preserved. GRUs apply element-wise interpolation between the previous and candidate hidden state. Parameters: W_z, W_r, W.
Long Short-Term Memory (LSTM)
LSTMs separate the hidden state hₜ and the cell state cₜ. The cell state is updated additively and passed through forget, input, and output gates. The architecture retains long-term memory better and is widely used in NLP.
RNNs Summary
- RNNs model sequences with shared parameters across time
- Use word vectors as input
- Train with BPTT
- Can predict next word or encode full sequence
- Struggle with long-range dependencies due to vanishing gradients
- Gated RNNs (GRUs, LSTMs) mitigate this by using additive updates and gates to preserve information
Decision Trees and Ensemble Methods
Decision Tree Prediction Process
Decision Tree Prediction: At each node, split on one feature. At each leaf, store predicted output (classification = majority label, regression = mean value). Given new input, walk the tree to determine the output. Very interpretable.
Learning Regression Trees
Learning Trees (Regression): At each node, choose the best feature and split threshold. Try all features and split points. Choose the split minimizing prediction error. Predict the mean in each region and compute total squared error.
Decision Tree Stopping Criteria
Stopping Criteria: Splitting to leaf size 1 overfits. Use min samples, max depth, or pruning.
Learning Classification Trees
Learning Trees (Classification): Predict the majority label at each node. Use split criteria like majority class accuracy or Gini Impurity. Gini(c) = ∑ p(c)(1 − p(c)) = ∑ p(c)·p(c′). Lower Gini = purer split. Split score = weighted average of child impurities.
Handling Missing Features in Trees
Handling Missing Features: Use surrogate variables—features that correlate with the chosen feature at each node. During test, if a feature is missing, use a surrogate to decide the path.
Ensemble Methods Overview
Ensembling: Combine predictions of multiple models to improve accuracy. Final prediction = average (regression) or majority vote (classification). Reduces overfitting, improves generalization.
Tree Ensemble Techniques
Tree Ensembles: Shallow trees capture weak patterns. Ensemble aggregates more info across trees.
Bagging (Bootstrap Aggregating)
Bagging: Train each tree on a bootstrap sample (sample with replacement). Each tree sees ~63% of the training set. Adds variance between trees.
Random Forests
Random Forests: Add randomness at the split level—only consider a random subset of features (e.g., √d) when splitting. Further reduces tree correlation, boosts diversity and performance.
Dropout as an Ensemble
Dropout as Ensemble: In neural nets, train different subnetworks each iteration by randomly dropping units. Test time: use the full net → output ≈ average of subnetworks. Shared weights, but ensemble-like effect.
Neural Network Ensembles
Neural Net Ensembles: Combine multiple trained networks. Stochastic training (SGD, init, dropout) already gives variation. Still, explicit ensembles (e.g., GPT-4) can improve performance at extra compute cost.
Vanilla RNNs and Gradient Issues
Vanilla RNN (Elman): At each timestep, an RNN receives the previous hidden state hₜ₋₁ and input vector xₜ. It computes hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b) with shared parameters across time. It struggles with long-range dependencies due to repeated multiplication of Wₕ, causing vanishing gradients over time.
The Vanishing Gradient Problem
Vanishing Gradient Problem: Gradients shrink as they are backpropagated through time. If ‖Wₕ‖ < 1, then (Wₕ)ᵗ → 0 quickly. This limits the ability to learn dependencies across distant timesteps.
Avoiding Vanishing Gradients
Avoiding Vanishing Gradients: Replace multiplicative recurrence with additive updates to allow gradients to accumulate rather than vanish. Purely additive recurrence preserves information but lacks flexibility.
Gated Recurrent Networks
Gated RNNs: Introduce gates to control the flow of information and selectively forget or preserve parts of the hidden state. Gates output values in [0, 1] using sigmoid functions and apply element-wise multiplication.
Gated Recurrent Unit (GRU)
GRU (Gated Recurrent Unit): A simplified gated RNN. It uses an update gate zₜ to control whether to keep or replace each component of hₜ₋₁ with a candidate update ~hₜ. If zₜᵢ = 1: replace neuron i with a new value. If zₜᵢ = 0: retain neuron i from the previous timestep. Parameters: W_z, W_r, W.
Long Short-Term Memory (LSTM)
LSTM (Long Short-Term Memory): A more expressive gated RNN. It maintains two states: cell state cₜ (long-term memory) and hidden state hₜ (output). Gates: forget gate fₜ, input gate iₜ, output gate oₜ control updates to the cell and hidden state. Cell state updated via cₜ = fₜ * cₜ₋₁ + iₜ * ~cₜ (additive). Output: hₜ = oₜ * tanh(cₜ).
Sequence-to-Sequence Tasks
Seq-to-seq Tasks: Tasks where input and output are both sequences (e.g., translation, summarization, or command-to-action). Two RNNs are used: an encoder processes the input into a feature vector, and a decoder generates the output one token at a time.
Encoder-Decoder Model Architecture
Encoder-Decoder Model: The encoder summarizes the input into a hidden/cell state vector, which is then used to initialize the decoder. The decoder predicts the next token conditioned on the previous hidden state and the previous token output. This approach is autoregressive and predicts an [END] token to stop generation.
Bidirectional RNNs
Bidirectional RNNs: Encodings are improved by running one RNN left-to-right and one right-to-left. The final encoding is a concatenation of the forward and backward final states. This doubles the feature dimension.
Attention Mechanism
Attention Mechanism: It solves the bottleneck of relying on a single encoder vector to store the entire input. It allows the decoder to attend to relevant encoder states at each generation step. Compute attention scores between the decoder state and each encoder state (e.g., using a dot product). Apply softmax to the scores to get weights αₜ. Compute the context vector cₜ = ∑ αₜ * encoder_stateₜ as a weighted average. There are no trainable parameters in its basic form.
Luong Attention (2015)
Luong Attention (2015): A dot product is computed between the decoder hidden state hₜ and each encoder state, which is then passed through a softmax to get attention weights, resulting in the context vector cₜ. Combine cₜ and hₜ, then pass them through an MLP to produce a new hidden state used for output prediction.
General Attention Formulation
General Attention Formulation: Inputs include a query vector q, key vectors k₁,…,kₜ, and value vectors v₁,…,vₜ. The output is a weighted sum of values, where weights are derived from softmax(dot(q, kₜ)). In sequence-to-sequence RNNs: q is the decoder hidden state, kᵢ are the encoder hidden states, and vᵢ are the encoder hidden states.
Attention as Information Retrieval
Attention as Retrieval: Think of the decoder hidden state as the query, encoder states as the keys, and the representations passed back as the values—just like a search engine matching queries to documents.
Conclusion on RNNs and Attention
Conclusion: GRUs/LSTMs mitigate vanishing gradients using gates and additive updates. RNNs are used as encoders to produce feature vectors or as decoders to generate sequences. Sequence-to-sequence models use both; attention improves them by dynamically aligning outputs to relevant inputs.
Transformer Neural Network Architecture
Transformer Overview: Input is a sequence of T word embeddings of size d, forming a T × d matrix. Apply alternating multi-headed attention and feedforward layers, both transforming T × d input to T × d output.
Transformer Embedding Layer
Embedding Layer: Learns word vectors plus positional embeddings (added to encode word order).
Transformer Feedforward Layer
Feedforward Layer: An MLP is applied separately to each token, typically structured as Linear(d → d_hidden) → ReLU → Linear(d_hidden → d).
Motivation for Attention
Attention Motivation: Replaces recurrence with attention over the full sequence, allowing parallelism and long-range dependencies without sequential bottlenecks.
Self-Attention Mechanism
Self-Attention: For an input matrix X ∈ ℝ^{T×d}, apply three learned linear projections: Q = XW_Q, K = XW_K, V = XW_V (where W matrices ∈ ℝ^{d×d_attn}). Compute dot products QKᵀ ∈ ℝ^{T×T}, then apply softmax row-wise to obtain attention weights. Multiply by V: The output is softmax(QKᵀ)V. Each output token attends to all tokens (including itself) in the input sequence.
Multi-Headed Attention
Multi-Headed Attention: Use n heads, each with its own W_Q, W_K, W_V mapping d → d/n. Perform attention independently for each head, resulting in an output of T × d/n. Concatenate all head outputs to get the final T × d matrix. This allows each head to learn different types of relationships (e.g., syntax, coreference, position).
Overall Transformer Architecture
Transformer Architecture: Stack L alternating multi-head attention and feedforward layers with residual connections and Layer Normalization after each sublayer. The final output is a T × d matrix, with one vector per input token.
Residual Connections
Residual Connections: Let the layer input be X and the output be O; the final output is X + O. This helps gradient flow and avoids vanishing gradients. It is also common in CNNs.
Layer Normalization
Layer Normalization: For each token vector x ∈ ℝ^d, compute the mean μ and standard deviation σ. Then, normalize to (x – μ) / (σ + ε), and finally rescale with learnable parameters a and b: y = a * normalized + b. It is applied before feedforward and attention layers for stability.
Positional Embeddings
Positional Embeddings: Since attention is order-invariant, positional information is added by learning a vector per position. The final input to the Transformer is the word embedding plus the positional embedding.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE): Tokenizes words into subword units to handle rare or unknown words. Common words become one token, while rare words become multiple subwords. This reduces vocabulary size and increases generalization.
Scaled Dot Product Attention
Scaled Dot Product Attention: Instead of raw dot products QKᵀ, scale by √d_attn to avoid large values causing near-zero or one softmax outputs. Formula: softmax(QKᵀ / √d_attn) * V.
Autoregressive Decoding
Autoregressive Decoding: During generation, the decoder predicts one token at a time using previous outputs. Apply causal masking: mask QKᵀ scores where the query index is less than the key index with -∞, so softmax ignores future tokens. This allows parallel training but enforces an autoregressive constraint.
Transformer Encoder-Decoder Model
Transformer as Encoder-Decoder: The encoder maps the input sequence to contextual representations. The decoder generates outputs one-by-one, using encoder outputs and previously generated tokens.
Masked Language Modeling (MLM)
Masked Language Modeling (MLM): A pretraining task for BERT, where random tokens are masked, and the model is trained to predict them. It trains on large unlabeled corpora (e.g., internet data) with massive compute resources.
Fine-Tuning Transformers
Fine-Tuning: Pretrain on MLM, then add a classifier head, and finally fine-tune on a downstream task (e.g., sentiment analysis, QA, entailment) using all model parameters. This is very effective even on small datasets.
Vision Transformers (ViT)
Vision Transformers: Split an image into patches (e.g., 16×16), flatten them, and embed them as a sequence of vectors. Add a CLS token and positional embeddings, then feed them to a standard Transformer. The final CLS vector is used for classification.
CNNs vs. Vision Transformers
CNN vs Vision Transformer: CNNs use fixed-size receptive fields and strong spatial biases. Vision Transformers allow each patch to attend globally, exhibiting weaker inductive bias but better scaling with data.
Training Transformers
Training Transformers: Use large datasets, GPUs/TPUs, a parallelism-friendly architecture, and pretraining plus fine-tuning to achieve state-of-the-art results in NLP and beyond.
Unsupervised Learning Fundamentals
Unsupervised Learning: There are no labels y; the dataset consists only of inputs {x(1), …, x(n)}. The goal is to learn the structure of data without supervision. Types include: Clustering (grouping data points), Dimensionality Reduction (finding a low-dimensional subspace), and Embeddings (learning vectors for structured objects).
Clustering Algorithms
Clustering: Assigns each data point x(i) to a cluster index zᵢ ∈ {1, …, k}. The output is an assignment vector z₁:ₙ where each cluster j is a set of points with zᵢ = j. The number of clusters k is a hyperparameter and must be chosen beforehand (or heuristically).
k-Means Clustering Objective
k-Means Clustering (Objective): Represents each cluster with a centroid µⱼ. The loss is L = ∑₍ᵢ₌₁₎ⁿ ∥x(i) − µ_{zᵢ}∥², aiming to minimize the distance from each point to its cluster center. Also called reconstruction error—how well can we represent each x(i) using just the centroid µ_{zᵢ}?
k-Means Algorithm (Alternating Minimization)
k-Means Algorithm (Alternating Minimization): Gradient descent cannot be used because zᵢ are discrete. Instead, alternate two steps until convergence:
- Step 1 (Update zᵢ): Assign each point to the nearest centroid, so zᵢ = argminⱼ ∥x(i) − µⱼ∥².
- Step 2 (Update µⱼ): Recompute each centroid as the mean of assigned points, so µⱼ = average of {x(i) | zᵢ = j}.
k-Means Initialization
Initialization: Start with random µⱼ by choosing k random data points. It converges when z₁:ₙ and µ₁:k stop changing. It is not guaranteed to find the global minimum; it may converge to a local minimum. Therefore, use multiple random restarts and pick the clustering with the lowest loss.
k-Means vs. Classification
Comparison to Classification: Superficially similar—the output is a label from {1,…,k} and depends on learned parameters µ₁:k. But no true labels are known in clustering, and we only care about dividing up the training data—no generalization to new test data is typically considered.
Choosing the Number of Clusters (k)
Choosing k (Number of Clusters): More clusters always reduce the loss, so the loss alone cannot be used to pick k. Use the elbow criterion: plot loss versus k and look for the point where the curve flattens (the “elbow”). Pick k where adding more clusters yields diminishing returns on loss reduction.
K-Means Limitations and Gaussian Mixture Models
k-Means Limitations: It assumes all clusters are spherical, equal in size, and defined by a single centroid. It cannot model clusters with differing shapes, densities, or orientations.
Gaussian Mixture Models (GMM)
Gaussian Mixture Model (GMM): A probabilistic clustering method where each cluster is defined by a multivariate Gaussian with its own mean µⱼ, covariance Σⱼ, and mixing weight πⱼ. It allows soft assignments of points to clusters based on likelihood.
Multivariate Gaussian Distribution
Multivariate Gaussian: A generalization of the univariate Gaussian to a vector x ∈ ℝᵈ. It is parameterized by a mean vector µ ∈ ℝᵈ and a covariance matrix Σ ∈ ℝᵈˣᵈ. Covariance captures variance (diagonal entries) and correlation (off-diagonal entries) between variables. The probability density function is: P(x) = (1 / (2π)^{d/2} √det(Σ)) · exp(−½(x−µ)ᵀΣ⁻¹(x−µ)).
GMM Generative Process
GMM Generative Process:
- For each point, sample Zᵢ ∈ {1,…,k} from a categorical distribution with probabilities π₁:ₖ.
- Given Zᵢ = j, sample xᵢ ~ 𝒩(µⱼ, Σⱼ).
Zᵢ are latent variables (i.e., not observed).
GMM Parameters
GMM Parameters:
- π₁:ₖ = mixture weights, πⱼ = P(Zᵢ = j), ∑ πⱼ = 1
- µ₁:ₖ = cluster means
- Σ₁:ₖ = cluster covariances
Soft Inference (E-step)
Soft Inference (E-step): Compute the posterior P(Zᵢ = j | xᵢ) using Bayes’ Rule: P(Zᵢ = j | xᵢ) = (πⱼ · 𝒩(xᵢ; µⱼ, Σⱼ)) / ∑_{b=1}^k (π_b · 𝒩(xᵢ; µ_b, Σ_b)). This gives a soft assignment matrix R ∈ ℝⁿˣᵏ where Rᵢⱼ = P(Zᵢ = j | xᵢ).
Hard Assignment in GMMs
Hard Assignment: zᵢ = argmaxⱼ P(Zᵢ = j | xᵢ), which is the cluster with the highest posterior probability.
EM Algorithm for GMMs
EM Algorithm for GMMs: It iteratively improves parameters and soft assignments.
- E-step: Compute Rᵢⱼ = P(Zᵢ = j | xᵢ; current π, µ, Σ)
- M-step: Update π, µ, Σ using weighted averages with Rᵢⱼ as soft counts
M-step Updates
M-step Updates:
- µⱼ = ∑ Rᵢⱼ xᵢ / ∑ Rᵢⱼ
- πⱼ = ∑ Rᵢⱼ / n
- Σⱼ = ∑ Rᵢⱼ (xᵢ − µⱼ)(xᵢ − µⱼ)ᵀ / ∑ Rᵢⱼ
Expected Complete Log-Likelihood (ECLL)
ECLL (Expected Complete Log-Likelihood): The objective function for EM is ECLL = ∑₍ᵢ₌₁₎ⁿ ∑₍ⱼ₌₁₎ᵏ Rᵢⱼ log P(xᵢ, Zᵢ = j). Use current Rᵢⱼ values to compute the expectation over latent variables and maximize the expected log-likelihood.
k-Means vs. GMMs Comparison
- k-Means = hard assignment, minimizes ∑ ∥xᵢ − µ_{zᵢ}∥²
- GMMs = soft assignment, probabilistic model with Gaussian components
- EM for GMM = soft version of k-Means (inference + parameter update)
- k-Means = special case of GMM with fixed spherical covariance and hard assignments
Dimensionality Reduction Techniques
Dimensionality Reduction aims to find a low-dimensional subspace that best represents a high-dimensional dataset. Unlike clustering, which identifies subgroups, dimensionality reduction identifies structure across continuous directions. It is often used for visualization, compression, and preprocessing of high-dimensional data like images or genomes. For example, projecting genetic data (600,000-dimensional vectors) to 2D space via PCA reveals patterns corresponding to European geography.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is the most widely used method for dimensionality reduction and finds orthogonal directions (principal components) that capture the most variance in the data. We assume the data has been mean-centered (i.e., ∑ x(i) = 0) so that the origin reflects the center of the data cloud.
Reconstruction Loss in PCA
Reconstruction Loss is defined as the squared Euclidean distance between the original point x(i) and its projection (wᵀx(i))w onto a unit direction vector w. The total loss L(w) = ∑ ∥x(i) − (wᵀx(i))w∥², where w is constrained to have ∥w∥ = 1. The projection of x onto w is given by (wᵀx)w, which is a scalar multiple of w. Our goal is to find the w that minimizes this loss over the dataset.
Equivalent Objective: Maximizing Variance
Equivalent Objective: Minimizing reconstruction loss is equivalent to maximizing the squared projection length (wᵀx(i))², since ∥x(i)∥² = (wᵀx(i))² + ∥x(i) − (wᵀx(i))w∥² by the Pythagorean Theorem. Therefore, minimizing loss is equivalent to maximizing ∑ (wᵀx(i))², which means maximizing the variance of projections onto w.
Variance Formulation and Covariance
Variance Formulation: The variance objective becomes wᵀΣw, where Σ = (1/n) ∑ x(i)x(i)ᵀ is the empirical covariance matrix of the dataset. Our goal is to find the unit vector w that maximizes wᵀΣw.
Eigendecomposition for PCA
Eigendecomposition: Since Σ is symmetric, it can be decomposed as Σ = UDUᵀ where U is an orthonormal matrix of eigenvectors and D is a diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λ_d. Letting a = Uᵀw and knowing ∥a∥ = ∥w∥ = 1, we rewrite the variance objective as aᵀDa = ∑ λ_j a_j², which is maximized by setting a₁ = 1 and all other a_j = 0. Thus, w = u₁, which is the eigenvector corresponding to the largest eigenvalue of Σ.
Principal Components
Principal Components: The top eigenvector defines the first principal component, which is the direction capturing the most variance in the dataset. To reduce dimensionality to k dimensions, project onto the top k eigenvectors (those corresponding to the top k eigenvalues). The resulting k-dimensional projection preserves as much variance as possible in k dimensions.
PCA Summary
- PCA identifies directions (principal components) that preserve maximal variance when projecting the data
- Each principal component is an eigenvector of the covariance matrix Σ
- The amount of variance explained by each component is proportional to its associated eigenvalue
- PCA is solved in closed form using eigendecomposition, requiring only linear algebra and no iterative optimization
Multi-Armed Bandit Problem
The Bandit Setting is a simplified form of reinforcement learning where there is no state, only actions and rewards. At each timestep t = 1,…,T, the agent chooses an action Aₜ ∈ {1,…,k} and receives a reward Rₜ drawn from an unknown distribution Pₐ(R) associated with action a. The agent’s goal is to maximize the total reward ∑ₜ₌₁ᵀ Rₜ without knowing the reward distributions in advance.
Exploration vs. Exploitation Tradeoff
Exploration vs Exploitation is the fundamental tradeoff: exploration involves trying unknown actions to learn their reward potential, while exploitation involves picking the action currently estimated to be best. Any good bandit algorithm must balance these two forces: explore enough to discover the best arm, and exploit enough to accumulate high rewards.
Regret in Bandit Algorithms
Regret is the difference between the total reward the algorithm achieves and the total reward the best fixed action (arm) would have given: Regret = µ(a*)·T − E[∑ₜ Rₜ], where µ(a*) = E[R | a*] is the expected reward of the optimal action a*. Regret is a theoretical measure used to evaluate algorithms, but it is not directly computable during execution.
Upper Confidence Bound (UCB) Algorithm
The Upper Confidence Bound (UCB) Algorithm solves bandits using the principle of optimism under uncertainty: at each round, pick the action with the highest upper confidence bound UCBₜ(a). Maintain a running estimate of the mean reward µ̂ₜ(a) for each action and a count nₜ(a) of how many times each action has been played.
UCB Score Formula
The UCB Score is defined as UCBₜ(a) = µ̂ₜ(a) + √(2 log t / nₜ(a)), where the first term encourages exploitation and the second term encourages exploration. As more samples are collected for action a, the confidence interval narrows (i.e., uncertainty shrinks), making that action less likely to be selected purely for exploration. The log(t) numerator grows slowly, ensuring that actions not chosen in a while will eventually become attractive again and be retried.
UCB Algorithm Steps
UCB Algorithm Steps:
- For t = 1 to k (number of actions), play each action once.
- For t = k+1 to T, choose Aₜ = argmaxₐ UCBₜ(a).
UCB Algorithm Interpretation
Interpretation: UCB balances learning and earning by combining an exploitation-promoting term (the sample mean) and an exploration bonus (uncertainty) into a single decision rule. Even without knowing the best action, UCB guarantees near-optimal performance over time by keeping an “open mind” and revisiting poorly explored actions.
UCB Regret Bound
The Regret Bound of UCB is sublinear: total regret R(T) = O(√(kT log T)) under bounded rewards R ∈ [0,1]. This implies that the average regret R(T)/T → 0 as T → ∞, meaning the algorithm asymptotically matches the optimal strategy.
Reinforcement Learning Fundamentals
Reinforcement Learning generalizes bandits by modeling sequential decision-making where actions influence future states. The environment is defined as a Markov Decision Process (MDP) with: S (set of states), s_start (start state), A(s) (actions available in state s), T(s, a, s′) (transition probability), R(s, a, s′) (immediate reward), and IsEnd(s) (whether state s ends an episode). The Markov property means that state s fully determines future dynamics regardless of past history. An agent’s policy π(s) maps each state to an action a ∈ A(s). The agent aims to maximize total rewards over an episode, not just immediate reward, by accounting for future reward potential. We use discounting with γ ∈ [0,1) to emphasize earlier rewards. The total discounted return is r₁ + γr₂ + γ²r₃ + …, which remains finite even in infinite-horizon settings. The state value function V^π(s) is the expected total discounted reward starting at state s and following policy π. The optimal value V*(s) satisfies: V*(s) = 0 if IsEnd(s); otherwise, it is maxₐ Q*(s, a). The Q-function Q*(s, a) gives the expected reward for taking action a in state s and acting optimally afterward: Q*(s, a) = ∑ₛ′ T(s, a, s′)·(R(s, a, s′) + γV*(s′)). The optimal policy π*(s) selects actions as π*(s) = argmaxₐ Q*(s, a). In the RL setting, the agent does not know T or R, so it learns by interacting with the environment in episodes of experience tuples (s, a, r, s′).
Tabular Q-Learning
In tabular Q-learning, we initialize Q̂(s, a) to 0 and iteratively update it using experience via Q̂(s, a) ← (1 − η)Q̂(s, a) + η(r + γV̂(s′)), where V̂(s′) = maxₐ′ Q̂(s′, a′) and η is a learning rate. This nudges Q̂ toward the new observed reward plus the estimated future value. To encourage exploration, we use ϵ-greedy policies: with probability ϵ, a random action is taken; otherwise, the action with the highest Q̂(s, a) is picked.
State Discretization
In large or continuous state spaces, we use state discretization: each dimension is split into B buckets, leading to a total of B^d states, which is feasible when d is small.
Function Approximation in RL
To scale further, we apply function approximation, such as linear models: represent the state-action pair (s, a) with features ϕ(s, a) and define Q̂(s, a) = wᵀϕ(s, a). Update via gradient descent using squared loss: w ← w − η(wᵀϕ(s, a) − r − γV̂(s′))·ϕ(s, a). This generalizes Q-values across states using shared weights and extracted features.
Deep Q-Learning (DQN)
In Deep Q-Learning, we replace the linear model with a neural network Q̂_θ(s, a), and update θ via backpropagation on the loss ½(Q̂_θ(s, a) − r − γV̂(s′))², computing gradients as ∇θ(Q̂_θ(s, a)), and estimating V̂(s′) = maxₐ′ Q̂_θ(s′, a′).
Policy Gradient Methods (REINFORCE)
Instead of learning Q-values, we can also learn policies directly with policy gradient methods like REINFORCE: model π_θ(a | s), which is a probability distribution over actions, and maximize the expected trajectory reward V(θ) = E_z∼πθ[R(z)], where z = [s₁, a₁, r₁, …, s_T, a_T, r_T, s_T₊₁]. The discounted reward of trajectory z is R(z) = ∑ₜ γ^{t−1} rₜ, and the gradient is ∇θV(θ) = E[R(z) ∑ₜ ∇θ log π_θ(aₜ | sₜ)]. In practice, we sample trajectories and update θ ← θ + ηR(z) ∑ₜ log π_θ(aₜ | sₜ) using stochastic gradient ascent. This avoids estimating Q-values and treats action selection as classification with high variance but direct credit assignment. Modern variants like Actor-Critic, PPO, and others reduce variance or improve stability but are built on this same core gradient estimator.
Adversarial Examples in Machine Learning
Adversarial Examples are inputs crafted by an attacker to intentionally cause a machine learning model to behave incorrectly. They can exist even for models with high average accuracy and are a major concern for security, interpretability, and robustness. In vision, adversarial perturbations are often small changes in pixel values that are imperceptible to humans but cause confident misclassifications. For example, a panda classified as a gibbon with 99% confidence after imperceptible noise is added.
Why Adversarial Examples Matter
Why We Care: Fooling facial recognition, self-driving cars, or spam detectors poses serious risks. Adversarial attacks also expose that models may rely on spurious correlations, not real understanding.
Adversarial Threat Models
The Threat Model includes three parts: attack vector (what the attacker can modify), knowledge (white-box = full access to the model, black-box = only query access), and goal (undirected = cause any error, directed = cause a specific target error). Attack Vectors include constrained perturbations (small norm difference, e.g., L₂ or L_∞), unconstrained inputs, and data poisoning (malicious training data). For images, attackers limit pixel changes to ε in each direction using norms like L_∞: max |xᵢ − x′ᵢ| ≤ ε or L₂: ‖x − x′‖₂ ≤ ε. White-box Attacks use model gradients directly; black-box attacks use queries and heuristics, often with a limited budget.
Fast Gradient Sign Method (FGSM)
The Fast Gradient Sign Method (FGSM) is a white-box attack that computes the perturbation as x′ = x + ε · sign(∇ₓ L(x, y)) to increase model loss. This approximates the model locally as linear and nudges the input in the direction that increases prediction error.
Defending Against Adversarial Attacks
Defending Against Attacks requires anticipating the worst-case input, not just random noise. Naive strategies like adding noise (data augmentation) are insufficient—they fail against worst-case perturbations.
Adversarial Training Defense
Adversarial Training defends by training the model on adversarial examples x_adv = A(x), where A is an attack algorithm like FGSM. To modify the loss, train on adversarial inputs with θ ← θ − η ∇θ L(f_θ(A(x)), y) at each gradient step. This improves robustness to the specific attack used during training but may not generalize to all attacks.
Adversarial Attacks in NLP
NLP Attacks include Unicode attacks (swapping ASCII for similar-looking Unicode), typo-based attacks (causing large embedding changes from small textual edits), and meaning-preserving paraphrases (e.g., “what has” → “what’s”) that change model predictions despite semantic equivalence.
Jailbreaking Large Language Models (LLMs)
Jailbreaking LLMs uses prompts or images to force large language models like ChatGPT to output restricted content. Attacks include prompt engineering and adversarial image encoding, often by subtly altering inputs to bypass alignment safeguards.
Adversarial Examples Summary
- Adversarial examples affect images, text, and multimodal models
- White-box attacks like FGSM are fast and effective
- Adversarial training improves robustness against specific threats
- Models may still fail under novel perturbations even with high average accuracy
Spurious Correlations and ML Fairness
Spurious Correlations are input-label associations that are predictive in training but not causally related to the task. Models learn any pattern that helps minimize loss—even irrelevant or misleading ones. For example, a model predicts “wolf” when snow is present because wolves often appear in snowy pictures—it misclassifies a husky as a wolf when snow is present. In medicine, models learned to detect pneumonia based on radiology technician tokens that vary by hospital, not actual medical features—this leads to poor generalization across hospitals. In NLP, models associate identity terms (e.g., “gay”, “Muslim”) with toxicity or sentiment, simply because such mentions co-occur with hate speech or reviews—not because the terms themselves imply anything. Spurious correlations undermine generalization and robustness: models fail when tested in different conditions (e.g., ducks on land instead of water). Solutions include data diversification and out-of-distribution testing—never assume a model will generalize without measuring it.
Fairness in Machine Learning
Fairness in ML concerns how models make decisions affecting different groups and individuals, especially across sensitive attributes A (e.g., race, gender). Core setup: input features X, prediction target Y, model output R, and sensitive attribute A—we ask whether the model treats different A values fairly. Fairness cannot be achieved by just omitting A (no fairness through unawareness)—models can reconstruct A from proxies like zip code.
Types of Fairness Definitions
Types of Fairness:
Independence (Statistical Parity)
Independence (Statistical Parity): prediction R is independent of A—groups get an equal percentage of positive predictions regardless of actual labels.
Separation (Equalized Odds)
Separation (Equalized Odds): prediction R is conditionally independent of A given Y—equal true and false positive rates across groups.
Sufficiency (Calibration)
Sufficiency (Calibration): Y is conditionally independent of A given R—predictions are equally reliable for all groups (e.g., if a model outputs 0.8 probability, the actual outcome occurs 80% of the time across all groups). These fairness criteria are often mutually incompatible—you cannot satisfy all simultaneously if groups differ in base rates.
COMPAS Example
For example, COMPAS (a recidivism prediction tool used in court decisions)—was more likely to falsely classify Black defendants as high risk (violating separation), yet was equally calibrated across groups (satisfying sufficiency). Takeaway: Fairness depends on what definition you use and what harms you prioritize.
Types of Fairness Harms
Types of Fairness Harms:
Allocative Harms
Allocative Harms: unequal access to opportunities/resources (e.g., bail decisions, medical care, resume filtering)—for example, Amazon’s recruiting AI penalized resumes with “women’s” and female colleges.
Unequal Accuracy
Unequal Accuracy: some groups consistently receive worse predictions—for example, ASR and facial recognition systems have higher error rates for Black speakers and dark-skinned women.
Representational Harms
Representational Harms: stereotypes are reinforced through model outputs—for example, search engines showing biased results, or machine translation always translating “doctor” as male.
Summary of Spurious Correlations and Fairness
- Machine Learning models learn from data, including undesirable patterns
- Spurious correlations hurt generalization
- Fairness definitions conflict
- Harms arise from allocation, unequal accuracy, or reinforcing societal bias
- Mitigation requires understanding model behavior, measuring performance across groups, and being explicit about which fairness definition matters for a given application