Machine Learning Fundamentals: Algorithms and PyTorch Implementation

Posted on Sep 29, 2025 in Mathematics and Computer Science

Machine Learning Core Concepts

Basic Mathematical Information

Matrix Multiplication

If Matrix A has size (m x n) and Matrix B has size (n x p), the resulting product AB has size (m x p).
The number of columns in A must equal the number of rows in B.
Calculation involves multiplying each row of A by each column of B.

Finding Log Base 2 (n)

AD_4nXfUiHots-XMWpr9xYeeVyJl2Bsz_zyCmDxGn4of4bp44I5m3IinDe4myLlS8s4E88V-N8NCEe9LAAM7OBeLnzfw36KoraNOeFysxq8nmm_uZmRjWSt34OdJI9TrFfBXkMR4gGMG?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Key Machine Learning Definitions

Supervised Learning: Models learn from labeled data to approximate a target function (hypothesis function).
Classification: Goal is to categorize input into discrete classes (e.g., predicting disease presence, identifying fruit type).
Regression: Goal is to predict continuous numerical values based on input data (e.g., predicting house price based on size).
Unsupervised Learning: Models find patterns in unlabeled data (e.g., clustering).
Overfitting (High Variance): Model performs well on training data but poorly on test data, often due to excessive complexity.
Underfitting (High Bias): Model is too simple to capture the underlying patterns in the data.

Machine Learning Pipeline

Data Preprocessing: Handling missing data, encoding categorical variables (one-hot, ordinal).
Feature Engineering: Selecting relevant features, removing irrelevant ones, and designing/creating new features.
Model Selection.
Model Training: The model learns to map input to output, adjusting parameters to minimize prediction errors.
Model Evaluation.

Data Preprocessing Techniques

Representing Data

Tabular Data Encoding

One-Hot Encoding: Converts categorical labels into binary vectors. E.g., “science”, “arts”, “hybrid” → [1,0,0], [0,1,0], [0,0,1]. The number of elements in the list equals the number of unique labels.
Ordinal Encoding: Assigns numerical values based on order or rank. E.g., “science”, “arts”, “hybrid” → 0, 2, 1 (if order is arbitrary or defined).

Image and Video Representation (CNN)

Pixel Grid Representation: Images are represented as grids of pixels.
Flat Black & White: 1 channel, typically using 0s and 1s.
Grayscale Image: 1 channel, using a scale of 256 possible levels (0 to 255 inclusive) representing varying shades of gray.
RGB: Combines Red, Green, and Blue channels, each with 256 possible levels.
Video: A sequence of images, represented by repeated pixel grid representations for each frame.

Word and Sentence Representation

Words (One-Hot Encoding): The number of elements equals the size of the vocabulary. E.g., “ume” → [1, 0, 0], “is” → [0, 1, 0], “cute” → [0, 0, 1].
Sentence (Sequence of Words): A list of one-hot encoded words. E.g., “ume is cute” → [[1,0,0], [0,1,0], [0,0,1]].
Bag of Words (BoW): Represents the frequency of words in a document, ignoring grammar and word order. E.g., “ume is so so cute cute cute”

			ume is so cute

Handling Data Issues

Class Imbalance

Issue: Model may learn that predicting the dominant class results in fewer mistakes.

Solution: Data Resampling

Oversampling: Duplicates of the minority class.
Undersampling: Subsample of the majority class.

Missing Values

Solution 1: Remove all data points with missing values (not recommended if the training sample is small).
Solution 2 (Imputation): Replace missing values with the mean or mode of the column.
- Implementation: sklearn.impute.SimpleImputer(strategy = "mean" or "most_frequent")

Outliers

Solution 1: Z-Score Method
- Calculate Z-score: sklearn.preprocessing.StandardScaler().fit_transform(data)
- Find data points where |Z| > threshold.
Solution 2: IQR Method
- Arrange data in ascending order.
- Calculate IQR (Interquartile Range): $Q3 – Q1$. (np.percentile(data, 75) - np.percentile(data, 25))
- Lower Bound = $Q1 – 1.5 \times IQR$
- Upper Bound = $Q3 + 1.5 \times IQR$
- Remove points outside this range (point > upper bound OR point < lower bound).

Feature Scaling

Issue: Features with different scales may cause the model to give more importance to features with a larger range.

Solution 1: Z-Score Standardization
Solution 2: Min-Max Scaling

Supervised Learning Models

K-Nearest Neighbors (K-NN)

Type: Supervised, primarily for classification.
Best for: Datasets with a small number of attributes.
Input/Output: Input can be continuous/discrete/binary; output is an array of labels.

K-NN Formula and Steps

AD_4nXfqxQQXeejYPUo0UB7wkiQdcIOrA-7P73FgSONnA6oLCswEhD5O5H3bobuBSL-TfiZmzC3OaH62CN_xnCzKyiOAdmsx1UuRlN8c1K1I5O6ahDcoNvubDH30BPzfMJZyksQjrYVCLw?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Calculate distances of data points from the given point using Manhattan or Euclidean distance. (n = number of features; $x_a$ & $x_b$ = two data points).
Identify $K$ nearest neighbors based on distance.
Classify the data point based on the majority vote among the $K$ neighbors.

Choosing K

K too large: Decision boundary is overly smooth (linear), leading to underfitting.
K too small: Decision boundary is overly complex, leading to overfitting.

Limitations

Requires feature scaling.
Curse of Dimensionality: In higher dimensions, points are more spread out, making it difficult to distinguish based on distance.

Mean Nearest-Farthest Distance Ratio:

AD_4nXey_GREAtvCJ4iKRSYWFEFBc4r1LF1Y6i9IxJhYZMu-jqElXVQzSn4FVJTvyzaQBXONh5D5VBiL4vDYn10Oh4XAA_nDty4dFrMCRZfC4Rbfmn2E4GdAefMKrtvPixTFpSFDA-tM9g?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Implementation

Use sklearn.neighbors.KNeighborsClassifier.

Euclidean distance: np.linalg.norm(x) (assuming $x$ is an array containing data points of $[x_1, x_2, \dots, x_b]$ features).

Decision Trees

Type: Supervised learning, used for classification.
Input: Binary, discrete, or discretized continuous data (e.g., range 0-10, 10-20).
Output: Single decision value.

Building the Tree

Get Root Node:
- Compute the entropy of the data and all attributes.
- Compute the remainder (or information gain) for each attribute. Choose the attribute with the smallest remainder as the root.
Split Dataset: Split the dataset by the root attribute’s values.
- If an attribute value has more than one outcome, repeat the process using the next best attribute (smallest remainder).
- To find the next best attribute: Compute the remainder within each subset filtered by the previous best attribute.
- If two attributes have the same remainder, choose the one that comes earlier in the breaking order.
Repeat until: No attributes remain OR all samples in a subset belong to one class.

Entropy (I)

Measures the uncertainty or unpredictability of the outcomes of a decision.

Maximum uncertainty: $I(y) = 1$ (random).
No uncertainty: $I(y) = 0$ (deterministic).

Formula:

AD_4nXcZGtdspqzm5nt9I9m7ebLfxmXWQfrxbGLZe_gPPHBRSjo53Q04JpR6jwf1hBrQJiRngWn9V1r1LwbovhRXeDY1uW_iiD-IlLcRamFz7DmBp1iQxF8PV6dkEsG7WGAvmp9ETazrIg?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Calculate Probability (p): $p = n(\text{samples with label}) / n(\text{total samples})$. If using a subset, filter by all samples with the attribute value (e.g., only see outcomes of points where attribute “taste = salty”).

Remainder

The uncertainty remaining after partitioning the dataset based on an attribute, or the weighted average of the entropy of the subsets after splitting the dataset based on that attribute.

Smaller remainder → better attribute.

Formula:

AD_4nXeiWjlZJxT8p3vFlPX6K33gOziFJ-Gq2pT2nIwmW7E-nk_BQ8R3QKsldXZ-oOiq3zFY_SiZhAuXCLn5ROSKY2pW-NlVsStq5PkacNbnCMkNs6hDlBY2epzAO0UicQBkkYWczywRsg?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Where $v$ = number of outcomes, $I(A)$ = entropy of subset A, $W(A)$ = weight of subset A.

Weight Calculation:

AD_4nXeomCNxmy-MHnakjEMVoWNqf-MwIVeQdAA3eY-TF8i8Tdm_j2MndBvNfv0wezfG-5CNyG447qvXbrzW5GgicK7mhpGRPFGT6po4CHestZ3-HWDZp3oA77LXdm2Ihy5hNsuJWwGphw?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Information Gain (IG)

Reduction in uncertainty after splitting on an attribute. Higher IG → better attribute → should be chosen earlier.

Formula: $IG(A) = \text{Entropy}(\text{data}) – \text{Remainder}(A)$

Pruning (To Prevent Overfitting)

Max-Depth: Limit the tree depth; use majority vote at maximum depth.
Min-Sample Leaf: Set minimum leaf sample size; use majority vote if fewer samples exist.

Gradient Descent

Iteratively updates weights to minimize losses by following the gradient.

Global Minimum Theorem: The graph of the loss function against weights must have only one minimum, which is the global minimum.

AD_4nXfKmDOnPUoE-nf943kajSWbjUShczKmLOfeTuZn_QdjSwYejCfSjZSOtOAoY3qbLxOgdHCs-8wjRg3WXB1rniCMkv09bKRpvAaz3-zy7zEBk8vdn_C1cyvNRRhQHAy3uiA8Win56w?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Stops when $\partial J / \partial w = 0$ (converges to minimum).

Loss function for Linear Regression: Mean Squared Error (MSE).
Loss function for Logistic Regression: Binary Cross-Entropy (BCE) Loss.

Gradient Calculation ($\partial / \partial w_j$ of loss):

AD_4nXc8vUcKPvy_fQl2UFh128e6Bn0YWJ7sb3jqRDaadq6DkdgApBTpC__N1vz-lTr1srzbmAcP10C5jSn2sJ8b620Zmbx2Hnfu413Cje9Dkl61atIpdnfpcVPCy48Xgz15w298tMvIMA?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Determining Optimal Weight and Learning Rate

Optimal Weight ($w^*$): Find the derivative of the cost function, set $d(\text{cost})/d(w) = 0$, and solve for $w$.
Optimal Learning Rate ($\gamma$):
1. Let error be $e(j) = w(j) – w^*$. Substitute this into the gradient update equation.
2. Rearrange the equation to see $e(t)$ on both sides (LHS $e(j+1)$, RHS $e(j)$).
3. Find the value of $\gamma$ whereby the error gets smaller, i.e., $|e(t+1)| < |e(t)|$.

Linear Regression

Type: Supervised, used for predicting continuous values by fitting a straight line to data.

Model Hypothesis

1 Feature, 1 Label: $h_w(x) = w_0 + w_1x_1$
Multiple Features: $h_w(x) = w_0 + w^Tx$
- $w^Tx = w_1x_1 + w_2x_2 + \dots + w_nx_n$

Key Terms

$\hat{y}$: Predicted value.
$x = [x_1, x_2, \dots, x_n]$: Input features.
$w = [w_1, w_2, \dots, w_n]$: Weights (parameters).
$w_0$: Y-intercept / bias.
$i$: Index of a data point (runs from 1 to $m$ = number of samples; $data.shape[0]$).
$j$: Index of a feature (runs from 1 to $n$ = number of features; $data.shape[1] – 1$).

Cost Function (MSE)

Measures the ‘goodness’ of the model.

AD_4nXdI6gkLLBhkmHKeUrzKSXq5e4OYQjuHT-dZaZRcesq_b36UuaDEoA_cFclskBXaywH0Rv0Z0Q96knAdrx8mTJ0wSjIhFiHfJJ-Ar1j9191L-aE7KHnshuAOxpNlSg4otMRP-WTloQ?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Where $m$ = number of training data samples.

Note: The loss function measures the loss of a single data point, i.e., $(\hat{y}-y)^2$ (squared error).
The cost function measures the average loss across the entire training set.

Variants of Gradient Descent

Mini-Batch: Uses a subset of data. Faster and cheaper, good for large datasets.
Stochastic: Uses 1 data point per update. Fast but may overfit, unlikely to hit the exact global minimum.

Handling Non-Linear Relationships

Transform the relationship into a linear one by adding polynomial features (Polynomial Regression).

Polynomial Regression: Approximates any continuous function on a closed and bounded interval (min-max known) to any degree of accuracy (Weierstrass Approximation Theorem).

Example: $h_w(x) = w_0 + w_1x + w_2x^2$

Logistic Regression

Type: Supervised, used for classification with continuous inputs.
Decision Boundary: Usually linear (or can be transformed to linear).
Use Case: Mainly for binary classification (yes/no, true/false, 0/1).
Prediction: Predicts the probability of a given input belonging to a class. A threshold (default Scipy: 0.5) determines the final classification.

Model (Sigmoid Function)

The sigmoid function maps any real-valued number into a probability between 0 and 1.

AD_4nXcVTfG4890PvYBiiQbnlwjO7CFp5HciCFGpkBh4CaRGmN4nO5W79TrmeApoxawHLWMz78tyRRWxt1VwO1SsoVUMi4g0ezdMeUdkv-1_eKteoSZhLGOMK_7bFAr4euUe_v0OUEP_hA?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Cost Function (Binary Cross-Entropy)

AD_4nXeLGYtH0G5zqaBHDBvWWaJuzT78MH9AW2r47ssQ_SlM5aXs6MTde1wGgRzohlldxX6ziWHzeTXy24UrqjadngGfpvttJY5etcPn4KTjxl5F619d1FwcnPtS18FfCYM7XlpyJ_Vgjg?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Common Issues and Fixes

Non-Linear Decision Boundary

Add polynomial features equal to the number of features in the dataset.

Example: Transform $h_w(x) = w_0 + w_1x_1 + w_2x_2$ into a circular boundary:

$h_w(x) = w_0 + w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_2^2$

Note: Coefficients may not necessarily be 0 and 1; this depends on the problem.

Multi-Class Classification

One vs All (OvA): Fit one classifier per class. Train each model to distinguish one class (“1”) from all others (“0”). Choose the class with the highest probability when it is “1”.
One vs One (OvO): Fit one classifier per class pair. The class with the higher probability “wins.” Choose the class with the highest number of “wins.”

Regularization

Methods to prevent overfitting:

Reduce the number of features (e.g., transform from high to low degree polynomial).
Regularization: Reduce the magnitude of weights ($w_j$).

L2 (Ridge) Regularized Cost Function

AD_4nXe3LXF5ChE5GATB_HyTn-NWdDo3I2GRt1GEyGgBrd0ZWQm9BLgrra1H8vhJrFlOCzkAFkOV1Gi1jGZ0WtDqGlFasfKL08yQXhREC7-qugVM5bixMoeavVVeQ6LxCNwYMZXO2IWVaQ?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Where $\lambda$ is the regularization parameter.

For Logistic Regression: $J(\text{BCE})$
For Linear Regression: $J(\text{MSE})$

Regularized Loss Gradient

AD_4nXczzbOi25ZN075NIX41kDUYbKopQiIDEoP43ngdTya78mTeUp2OaT1b3toavTcJfmTY9ISONRYPcLVwHrBr1nz2_6piQmiQWkyPb0WxqqMtUKe6JSO8bwHvB485GVD4n60rE9qfrA?key=M4vzI9Gg7pcCKGgSZtX7d-xa

Unsupervised Learning: Clustering

Unsupervised learning method used to find groupings based on data point feature similarities (no labels).

K-Means Clustering

Centroid Definition

Coordinates are the mean of all data points for each feature ($x_1, x_2, x_3, \dots, x_n$) within the cluster. $m$ = number of data points in the cluster.

Algorithm Steps

Initialize $K$ centroids.
For each data point $i$ (in range 0 to $m$), group it to the nearest centroid (calculated using Euclidean distance).
Update centroids (calculate the mean of all data points in the new cluster).

The loop stops when the centroids no longer change (convergence).

Initialization Methods

K-Means: Randomly initialize $K$ centroids.
K-Means++: Random initialization of the first centroid. Select subsequent centroids with the largest probability $P$, proportional to the squared distance ($x^2$) of each data point from the existing centroids.
- $P(x^2)$ = squared distance of the current point from the nearest centroid.
- Largest distance → highest probability → selected as the next centroid.

Evaluate Performance (J)

Calculate the Mean Euclidean Distance J of each data point from its centroid. A smaller value indicates better performance.

Determining K

Elbow Method (Heuristic): Set $K$ to the ‘elbow’ point—the point where increasing the number of clusters ($K$) does not lead to a drastic improvement in clustering performance ($J$). (Note: Some graphs of J against K may not have a clear elbow).
Application Dependent Method: $K$ is chosen based on specific case requirements.

Hierarchical Clustering

Steps

Initially, every point is a data cluster.
In a loop, calculate the distances of each data point from all other data points using a linkage method.
Find the minimum distance among all data points and merge them together.
Repeat until all points form a single cluster.

Comparing Merged Clusters: After merging points [A, B] with point C, the distance is determined by the linkage type. For example, if using single linkage, the distance is $\min(\text{dist}(A, C), \text{dist}(B, C))$.

Visualization: Dendrogram

Draw from the bottom up until there is one link for all points. Drawing a horizontal line across the dendrogram shows the number of clusters (equal to the number of intersections between the line and the dendrogram).

Types of Linkages (Distance Metrics)

Single Linkage: Distance between the points closest to one another in the clusters.
Average Linkage: Mean distance between all points in each cluster.
Complete Linkage: Distance between the points farthest from one another between clusters.
Centroid Linkage: Distance between the centroids of the clusters.

Implementation Differences

K-Means uses sklearn.cluster.KMeans(n_clusters=k) and minimizes a loss function.
Hierarchical clustering uses sklearn.cluster.AgglomerativeClustering and combines pairs of clusters that are close together according to the linkage method.

Model Evaluation Metrics

Regression Metrics

MSE: Mean Squared Error.
MAE: Mean Absolute Error (mean of sum of $| \hat{y} – y |$).

Classification Metrics (Confusion Matrix)

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

Note: For all metrics below, values are between 0 and 1. Higher is better.

Metric	Formula	Definition / Use Case
Accuracy (A)	$(TP + TN) / \text{Total Samples}$	True predictions / total number of samples
Precision (P)	$TP / (TP + FP)$	True positives / all predicted positives
Recall (R)	$TP / (TP + FN)$	True positives / all actual positives
F1 Score	$2 / ( (1 / P) + (1/R) )$	Harmonic mean of Precision and Recall

Hyperparameter Tuning and Validation

Hyperparameter: External configuration parameters of an ML model not learned from training data (e.g., learning rate, regularization parameter $\lambda$).

Methods to Find Best Hyperparameter

Grid/Exhaustive Search: Test all specified hyperparameter combinations.
Random Search: Select hyperparameters randomly to test.

Model Performance Measurement Steps

Training:
- If finding the best hyperparameter: Train a new model with each hyperparameter value on the same dataset.
- If finding the best model: Select different models (K-NN, Logistic Regression, etc.) to train on the same dataset.
Validating: Use a subset of the training sample to find the model/hyperparameter that produces the minimum error.
Testing: Final evaluation on unseen test data.

Neural Networks and Deep Learning

Perceptrons

Perceptron Model

Input (weighted sum: $w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n$) → Activation Function ($g$) → Output ($\hat{y}$)

The original perceptron, inspired by brain neurons, uses a step function: output is -1 if $x < 0$, and 1 if $x \ge 0$.
The perceptron output without activation is equivalent to linear regression, forming a straight-line decision boundary.
The activation function applies to the weighted sum of all inputs, not to individual input features.

Perceptron Update Algorithm

Initialize weights randomly, predict $\hat{y}$.
If $\hat{y} \ne y$, update weights: $w_j = w_j – \gamma(\hat{y}_j – y_j)x_j$ (where $\gamma$ is the learning rate).
Repeat until all points are classified correctly or maximum steps are reached.

Logic Gates

True = 1, False = 0

NOT: Opposite (True if 0, False if 1).
AND (&): Only true if both inputs are 1.
OR: True if at least 1 input is 1.
NOR: Only true if both inputs are 0.
XNOR: True if inputs are the same (1,1) or (0,0) → 1.
XOR: True if inputs differ (0,1) or (1,0) → 1.

Neural Networks Structure

Single Layer: No hidden layers; can implement gates like AND, OR, NAND.
Multi-Layer: Input → Hidden layers → Output (required for non-linear problems like XOR, XNOR).
Forward Propagation: Data flows from input → hidden layers → output. An activation function is used at each layer (except for the output layer when performing regression).
Weight Matrix: Rows = number of inputs, Columns = number of outputs for the next layer. Example: A 3×5 matrix means the next layer has 5 neurons/outputs.

Evaluating Mathematical Representations

Start with the inner parentheses first. E.g., $\text{NOT}(\text{AND}(x_1, x_2))$. If $x_1 = 0, x_2 = 1$, then $\text{NOT}(0) \rightarrow 1$.

Activation Functions

Softmax: Used for multi-class classification (for $C$ number of classes).
- Softmax(dim=1) returns a list of probabilities $p$ (input row belongs to class $i$). $\sum(\text{row}) = 1$.
- The index of the element corresponds to the class label.
- Predicted class = output.argmax(dim=1).
Sigmoid: Used for binary classification (Logistic Regression). Handles non-linear boundaries if input features are transformed.
No Activation: Used for regression tasks ($g(x) = x$).

PyTorch Implementation

Import: import torch.nn as nn

Containers

nn.Module: Used as input when defining a neural network class.
nn.Sequential: Directly defines the class by calling layers sequentially.

Linear and Activation Functions

nn.Linear(input_size, output_size): Defines a single layer.
Activation functions (e.g., nn.Sigmoid, nn.ReLU, nn.Softmax) must be called separately for every layer before the activation class in PyTorch.

Building a Neural Network using `nn.Module`

Example: Single layer NN for REGRESSION (last layer has no activation function).

class NNRegressor(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.linear1 = torch.nn.Linear(input_size, output_size)
        self.linear2 = torch.nn.Linear(output_size, 1)
        # output size of 1st layer = input size of 2nd layer
        self.activate = torch.nn.ReLU() # or any other activation func

    def forward(self, x):
        x = self.linear1(x)
        x = self.activate(x)
        x = self.linear2(x)
        return x

model1 = NNRegressor(4, 8) # 4 features (x_n), 8 neurons to map to initially

Building a Neural Network using `nn.Sequential`

Used to package a series of modules.

model2 = torch.nn.Sequential(
    torch.nn.Linear(5, 8),
    torch.nn.ReLU(),
    torch.nn.Linear(8, 1)
)

Training Loop (Gradient Descent)

Loss Functions: torch.nn.CrossEntropyLoss (multi-class), torch.nn.BCELoss (binary), torch.nn.MSELoss (regression).
Optimizer: optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# Training loop example
loss_fn = torch.nn.BCELoss() 
for epoch in range(epochs):
    optimizer.zero_grad()
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    loss.backward() # compute gradient
    optimizer.step() # update weights

Convolutional Neural Networks (CNN)

Usually used for object detection in images (e.g., cancer cells, classifying image types).

Fully Connected Layer: Kernel “views” all pixels (‘square’) of the input image to determine the output.
Convolution: Allows the model to view local regions of an image.

Building a CNN

Similar to a standard NN, but each convolutional layer needs pooling afterward.

self.c1 = torch.nn.Conv2d(input_channels, output_channels, kernel_size, stride=1, padding=0) 
# padding="valid" = padding=0
self.pool = torch.nn.MaxPool2d(kernel_size)

To transition to a FULLY CONNECTED LAYER (Conv2d → MaxPool → Linear), flatten inputs first.

Input size = channels $\times$ height $\times$ width.

torch.flatten(batch_size, start_dim=1, end_dim=-1)

Components

W/H = N: Dimensions (Width/Height).
K: Kernel dimension (for W/H respectively).
- The feature map value is the sum of element-wise multiplication with the local region.
P: Padding.
S: Stride. A stride of ‘n’ is inclusive of the current first element. E.g., if the top left starts from A, stride = 2 means the second region should start from C.

Pooling

Downsamples feature maps, reducing dimensions.

Types: Average, Max, Sum.
PyTorch Methods: torch.nn.MaxPool2d, torch.nn.AvgPool1d, torch.nn.AdaptiveMaxPool2d.
Output Shape: (Initial height or width) / pooling factor.