Neural Network Architectures and Learning Concepts
Feedforward Neural Network (FNN)
A Feedforward Neural Network is the simplest type of artificial neural network in which information flows in only one direction, from the input layer to the output layer. There are no feedback connections or loops.
Basic Structure
- Consists of an input layer, one or more hidden layers, & an output layer.
- Data flows from input → hidden → output.
Single-layer Feedforward Network
Has an input layer & an output layer only. The output layer performs the main computation using weights. Inputs are connected to outputs with different weights, & each output node gives one result.
Multilayer Feedforward Network (MLP)
Has one or more hidden layers between input & output. It is more powerful & commonly used than single-layer networks.
Working of MLP
Each neuron in a hidden layer takes the weighted sum of all inputs. Then it applies an activation function (e.g., sigmoid, ReLU) to decide its output. The output of one layer becomes the input for the next layer.
Training (Learning Process)
The network is trained using data & a learning algorithm such as backpropagation. During training, weights & biases are adjusted to reduce the error between predicted output & actual output.
Characteristics
- Information moves only forward; no loops.
- Does not have memory of previous inputs.
Applications
Used for classification, Regression (predicting continuous values), Speech recognition, Image recognition, etc.
Multilayer Perceptron (MLP)
A Multilayer Perceptron (MLP) is a type of feedforward neural network that contains one or more hidden layers between the input & output layers. It can learn complex non-linear relationships.
Architecture
- Input Layer: Receives the raw input features from the dataset (e.g., pixel values, sensor readings).
- Hidden Layers: One or more layers of neurons between input & output. Each neuron receives a weighted sum of inputs & passes it through an activation function.
- Output Layer: Produces the final output such as a class label or a numerical value.
Working
Each neuron computes a weighted sum of its inputs plus a bias term. This sum is passed through an activation function (like sigmoid, ReLU) to produce the output of the neuron. The outputs of one layer become inputs to the next layer, so information flows forward through the network. During training, the network uses a learning algorithm such as backpropagation to adjust weights & biases so that the error between predicted output & actual target is minimized.
Recurrent Neural Network (RNN)
A Recurrent Neural Network (RNN) is a type of neural network designed to handle sequential data. Unlike feedforward networks, RNNs have feedback connections, which give them a form of memory about previous inputs.
Feedback Concept
In an RNN, the output of a neuron at one time step can be fed back as input at the next time step. This creates a loop and allows the network to maintain an internal state that stores information about the past.
Single-layer Feedback Network
Consists of one main recurrent layer with feedback connections. At each time step, the output depends on: the current input, and the previous hidden state.
Multilayer Feedback Network
Extends the idea to multiple hidden layers. Feedback can occur: within the same layer, and from deeper layers back to earlier layers. This allows for a richer and more complex memory system.
RNNs for Sequential Data
RNNs are suitable for tasks where order matters, such as time series, speech, text, and video.
Types of RNN Based on Input–Output Structure
- One-to-One: Single input, single output (e.g., image classification).
- One-to-Many: Single input, sequence output (e.g., image → image caption).
- Many-to-One: Sequence input, single output (e.g., sentiment analysis from a sentence or review).
- Many-to-Many: Sequence input, sequence output (e.g., machine translation, where a sentence in one language is translated into another sentence).
Applications
Used in language modelling, sentiment analysis, speech recognition, time-series prediction, and machine translation.
Radial Basis Function Network (RBFN)
A Radial Basis Function Network (RBFN) has three layers: Input, one Hidden Layer, and Output. The hidden layer units use radial basis functions that depend on the distance between the input and a center. Each hidden neuron responds strongly when the input is close to its center. The output layer computes a weighted sum of these responses. Used in pattern recognition, time-series prediction, and financial forecasting.
Recursive Neural Network (RecNN)
A Recursive Neural Network (RecNN) is designed for structured input such as trees. It applies the same set of weights recursively over a tree structure. Unlike RNNs (which work on sequences), RecNNs work on hierarchical structures. Used in NLP, especially where sentence structure (parse tree) is important, e.g., sentiment analysis.
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) networks are a special type of Recurrent Neural Network (RNN) designed to handle long-term dependencies and to overcome the vanishing gradient problem of standard RNNs.
Motivation
Standard RNNs have difficulty learning from long sequences due to vanishing and exploding gradients. LSTMs were designed to remember information for long periods of time by controlling the flow of information.
Basic Structure
An LSTM unit contains: a cell state (long-term memory), and a hidden state (short-term output). It uses gates to control what to keep, what to forget, and what to output.
Forget Gate ($f_t$)
- Input: current input and previous hidden state.
- Output: a value between 0 and 1 for each element of the cell state.
- Function: decides what part of the old cell state to forget.
Input Gate ($i_t$)
- Has two parts: A sigmoid layer that decides which values to update, and A tanh layer that creates new candidate values.
- Together, they determine what new information to store in the cell state.
Updating Cell State
Old cell state is first multiplied by the forget gate (to forget some information). Then the result is added to the new candidate values scaled by the input gate. This produces the updated cell state.
Output Gate ($o_t$)
Decides what part of the cell state becomes the hidden state (output) for the current time step. The hidden state is used for prediction and passed to the next time step.
Advantages
- LSTMs can preserve long-term information and selectively forget irrelevant information.
- They reduce the effect of vanishing gradients and allow training on long sequences.
Applications
Used for language modelling, machine translation, speech recognition, time-series forecasting, etc.
Model Evaluation Metrics
- Accuracy: Out of ALL guesses, how many were correct? $\frac{TP+TN}{TP+TN+FP+FN}$
- Precision: When the model says Fraud, how often is it right? $\frac{TP}{TP+FP}$
- Recall: Out of all real frauds, how many did the model catch? $\frac{TP}{TP+FN}$
- Specificity: Out of all real NOT-fraud people, how many the model correctly ignored? $\frac{TN}{TN+FP}$
- F1 Score: Balance between precision and recall. When you want both to be good. $2 \times \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) is a feed-forward deep learning model mainly used for image processing and computer vision tasks such as image classification, object detection, and recognition. It processes input data that has a grid-like topology, for example, a 2D image represented as a matrix of pixel values. CNN is also called ConvNet. Designed to automatically and adaptively learn spatial hierarchies of features from images. Uses layers like convolution, ReLU, pooling, flattening, and fully connected layers.
Input Representation
Each image is represented as a 2D (grayscale) or 3D (color) matrix of pixel values. This matrix is given as input to the CNN.
Convolution Layer
Uses small matrices called filters/kernels (e.g., 3×3). The filter is slid over the input image (with a step size called stride). At each position, the dot product of filter and image patch is computed to produce a feature map. Purpose: to extract low-level features like edges, lines, textures.
ReLU Layer (Activation)
ReLU = Rectified Linear Unit. Applies function: $f(x) = \max(0, x)$ element-wise on feature map. Converts negative values to 0 and introduces non-linearity.
Pooling Layer
Performs down-sampling of feature maps. Reduces spatial dimensions, number of parameters, and computational cost. Common pooling operations:
- Max pooling – selects maximum value in a region.
- Average pooling – takes average value in a region.
Helps in making the model more invariant to small translations in the input image.
Stacking of Convolution + ReLU + Pooling Layers
Several such layers are stacked to learn high-level features. Early layers learn edges, later layers learn shapes, parts, and finally objects.
Flattening Layer
Converts the final pooled feature maps into a single long vector. This vector is then passed to the fully connected layers.
Fully Connected (FC) Layer
Works like a traditional neural network. Every neuron is connected to all outputs of the previous layer. Learns the non-linear combination of high-level features to perform classification.
Output Layer
Typically uses Softmax activation for multi-class classification. Produces a probability distribution over possible classes (e.g., cat, dog, bird).
Applications
- Image and video classification
- Object detection and tracking
- Face recognition
- Medical image analysis, etc.
Ensemble Learning Methods
Ensemble learning combines multiple weak models to build a strong predictive model. Random Forest uses bagging, while AdaBoost uses boosting. Both improve accuracy but work differently.
1. Random Forest
- Based on Bagging (Bootstrap Aggregating).
- Creates many datasets using sampling with replacement.
- Builds many decision trees in parallel.
- Uses random feature selection at each split (feature bagging).
- Final result: majority vote (classification) or average (regression).
- Reduces variance and prevents overfitting. Works well for noisy datasets.
2. AdaBoost
- Based on Boosting.
- Builds weak learners sequentially.
- Each new learner focuses on misclassified examples.
- Assigns weights to data points and alpha values to weak learners.
- Final prediction is a weighted vote.
- Reduces bias by improving step-by-step. Sensitive to noisy data and outliers.
3. Key Differences
- Random Forest → Parallel, reduces variance, uses full trees.
- AdaBoost → Sequential, reduces bias, uses stumps.
Random Forest is more robust; AdaBoost is more sensitive but often very accurate.
Ensemble Methods Fundamentals
Ensemble methods are machine learning techniques that combine predictions from multiple models to form a more accurate and robust final model. They work on the idea of “wisdom of the crowd,” where multiple weak learners together form a strong learner.
1. Weak and Strong Learners
- Weak Learner: Performs slightly better than random guessing.
- Strong Learner: Formed by combining many weak learners.
2. Why Ensemble Methods Work
- Reduce bias and variance.
- Different models make different errors.
- Combining many predictions cancels individual errors.
3. Types of Ensemble Methods
A. Bagging (Bootstrap Aggregating)
- Goal: Reduce variance.
- Trains multiple models on different bootstrapped samples. Models are trained in parallel.
- Final prediction: Majority vote or average.
- Example: Random Forest.
B. Boosting
- Goal: Reduce bias.
- Models are trained sequentially. Each new model focuses on misclassified data by increasing their weights.
- Final prediction uses weighted voting.
- Examples: AdaBoost, Gradient Boosting, XGBoost.
C. Stacking
- Goal: Combine strengths of different models.
- Trains multiple diverse models on the same data.
- A final meta-model learns how to best combine their outputs.
- Provides high predictive accuracy.
Handling Class Imbalance
Class imbalance occurs when one class heavily outnumbers another. In such cases, standard classifiers become biased toward the majority class, making accuracy an unreliable metric. Special techniques are needed to correctly learn & detect the minority class.
1. Use Appropriate Evaluation Metrics
Accuracy is misleading for imbalanced data. Use Precision, Recall, F1-score, & PR-AUC. These metrics focus on minority class performance. Use confusion matrix to examine true positives & false negatives.
2. Data Resampling Methods
- a. Oversampling: Increase minority class samples by duplication.
- b. SMOTE: Generates synthetic minority samples by interpolating between neighbors.
- c. Undersampling: Reduces majority class size. Faster but may lose information.
3. Algorithm-Level Solutions
- a. Cost-Sensitive Learning: Assign higher misclassification cost to minority class. Use parameters like
class_weight='balanced'. - b. Algorithms Robust to Imbalance: Boosting methods (AdaBoost, XGBoost) give more focus to misclassified data. Tree-based models (Random Forest) capture minority patterns effectively.
4. Example
In fraud detection (99.5% normal, 0.5% fraud): Use F1-score & PR-AUC. Apply SMOTE to create synthetic fraud samples. Train Random Forest/XGBoost with balanced class weights. Evaluate using precision–recall thresholding.
Bias–Variance Tradeoff
The Bias–Variance Tradeoff is a key concept for improving machine learning models. It explains why a model performs poorly & guides the correct strategy to fix it. By analyzing whether a model has high bias or high variance, we can take targeted actions to improve its generalization.
1. Diagnosing High Bias (Underfitting)
- High error on both training & test data.
- Model too simple to learn patterns.
- Example: Linear Regression used for a non-linear problem.
Fixes for High Bias
- Use more complex models.
- Add relevant features.
- Reduce regularization strength.
- Use Boosting (AdaBoost) to reduce bias.
2. Diagnosing High Variance (Overfitting)
- Low training error but high test error.
- Model learns noise instead of pattern.
- Example: Very deep decision tree.
Fixes for High Variance
- Use simpler models.
- Collect more training data.
- Apply feature selection.
- Use regularization (L1/L2).
- Use Bagging (Random Forest) to reduce variance.
- Prune decision trees.
3. Why This Tradeoff Helps
- Helps identify exact reason for poor performance.
- Prevents random trial & error.
- Allows targeted optimization strategies.
- Leads to better generalization & efficient model tuning.
