Deep Learning Architectures: CNNs, RNNs, and GANs
1. Pooling Layers in CNNs
A pooling layer is a down-sampling layer in a Convolutional Neural Network (CNN) usually placed after a convolutional layer. It reduces the spatial dimensions (width × height) of the input feature maps while retaining the most critical structural information.
Types of Pooling Layers
- Max Pooling: Extracts the maximum value from the region covered by the sliding filter. Purpose: Captures dominant features like sharp edges and bright pixels.
- Average Pooling: Computes the average (mean) value of all pixels covered by the sliding filter. Purpose: Retains smooth background information and provides a generalized view.
- Global Pooling: Reduces the entire feature map (H × W) into a single value (1 × 1). Purpose: Used before the final output layer to replace heavy flattening operations.
Key Features
- Zero Trainable Parameters: Relies on fixed mathematical functions; keeps the network lightweight.
- Dimensionality Reduction: Reduces computational cost (FLOPs) and memory usage.
- Translation Invariance: Makes the network robust to small shifts or distortions.
- Overfitting Control: Acts as a form of regularization by reducing data volume.
2. Applications of CNNs
- Image Classification: Categorizing images into specific labels (e.g., medical diagnosis).
- Object Detection: Localizing and classifying multiple objects (e.g., YOLO).
- Semantic Segmentation: Classifying every pixel in an image.
- Facial Recognition: Verifying identity using biometric features.
- Medical Image Analysis: Detecting anomalies in CT scans and X-rays.
3. Core Working Principle of CNNs
Traditional neural networks flatten inputs into 1D vectors, losing spatial relationships. CNNs preserve spatial structure (H × W × C) using a sliding window approach (convolution) with shared weights to detect local features.
Standard CNN Architecture
- Input Layer: Receives raw image data (e.g., 224 × 224 × 3).
- Convolutional Layer: Performs element-wise dot products to produce feature maps.
- Activation Function (ReLU): Introduces non-linearity.
- Pooling Layer: Performs down-sampling.
- Flattening Layer: Converts 3D feature maps into a 1D vector.
- Fully Connected Layer: Connects neurons to perform final classification.
4. ReLU and Dropout Layers
ReLU (Rectified Linear Unit)
ReLU is a non-linear activation function defined as f(x) = max(0, x). It introduces non-linearity, mitigates the vanishing gradient problem, and promotes computational sparsity. Its main limitation is the “Dying ReLU” problem, often solved by using Leaky ReLU.
Dropout Layer
Dropout is a regularization technique that randomly deactivates a percentage of neurons during training to prevent overfitting. During testing, all neurons are active, but weights are scaled down.
5. Padding and Strided Convolution
- Padding: Adding pixels around the border to preserve edge information and control output size (Valid, Same, or Full padding).
- Strided Convolution: Moving the filter by a fixed number of pixels (stride) to reduce computational complexity and perform down-sampling.
6. Recurrent Neural Networks (RNNs)
RNNs are designed to process sequential data (text, speech, time-series) by maintaining a hidden state (memory). Types include One-to-One, One-to-Many, Many-to-One, Many-to-Many, Bidirectional RNNs, LSTMs, and GRUs.
7. Seq2Seq Models
The Encoder-Decoder architecture converts an input sequence into an output sequence using a context vector. It is widely used for machine translation, text summarization, and chatbots.
8. Advanced Architectures
- LSTM: Uses cell states and gates (Forget, Input, Output) to learn long-term dependencies.
- Bi-LSTM: Processes sequences in both forward and backward directions.
- GANs: Consists of a Generator and Discriminator competing to create realistic synthetic data.
- Deep Belief Networks (DBN): Stacks Restricted Boltzmann Machines (RBMs) for unsupervised feature learning.
