Computer Vision Concepts: Image Processing, Transforms, and Models
Q1: Image Representation and Processing
In computer vision, image representation is the method of converting a real-world scene into a digital format that a computer can understand and process. A digital image is represented as a two-dimensional function f(x, y), where x and y denote spatial coordinates and f represents the intensity or channel values at that location. In grayscale images, each pixel stores a single intensity value; in color images, each pixel is represented using multiple channels such as RGB, HSV, or YCbCr. The choice of color model depends on the application, as some models separate intensity from color information and can simplify processing. Image representation also considers spatial resolution, intensity resolution, and sampling, which together determine image quality and level of detail.
Image processing refers to the set of operations applied to images to enhance their quality or extract meaningful information. These operations include noise reduction, contrast enhancement, image filtering, edge detection, segmentation, and feature extraction. In computer vision systems, image processing acts as a crucial preprocessing step that prepares images for higher-level tasks such as object detection, recognition, tracking, and scene understanding. The main objective is not only to improve visual appearance but also to transform raw image data into a form suitable for analysis, interpretation, and decision-making by machines.
Q2: Walsh–Hadamard Transform and Applications
The Walsh–Hadamard Transform (WHT) is an orthogonal transform used to represent an image or signal in the sequency domain. Unlike Fourier transforms, which use sine and cosine basis functions, the Walsh–Hadamard Transform uses square-wave basis functions with values of +1 and −1. This property eliminates the need for multiplications, making the transform computationally efficient and suitable for fast image processing applications. The transform decomposes an image into components representing different spatial frequencies.
In computer vision, WHT is used for image compression, pattern recognition, feature extraction, and noise removal. Due to its simplicity and low computational complexity, it is suitable for real-time systems and hardware implementations. The transform has good energy compaction properties, meaning significant image information is concentrated in fewer coefficients; this reduces storage requirements and transmission bandwidth. WHT is also used in image watermarking and texture analysis where fast transformation is essential.
Q3: Differences Between DFT and DCT
| DFT (Discrete Fourier Transform) | DCT (Discrete Cosine Transform) |
|---|---|
| Represents signals using both sine and cosine functions. | Represents signals using only cosine functions. |
| Produces complex-valued coefficients (real and imaginary parts). | Produces only real-valued coefficients. |
| Generally higher computational complexity. | Computationally more efficient than DFT in many cases. |
| Suffers from spectral leakage due to signal discontinuities. | Reduces spectral leakage due to even symmetry. |
| Energy tends to be spread across many coefficients. | Strong energy compaction; energy concentrated in low frequencies. |
| Less suitable for image compression. | Widely used in image compression standards such as JPEG. |
Q4: Prewitt, Sobel, and Canny Edge Detection
Edge detection is a key operation in computer vision used to identify object boundaries and structural information in an image. Prewitt and Sobel operators are first-order derivative-based methods that compute image gradients in horizontal and vertical directions. The Prewitt operator uses simple convolution masks to detect edges, making it easy to implement but sensitive to noise. It provides a rough estimation of edges and is mainly used in basic applications.
The Sobel operator improves upon Prewitt by assigning higher weight to central pixels in the convolution mask. This makes Sobel more robust to noise and results in smoother, more accurate edges. Canny edge detection is a more advanced and near-optimal method that involves multiple stages: Gaussian smoothing, gradient computation, non-maximum suppression, and hysteresis thresholding. This approach ensures accurate edge localization, reduced noise response, and detection of true edges while minimizing false detections. Due to its reliability, Canny is widely used in real-world applications.
| Prewitt | Sobel | Canny |
|---|---|---|
| Simple gradient operator | Weighted gradient operator | Multi-stage, near-optimal detector |
| Noise sensitive | Less noise sensitive | Strong noise suppression |
| Produces relatively thick edges | Smoother edge maps | Thin and well-localized edges |
| Fast computation | Moderate computation | Computationally more expensive |
| No built-in thresholding | No built-in thresholding | Uses hysteresis thresholding |
| Used in basic applications | Improved edge detection in practice | Common in real-world applications |
Q5: Region-based vs Edge-based Segmentation
| Region-based Segmentation | Edge-based Segmentation |
|---|---|
| Groups pixels based on similarity in intensity, color, or texture. | Detects object boundaries using intensity discontinuities. |
| Produces complete regions of objects. | Produces object outlines or boundaries. |
| More robust to noise. | Sensitive to noise and weak edges. |
| Uses techniques like region growing and split-and-merge. | Uses edge detectors such as Sobel, Prewitt, and Canny. |
| Suitable for medical and satellite images. | Suitable when object boundaries are clear. |
| Computationally more complex. | Computationally simpler. |
Q6: SVM, KNN, and Random Forest Compared
| SVM | KNN | Random Forest |
|---|---|---|
| Finds an optimal hyperplane with maximum margin. | Classifies using nearest neighbors. | Uses multiple decision trees (ensemble). |
| Effective in high-dimensional spaces. | Performs poorly in very high-dimensional data without dimensionality reduction. | Handles high-dimensional data well. |
| Requires kernel selection and parameter tuning. | No explicit training phase; stores training data. | Requires training of multiple trees. |
| Relatively memory efficient. | Memory intensive for large datasets. | Moderate memory usage. |
| Sensitive to kernel parameters and C value. | Sensitive to distance metric and k value. | Less sensitive to parameter tuning; robust ensemble. |
| High accuracy with small or medium datasets. | Slow at test time with very large training sets. | Fast and scalable for large datasets. |
Q7: The SIFT Algorithm
Scale-Invariant Feature Transform (SIFT) is a powerful feature detection and description algorithm used to extract distinctive local features from images. The key advantage of SIFT is that its detected features are invariant to scale, rotation, illumination changes, and small affine transformations, which makes the algorithm robust for many real-world applications.
The algorithm starts with scale-space construction, where the input image is repeatedly blurred using Gaussian filters at different scales. This allows detection of keypoints that are independent of image size. The Difference of Gaussian (DoG) is used to efficiently approximate the Laplacian of Gaussian and is calculated as:
DoG(x, y, σ) = G(x, y, kσ) − G(x, y, σ)
Here, G(x, y, σ) is the Gaussian-blurred image at scale σ and k is a constant multiplicative factor. Local maxima and minima in the DoG images across scales and spatial locations are selected as candidate keypoints. These keypoints are refined by eliminating low-contrast points and unstable edge responses, for example using a Taylor series expansion to improve localization accuracy.
After keypoint detection, orientation assignment is performed to achieve rotation invariance. For each keypoint, gradient magnitude and orientation are computed using pixel differences in the x and y directions as follows:
m(x, y) = sqrt(Ix² + Iy²)
θ(x, y) = atan2(Iy, Ix)
A dominant orientation is assigned to each keypoint based on these gradients. This ensures the descriptor remains invariant to image rotation.
Finally, a feature descriptor is generated by dividing the region around each keypoint into subregions and computing orientation histograms. The result is typically a 128-dimensional feature vector that uniquely represents the local image structure. These descriptors are distinctive and robust, making SIFT widely used in image matching, object recognition, panorama stitching, and 3D reconstruction.
Q9: The GAN Algorithm
Generative Adversarial Networks (GANs) are deep learning models used to generate new data samples that resemble real data. A GAN consists of two neural networks trained simultaneously: the Generator (G) and the Discriminator (D). The generator takes random noise as input and learns to produce fake samples, while the discriminator evaluates whether a given sample is real (from training data) or fake (produced by G).
GAN training is formulated as a minimax optimization problem, where the generator tries to minimize the loss while the discriminator tries to maximize it. The objective is:
min_G max_D V(D, G) = E_x~p_data[log D(x)] + E_z~p_z[log(1 − D(G(z)))]
Here, x are real samples from p_data and z is random noise from a prior p_z. The discriminator D(x) outputs the probability that x is real, while D(G(z)) is the probability the generated sample is classified as real. During training, D learns to distinguish real from fake, and G learns to produce samples that can fool D. The adversarial process continues until an equilibrium where generated samples are highly realistic.
GANs are used in image synthesis, super-resolution, image-to-image translation, face generation, deepfake creation, and data augmentation. They can be challenging to train due to instability, mode collapse, and sensitivity to hyperparameters.
Q10: CNN and Fast R-CNN
A Convolutional Neural Network (CNN) is a deep learning architecture designed for image data that exploits spatial relationships between pixels. CNNs consist of convolutional layers, activation functions (commonly ReLU), pooling layers, and fully connected layers. Convolutional layers apply learnable filters to input images to extract features. The basic convolution operation can be expressed as:
Feature Map = Image * Kernel
where the kernel (filter) slides over the image and computes dot products to generate feature maps. Early layers learn low-level features (edges, corners, textures); deeper layers learn higher-level features (shapes, object parts). Pooling layers reduce spatial dimensions and computation while preserving important information. CNNs are widely used in image classification, face recognition, and object detection.
Fast R-CNN is an object detection framework that improves efficiency over earlier R-CNN approaches. Unlike R-CNN, which processes each region proposal separately through a CNN, Fast R-CNN processes the entire image once to produce convolutional feature maps. From these maps, Region of Interest (RoI) pooling extracts fixed-size feature vectors for each proposal.
These feature vectors are passed through fully connected layers to output both class probabilities and refined bounding box coordinates. By sharing convolutional computations across proposals, Fast R-CNN significantly reduces computation and increases detection speed while maintaining accuracy.
Q11: Image Filtering
Image filtering is a fundamental technique in computer vision and image processing used to enhance images or extract information by modifying pixel values based on their neighborhood. Main objectives include noise reduction, image smoothing, edge enhancement, and feature extraction. Filtering is commonly used as a preprocessing step before edge detection, segmentation, and recognition.
Image filtering is performed by applying a filter kernel (mask) over the image. The kernel slides across the image and the output pixel value is computed using convolution:
g(x, y) = Σ_i Σ_j f(x − i, y − j) · h(i, j)
where f(x, y) is the input image, h(i, j) is the filter kernel, and g(x, y) is the filtered image. The result depends on the filter type.
Filters are broadly classified as linear and non-linear. Linear filters such as mean and Gaussian use weighted averages of neighboring pixels for smoothing and noise reduction. Non-linear filters, such as the median filter, effectively remove impulse noise while preserving edges. Filters can also be low-pass (remove high-frequency noise) or high-pass (enhance edges and fine details).
Filtering can be applied in the spatial domain or the frequency domain. In the frequency domain, filtering is performed by modifying frequency components after applying the Fourier Transform. Image filtering improves image quality and highlights important features, making it an essential step in computer vision systems.
Q12: Adaptive Histogram Equalization (AHE)
Adaptive Histogram Equalization (AHE) is an image enhancement technique used to improve contrast in images with non-uniform illumination. Unlike global histogram equalization, which applies a single transformation to the entire image, AHE enhances contrast locally, making it effective for images with varying lighting conditions.
In AHE, the image is divided into small regions called tiles, and histogram equalization is applied independently to each region. This allows local details in dark or bright areas to become more visible. The basic histogram equalization transformation is:
s = (L − 1) × CDF(r)
where L is the number of gray levels and CDF(r) is the cumulative distribution function. In AHE, this transformation is computed separately for each local region.
A major drawback of naive AHE is that it can amplify noise in uniform regions. To overcome this, Contrast Limited Adaptive Histogram Equalization (CLAHE) clips the histogram to limit excessive contrast enhancement. AHE and CLAHE are widely used in medical imaging, satellite imagery, and low-light enhancement where preserving local details is important.
Q8: The YOLO Algorithm
YOLO (You Only Look Once) is a real-time object detection algorithm that formulates detection as a single regression problem rather than using a multi-stage pipeline. Instead of generating region proposals and then classifying them, YOLO processes the entire image in one forward pass of a convolutional neural network. This unified approach makes YOLO extremely fast and suitable for real-time applications.
In YOLO, the input image is divided into an S × S grid, where each grid cell is responsible for detecting objects whose center lies within that cell. Each grid cell predicts a fixed number of bounding boxes, along with confidence scores and class probabilities. The confidence score indicates how certain the model is that an object exists and is computed as:
Confidence = P(Object) × IOU(pred, truth)
Each bounding box is represented by four parameters: (x, y, w, h), where x and y denote the center coordinates relative to the grid cell, and w, h are the width and height relative to the image dimensions. During training, YOLO minimizes a composite loss combining localization error, confidence error, and classification error.
The single-stage architecture enables YOLO to learn global contextual information from the entire image, helping reduce false detections from background regions. Due to its speed and reasonable accuracy, YOLO is widely used in autonomous driving, video surveillance, robotics, and traffic monitoring.
