Machine Learning Model Types and Data Preprocessing
Machine Learning Model Types and Descriptions
1. Geometric Models
Geometric models represent data as points in a multidimensional space. Learning involves finding geometric structures like hyperplanes, clusters, or nearest neighbors that can separate or classify the data. These models rely on the distance between data points, vector spaces, and geometric transformations.
- Examples: Linear Classifiers (Perceptron, Logistic Regression, SVM), Nearest Neighbor Classifiers (k-NN), Clustering Models (K-means).
2. Probabilistic Models
Probabilistic models are based on the principle of probability and statistics. They model the relationship between input and output using conditional probabilities and make predictions by estimating the likelihood of outcomes. They handle uncertainty very effectively.
- Examples: Naïve Bayes Classifier, Hidden Markov Models (HMM), Bayesian Networks.
Advantage: Provides interpretable results and works well with noisy data.
3. Logical Models
Logical models use if–then rules and decision boundaries expressed in logical terms. They are more symbolic and often interpretable compared to geometric or probabilistic models.
- Examples: Decision Trees (ID3, C4.5, CART), Rule-based Classifiers (RIPPER, CN2).
They are very useful when human-understandable rules are required.
4. Grouping and Grading Models
- Grouping Models: Divide the dataset into groups of similar items (Clustering). Techniques: K-Means, Hierarchical Clustering. Applications: Customer segmentation, Document clustering.
- Grading Models: Assign an order or rank to data items instead of a strict classification. Applications: Information retrieval ranking, Recommender Systems.
5. Parametric vs. Non-Parametric Models
Parametric Models
Parametric models assume a fixed functional form for the relationship between input and output. Learning involves estimating a finite set of parameters.
- Examples: Linear Regression, Logistic Regression, Neural Networks.
Non-Parametric Models
Non-parametric models do not assume a fixed functional form. The complexity of the model grows with the data.
- Examples: k-Nearest Neighbor (k-NN), Decision Trees, Kernel-based Methods.
Detailed Look at Geometric Models
Geometric Models Explained
Geometric models represent data points in a multidimensional feature space, where each item is a vector of features. Learning involves finding geometric structures (lines, hyperplanes, clusters) to separate or classify points. They are valued for their simplicity, interpretability, and efficiency.
Types of Geometric Models
1. Linear Models
These models assume the decision boundary is a line (in 2D) or a hyperplane (in higher dimensions).
- Examples: Linear Regression, Logistic Regression, Support Vector Machines (SVM).
2. Nearest Neighbor Models
Based on distance measures (like Euclidean distance). A point is classified by its closest neighbors.
- Example: k-Nearest Neighbor (k-NN) classifier.
3. Clustering Models
Divides data into natural groups based on minimizing intra-cluster distance and maximizing inter-cluster distance.
- Examples: K-Means Clustering, Hierarchical Clustering.
4. Kernel-based Models
These extend linear models by mapping data into higher-dimensional space where it becomes linearly separable.
- Example: Kernelized SVM.
Grouping and Grading Model Techniques
1. Grouping Models (Clustering)
Grouping models divide a dataset into groups of similar data points, typically used in unsupervised learning.
- Goal: Objects within the same group (intra-group) must be very similar; objects in different groups (inter-group) must be dissimilar.
- Techniques used: K-Means Clustering, Hierarchical Clustering, DBSCAN.
- Applications: Customer segmentation, Document/topic clustering, Image segmentation.
2. Grading Models (Ranking)
Grading models assign an order, level, or rank to data items, focusing on ranking or prioritization.
- Techniques used: Ordinal Regression, Ranking Algorithms.
- Applications: Search engine result ranking, Product recommendation ranking, Credit scoring.
Machine Learning Workflow Steps
1. Problem Definition
Clearly define the goal of the ML application (classification, regression, clustering, or recommendation).
2. Data Collection
Gather relevant data from sources like databases, sensors, APIs, or web scraping. The quality and quantity of data are crucial.
3. Data Preprocessing
Raw data requires cleaning (handling missing data, removing duplicates), scaling (Normalization/Standardization), and Feature Selection/Extraction. Data must be Split into Training Set, Validation Set, and Testing Set.
4. Model Selection
Choose the model based on the problem type (e.g., K-Means for Clustering, SVM for Classification). The choice depends on the size, type of data, and application needs.
5. Model Training
Feed the training data into the model. The model learns patterns and relationships by optimizing its parameters.
6. Model Evaluation
Evaluate the trained model using test data. Performance is measured using metrics like Accuracy, Precision, Recall (for classification) or MSE, R² score (for regression).
7. Model Optimization
Fine-tune the model using techniques like Hyperparameter Tuning, Regularization to avoid overfitting, and Cross-validation.
8. Deployment
Integrate the satisfactory model into a real-world environment (mobile apps, web applications, or embedded systems).
9. Monitoring and Maintenance
Continuously monitor the model for performance degradation due to concept drift, requiring regular retraining and updating.
AI vs. Machine Learning Comparison
| Artificial Intelligence (AI) | Machine Learning (ML) |
|---|---|
| AI is the broader concept of machines being able to carry out tasks in a way that we consider “intelligent.” | ML is a subset of AI that enables systems to learn from data and improve performance. |
| To simulate human intelligence (thinking, reasoning, decision-making). | To allow machines to learn from data and predict outcomes. |
| Very broad – includes ML, robotics, vision, expert systems, etc. | Narrower – focuses only on data-driven learning. |
| Can use rule-based systems, search algorithms, logic, as well as learning. | Uses statistical models, algorithms, and optimization techniques. |
| AI may or may not require data; it can be rule-based. | ML is highly data-dependent. |
| Self-driving cars, Chess-playing robots, Medical diagnosis, Speech recognition. | Spam filtering, Stock price prediction, Recommender systems. |
| Knowledge representation, Reasoning, Heuristics, Search. | Regression, Classification, Clustering, Neural Networks. |
| AI is the superset. | ML is a subset of AI. |
Parametric vs. Non-Parametric Models
| Aspect | Parametric Models | Non-Parametric Models |
|---|---|---|
| Assumption | Assume a fixed functional form. | No assumption about functional form. |
| Parameters | Finite set of parameters to learn. | Number of parameters grows with data. |
| Complexity | Fixed, does not depend on data size. | Flexible, complexity increases with data. |
| Data Requirement | Works well with small datasets. | Requires large datasets. |
| Computation | Fast training and prediction. | Slower, computationally expensive. |
| Risk | May underfit if assumption is wrong. | May overfit if not regularized. |
| Examples | Linear Regression, Logistic Regression, Naïve Bayes. | k-NN, Decision Trees, SVM, Random Forests. |
Traditional Programming vs. Machine Learning
| Traditional Programming | Machine Learning |
|---|---|
| Programmer explicitly writes rules/logic to process input data and generate output. | System learns patterns from data and generates its own rules/model. |
| Input Data + Explicit Program/Rules → Output | Input Data + Output (Labels) → Algorithm learns → Model |
| Output (result of rules applied to input) | Model (which can make predictions on unseen data) |
| Depends completely on human-coded instructions. | Depends on data quality and learning algorithm. |
| Rigid: hard to adapt to new scenarios without rewriting code. | Flexible: model can improve with more data. |
| Payroll system, Calculator, Banking transaction processing. | Spam email classification, Image recognition, Speech translation. |
Types of Machine Learning Learning Paradigms
a) Supervised Learning
Learning with labeled data (each input has a corresponding correct output). The algorithm learns a mapping function $f(x) \to y$. Goal: Predict outcomes for new unseen inputs.
- Examples: Predicting house prices (Regression), Email spam classification (Classification).
b) Unsupervised Learning
Learning with unlabeled data (no outputs are given). The algorithm tries to discover hidden patterns, groups, or structures.
- Goal: Clustering or dimensionality reduction.
- Examples: Market basket analysis, Customer segmentation.
c) Semi-Supervised Learning
Learning using a combination of a small amount of labeled data and a large amount of unlabeled data. Usefulness: Helpful when labeling is costly or time-consuming.
d) Reinforcement Learning
Learning through trial and error using feedback in terms of rewards or penalties. The agent interacts with an environment to maximize cumulative reward.
- Examples: Game playing (Chess, Go), Self-driving cars.
Supervised, Unsupervised, and Semi-Supervised Learning Explained
1. Supervised Learning
The model is trained on a labeled dataset, where both input features and output labels are known. The algorithm learns a mapping function from input to output.
- Goal: Predict the output for unseen data based on past labeled data.
- Examples: Classification (spam detection), Regression (house price prediction).
- Advantages: High accuracy if sufficient labeled data is available; easy to evaluate.
- Disadvantages: Requires large labeled datasets which are costly to prepare.
2. Unsupervised Learning
The dataset contains only input data without labels. The algorithm tries to learn patterns, structure, or relationships in the data.
- Goal: Discover hidden structures, groupings, or reduce data dimensionality.
- Examples: Clustering (customer segmentation), Dimensionality Reduction (PCA).
- Advantages: Useful when labeled data is not available; aids exploratory data analysis.
- Disadvantages: Harder to evaluate results since no true labels are available.
3. Semi-Supervised Learning
A combination of supervised and unsupervised learning, using a small amount of labeled data and a large amount of unlabeled data.
- Goal: Achieve better accuracy than unsupervised learning when labeled data is limited.
- Examples: Web content classification where only few web pages are labeled; expensive medical image diagnosis.
- Advantages: Reduces the cost of data labeling; better performance than purely unsupervised methods.
- Disadvantages: Complex algorithms are required to balance labeled and unlabeled data.
Data Preprocessing Techniques
Min-Max Scaling (Normalization)
Min-Max scaling is a feature scaling technique used to normalize data into a fixed range, usually [0,1]. It linearly transforms the original values by preserving the relationships between them.
$$x’ = \frac{(x – x_{\text{min}})}{(x_{\text{max}} – x_{\text{min}})}$$
Where: $x$ = original value; $x_{\text{min}}$ = minimum value in dataset; $x_{\text{max}}$ = maximum value in dataset; $x’$ = normalized value (in range [0,1]).
- Steps in Min-Max Scaling: 1. Find the minimum and maximum values. 2. Apply the formula for each value. 3. Get normalized values.
- Advantages: Simple, easy to compute, and preserves relationships among data values.
Handling Missing Values
1. Deletion Methods
- Listwise Deletion: Remove entire rows where any value is missing. Simple, but risks losing important data.
- Column Deletion: Drop the entire column if too many values are missing (e.g., >70–80%). Risks losing potentially important information.
2. Imputation Methods
Fill in missing values with estimated values.
- Mean/Median/Mode Imputation: Use mean (for numerical, normal distribution), median (for skewed numerical), or mode (for categorical).
- Constant Value Imputation: Replace missing values with a fixed number (e.g., -999, “Unknown”). Useful when missingness itself has meaning.
- K-Nearest Neighbors (KNN) Imputation: Replace missing values based on nearest neighbors’ feature values (more accurate but expensive).
- Regression Imputation: Predict missing values using a regression model built on other features.
- Multiple Imputation: Generates multiple possible values using statistical models and takes the average (more reliable).
3. Model-Based Handling
Some ML models (e.g., XGBoost, LightGBM, CatBoost) can handle missing values internally by learning optimal splits even with missing data.
Feature Selection
Feature Selection is the process of selecting the most relevant features (attributes) from a dataset to build an efficient and accurate Machine Learning model. It removes irrelevant, redundant, or noisy features.
- Need: Reduces dimensionality, improves accuracy, reduces training time, and prevents overfitting.
Filtering Technique (Filter Method)
Filtering evaluates feature relevance independently of the machine learning algorithm using statistical measures.
- Working: Each feature is ranked based on a statistical score, and a threshold keeps the best features before model training.
- Common Techniques: Correlation coefficient, Chi-square test, Information Gain / Mutual Information, Variance Threshold.
Categorical Variable Encoding
Machine Learning models require numerical values. Encoding converts categorical attributes (like Gender or Color) into a numerical format without losing meaning.
One-Hot Encoding
This technique converts each categorical value into a binary vector (0 or 1). Each unique category becomes a new column (feature).
Consider a categorical variable: Color = {Red, Blue, Green}. After one-hot encoding, three new columns (Color_Red, Color_Blue, Color_Green) are created.
Dimensionality Reduction Techniques
Process of Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that transforms high-dimensional data into a smaller set of uncorrelated variables called principal components, preserving most of the information (variance).
- Standardize the Data: Rescale features to have mean = 0 and standard deviation = 1.
- Compute Covariance Matrix: Measures how variables are correlated.
- Calculate Eigenvalues and Eigenvectors: Eigenvectors define the directions (principal components); Eigenvalues define the variance captured.
- Sort and Select Principal Components: Select top k eigenvectors corresponding to the largest eigenvalues.
- Transform the Data: Project the original dataset onto the new k-dimensional feature space.
Use of PCA in Preprocessing Stage
PCA is used because high-dimensional data often contains redundant features, increasing computational cost and causing the curse of dimensionality. PCA reduces features while keeping the maximum possible variance (information).
Kernel PCA (Principal Component Analysis)
Kernel PCA is an extension of PCA that allows non-linear dimensionality reduction using the kernel trick. Standard PCA fails on complex, non-linear data (like concentric circles).
- Concept: Uses a kernel function $K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)$ to compute dot products in a higher-dimensional feature space where data becomes linearly separable.
- Common Kernels: Linear kernel, Polynomial kernel, Radial Basis Function (RBF).
- Advantages: Handles non-linear data structures effectively; useful in image processing and pattern recognition.
Feature Selection Wrapper Methods
Types of Wrapper Methods
1. Forward Selection
Starts with no features and iteratively adds the feature that provides the greatest improvement in model performance.
- Advantage: Simple and efficient when few features are important.
- Limitation: Can miss effective feature combinations.
2. Backward Elimination
Starts with all features and iteratively removes the least important feature (whose removal least reduces performance).
- Advantage: Useful when many irrelevant features exist.
- Limitation: Computationally expensive for high-dimensional data.
3. Recursive Feature Elimination (RFE)
A systematic version of backward elimination. Trains the model, ranks features by importance, removes the least important one, and repeats until the desired number of features is reached.
4. Exhaustive Search (Brute Force)
Evaluates all possible subsets of features and selects the subset yielding the best model performance.
- Advantage: Guarantees finding the best feature subset.
- Limitation: Computationally infeasible for large feature sets (exponential growth).
Recommender System Techniques
Matrix Factorization
Matrix Factorization decomposes a user-item interaction matrix (often sparse due to missing ratings) into the product of two lower-dimensional matrices ($R \approx U \times V^T$).
- Goal: Learn latent features of users ($U$) and items ($V$) to predict missing values (ratings).
Content-Based Filtering
Recommends items to a user based on the features of the items and the preferences of the user. It relies on item descriptions/attributes (content) rather than only user ratings.
