Machine Learning Models: Regression and Classification
Regression Models and Regularization Techniques
1. Linear Regression (Ordinary Least Squares – OLS)
Linear Regression is the most basic form, aiming to model the relationship between a dependent variable (Y) and one or more independent variables (X) by fitting a straight line (or hyperplane) to the data.
- Goal: To find the coefficient values (β) that minimize the Residual Sum of Squares (RSS), which is the sum of the squared differences between the observed data points and the values predicted by the model.
- Formula (Cost Function to Minimize): Where ŷi is the predicted value.
- Use Case: When the dataset is simple, the features are not highly correlated, and there is no major risk of overfitting.
2. Ridge Regression
Ridge Regression addresses some of the problems of OLS, particularly multicollinearity (highly correlated independent variables) and overfitting, by adding a penalty term to the cost function.
- Regularization Type: L2 Regularization
- Penalty: Adds a penalty term proportional to the square of the magnitude of the coefficients.
- Formula (Cost Function to Minimize): Where λ ≥ 0 is the tuning (regularization) parameter.
- Impact on Coefficients: It shrinks the coefficients towards zero, reducing their impact, but it never sets any coefficient exactly to zero. All predictors remain in the model.
- Use Case: When you have many predictors, all of which are potentially relevant, and you want to reduce the magnitude of the coefficients to prevent overfitting and handle multicollinearity.
3. Lasso Regression
Lasso Regression (Least Absolute Shrinkage and Selection Operator) is another regularization technique that is similar to Ridge but uses a different penalty.
- Regularization Type: L1 Regularization
- Penalty: Adds a penalty term proportional to the absolute value of the magnitude of the coefficients.
- Formula (Cost Function to Minimize):
- Impact on Coefficients: It shrinks the coefficients towards zero and, crucially, can set some coefficients exactly to zero. This effectively removes the corresponding features from the model.
- Use Case: When you have a large number of features and want to perform automatic feature selection to create a simpler, more interpretable model by eliminating irrelevant predictors.
Core Linear and Specialized Regression Models
These models form the basis of most regression techniques, assuming a straight-line relationship between the variables.
1.1 Linear Regression (OLS)
- Goal: To find the line of best fit that minimizes the sum of squared differences between the predicted values and the actual data points (Ordinary Least Squares or OLS).
- Relationship: Assumes a linear relationship between the independent variable(s) and a continuous dependent variable.
- Use Case: Predicting a numerical value like house prices, temperature, or sales volume based on a simple, direct relationship with features.
1.2 Polynomial Regression
- Goal: To fit a non-linear relationship to the data using a linear model approach.
- Relationship: Models the relationship as an nth-degree polynomial (e.g., quadratic or cubic), allowing for curved lines. It still uses linear regression principles on transformed features.
- Use Case: When the relationship is clearly curved, such as modeling disease progression over time or population growth.
1.3 Logistic Regression
Note: Despite the name, Logistic Regression is primarily used for Classification, not continuous prediction.
- Goal: To estimate the probability of a binary (two-outcome) event occurring.
- Relationship: Uses the sigmoid function to map the linear combination of predictors to a probability between 0 and 1.
- Use Case: Predicting whether an email is spam (Yes/No), if a customer will churn (True/False), or if a medical test result is positive (0/1).
Advanced and Specialized Models
- Poisson Regression: Used for Count Data (non-negative integers). It assumes the dependent variable follows a Poisson distribution. Primary Use Case: Modeling the number of events (e.g., traffic accidents, website clicks, insurance claims).
- Quantile Regression: Used for Continuous Data. It estimates the relationship at different quantiles (e.g., median, 90th percentile) of the target variable’s distribution. Primary Use Case: When you need to predict the tails of the distribution.
- Support Vector Regression (SVR): Used for Continuous Data. Based on the Support Vector Machine (SVM) algorithm; finds a function that has at most ε (epsilon) deviation from the actual values. Primary Use Case: Complex and non-linear relationships using the “kernel trick.”
- Decision Tree Regressor: Used for Continuous Data. Splits the data into branches based on feature values. The prediction is the average value in the final leaf node. Primary Use Case: Easy to interpret and handles non-linear relationships well.
- Random Forest Regressor: Used for Continuous Data. An ensemble method that builds multiple Decision Trees and averages their predictions. Primary Use Case: High-performance model for complex non-linear data.
Understanding the Bias-Variance Trade-off
The Bias-Variance Trade-off is one of the most fundamental concepts in machine learning. It describes the conflict between two sources of error that prevent a model from generalizing well to new, unseen data. The goal is to find a balance—the “sweet spot”—that minimizes the total error.
The total expected prediction error can be mathematically decomposed into three parts: Irreducible Error (noise inherent in the data), Bias², and Variance. The goal is to minimize the reducible error.
1. Defining Bias and Variance
- High Bias (Systematic Error): Bias is the error introduced by approximating a real-world problem with a simplified model. It measures how far off the average model prediction is from the true value. Cause: Model is too simple. Symptom: Poor performance on both training and test data.
- High Variance (Random Error): Variance is the error introduced by a model’s sensitivity to small fluctuations in the training data. It measures how much predictions change if the model is trained on a different dataset. Cause: Model is too complex and fits noise. Symptom: High performance on training data but poor on test data.
2. Relationship to Underfitting and Overfitting
- Underfitting (High Bias, Low Variance): The model is too simple to capture necessary patterns. Example: Using Linear Regression for a parabolic relationship.
- Overfitting (Low Bias, High Variance): The model is too complex and flexible, fitting random fluctuations. Example: Using a very high-degree polynomial that wiggles to hit every outlier.
- Optimal Fit (Low Bias, Low Variance): The model captures the underlying trend without fitting the noise.
Machine Learning Evaluation Metrics
The term “evaluation metrics” often refers to the Confusion Matrix for classification or specific error calculations for regression. The choice of metric depends on the problem type.
1. Classification Evaluation Metrics
The Confusion Matrix summarizes performance by counting correct and incorrect predictions:
- True Positive (TP): Correctly predicted positive.
- False Negative (FN): Incorrectly predicted negative (missed case).
- False Positive (FP): Incorrectly predicted positive (false alarm).
- True Negative (TN): Correctly predicted negative.
Common Metrics:
- Accuracy: Proportion of total correct predictions. Best for balanced datasets.
- Precision: Out of all predicted positives, how many were actually positive? Important when FP cost is high (e.g., spam detection).
- Recall (Sensitivity): Out of all actual positives, how many did the model catch? Important when FN cost is high (e.g., disease diagnosis).
- F1 Score: The harmonic mean of Precision and Recall. Excellent for imbalanced datasets.
- ROC Curve & AUC: Plots True Positive Rate against False Positive Rate. AUC closer to 1.0 indicates a better model.
2. Regression Evaluation Metrics
- Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values.
- Mean Squared Error (MSE): The average of the squared errors. It heavily penalizes large errors (outliers).
- Root Mean Squared Error (RMSE): The square root of MSE. It is in the same unit as the target variable and is the most common metric.
K-Nearest Neighbors (KNN) Algorithm
The K-Nearest Neighbors (KNN) algorithm is a simple, non-parametric, supervised method used for both classification and regression. It is a “lazy learner” because computation happens at the time of prediction.
How KNN Works
- Choose K: Select the number of nearest neighbors.
- Calculate Distance: Use metrics like Euclidean Distance (straight-line) or Manhattan Distance (city block).
- Find Neighbors: Identify the K closest data points.
- Make Prediction: For classification, use a majority vote. For regression, use the average (mean) of the neighbors.
Advantages and Disadvantages
- Pros: Simple, intuitive, no assumptions about data distribution, handles multi-class problems naturally.
- Cons: Computationally expensive at prediction time, sensitive to the “Curse of Dimensionality,” requires feature scaling (normalization), and sensitive to irrelevant features.
Binary vs. Multiclass Classification
- Binary Classification: Categorizes data into exactly two classes (0/1, True/False). Uses a Sigmoid activation function and Binary Cross-Entropy loss.
- Multiclass Classification: Categorizes data into three or more mutually exclusive classes. Uses a Softmax activation function and Categorical Cross-Entropy loss.
Outlier Analysis and Handling Techniques
Outlier analysis involves identifying data points that significantly deviate from the majority of observations.
1. Detection Techniques
- Statistical Methods: Z-Score (standard deviations from mean) and IQR Method (using the 25th and 75th percentiles).
- Proximity-Based: KNN (isolated points) and Local Outlier Factor (LOF) (density-based).
- Machine Learning: Isolation Forest (isolates anomalies in fewer splits) and DBSCAN (points not belonging to any cluster).
2. Treatment Techniques
- Removing (Trimming): Dropping the data points if they are errors.
- Imputing (Capping): Replacing outliers with the mean, median, or using Winsorization (capping at a specific percentile).
- Transforming: Using Log or Square Root transformations to reduce the influence of extreme values.
Clustering: K-Means and Hierarchical
K-Means Clustering
An unsupervised algorithm that partitions data into K distinct clusters by minimizing the Within-Cluster Sum of Squares (WCSS). It iteratively assigns points to the nearest centroid and updates the centroid position until convergence.
Hierarchical Clustering
Builds a hierarchy of clusters represented by a Dendrogram. It does not require a pre-defined number of clusters.
- Agglomerative (Bottom-Up): Starts with each point as a cluster and merges the closest pairs.
- Divisive (Top-Down): Starts with one cluster and recursively splits it.
- Linkage Methods: Single (closest points), Complete (farthest points), Average, or Ward’s Method (minimizing WCSS).
