Machine Learning Fundamentals: From Regression to Clustering
NOTEBOOK 8
Machine Learning Fundamentals
Machine learning, a branch of artificial intelligence and computer science, utilizes data and algorithms to mimic human learning. The data used during the learning phase, known as training data, serves as the guiding principle for the machine learning system.
A machine learning model is an algorithm or mathematical expression that defines the relationship between a target variable and one or more predictor variables.
Types of Machine Learning
- Supervised Learning: Training data includes both predictor and target variables.
- Unsupervised Learning: Training data contains only predictor variables.
- Semi-Supervised Learning: A combination of supervised and unsupervised learning.
- Reinforcement Learning: Requires no training data; learns independently through reward and penalty techniques.
Based on the Target Variable
- Regression: The target is a numerical variable, predicting values like salary or temperature.
- Classification: The target is a categorical variable, predicting classes like positive or negative.
Regression
Definition
A regression model defines the relationship between one or more independent variables (predictors) and a dependent variable (target). Its purpose is to understand how the dependent variable changes in response to changes in the independent variable(s). The target variable in regression is always numerical.
Pearson’s Correlation Coefficient
Pearson’s Correlation Coefficient ranges from -1 to 1: -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship between variables.
Disadvantages of Linear Regression
- Limited to Linear Relationships: Linear regression only considers linear relationships between variables.
- Sensitive to Outliers: Outliers can significantly impact model performance and accuracy.
- Data Must Be Independent: Multicollinearity (correlation between independent variables) must be addressed before applying linear regression.
Multiple Linear Regression
Multiple linear regression defines the relationship between two or more independent variables and a dependent variable. The relationship is represented by the equation: Y = β0 + β1X1 + β2X2 + … + βnXn
Multicollinearity
Multicollinearity occurs when two or more predictor variables are highly correlated, making it difficult to isolate their individual effects on the dependent variable.
Pipelines
Pipelines streamline machine learning workflows by encapsulating multiple steps into a single object. They offer convenience, prevent data leakage, and ensure consistent data processing.
Key Features
- Initialization: Create a Pipeline object with a list of tuples, each containing a name and an estimator or transformer.
- Execution: The
fit()method sequentially applies transformations, fitting each step with the data. - Chaining: Each step (except the last) must be a transformer with
fit()andtransform()methods. Output from one step becomes input for the next. - Final Estimator: The last step is an estimator with
fit()andpredict()methods, typically the model for training and prediction.
Classification
Classification categorizes input data into predefined classes. The goal is to develop a model that accurately predicts the class of new instances based on learned patterns.
Key Concepts
- Categories: Distinct labels or groups for data categorization (binary or multiclass).
- Features: Data characteristics used for training and prediction.
- Training Data: Labeled data with input features and corresponding class labels.
- Model: Mathematical representation of learned relationships (e.g., logistic regression, SVM, decision trees).
- Prediction: Using the trained model to predict categories of new instances.
- Evaluation Metrics: Assessing model performance (e.g., accuracy, precision, recall, F1 score).
Types of Classification
- Binary Classification: Predicting one of two classes (e.g., spam detection).
- Multiclass Classification: Predicting one of more than two classes (e.g., image recognition).
Logistic Regression
Despite its name, logistic regression is primarily used for classification, predicting the probability of an instance belonging to a specific class.
Support Vector Machines (SVM)
SVMs find the optimal decision boundary (hyperplane) that maximally separates data points into different classes.
Key Concepts
- Hyperplane: Decision boundary separating data into classes.
- Support Vectors: Data points closest to the hyperplane, influencing its position.
- Margin: Distance between the hyperplane and the nearest data point.
- Kernel Trick: Handling non-linear decision boundaries by mapping data into a higher-dimensional space.
Types of SVM
- Hard Margin SVM: Assumes perfectly separable data, sensitive to outliers.
- Soft Margin SVM: Allows for misclassifications, handles non-separable data, uses a”” parameter to control the trade-off between margin width and misclassifications.
K-Nearest Neighbors (KNN)
KNN is a simple algorithm used for classification and regression, based on the proximity of similar items.
KNN Classification
The class of a data point is determined by the most common class among its k nearest neighbors.
KNN Regression
The value of a query point is predicted by averaging the values of its k nearest neighbors.
Selecting Optimal K
- Cross-validation: Helps determine the k value with the highest accuracy.
- Square Root Rule: Suggests an initial k value approximately equal to the square root of the dataset size.
Limitations of KNN
- Slow with large datasets due to distance calculations.
- Performance degrades with increasing features.
- Memory-intensive as it stores the entire training dataset.
- Struggles with binary or categorical variables.
Clustering
Clustering is an unsupervised technique that groups similar data points into clusters based on inherent patterns. The center of a cluster is called a centroid.
Objective
Partition data into groups with minimal intra-cluster distance and maximal inter-cluster distance.
Clustering Methods
- Partitioning Methods: Divide data into separate clusters (e.g., KMeans).
- Hierarchical Clustering: Forms cluster hierarchies (agglomerative or divisive).
- Density-based Methods: Identify clusters as areas of higher density (e.g., DBSCAN).
Elbow Method
Helps determine the optimal number of clusters (k) by analyzing the sum of squared distances (SSD) for different k values.
Decision Trees
Decision trees are tree-like structures used for classification and regression, making decisions based on a series of rules.
Gini Impurity
Measures the degree of impurity or disorder in a set of examples.
Entropy
Measures the uncertainty of a set of instances.
Splitting Criterion in Regression
Minimizes mean squared error (MSE) to find splits resulting in subsets with minimal target variable variability.
Model Complexity and Bias-Variance Trade-off
Model complexity refers to a model’s capacity to capture data patterns. Balancing complexity is crucial to avoid overfitting or underfitting.
Overfitting
Occurs when a model is too complex, capturing noise as patterns, leading to high variance.
Underfitting
Occurs when a model is too simple, failing to capture data structure, leading to high bias.
Factors Influencing Complexity
- Number of features
- Model structure
- Algorithm’s learning capacity
Considerations
- Interpretability: Simpler models are easier to understand.
- Computational Efficiency: Complex models require more resources.
Ensemble Learning
Ensemble learning combines multiple models “weak learner”) to improve performance. It leverages diversity to reduce errors.
Types of Ensemble Methods
- Bagging (Bootstrap Aggregating): Trains multiple models on random data subsets, combining predictions through averaging or majority voting (e.g., Random Forest).
- Boosting: Sequentially trains models, focusing on correcting previous errors, combining predictions with weighted accuracy (e.g., AdaBoost).
- Stacking (Stacked Generalization): Trains a meta-model on the predictions of multiple base models.
Model Selection
Choosing the best model involves considering factors like subset selection, shrinkage, and dimension reduction.
Subset Selection
Identifies a subset of relevant predictors and fits a model using least squares.
Shrinkage (Regularization)
Fits a model with all predictors but shrinks coefficients towards zero, reducing variance and potentially performing variable selection.
Dimension Reduction
Projects predictors into a lower-dimensional subspace.
Best Subset Selection
Iteratively evaluates models with increasing numbers of predictors, selecting the best based on RSS or R2.
Forward Stepwise Selection
Starts with a null model and iteratively adds predictors based on RSS improvement.
Backward Stepwise Selection
Starts with a full model and iteratively removes predictors based on RSS improvement.
Choosing the Optimal Model
Training error (RSS, R2) is not a reliable indicator of test error. Cross-validation and other techniques are used to estimate test error and select the best model.
Sigmoid Function
The sigmoid function is used in logistic regression to predict the probability of an instance belonging to the positive class: σ(x) = 1 / (1 + exp(-x))
Prediction Formula
P(y = 1 | x) = σ(β0 + β1x)
Decision Threshold
Precision and Recall
Precision: TP / (TP + FP) Recall: TP / (TP + FN)
