Machine Learning Fundamentals: From Regression to Clustering

NOTEBOOK 8

Machine Learning Fundamentals

Machine learning, a branch of artificial intelligence and computer science, utilizes data and algorithms to mimic human learning. The data used during the learning phase, known as training data, serves as the guiding principle for the machine learning system.

A machine learning model is an algorithm or mathematical expression that defines the relationship between a target variable and one or more predictor variables.

Types of Machine Learning

  • Supervised Learning: Training data includes both predictor and target variables.
  • Unsupervised Learning: Training data contains only predictor variables.
  • Semi-Supervised Learning: A combination of supervised and unsupervised learning.
  • Reinforcement Learning: Requires no training data; learns independently through reward and penalty techniques.

Based on the Target Variable

  • Regression: The target is a numerical variable, predicting values like salary or temperature.
  • Classification: The target is a categorical variable, predicting classes like positive or negative.

Regression

Definition

A regression model defines the relationship between one or more independent variables (predictors) and a dependent variable (target). Its purpose is to understand how the dependent variable changes in response to changes in the independent variable(s). The target variable in regression is always numerical.

Pearson’s Correlation Coefficient

Pearson’s Correlation Coefficient ranges from -1 to 1: -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship between variables.

Disadvantages of Linear Regression

  • Limited to Linear Relationships: Linear regression only considers linear relationships between variables.
  • Sensitive to Outliers: Outliers can significantly impact model performance and accuracy.
  • Data Must Be Independent: Multicollinearity (correlation between independent variables) must be addressed before applying linear regression.

Multiple Linear Regression

Multiple linear regression defines the relationship between two or more independent variables and a dependent variable. The relationship is represented by the equation: Y = β0 + β1X1 + β2X2 + … + βnXn

Multicollinearity

Multicollinearity occurs when two or more predictor variables are highly correlated, making it difficult to isolate their individual effects on the dependent variable.

Pipelines

Pipelines streamline machine learning workflows by encapsulating multiple steps into a single object. They offer convenience, prevent data leakage, and ensure consistent data processing.

Key Features

  • Initialization: Create a Pipeline object with a list of tuples, each containing a name and an estimator or transformer.
  • Execution: The fit() method sequentially applies transformations, fitting each step with the data.
  • Chaining: Each step (except the last) must be a transformer with fit() and transform() methods. Output from one step becomes input for the next.
  • Final Estimator: The last step is an estimator with fit() and predict() methods, typically the model for training and prediction.

Classification

Classification categorizes input data into predefined classes. The goal is to develop a model that accurately predicts the class of new instances based on learned patterns.

Key Concepts

  • Categories: Distinct labels or groups for data categorization (binary or multiclass).
  • Features: Data characteristics used for training and prediction.
  • Training Data: Labeled data with input features and corresponding class labels.
  • Model: Mathematical representation of learned relationships (e.g., logistic regression, SVM, decision trees).
  • Prediction: Using the trained model to predict categories of new instances.
  • Evaluation Metrics: Assessing model performance (e.g., accuracy, precision, recall, F1 score).

Types of Classification

  • Binary Classification: Predicting one of two classes (e.g., spam detection).
  • Multiclass Classification: Predicting one of more than two classes (e.g., image recognition).

Logistic Regression

Despite its name, logistic regression is primarily used for classification, predicting the probability of an instance belonging to a specific class.

Support Vector Machines (SVM)

SVMs find the optimal decision boundary (hyperplane) that maximally separates data points into different classes.

Key Concepts

  • Hyperplane: Decision boundary separating data into classes.
  • Support Vectors: Data points closest to the hyperplane, influencing its position.
  • Margin: Distance between the hyperplane and the nearest data point.
  • Kernel Trick: Handling non-linear decision boundaries by mapping data into a higher-dimensional space.

Types of SVM

  • Hard Margin SVM: Assumes perfectly separable data, sensitive to outliers.
  • Soft Margin SVM: Allows for misclassifications, handles non-separable data, uses a”” parameter to control the trade-off between margin width and misclassifications.

K-Nearest Neighbors (KNN)

KNN is a simple algorithm used for classification and regression, based on the proximity of similar items. rzj5H3XnlReUx4RJMXEWS0Iv3LVUiFr1mB_GR9KcrIUJUjzP5b5K02m7ZJ0_k3c3SZkyOiYtlxAJgu_pOdoewhAwYD9_rgyLBFp6KQ-LvSZNrclISg5t4UtPYOR0vtH59IwC4hIO6fjVFWMfPpERTA=s2048

KNN Classification

The class of a data point is determined by the most common class among its k nearest neighbors.

KNN Regression

The value of a query point is predicted by averaging the values of its k nearest neighbors.

Selecting Optimal K

  • Cross-validation: Helps determine the k value with the highest accuracy.
  • Square Root Rule: Suggests an initial k value approximately equal to the square root of the dataset size.

Limitations of KNN

  • Slow with large datasets due to distance calculations.
  • Performance degrades with increasing features.
  • Memory-intensive as it stores the entire training dataset.
  • Struggles with binary or categorical variables.

Clustering

Clustering is an unsupervised technique that groups similar data points into clusters based on inherent patterns. The center of a cluster is called a centroid.

Objective

Partition data into groups with minimal intra-cluster distance and maximal inter-cluster distance.

Clustering Methods

  • Partitioning Methods: Divide data into separate clusters (e.g., KMeans).
  • Hierarchical Clustering: Forms cluster hierarchies (agglomerative or divisive).
  • Density-based Methods: Identify clusters as areas of higher density (e.g., DBSCAN).

Elbow Method

Helps determine the optimal number of clusters (k) by analyzing the sum of squared distances (SSD) for different k values.

Decision Trees

Decision trees are tree-like structures used for classification and regression, making decisions based on a series of rules.

Gini Impurity

Measures the degree of impurity or disorder in a set of examples. Ph_PwjR41hbs7Ni-mRVd8CG2WjDhWQmF_s2y8DEVyiKaT_LD5LhzcTVjvBawK5rdM8scYwoi2fFEMYnuIoQCrOExxOYgCNQ0I-V7bYMLsZOl75AjCdM0Lpg8Isi2BIw2hQCn0A4n_AewgyC9Z6axYw=s2048

Entropy

Measures the uncertainty of a set of instances. FiHNwZuWMnTeGpCk7ZrMQnZxG1RIuLKuzuUmIvjZqxcOvxISil2yqum6hEqUL4ODq6K3SczVDRIuVGrgptASxQ-UT9pYtWpt015uXmhs00weP0XI4RD7HiMrl-y9RHXrScaJnJ3qI7vi6Kq9vy0HlA=s2048

Splitting Criterion in Regression

Minimizes mean squared error (MSE) to find splits resulting in subsets with minimal target variable variability. DhuI8vWPUJ79nSex5jU35_WXpHzrCRGzqXQ-60MFOBsM_f5Z57GzrkmE-sNrrorUUxK82lp58LP0pui5yL7wfVuWg4uVbUYjOfh__lKTJGJbgIGC5Q-4_C0FEEsPb8DPRjUvm0ONkjR59KrqYQa9Bw=s2048

Model Complexity and Bias-Variance Trade-off

Model complexity refers to a model’s capacity to capture data patterns. Balancing complexity is crucial to avoid overfitting or underfitting.

Overfitting

Occurs when a model is too complex, capturing noise as patterns, leading to high variance.

Underfitting

Occurs when a model is too simple, failing to capture data structure, leading to high bias.

Factors Influencing Complexity

  • Number of features
  • Model structure
  • Algorithm’s learning capacity

Considerations

  • Interpretability: Simpler models are easier to understand.
  • Computational Efficiency: Complex models require more resources.

Ensemble Learning

Ensemble learning combines multiple models “weak learner”) to improve performance. It leverages diversity to reduce errors.

Types of Ensemble Methods

  • Bagging (Bootstrap Aggregating): Trains multiple models on random data subsets, combining predictions through averaging or majority voting (e.g., Random Forest).
  • Boosting: Sequentially trains models, focusing on correcting previous errors, combining predictions with weighted accuracy (e.g., AdaBoost).
  • Stacking (Stacked Generalization): Trains a meta-model on the predictions of multiple base models.

Model Selection

Choosing the best model involves considering factors like subset selection, shrinkage, and dimension reduction.

Subset Selection

Identifies a subset of relevant predictors and fits a model using least squares.

Shrinkage (Regularization)

Fits a model with all predictors but shrinks coefficients towards zero, reducing variance and potentially performing variable selection.

Dimension Reduction

Projects predictors into a lower-dimensional subspace.

Best Subset Selection

Iteratively evaluates models with increasing numbers of predictors, selecting the best based on RSS or R2.

Forward Stepwise Selection

Starts with a null model and iteratively adds predictors based on RSS improvement.

Backward Stepwise Selection

Starts with a full model and iteratively removes predictors based on RSS improvement.

Choosing the Optimal Model

Training error (RSS, R2) is not a reliable indicator of test error. Cross-validation and other techniques are used to estimate test error and select the best model.

Sigmoid Function

The sigmoid function is used in logistic regression to predict the probability of an instance belonging to the positive class: σ(x) = 1 / (1 + exp(-x))

Prediction Formula

P(y = 1 | x) = σ(β0 + β1x)

Decision Threshold

91P+dEs2wAAAABJRU5ErkJggg== wJo0ESNqi4ibwAAAABJRU5ErkJggg== wHD42CyJQIyswAAAABJRU5ErkJggg== D9gIl16gFInPwAAAABJRU5ErkJggg==

Precision and Recall

Precision: TP / (TP + FP)     Recall: TP / (TP + FN)

ya1M9frVJtoAAAAAElFTkSuQmCC     q6BTp07+cffu3f3YZoOxzdxoW6QO4pJLLok6K4QQol5IGIiWpXPnzr6WgCmIbLbcVq5c6cXBsmXLalpPkAaEwTXXXONPekzDIYcc4pYvX154JIQQ9UHCQLQ01A4wDpmzEgjvL1261K1atarw3fpCkWOPHj28YGEIk0FL5TPPPFN49NbPcUPEMJNBCCHqiYSBEEIIISJUfCiEEEKICAkDIYQQQkRIGAghhBAiQsJACCGEEBESBkIIIYSIkDAQQgghRISEgRBCCCEiJAyEEEIIESFhIIQQQogICQMhhBBCREgYCCGEECJCwkAIIYQQERIGQgghhIiQMBBCCCFEhISBEEIIISIkDIQQQggRIWEghBBCiAgJAyGEEEJESBgIIYQQIkLCQAghhBAREgZCCCGEiJAwEEIIIUSEhIEQQgghIiQMhBBCCFHAuf8HzW8mwhmvSUUAAAAASUVORK5CYII= 8NzCy8gAAAABJRU5ErkJggg==

g9SdKMrYmM1BAAAAABJRU5ErkJggg== 0Yp5s+fb88995ytW7fOXwKgumjcuLGNHj3av1WzKZZNmjTJvwWgOiGWZdfgwYMp5wHV1MCBA61ly5b+rdxAop4GgjdQPalwix2IZUD1RCzLPuIjUD3lYnwkUQcAAAAAIELoow4AAAAAQISQqAMAAAAAECEk6gAAAAAARAiJOgAAAAAAEUKiDgAAAABAhJCoAwAAAAAQISTqAAAAAABECIk6AAAAAAARQqIOAAAAAECEkKgDAAAAABAhJOoAAAAAAEQIiToAAAAAABFCog4AAAAAQISQqAMAAAAAECEk6gAAAAAARAiJOgAAAAAAEUKiDgAAAABAhJCoAwAAAAAQISTqAAAAAABECIk6AAAAAAARQqIOAAAAAECEkKgDAAAAABAhJOoAAAAAAEQIiToAAAAAAJFh9v8D18l9DO3ontUAAAAASUVORK5CYII=