Regression, Regularization and Time Series Concepts for ML
1. Univariate linear regression assumptions
- Linearity: The relationship between X and Y is linear.
- Independence: The residuals (errors) are independent of each other.
- Homoscedasticity: The variance of residuals is constant across all levels of X.
- Normality of errors: The residuals follow a normal distribution.
These assumptions are important because they ensure the reliability and accuracy of the linear regression model. If the relationship between X and Y is not truly linear, the model won’t capture the real pattern. If the residuals are not independent, results will be biased. Without homoscedasticity, predictions will be less trustworthy. Finally, the normality of errors is needed to produce valid confidence intervals and hypothesis tests.
2. Bias vs. variance: effect on model performance
Bias is the error that occurs when a model is too simple and does not learn the true pattern in the data. A high-bias model fails to capture underlying patterns and leads to underfitting. Predictions are consistently incorrect across different data points.
Variance refers to the model’s sensitivity to small changes in the training data. High-variance models are too complex and capture noise along with the signal, leading to overfitting. Such models perform well on the training set but poorly on new, unseen data.
3. Multicollinearity in multiple linear regression
Multicollinearity occurs when two or more independent variables are highly correlated with each other. This makes it difficult for the model to distinguish their individual effects on the dependent variable. It inflates the standard errors of the coefficient estimates, making them unstable and less interpretable.
4. Regularization: L1 versus L2 comparison
Regularization is a technique to prevent overfitting by penalizing large weights in the model. It adds a regularization term to the cost (loss) function.
- L1 (Lasso): Adds the absolute value of coefficients and encourages sparsity by setting some coefficients exactly to zero.
- L2 (Ridge): Adds the squared values of coefficients and shrinks them toward zero, but does not eliminate them entirely.
5. Stationarity in time series and forecasting
A stationary time series has constant statistical properties over time (mean, variance, autocorrelation). If the data is non-stationary, transformations like differencing, log transformation, or detrending are often needed. Stationary series are important for forecasting because many time series models (ARIMA, etc.) assume stationarity to make valid predictions.
6. Role of the sigmoid function in logistic regression
The sigmoid function is a mathematical function commonly used in logistic regression and neural networks to model probabilities. It maps any real-valued number to a range between 0 and 1, making it ideal for binary classification tasks.
7. Converting time series into supervised learning
To convert time series data into a supervised learning format, you create input-output pairs by using past observations to predict future values. The process is:
- Choose a lag: how many past time steps (lags) you want to use.
- Create input features: for each time step, use the previous values of the time series as inputs.
- Set the target (output): the value you want to predict, typically the next time step.
- Form a dataset: combine the inputs and outputs into a structured dataset where each row is a training example.
8. Lagged features for time series forecasting
Lagged features are past values of a time series used as input variables to help predict future values. In time series forecasting, these lagged values capture temporal patterns and trends. By including them as predictors, models can learn how past behavior influences future outcomes. For instance, to predict today’s value, you might use the values from the past three days (lags 1, 2, and 3) as input features.
9. Handling categorical data in regression
In regression models, categorical data needs to be converted into numerical form.
- Label encoding: Converts each category into a unique integer (e.g., “High School” = 1, “University” = 2, “Master” = 3). Use when the categorical variable has an inherent order or ranking (ordinal data).
- One-hot encoding: Creates a binary column for each category. For example, if a feature has three categories (“Red”, “Blue”, “Green”), it will create three columns with 1 or 0 indicating whether the observation belongs to that category. Use when the categorical variable does not have a natural order (nominal data).
10. ROC curve for classification evaluation
The ROC curve is a tool used to evaluate the performance of binary classification models. It shows how well the model distinguishes between two classes by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at different threshold values.
The better the model, the more the ROC curve hugs the top-left corner of the plot, indicating high TPR and low FPR. A model that performs no better than random guessing will form a diagonal line from the bottom-left to the top-right. The Area Under the Curve (AUC) is a single number that summarizes the model’s performance — closer to 1 is better, and 0.5 means the model is guessing randomly. The ROC curve helps compare models and choose the best one, especially when classes are imbalanced.
11. Loss functions and model training
A loss function measures how well a machine learning model’s predictions match the actual target values. It calculates the error between the predicted output and the true output for each data point. The loss function guides the learning process: during training the model tries to minimize this loss by adjusting its parameters (like weights). A smaller loss means the model is making better predictions. Optimization algorithms (like gradient descent) use the loss function to determine which direction to move in order to improve the model.
12. Underfitting versus overfitting: examples
Underfitting: The model is too simple to capture patterns in the data and performs poorly on both training and test data. Example: using a straight line (linear model) to fit data that clearly follows a curve.
Overfitting: The model is too complex and captures noise in the data. It achieves high training accuracy but poor generalization on new data. Example: a decision tree that grows too deep and memorizes every detail of the training set.
13. Feature engineering impact on regression
Feature engineering is the process of transforming raw data into meaningful features that improve the predictive power of a model. Feature engineering matters because:
- Improved model performance: Well-designed features help models make better predictions.
- Enhanced interpretability: Features that reflect real-world meaning make the model easier to understand and explain.
- Reduced overfitting: Clear, relevant features reduce noise and help the model generalize better to new data.
14. Why not to shuffle time series data
Shuffling time series data when splitting into train and test sets is generally a bad idea because it breaks the natural time order of the data. If you shuffle the data:
- It can lead to data leakage, where the model learns information it wouldn’t have in a real scenario.
- Evaluation metrics may become unreliable, giving a false sense of good performance.
In time series, past values influence future values, so keeping the chronological order is crucial for making realistic predictions.
15. When to use multiple linear regression
Suppose you want to predict a student’s final exam score. Using just one variable, like hours of study, would be a univariate regression. But exam performance typically depends on several factors such as hours of study, attendance rate, sleep quality, participation in class, and previous test scores. In this case, multiple linear regression is better because it allows you to include all these predictors to get a more accurate prediction.
