Statistical Analysis: Regression and Probability Models
Regression Analysis and Predictive Modeling
Regression analysis is a statistical method used to model the relationship between variables and to predict the value of one variable using another.
Main Types of Regression
- Simple linear regression: One independent variable and one dependent variable.
- Multiple regression: Several independent variables predicting one dependent variable.
- Logistic regression: Used when the dependent variable represents probabilities or categories.
The goal of simple linear regression is to find the best straight line that describes the relationship between variables. This line is determined using the least squares method, which minimizes the squared differences between the observed points and the regression line.
The Regression Equation
y = bx + a
- y = estimated dependent variable.
- x = independent variable.
- b = slope (how much y changes when x increases by 1).
- a = intercept (value of y when x = 0).
A stronger linear relationship between variables leads to more accurate predictions.
Time Series Analysis
A time series is a set of observations collected over time at regular intervals.
- Trend: Long-term movement of data (upward or downward).
- Seasonality: Regular patterns that repeat over time (months, seasons, holidays).
- Cyclic patterns: Long-term fluctuations influenced by economic or social conditions.
- Stationarity: When the series has no clear trend or pattern.
To reduce irregular fluctuations, analysts often use moving averages, which smooth the data by averaging values around a time point.
Probability Theory and Random Variables
A random experiment has uncertain outcomes, but patterns appear when it is repeated many times.
Key Probability Concepts
- Sample space (S): All possible outcomes of an experiment.
- Event: A subset of outcomes from the sample space.
- Probability: The chance that an event occurs.
Fundamental Probability Rules
- 0 ≤ P(A) ≤ 1
- P(S) = 1
- If events A and B are disjoint, they cannot happen together.
Types of Random Variables
Random variables can be categorized as:
- Discrete: Take separate, countable values (e.g., number of coins, number of people, month of birth).
- Continuous: Can take any value within a range (e.g., height, weight, income).
Characteristics of Random Experiments
A random experiment is an experiment where the outcome is uncertain but follows patterns when repeated many times.
- Uncertainty: The exact outcome cannot be known before the experiment happens.
- Repeatability: The experiment can be repeated under the same conditions many times.
- Sample space (S): The set of all possible outcomes of a random experiment.
Example: Tossing a coin → sample space = {Heads, Tails}
Understanding the Normal Distribution
The normal distribution is a probability distribution with a bell-shaped curve that describes many natural, economic, and social phenomena.
Main Characteristics
- The curve is symmetric around the mean.
- The mean, median, and mode are equal and located at the center.
- Most values are concentrated around the mean.
- The probability decreases as we move away from the mean.
Mean (μ)
The center of the distribution represents the average value of the dataset.
Standard Deviation (σ)
Measures how spread out the data are around the mean.
- Small σ: Values are close to the mean.
- Large σ: Values are more spread out.
The Empirical Rule (68–95–99.7 Rule)
In a normal distribution:
- 68% of values are within ±1 standard deviation.
- 95% are within ±2 standard deviations.
- 99.7% are within ±3 standard deviations.
This rule helps estimate probabilities quickly.
Standard Normal Distribution
A normalized version of the distribution where:
- Mean = 0
- Standard deviation = 1
This allows for direct comparison between different datasets.
