Core Statistical Concepts: Data, Inference & Models
Understanding Data Types
Different types of data require different analytical approaches:
- Cross-sectional data: Characteristics of many subjects or observations at the same point in time.
- Time series data: Data focusing on one observation over several time periods.
- Cross-sectional time series data: A collection of observations for multiple subjects at multiple time periods.
- Panel or longitudinal data: The same subject measured over various time periods.
Quantitative Data
Quantitative data is numerical. Usually, there are measurement units or counts associated with quantitative data.
Descriptive Statistics Essentials
Measures of Central Tendency and Skewness
- If Mean > Median: The distribution is typically right-skewed.
- If Mean < Median: The distribution is typically left-skewed.
Standard Deviation (SD) and Standard Error (SE)
Standard Deviation (SD) quantifies the variation or dispersion within a single set of measurements.
The Standard Error (SE), also known as the standard deviation of the means, quantifies the precision of the sample mean as an estimate of the population mean. It measures the variation across multiple sets of measurements (or more practically, the expected variation of sample means if multiple samples were taken).
- For a population: SE = σ/√n (where σ is the population SD, n is the sample size).
- For a sample: SE = s/√n (where s is the sample SD, n is the sample size).
Increasing the sample size (n) leads to a smaller SE. The smaller the sample standard deviation (s), the smaller the SE.
Foundations of Inferential Statistics
The Empirical Rule
For a normal distribution:
- Approximately 68% of data falls within 1 standard deviation (SD) of the mean.
- Approximately 95% of data falls within 2 standard deviations (SD) of the mean.
- Approximately 99.7% of data falls within 3 standard deviations (SD) of the mean.
Central Limit Theorem (CLT)
The Central Limit Theorem states that the distribution of sample averages (means) will be approximately normal, regardless of the shape of the population distribution, provided the sample size is sufficiently large. As the sample size increases, the distribution of sample averages becomes more normal and narrower (less spread out).
Sample Size and Precision
- Larger samples → Smaller SE → Narrower confidence interval → More precise estimate of the population mean (μ).
- Smaller samples → Larger SE → Wider confidence interval → Less precise estimate of the population mean (μ).
Confidence Intervals
A Confidence Interval (CI) provides a range of plausible values for a population parameter.
For example, an approximate 95% confidence interval for the population mean can be calculated as: Sample Mean ± (2 * SE).
- Upper 95% limit = Sample Mean + (SE * 2)
- Lower 95% limit = Sample Mean – (SE * 2)
Z-scores, Z-statistics, and T-statistics
- Z-score: Standardizes an individual data point (x) from a population with known mean (μ) and standard deviation (σ). Formula: z = (x – μ) / σ. Z-scores help calculate probabilities and compare values from different normal distributions.
- Z-statistic: Used for hypothesis tests about a population mean when the population standard deviation (σ) is known. Formula: Z = (x̄ – μ) / (σ/√n), where x̄ is the sample mean.
- T-statistic: Used for hypothesis tests about a population mean when the population standard deviation (σ) is unknown and estimated by the sample standard deviation (s). Formula: T = (x̄ – μ) / (s/√n), where x̄ is the sample mean.
Hypothesis Testing
A statistical hypothesis is an assumption about a population parameter.
- Null Hypothesis (H0): The default belief or statement of no effect/difference (e.g., H0: μ = #).
- Alternative Hypothesis (H1): The hypothesis that the researcher aims to support, contradicting the null hypothesis (e.g., H1: μ ≠ #).
A common rule of thumb for decision making (especially with large samples, relating to an approximate 95% confidence level):
- If |t| ≥ 2, we reject H0 and accept H1.
- If |t| < 2, we do not reject H0.
Why the rule of thumb involving ‘2’? Because for a distribution that is approximately normal, a sample mean will be more than approximately two standard errors away from the true population mean only about 5% of the time if the null hypothesis is true.
Correlation and Regression Analysis
Correlation Coefficient
The correlation coefficient is a numerical measure (typically ranging from -1 to +1) of how closely two variables move together, indicating the strength and direction of the linear relationship between X and Y.
- A positive correlation coefficient means that when we see a higher value for one variable, we also tend to see a higher value for the other variable.
- A negative correlation coefficient means that when we see a higher value for one variable, we tend to see a lower value for the other variable.
- When the correlation coefficient is 0, it means there is no linear correlation between the variables.
Regression Line and Residuals
The Regression Line describes the linear relationship between a predictor variable (X) and a response variable (Y).
- Equation: Predicted Y = b0 + b1*X
- b0 is the intercept (predicted value of Y when X=0).
- b1 is the coefficient or slope.
- Slope (b1) = Rise/Run = Change in Y / Change in X. It represents the change in Y for a one-unit change in X.
The full model includes an error term: Y = b0 + b1*X + error.
The error (or residual) is the difference between the actual value and the predicted value: Error = Actual Value – Predicted Value.
R-squared (R²)
R² (R-squared) indicates the percentage of the variation in the dependent variable (Y) that is explained by changes in the independent variable(s) (X) in the regression model. It is always between 0 and 1 (or 0% and 100%).
In a simple linear regression (one X variable): R² = [correlation(x,y)]².
High vs. Low R²
- A high R² is often found for outcomes that are strongly determined by a few observable factors (common in many time series regressions).
- A low R² is often found for outcomes that depend heavily on unobserved factors (common in surveys of people or complex systems).
Statistical Significance of Coefficients
If the p-value associated with a regression coefficient is less than a chosen significance level (e.g., 0.05), we reject the null hypothesis (that the true coefficient is zero) with a certain level of confidence (e.g., 95% confidence). This indicates that the coefficient is statistically significant, meaning the predictor variable has a statistically significant impact on the outcome variable.
Common Sampling Biases
Bias in sampling can lead to inaccurate conclusions:
- Non-representative Samples: If research is conducted on a group that does not accurately reflect the broader population, conclusions drawn from it cannot be generalized to the population as a whole.
- Survivorship Bias: This occurs when researchers focus on individuals or groups that have passed some sort of selection process while ignoring those who did not. Consequently, the data only includes “successful” cases, leading to overly optimistic or skewed findings.
- Self-selection Bias: This arises when people volunteer to respond to a survey or participate in a study. The characteristics and/or views of these respondents are likely to differ from those who did not volunteer, potentially skewing the results.
Interpreting Regression Model Coefficients
Example 1: Housing Prices
Consider the regression model: Price^ = b0 + b1*Size + b2*BeaconStreet
(Where Price^ is the predicted price, Size is a continuous variable, and BeaconStreet is a dummy variable, e.g., 1 if on Beacon Street, 0 otherwise).
- b1 tells us the expected change in Price for a one-unit change in Size, controlling for the effect of Beacon Street (i.e., holding location constant).
- b2 tells us the difference in predicted Price for condos on Beacon Street compared to those not on Beacon Street, controlling for the effect of Size (i.e., holding size constant).
Example 2: Sales Model with Dummy Variables
Consider the regression model: Sales^ = 200 – 0.5*Price + 50*Spring + 90*Summer – 25*Winter
(Where Sales^ is predicted sales, Price is a continuous variable, and Spring, Summer, Winter are dummy variables for seasons, with Fall being the excluded/reference category).
Interpretation of the coefficient ’50’ for Spring: Sales are, on average, 50 units higher in the Spring than in the Fall, holding price constant.
Understanding Causality in Data
Correlation does not imply causation. When observing a relationship between X and Y, several possibilities exist:
- X causes Y: A change in X directly leads to a change in Y.
- Y causes X (Reverse Causality): A change in Y directly leads to a change in X.
- Confounding Factor: Another variable (or variables) Z causes both X and Y to change, creating an apparent relationship between X and Y.
- Simultaneity (Bidirectional Causality): X causes Y, AND Y causes X. This is also known as a feedback loop (e.g., the relationship between depression and alcohol consumption).