Essential Statistical Concepts and Tests
Simple Linear Regression
Purpose: Predict a numerical outcome (dependent variable Y) from a numerical predictor (independent variable X).
Equation: Y = a + bX
a (intercept): Predicted Y when X = 0
b (slope): For each 1-unit increase in X, Y increases/decreases by b units.
Example: Income = 20000 + 3000 × YearsOfEducation → Each extra year of education predicts $3,000 more income.
R² (Coefficient of Determination): Tells us how much of the variation in Y is explained by X. Ranges from 0 to 1.
Interpretation: If R² = 0.64 → 64% of the variance in income is explained by years of education.
Important Concepts: Outliers, residuals, linear relationship assumption.
ANOVA (Analysis of Variance)
Purpose: Compare means across 3 or more groups to see if at least one group mean is different.
Key Idea: Looks at how much total variability (SST) is due to differences between groups (SSB) vs. within groups (SSW).
Formulas:
SST = SSB + SSW
MSB = SSB / dfB, MSW = SSW / dfW
F = MSB / MSW
Interpretation: If F > critical value → Reject H₀ → At least one group mean is different.
Example: Comparing average calorie burn in 5 types of exercise.
Correlation
Purpose: Measure the strength and direction of a linear relationship between two numeric variables.
Pearson’s r: Ranges from -1 to 1
r = 0 → No linear relationship
r = +1 → Perfect positive relationship
r = -1 → Perfect negative relationship
R² = r²: Interpreted as percent of variance in Y explained by X.
Fisher Z-transformation: Used to compare two different correlation coefficients.
Chi-Square Tests
Use when: Your variables are categorical (e.g., yes/no, categories, ranks).
Two Types:
Test of Independence: Are two variables related (e.g., Netflix hours and fitness level)?
Goodness of Fit: Do observed frequencies match a theoretical distribution?
Formula:
O = Observed frequency
E = Expected frequency (based on row/column totals)
Assumption: Expected counts should generally be ≥ 5 in each cell.
Interpretation: If chi-square statistic > critical value → Reject H₀.
Measures of Center and Spread
Measures of Central Tendency
Mean: Average
Median: Middle value when ordered
Mode: Most frequent value
Use median when outliers/skewed data are present.
Measures of Dispersion
Variance: Average squared deviation from mean → Shows data spread
Standard Deviation: √Variance → Easier to interpret (in same units as original data)
Interpretation: Larger SD = more spread = less consistent data.
Scatterplots and Frequency Tables
Scatterplots
X-axis: Independent variable
Y-axis: Dependent variable
Each point: One individual
Used to visualize correlation/regression patterns.
Look for: Linear trend, outliers, direction (positive/negative).
Frequency Tables
Shows count of occurrences per category (e.g., Netflix hours by fitness level).
Used for Chi-square tests (calculate expected values).
Measures of Effect Size
Phi: For 2×2 tables (calculate it!)
Gamma, Tau-b, Lambda, Cramer’s V: Know these are used for ordinal/nominal variables → you don’t calculate these.
They tell us strength of association, not causation.
Levels of Measurement
Level | Description | Example |
Nominal | Names/labels (no order) | Gender, Race |
Ordinal | Categories with logical order | Satisfaction rating |
Interval | Numerical, equal spacing, no true zero | Temperature (°C) |
Ratio | Interval + true zero | Age, Weight, Income |
Choosing the Right Statistical Test
If You Want To… | Use This Test |
Compare one mean to a known value | Z-test (if SD known) |
Compare one mean to a value (SD unknown) | t-test |
Compare two group means | Two-sample t-test |
Compare same group before and after | Paired t-test |
Compare 3+ means across groups | ANOVA |
Test association between 2 categories | Chi-square Test of Independence |
Compare frequencies with theoretical values | Chi-square Goodness of Fit |
Predict numeric outcome from numeric variable | Simple Linear Regression |
Predict binary outcome (e.g., yes/no) | Logistic Regression |
Compare 2 correlations | Fisher Z transformation |
Key Assumptions:
T-tests: Normal distribution, equal variances (unless Welch used)
ANOVA: Normality + equal variance
Chi-square: Expected count ≥ 5 per cell
Regression: Linearity, independence, normal residuals
Hypothesis Testing, Errors, and Confidence Intervals
H₀ (Null Hypothesis): No effect, no difference
H₁ (Alternative Hypothesis): There is a difference
Alpha (α): Probability of making a Type I Error (typically 0.05)
Errors:
Type I Error: Rejecting H₀ when it’s true (false positive)
Type II Error: Failing to reject H₀ when it’s false (false negative)
Confidence Intervals (CI):
Definition: A range of values believed to contain the population parameter
95% CI: If we repeated this study 100 times, ~95 CIs would contain the true value
Narrow CI = more precision