Essential Statistical Concepts and Tests

Simple Linear Regression

  • Purpose: Predict a numerical outcome (dependent variable Y) from a numerical predictor (independent variable X).

  • Equation: Y = a + bX

    • a (intercept): Predicted Y when X = 0

    • b (slope): For each 1-unit increase in X, Y increases/decreases by b units.

  • Example: Income = 20000 + 3000 × YearsOfEducation → Each extra year of education predicts $3,000 more income.

  • R² (Coefficient of Determination): Tells us how much of the variation in Y is explained by X. Ranges from 0 to 1.

  • Interpretation: If R² = 0.64 → 64% of the variance in income is explained by years of education.

  • Important Concepts: Outliers, residuals, linear relationship assumption.

ANOVA (Analysis of Variance)

  • Purpose: Compare means across 3 or more groups to see if at least one group mean is different.

  • Key Idea: Looks at how much total variability (SST) is due to differences between groups (SSB) vs. within groups (SSW).

  • Formulas:

    • SST = SSB + SSW

    • MSB = SSB / dfB, MSW = SSW / dfW

    • F = MSB / MSW

  • Interpretation: If F > critical value → Reject H₀ → At least one group mean is different.

  • Example: Comparing average calorie burn in 5 types of exercise.

Correlation

  • Purpose: Measure the strength and direction of a linear relationship between two numeric variables.

  • Pearson’s r: Ranges from -1 to 1

    • r = 0 → No linear relationship

    • r = +1 → Perfect positive relationship

    • r = -1 → Perfect negative relationship

  • R² = r²: Interpreted as percent of variance in Y explained by X.

  • Fisher Z-transformation: Used to compare two different correlation coefficients.

Chi-Square Tests

  • Use when: Your variables are categorical (e.g., yes/no, categories, ranks).

  • Two Types:

    • Test of Independence: Are two variables related (e.g., Netflix hours and fitness level)?

    • Goodness of Fit: Do observed frequencies match a theoretical distribution?

  • Formula:

    • O = Observed frequency

    • E = Expected frequency (based on row/column totals)

  • Assumption: Expected counts should generally be ≥ 5 in each cell.

  • Interpretation: If chi-square statistic > critical value → Reject H₀.


Measures of Center and Spread

Measures of Central Tendency

  • Mean: Average

  • Median: Middle value when ordered

  • Mode: Most frequent value

  • Use median when outliers/skewed data are present.

Measures of Dispersion

  • Variance: Average squared deviation from mean → Shows data spread

  • Standard Deviation: √Variance → Easier to interpret (in same units as original data)

Interpretation: Larger SD = more spread = less consistent data.


Scatterplots and Frequency Tables

Scatterplots

  • X-axis: Independent variable

  • Y-axis: Dependent variable

  • Each point: One individual

  • Used to visualize correlation/regression patterns.

  • Look for: Linear trend, outliers, direction (positive/negative).

Frequency Tables

  • Shows count of occurrences per category (e.g., Netflix hours by fitness level).

  • Used for Chi-square tests (calculate expected values).


Measures of Effect Size

  • Phi: For 2×2 tables (calculate it!)

  • Gamma, Tau-b, Lambda, Cramer’s V: Know these are used for ordinal/nominal variables → you don’t calculate these.

  • They tell us strength of association, not causation.


Levels of Measurement

Level

Description

Example

Nominal

Names/labels (no order)

Gender, Race

Ordinal

Categories with logical order

Satisfaction rating

Interval

Numerical, equal spacing, no true zero

Temperature (°C)

Ratio

Interval + true zero

Age, Weight, Income


Choosing the Right Statistical Test

If You Want To…

Use This Test

Compare one mean to a known value

Z-test (if SD known)

Compare one mean to a value (SD unknown)

t-test

Compare two group means

Two-sample t-test

Compare same group before and after

Paired t-test

Compare 3+ means across groups

ANOVA

Test association between 2 categories

Chi-square Test of Independence

Compare frequencies with theoretical values

Chi-square Goodness of Fit

Predict numeric outcome from numeric variable

Simple Linear Regression

Predict binary outcome (e.g., yes/no)

Logistic Regression

Compare 2 correlations

Fisher Z transformation

Key Assumptions:

  • T-tests: Normal distribution, equal variances (unless Welch used)

  • ANOVA: Normality + equal variance

  • Chi-square: Expected count ≥ 5 per cell

  • Regression: Linearity, independence, normal residuals


Hypothesis Testing, Errors, and Confidence Intervals

  • H₀ (Null Hypothesis): No effect, no difference

  • H₁ (Alternative Hypothesis): There is a difference

  • Alpha (α): Probability of making a Type I Error (typically 0.05)

Errors:

  • Type I Error: Rejecting H₀ when it’s true (false positive)

  • Type II Error: Failing to reject H₀ when it’s false (false negative)

Confidence Intervals (CI):

  • Definition: A range of values believed to contain the population parameter

  • 95% CI: If we repeated this study 100 times, ~95 CIs would contain the true value

  • Narrow CI = more precision