Statistical Fundamentals and Key Concepts Reference

Hypothesis Testing and P-Values

P-Value Definition

The p-value is the probability of observing a statistic as extreme (or more extreme) as the sample statistic, assuming the null hypothesis (H₀) is true.

Interpretation

  • Large p-value: Evidence in favor of H₀ (Null Hypothesis).
  • Small p-value: Evidence in favor of Hₐ (Alternative Hypothesis).

Types of Errors

  • Type I Error (α): Rejecting H₀ when H₀ is true.
  • Type II Error (β): Failing to reject H₀ when H₀ is false.

Study Design Fundamentals

  • Sample: A subset of the population that is observed or measured.
  • Population: The entire group of interest.

Variables

  • Explanatory Variable (x): The factor believed to influence the outcome.
  • Response Variable (y): The outcome that is measured.

Study Types

  • Observational Study: Researchers observe subjects without assigning treatments.
  • Experimental Study: Treatments are assigned randomly to subjects to determine causality.

Causality: Causal relationships can only be implied in well-designed experiments.

Descriptive Statistics

Distribution Shape and Measures of Center

  • Symmetric: Mean ≈ Median
  • Right Skewed: Mean > Median
  • Left Skewed: Mean < Median

Robustness to Outliers

  • Mean: Sensitive to outliers.
  • Median: Robust to outliers.

Z-Score

A Z-score measures the number of standard deviations an observation is from the mean. It quantifies how extreme an observation is compared to the mean.

Bootstrap and Randomization Methods

Bootstrap

  • Resamples the original sample with replacement.
  • Used to estimate variability and construct Confidence Intervals (CI).
  • The resulting distribution is centered at the sample mean.

Randomization (Permutation Test)

  • Simulates the null distribution.
  • The resulting distribution is centered at the H₀ parameter (null value).

Note on Sample Size: Each bootstrap sample must have the same size as the original sample.

Key Probability Rules

  • Joint Probability: P(A ∩ B) = P(A) * P(B | A)
  • Total Probability: P(B) = P(A ∩ B) + P(Ā ∩ B)
  • Conditional Probability: P(A | B) = P(A ∩ B) / P(B)
  • Independence: If A and B are independent, P(A ∩ B) = P(A) * P(B)

Chi-Square Test

Use: Testing for association between two categorical variables.

Formulas

  • Expected Counts (E): E = (Row Total * Column Total) / n
  • Degrees of Freedom (df): df = (r - 1)(c - 1) (where r is rows, c is columns)

Hypotheses

  • Null Hypothesis (H₀): No association exists between the variables.
  • Alternative Hypothesis (Hₐ): An association exists between the variables.

Regression and Correlation Analysis

Simple Linear Regression Model

Equation: ŷ = b₀ + b₁x

  • Slope (b₁): The predicted change in the response variable (y) for every one-unit increase in the explanatory variable (x).
  • Intercept (b₀): The predicted value of y when x equals 0.

Correlation Coefficient (r)

  • Range: -1 ≤ r ≤ 1
  • Direction: Indicated by the sign (Positive or Negative).
  • Strength:
    • |r| close to 1 indicates a strong linear relationship.
    • |r| close to 0 indicates a weak linear relationship.

Prediction: To predict y, plug the value of x into the regression equation.

Discrete Random Variables

Valid Probability Function Conditions

  • The probability of any outcome must be between 0 and 1: 0 ≤ P(X=x) ≤ 1
  • The sum of all probabilities must equal 1: ∑P(X=x) = 1

Example (Mutually Exclusive Events): P(X=3 or 4) = P(X=3) + P(X=4)

Binomial Distribution Conditions (BINS)

  • Binary: Only two outcomes (success/failure).
  • Independence: Each trial is independent.
  • Number: Fixed number (n) of trials.
  • Success: Constant probability (p) of success per trial.

Confidence Intervals (CI)

Interpretation

“I am 95% confident that the true population parameter lies within this calculated interval.”

Formulas

  • CI for Proportion: p̂ ± z* * √[p̂(1 - p̂) / n]
  • CI for Mean (t-distribution): x̄ ± t* * (s / √n)

Essential Statistical Tips

Conditions to Check Before Running Tests

  • Proportion Tests: Check that np ≥ 10 and n(1-p) ≥ 10 (Success/Failure Condition).
  • T-Tests: Check for approximate normality of the sample distribution or large sample size.
  • Chi-Square Test: Ensure all expected counts are ≥ 5.

Hypothesis Test Decision Rule

  • If p-value ≤ α (Significance Level), Reject H₀.
  • If p-value > α, Fail to Reject H₀.

Applied Statistics Examples

  1. Correlation Interpretation (r = 0.66)

    Question: The correlation between the variables is approximately r = 0.66. Explain what this correlation tells us about the strength and direction of the association between the variables.

    Answer: A correlation of r = 0.66 indicates that the relationship between social media engagement score and average onsite spend is moderately strong and positive.

  2. Identifying Variables

    Question: Which variable is the explanatory variable, and which is the response variable for this linear regression model?

    • Explanatory Variable (x-axis): Social media engagement score.
    • Response Variable (y-axis): Average onsite spend.
  3. Y-Intercept Interpretation

    Question: Interpret the y-intercept of the linear regression model in context.

    Answer: When the social media engagement score is 0, the average onsite spend is predicted to be $33.40.

  4. Using the Regression Equation for Prediction

    Question: Show how you would use the regression equation to calculate the predicted onsite spend for a festival attendee with a social media engagement score of 50. (You do NOT need to work this out.)

    Calculation: Predicted Average Onsite Spend = 33.40 + 1.67 * (50)

  5. R-Squared Interpretation (R² = 0.548)

    Question: The (R-squared) for this linear regression model is 0.548. Interpret this value in context.

    Answer: 54.8% of the variability in the average onsite spend is explained by the social media engagement score (the model).