Statistical Fundamentals and Key Concepts Reference
Hypothesis Testing and P-Values
P-Value Definition
The p-value is the probability of observing a statistic as extreme (or more extreme) as the sample statistic, assuming the null hypothesis (H₀) is true.
Interpretation
- Large p-value: Evidence in favor of H₀ (Null Hypothesis).
- Small p-value: Evidence in favor of Hₐ (Alternative Hypothesis).
Types of Errors
- Type I Error (α): Rejecting H₀ when H₀ is true.
- Type II Error (β): Failing to reject H₀ when H₀ is false.
Study Design Fundamentals
- Sample: A subset of the population that is observed or measured.
- Population: The entire group of interest.
Variables
- Explanatory Variable (x): The factor believed to influence the outcome.
- Response Variable (y): The outcome that is measured.
Study Types
- Observational Study: Researchers observe subjects without assigning treatments.
- Experimental Study: Treatments are assigned randomly to subjects to determine causality.
Causality: Causal relationships can only be implied in well-designed experiments.
Descriptive Statistics
Distribution Shape and Measures of Center
- Symmetric: Mean ≈ Median
- Right Skewed: Mean > Median
- Left Skewed: Mean < Median
Robustness to Outliers
- Mean: Sensitive to outliers.
- Median: Robust to outliers.
Z-Score
A Z-score measures the number of standard deviations an observation is from the mean. It quantifies how extreme an observation is compared to the mean.
Bootstrap and Randomization Methods
Bootstrap
- Resamples the original sample with replacement.
- Used to estimate variability and construct Confidence Intervals (CI).
- The resulting distribution is centered at the sample mean.
Randomization (Permutation Test)
- Simulates the null distribution.
- The resulting distribution is centered at the H₀ parameter (null value).
Note on Sample Size: Each bootstrap sample must have the same size as the original sample.
Key Probability Rules
- Joint Probability:
P(A ∩ B) = P(A) * P(B | A) - Total Probability:
P(B) = P(A ∩ B) + P(Ā ∩ B) - Conditional Probability:
P(A | B) = P(A ∩ B) / P(B) - Independence: If A and B are independent,
P(A ∩ B) = P(A) * P(B)
Chi-Square Test
Use: Testing for association between two categorical variables.
Formulas
- Expected Counts (E):
E = (Row Total * Column Total) / n - Degrees of Freedom (df):
df = (r - 1)(c - 1)(where r is rows, c is columns)
Hypotheses
- Null Hypothesis (H₀): No association exists between the variables.
- Alternative Hypothesis (Hₐ): An association exists between the variables.
Regression and Correlation Analysis
Simple Linear Regression Model
Equation: ŷ = b₀ + b₁x
- Slope (b₁): The predicted change in the response variable (y) for every one-unit increase in the explanatory variable (x).
- Intercept (b₀): The predicted value of y when x equals 0.
Correlation Coefficient (r)
- Range:
-1 ≤ r ≤ 1 - Direction: Indicated by the sign (Positive or Negative).
- Strength:
|r|close to 1 indicates a strong linear relationship.|r|close to 0 indicates a weak linear relationship.
Prediction: To predict y, plug the value of x into the regression equation.
Discrete Random Variables
Valid Probability Function Conditions
- The probability of any outcome must be between 0 and 1:
0 ≤ P(X=x) ≤ 1 - The sum of all probabilities must equal 1:
∑P(X=x) = 1
Example (Mutually Exclusive Events): P(X=3 or 4) = P(X=3) + P(X=4)
Binomial Distribution Conditions (BINS)
- Binary: Only two outcomes (success/failure).
- Independence: Each trial is independent.
- Number: Fixed number (n) of trials.
- Success: Constant probability (p) of success per trial.
Confidence Intervals (CI)
Interpretation
“I am 95% confident that the true population parameter lies within this calculated interval.”
Formulas
- CI for Proportion:
p̂ ± z* * √[p̂(1 - p̂) / n] - CI for Mean (t-distribution):
x̄ ± t* * (s / √n)
Essential Statistical Tips
Conditions to Check Before Running Tests
- Proportion Tests: Check that
np ≥ 10andn(1-p) ≥ 10(Success/Failure Condition). - T-Tests: Check for approximate normality of the sample distribution or large sample size.
- Chi-Square Test: Ensure all expected counts are
≥ 5.
Hypothesis Test Decision Rule
- If
p-value ≤ α(Significance Level), Reject H₀. - If
p-value > α, Fail to Reject H₀.
Applied Statistics Examples
Correlation Interpretation (r = 0.66)
Question: The correlation between the variables is approximately
r = 0.66. Explain what this correlation tells us about the strength and direction of the association between the variables.Answer: A correlation of
r = 0.66indicates that the relationship between social media engagement score and average onsite spend is moderately strong and positive.Identifying Variables
Question: Which variable is the explanatory variable, and which is the response variable for this linear regression model?
- Explanatory Variable (x-axis): Social media engagement score.
- Response Variable (y-axis): Average onsite spend.
Y-Intercept Interpretation
Question: Interpret the y-intercept of the linear regression model in context.
Answer: When the social media engagement score is 0, the average onsite spend is predicted to be $33.40.
Using the Regression Equation for Prediction
Question: Show how you would use the regression equation to calculate the predicted onsite spend for a festival attendee with a social media engagement score of 50. (You do NOT need to work this out.)
Calculation:
Predicted Average Onsite Spend = 33.40 + 1.67 * (50)R-Squared Interpretation (R² = 0.548)
Question: The
R²(R-squared) for this linear regression model is 0.548. Interpret this value in context.Answer: 54.8% of the variability in the average onsite spend is explained by the social media engagement score (the model).
