Statistical Analysis Methods and Data Interpretation
Box Plot Analysis for Distribution Comparison
Box plots are used for comparing distributions between groups.
Key Components:
- Median: The line inside the box representing the typical value.
- Q1/Q3: The bottom and top of the box, respectively.
- IQR (Interquartile Range): Calculated as Q3 − Q1, representing the middle 50% of the data.
- Narrow IQR: Indicates low variability and good reproducibility.
- Wide IQR: Indicates high variability.
- Whiskers: Represent the overall non-outlier range.
- Outliers: Points outside the whiskers; these may indicate biological variation, error, or contamination.
Interpreting Overlap:
- Little to no overlap: Groups may differ clearly.
- Large overlap: Differences are less clear; a statistical test is required.
Template Sentence:
The box plot compares [variable] between [groups]. The median is highest in [group] and lowest in [group], suggesting [group] has the highest typical value. The IQR is [narrow/wide], indicating [low/high] variability. Outliers are [present/absent]. The groups show [little/large] overlap, suggesting [clear/uncertain] differences. A statistical test is needed to confirm significance.
One-Way ANOVA for Comparing Multiple Means
Use a One-Way ANOVA to compare the means of three or more groups.
- H0 (Null Hypothesis): All group means are equal.
- H1 (Alternative Hypothesis): At least one group mean differs.
Key Rules:
- Large F-value: Indicates that between-group variability is greater than within-group variability.
- p < 0.05: Reject H0; there is a significant difference.
- p > 0.05: Fail to reject H0; there is no significant evidence of a difference.
- ***: Denotes a highly significant result.
Note: ANOVA indicates that at least one group differs, but not which specific one. If the result is significant, use a Tukey post hoc test.
Template Sentence:
A one-way ANOVA was used to test whether [outcome] differs between [groups]. The F-value is [F], showing between-group variability is [greater/not greater] than within-group variability. The p-value is [p], which is [below/above] 0.05, so H0 is [rejected/not rejected]. Therefore, [factor] [has/does not have] a significant effect on [outcome]. If significant, a Tukey test is needed to identify which groups differ.
Linear Regression and Predictive Modeling
Use Linear Regression to test whether one continuous variable predicts another.
- x: Independent variable or predictor.
- y: Dependent variable or response.
- Equation: y = a + bx (where a is the intercept and b is the slope).
Understanding the Slope:
- Positive slope: As x increases, y increases.
- Negative slope: As x increases, y decreases.
- For each 1-unit increase in x, y changes by b units.
Prediction and Extrapolation:
To predict a value, insert the given x into the equation. If x is outside the observed range, this is extrapolation and should be interpreted with caution.
Template Sentence:
Linear regression was used to assess whether [x] predicts [y]. The slope is [positive/negative], meaning that as [x] increases, [y] tends to [increase/decrease]. For each 1-unit increase in [x], the model predicts a change of [b] units in [y].
Correlation Coefficients and R-Squared
Correlation measures the strength and direction of a linear relationship.
- r (Correlation Coefficient):
- r close to +1: Strong positive correlation.
- r close to −1: Strong negative correlation.
- r close to 0: Weak or no linear correlation.
- R² (Coefficient of Determination): Represents the variability in y explained by x. For example, an R² of 0.9915 means 99.15% of the variance is explained, while 0.85% is unexplained.
Note: R² does not indicate direction; the slope or r value provides direction. r = √R², then apply the sign of the slope.
Template Sentence:
The R² value is [R²], meaning [percentage]% of the variability in [y] is explained by [x], while the remaining variability is unexplained. Since the slope is [positive/negative], the correlation is [positive/negative].
t-Values and p-Values in Regression
These values test whether the slope is significantly different from zero.
- H0: Slope = 0 (no linear relationship).
- H1: Slope ≠ 0 (significant relationship).
Decision Rules:
- Large absolute t-value: Strong evidence that the slope is not zero. A larger absolute t usually results in a smaller p.
- p < 0.05: The slope is significant.
- p > 0.05: The slope is not significant.
Template Sentence:
The t-value tests whether the regression coefficient differs from zero. A large absolute t-value gives stronger evidence of a significant relationship. If p < 0.05, [x] is a significant predictor of [y].
Residuals and Model Diagnostic Plots
A residual is the difference between the observed value and the predicted value (Observed − Predicted).
- Residual close to 0: The model is close to reality.
- Positive residual: Observed > Predicted.
- Negative residual: Observed < Predicted.
Residuals vs. Fitted Plot:
This plot checks if linear regression is appropriate.
- Good: Random scatter around 0, no discernible pattern, and similar spread.
- Bad: A curve indicates a non-linear relationship; a funnel shape indicates unequal variance (heteroscedasticity); distant points indicate outliers.
Template Sentence:
In the residuals vs. fitted plot, residuals should be randomly scattered around zero. If there is no pattern, the linear model is appropriate. A curve, funnel, or extreme points suggest assumption problems.
Normal Q-Q Plot:
This plot checks the normality of residuals.
- Good: Points stay close to the diagonal line, including at the extremes.
- Bad: Points deviate from the line. Deviations at the extremes indicate non-normality, outliers, or heavy tails.
- Upper/Right: Large positive residuals.
- Lower/Left: Large negative residuals.
Template Sentence:
In the Q-Q plot, if points follow the diagonal line, residuals are approximately normal. If points deviate strongly, especially at the extremes, residuals may not be normal and outliers or heavy tails may be present.
Shapiro-Wilk Test for Normality:
- H0: Residuals are normally distributed.
- H1: Residuals are not normally distributed.
- p > 0.05: Normality is accepted/compatible.
- p < 0.05: Residuals are not normal.
Odds Ratios and Likelihood Ratios
- Odds Ratio (OR): Odds in the exposed/treatment group divided by the odds in the control group.
- OR = 1: No association.
- OR > 1: Higher odds in the exposed group.
- OR < 1: Lower odds (protective effect).
- Positive Likelihood Ratio (LR+): Sensitivity / (1 − Specificity). A positive test increases the probability of disease.
- Negative Likelihood Ratio (LR−): (1 − Sensitivity) / Specificity. A negative test decreases the probability of disease.
t-Tests and Confidence Intervals
t-Test:
Used to compare means between two groups.
- H0: μ1 = μ2.
- H1: μ1 ≠ μ2.
- p < 0.05: Significant difference exists.
Confidence Interval (CI):
A 95% CI for the mean is calculated as: x̄ ± 1.96 × (s/√n).
- If the CI for a difference includes 0, the result is not significant.
- If the CI for a difference does not include 0, the result is significant.
- If the CI for OR/RR includes 1, the result is not significant.
- If the CI for OR/RR does not include 1, the result is significant.
The Normal Distribution and Z-Scores
A Normal Curve is symmetric, unimodal, and mesokurtic. The mean, median, and mode are equal, and the total area under the curve is 1.
Empirical Rule:
- μ ± 1 SD ≈ 68%
- μ ± 2 SD ≈ 95%
- μ ± 3 SD ≈ 99.7%
The Standard Normal Distribution N(0,1) has a mean of 0 and a standard deviation of 1.
Z-Score Calculation:
z = (x − μ) / σ. This represents how many standard deviations x is from the mean.
- z > 0: Above the mean.
- z < 0: Below the mean.
- z = 2: Two standard deviations above the mean.
- z = −2: Two standard deviations below the mean.
Choosing the Right Statistical Test
- 1 Qualitative Variable: Use frequencies and proportions.
- 1 Quantitative Variable: Use mean, median, SD, histograms, or box plots.
- 2 Qualitative Variables: Use a Chi-square test.
- 1 Qualitative + 1 Quantitative Variable: Compare means (e.g., t-test or ANOVA).
- 2 Quantitative Variables: Use correlation or regression.
