Statistical Foundations for Data Analysis
PPDAC Cycle: Data Problem-Solving
Problem: Clearly define your research question.
Plan: Choose a sampling method and variables.
Data: Collect and clean data (e.g., remove errors, handle missing values).
Analysis: Use EDA (plots & statistics) and model relationships (e.g., regression).
Conclusion: Answer your research question. Be cautious about generalizing!
Essential Sampling Methods
Method | Description | Pros | Cons |
---|---|---|---|
Simple Random | Each unit has equal chance (like a lucky draw) | Unbiased | May need full list of population |
Systematic | Pick every kth unit after a random start | Simple to do | Risk of hidden patterns |
Stratified | Divide into groups (strata) and sample each | More representative | Need info about groups |
Cluster | Randomly choose entire clusters | Cheaper and faster | High variability risk |
Tip: Always prefer Probability Sampling when generalizing to the population!
Common Biases in Data Collection
Selection Bias: Sampling method misses important groups (e.g., only interviewing shoppers in malls).
Non-response Bias: Some groups do not reply (e.g., sensitive surveys), making the sample misleading.
Understanding Data Variables
Type | Examples | Note |
---|---|---|
Categorical – Nominal | Gender, Race | No natural order |
Categorical – Ordinal | Satisfaction rating (Low, Medium, High) | Has a logical order |
Numerical – Discrete | Number of children | Countable whole numbers |
Numerical – Continuous | Height, Weight | Infinite possibilities |
Independent Variable (X): What you change or categorize.
Dependent Variable (Y): What you measure.
Key Summary Statistics
Central Tendency Measures
Mean: Sum of values ÷ number of values. (Affected by outliers)
Median: Middle value when sorted. (Resistant to outliers)
Mode: Most frequent value.
Measures of Data Spread
Variance (s²): Average of squared differences from the mean.
Standard Deviation (s): Square root of variance.
Interquartile Range (IQR): Difference between Q3 and Q1 (middle 50% spread).
Tip: Use Median + IQR when data is skewed or has outliers!
Types of Study Designs
Experimental Study: Researcher controls the independent variable (e.g., drug vs. placebo).
Observational Study: Observe without interfering (e.g., studying coffee drinkers’ health naturally).
Important: Experimental studies can suggest causation. Observational studies can only suggest association.
Advanced Statistical Concepts & Tips
Essential Statistical Formulas
Probability Formulas
Conditional Probability:
P(A|B) = P(A and B) / P(B)
(Probability of A given B has occurred.)
Independence Rule:
P(A and B) = P(A) × P(B)
(If A and B are independent.)
Confidence Interval (CI)
CI = Estimate ± Margin of Error
Interpretation: “We are 95% confident that the true population parameter lies within this range.”
Hypothesis Testing Basics
Null Hypothesis (H₀): No effect, no difference.
Alternative Hypothesis (H₁): There is an effect/difference.
Decision Rule: Reject H₀ if p-value < 0.05.
Linear Regression Equation
y = a + b×x
a = Intercept (starting point when x=0)
b = Slope (how much y changes for a 1-unit increase in x)
Correlation Coefficient (r)
Ranges from -1 (perfect negative) to 0 (no relation) to +1 (perfect positive).
Only measures linear relationships!
Data Visualization Charts
Chart Type | Purpose |
---|---|
Bar Plot | Compare categories |
Histogram | Show distribution (shape, center, spread) |
Boxplot | Show medians, quartiles, outliers |
Scatter Plot | Show relationships between 2 numerical variables |
Tip: Always draw plots first during Exploratory Data Analysis (EDA)!
Common Statistical Fallacies
Prosecutor’s Fallacy: Misinterpreting conditional probabilities (P(A|B) ≠ P(B|A)).
Base Rate Fallacy: Ignoring overall population rates.
Conjunction Fallacy: Believing specific conditions are more likely than general ones (they are not!).
Data Analysis Problem-Solving Tips
Check if it’s population or sample: Use the correct formula (sample mean vs. population mean).
Sampling method matters: Prefer probability sampling.
Use mean if symmetric; median if skewed or outliers exist.
Watch for confounders: Especially in observational studies.
Always define H₀ and H₁ properly before testing hypotheses.
Check linearity before using correlation or regression.
Graphs first, then statistics! Plots often reveal hidden insights.
Answer interpretation questions carefully: Always relate back to the context.