Statistical Foundations for Data Analysis

PPDAC Cycle: Data Problem-Solving

  • Problem: Clearly define your research question.

  • Plan: Choose a sampling method and variables.

  • Data: Collect and clean data (e.g., remove errors, handle missing values).

  • Analysis: Use EDA (plots & statistics) and model relationships (e.g., regression).

  • Conclusion: Answer your research question. Be cautious about generalizing!


Essential Sampling Methods

MethodDescriptionProsCons
Simple RandomEach unit has equal chance (like a lucky draw)UnbiasedMay need full list of population
SystematicPick every kth unit after a random startSimple to doRisk of hidden patterns
StratifiedDivide into groups (strata) and sample eachMore representativeNeed info about groups
ClusterRandomly choose entire clustersCheaper and fasterHigh variability risk

Tip: Always prefer Probability Sampling when generalizing to the population!


Common Biases in Data Collection

  • Selection Bias: Sampling method misses important groups (e.g., only interviewing shoppers in malls).

  • Non-response Bias: Some groups do not reply (e.g., sensitive surveys), making the sample misleading.


Understanding Data Variables

TypeExamplesNote
Categorical – NominalGender, RaceNo natural order
Categorical – OrdinalSatisfaction rating (Low, Medium, High)Has a logical order
Numerical – DiscreteNumber of childrenCountable whole numbers
Numerical – ContinuousHeight, WeightInfinite possibilities
  • Independent Variable (X): What you change or categorize.

  • Dependent Variable (Y): What you measure.


Key Summary Statistics

Central Tendency Measures

  • Mean: Sum of values ÷ number of values. (Affected by outliers)

  • Median: Middle value when sorted. (Resistant to outliers)

  • Mode: Most frequent value.

Measures of Data Spread

  • Variance (s²): Average of squared differences from the mean.

  • Standard Deviation (s): Square root of variance.

  • Interquartile Range (IQR): Difference between Q3 and Q1 (middle 50% spread).

Tip: Use Median + IQR when data is skewed or has outliers!


Types of Study Designs

  • Experimental Study: Researcher controls the independent variable (e.g., drug vs. placebo).

  • Observational Study: Observe without interfering (e.g., studying coffee drinkers’ health naturally).

Important: Experimental studies can suggest causation. Observational studies can only suggest association.


Advanced Statistical Concepts & Tips


Essential Statistical Formulas

Probability Formulas

  • Conditional Probability:

    P(A|B) = P(A and B) / P(B)

    (Probability of A given B has occurred.)

  • Independence Rule:

    P(A and B) = P(A) × P(B)

    (If A and B are independent.)


Confidence Interval (CI)

  • CI = Estimate ± Margin of Error

  • Interpretation: “We are 95% confident that the true population parameter lies within this range.”


Hypothesis Testing Basics

  • Null Hypothesis (H₀): No effect, no difference.

  • Alternative Hypothesis (H₁): There is an effect/difference.

  • Decision Rule: Reject H₀ if p-value < 0.05.


Linear Regression Equation

  • y = a + b×x

    • a = Intercept (starting point when x=0)

    • b = Slope (how much y changes for a 1-unit increase in x)

Correlation Coefficient (r)

  • Ranges from -1 (perfect negative) to 0 (no relation) to +1 (perfect positive).

  • Only measures linear relationships!


Data Visualization Charts

Chart TypePurpose
Bar PlotCompare categories
HistogramShow distribution (shape, center, spread)
BoxplotShow medians, quartiles, outliers
Scatter PlotShow relationships between 2 numerical variables

Tip: Always draw plots first during Exploratory Data Analysis (EDA)!


Common Statistical Fallacies

  • Prosecutor’s Fallacy: Misinterpreting conditional probabilities (P(A|B) ≠ P(B|A)).

  • Base Rate Fallacy: Ignoring overall population rates.

  • Conjunction Fallacy: Believing specific conditions are more likely than general ones (they are not!).


Data Analysis Problem-Solving Tips

  1. Check if it’s population or sample: Use the correct formula (sample mean vs. population mean).

  2. Sampling method matters: Prefer probability sampling.

  3. Use mean if symmetric; median if skewed or outliers exist.

  4. Watch for confounders: Especially in observational studies.

  5. Always define H₀ and H₁ properly before testing hypotheses.

  6. Check linearity before using correlation or regression.

  7. Graphs first, then statistics! Plots often reveal hidden insights.

  8. Answer interpretation questions carefully: Always relate back to the context.