Foundational Statistical Concepts for Data Analysis

Types of Research Questions

  • Making an Estimate About the Population

    • What is the average number of hours that students study each week?
    • What proportion of all Singaporean students is enrolled in a university?
  • Testing a Claim About the Population

    • Does the majority of students qualify for student loans?
    • Is the average course load for a university student greater than 20 units?
  • Comparing Two Sub-Populations or Investigating a Relationship Between Two Variables in the Population

    • In University X, do female students have a higher GPA score than male students?
    • Does drinking coffee help students pass the math exam?

Sampling Methods in Statistics

  • Simple Random Sampling

    • Units are selected randomly from the sampling frame by a random number generator.
    • Sample results do not change haphazardly from sample to sample, and variability is due to chance.

    Advantages

    • The sample tends to be a good representation of the population.

    Disadvantages

    • Subject to non-response (individuals may choose to opt out of the study).
    • Possible limited access to information as selected individuals may be located in different geographical locations.
  • Systematic Sampling

    • A method of selecting units from a list through the application of a selection interval K, so that every Kth unit on the list, following a random start, is included in the sample.

    Advantages

    • More straightforward and simpler selection process than simple random sampling.
    • Does not require knowing the exact population size at the planning stage.

    Disadvantages

    • May not be representative of the population if the sampling list is non-random.
  • Stratified Sampling

    • The population is broken down into strata, in which each stratum is similar in nature but size may vary across strata.
    • A simple random sample is then employed from every stratum.

    Advantages

    • Able to get a representative sample from every stratum.

    Disadvantages

    • Quite complicated and time-consuming.
    • Requires information about the sampling frame and strata, which can be hard to define.
  • Cluster Sampling

    • Breaking down the population into clusters, then randomly sampling a fixed number of clusters.
    • All observations from the selected clusters are then included.

    Advantages

    • Less tedious, costly, and time-consuming as opposed to other sampling methods (e.g., stratified sampling).

    Disadvantages

    • High variability due to dissimilar clusters or small numbers of clusters.

Criteria for Generalizability

  • Good sampling frame.
  • Probability-based sampling employed.
  • Large sample size considered.
  • Minimum non-response.

Understanding Data Types and Variables

  • Categorical Variables

    • Takes category or label values (each observation is placed into only one label).
    • Example: Smoking status.
  • Numerical Variables

    • Takes numerical values for which arithmetic operations such as adding and averaging make sense.
    • Example: Age and mass.
    • Categorical Ordinal

      • Categories that come with a natural ordering, and numbers are used to represent the ordering.
      • Example: Happiness index that is rated 0-10 in order of increasing happiness.
    • Categorical Nominal

      • No intrinsic ordering for the variables.
      • Example: Eye color.
    • Numerical Discrete

      • Possible values of the variable form a set of numbers with ‘gaps’.
      • Example: Population count.
    • Numerical Continuous

      • Can take on all possible numerical values within a given range or interval.
      • Example: Time, length.

Descriptive Statistics: Measures & Properties

  • Properties of the Mean

    • Adding a constant value to all data points changes the mean by that constant value.
    • Multiplying a constant number c to all data points will result in the mean also being multiplied by c.

    Limitations of the Mean

    • Does not tell us about the distribution over the total number of data points (n).
    • Does not tell us about the frequency of occurrence of numerical variable values.
  • Properties of the Median

    • Adding a constant value to all data points changes the median by that constant value.
    • Multiplying all data points by a constant value, c, results in the median being multiplied by c.

    Limitations of the Median

    • Median alone does not tell us about the total value, the frequency of occurrence, or the distribution of data points of the numerical data (similar to the mean).
    • Knowing the median of subgroups does not tell us anything about the overall median, apart from the fact that it must be between the medians of the subgroups (unlike the mean, for which a weighted average can be calculated).
  • Properties of the Standard Deviation

    • Always non-negative with the same units as the numerical variable.
    • Adding a constant value, c, to all data points does not change the standard deviation.
    • Multiplying all data points by a constant value c results in the standard deviation being multiplied by the absolute value of c.
  • Interquartile Range (IQR)

    • The difference between the third and first quartile.
    • A small IQR value means that the middle 50% of the data values have a narrow spread, and vice versa.

    Properties of IQR

    • It is always non-negative.
    • Adding a constant value to all data points does not change the IQR.
    • Multiplying all data points by a constant value c results in the IQR being multiplied by the absolute value of c.

Experimental Study

  • Primary goal is to provide evidence for a cause-and-effect relationship.

Rates, Associations, and Simpson’s Paradox

  • Marginal Rate

    • Rate(Y) = 350/1050 = 33.33% OR Rate(success) = 831/1050 = 79.1%.
  • Conditional Rate

    • Rate(success | X) = 542/700 = 77.4%.
  • Joint Rate

    • Rate(Y and failure) = 61/1050 = 5.81%.
    • Note: This is not a conditional rate.

Understanding Association

  • Association Absent

    • Rate(A|B) = Rate(A|NB)
    • The rate of A is not affected by the presence or absence of B; hence, A and B are not associated.
  • Association Present

    • Rate(A|B) > Rate(A|NB)
    • The presence of A when B is present is stronger than when B is absent; hence, there is a positive association between A and B.
    • Rate(A|B) < Rate(A|NB)
    • The presence of A when B is present is weaker than when B is absent; hence, there is a negative association between A and B.

Rules of Rate

  1. Symmetry Rule

    • Rate(A|B) > Rate(A|NB) ⇔ Rate(B|A) > Rate(B|NA)
    • Rate(A|B) < Rate(A|NB) ⇔ Rate(B|A) < Rate(B|NA)
    • Rate(A|B) = Rate(A|NB) ⇔ Rate(B|A) = Rate(B|NA)
  2. Basic Rule of Rate

    • The overall Rate(A) will always lie between Rate(A|B) and Rate(A|NB).
  3. The closer Rate(B) is to 100%, the closer Rate(A) is to Rate(A|B).
  4. If Rate(B) = 50%, then Rate(A) = {Rate(A|B) + Rate(A|NB)}/2.
  5. If Rate(A|B) = Rate(A|NB), then Rate(A) = Rate(A|B) = Rate(A|NB).

Simpson’s Paradox

  • A phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
  • Occurs when the majority of the individual subgroup rates are opposite from the overall association.

Confounders

  • Also known as a third variable that is associated with both the independent and dependent variables whose relationship we are investigating.
  • Simpson’s paradox implies the presence of confounders, but the converse is NOT TRUE.
  • Slicing is a method to control confounders (randomization is not always possible in studies).

Data Visualization Techniques

Univariate Data Visualization

  • Histograms

    • Gives a general idea of distribution, spread/variance, outliers, and mean, median, and mode depending on skew direction.
  • Box Plots

    • Gives a general idea of IQR, median, obvious outliers, and estimated spread/skew (e.g., max-median, median-min).
    • Does NOT provide information on exact data points, proportions, averages, or standard deviations.

Bivariate Data Visualization

  • Scatterplots and Regression Lines

    • Used to visualize: Direction (positive/negative/neither), Form (linear/nonlinear), and Strength (r-value closeness).

R-Value: Linear Correlation Coefficient

  • A measure of linear correlation; closer to +1 or -1 means stronger correlation.
  • Not affected by adding or subtracting constants to all X and Y values (e.g., +6 to all X and -3 from all Y).
  • Affected if you multiply X OR Y by -1 (flips r-value sign), but NOT affected if you multiply both by -1.
  • Not affected by multiplying a positive constant to all values of both or only one variable.

Regression Line Properties

  • Draw a regression line using the least squares method, following Y = mX + b.
  • m = regression slope = (SDY/SDX)r.
  • X cannot be used to find Y and vice versa because they are in a non-deterministic relationship (association); only the average of X can be used to find the average of Y.
  • If the average of Y intersects the average of X above y=x, the average of Y is higher than the average of X (e.g., y = x + 1/2/3, etc.), but there can still be points below y=x.
  • Cannot predict beyond the observed X or Y range.

Key Probability Concepts

  • Independent Events

    • If A and B are independent, P(A∩B) = P(A)P(B).
    • Example: With replacement, drawing a green sock then a red sock. Therefore, P(A|B) = P(A).
  • Mutually Exclusive Events

    • If A and B are mutually exclusive, P(A∪B) = P(A) + P(B) (e.g., when adding different outcomes in a probability tree).
    • Cannot use P(A|B) as P(A∩B) = 0, making the equation invalid.
    • P(A∪B) = P(A) + P(B) – P(A∩B) (General Addition Rule).
  • Notation: ∩ = ‘and’ (overlap), ∪ = ‘or’ (union of A and B).

Normal Distribution

  • Notation: N~(mean, variance) where variance = SD². Therefore, if variance = 4, SD = 2.
  • Normal Distribution curve: Symmetrical about the mean, peak occurs at the mean, so mean = median = mode.

Statistical Inference: Confidence Intervals

Confidence Intervals

  • Formula: Sample Mean ± Margin of Error.
  • A larger sample size results in a lower margin of error and a smaller interval.
  • Increasing the confidence level (e.g., from 95% to 99%) results in a wider interval. Decreasing the confidence level (e.g., from 99% to 95%) results in a narrower interval.
  • Interpretation: 95% confident means that if the same sampling method were repeated many times, approximately 95 out of 100 constructed intervals would contain the true population mean.
  • Correct Statement: We are 95% confident that the true population mean is between [lower bound] and [upper bound].
  • Incorrect Interpretations:
    • It is NOT the probability that all households contain the sample mean (all samples contain their own sample mean).
    • It is NOT a 95% chance that the population mean is within the interval (the population mean is fixed; it’s either in or out).

Statistical Inference: Hypothesis Testing

Hypothesis Testing Fundamentals

  • Null Hypothesis (H₀)

    • Assumes the current state or no effect.
    • Example: A coin is not biased (P(Heads) = 0.5), or the average age of children is 8.
  • Alternative Hypothesis (H₁)

    • What we are trying to prove.
    • Example: A coin is biased (P(Heads) ≠ 0.5), or the average age of children is 10.
  • P-Value Interpretation

    • The probability of observing an equally or more extreme outcome than the one obtained, assuming the null hypothesis is true, and this outcome supports the alternative hypothesis.
  • Decision Rules for Hypothesis Testing

    • If p-value < significance level (α), we reject the null hypothesis in favor of the alternative hypothesis (conclude there is sufficient evidence for the alternative hypothesis).
    • If p-value > significance level (α), we cannot reject the null hypothesis, and the test is inconclusive (we do not have sufficient evidence to reject the null hypothesis).
  • Note: Hypothesis testing is not needed for a census.

T-Test

  • Mainly used when testing for a significant difference between a sample mean and a known/hypothesized mean.
  • Population distribution should be approximately normal if the sample size (n) is less than 30 (for small samples).
  • Data is acquired randomly.

Chi-Square Test

  • Mainly used when testing for a significant association between two categorical variables.
  • Data consists of counts for categories of categorical variables.
  • Data is acquired randomly.