Statistical Foundations for Data Analysis

Numerical Descriptive Statistics

In statistics, μ (mu) represents the population mean, while x̄ (x-bar) represents the sample mean.

Key Distribution Shapes:

  • Symmetric Distribution: Mean = Median = Mode
  • Right-Skewed Distribution (Positive Skew): Mean > Median > Mode
  • Left-Skewed Distribution (Negative Skew): Mean < Median < Mode

Calculating Standard Deviation:

Population Standard Deviation (σ):

  1. Find the mean (μ) of the data.
  2. For each data point (x), calculate the squared difference from the mean: (x – μ)².
  3. Find the mean of these squared differences (this is the variance).
  4. Take the square root of the variance.

Sample Standard Deviation (s): The calculation is similar to the population standard deviation, but when finding the mean of the squared differences, you divide by (n-1) instead of n, where n is the sample size. This is known as Bessel’s correction.

Probability and Probability Distributions

The sample space (S) is the set of all possible outcomes: S = {O₁, O₂, …, Oₙ}.

Axioms of Probability:

  1. For each outcome Oᵢ, its probability P(Oᵢ) must be between 0 and 1, inclusive: 0 ≤ P(Oᵢ) ≤ 1.
  2. The sum of the probabilities of all possible outcomes in the sample space must equal 1: Σ P(Oᵢ) = 1.

Key Probability Concepts:

  • Union (A ∪ B): The probability that event A occurs, or event B occurs, or both occur.
  • Intersection (A ∩ B): The probability that both event A and event B occur.
  • Complement (Ā): The probability that event A does not occur.
  • Conditional Probability (P(A|B)): The probability that event A occurs given that event B has already occurred.

Important Definitions:

  • Mutually Exclusive Events: Two events are mutually exclusive if they cannot both occur at the same time on any single trial.
  • Independent Events: Two events A and B are independent if the occurrence of one does not affect the probability of the other. Mathematically, P(A|B) = P(A) or P(B|A) = P(B).
  • Complement Rule: Every simple event must belong to either an event A or its complement, Ā.

Advanced Topics:

  • Probability Trees: Visual tools used to represent and calculate probabilities for sequential events.
  • Bayes’ Theorem: A formula used to update the probability of a hypothesis as more evidence or information becomes available.

Population Mean / Expected Value:

For a discrete random variable X with possible values x₁, x₂, …, the population mean (or expected value, E[X]) is calculated as the sum of each value multiplied by its probability.

Continuous Probability, Sampling, & Distributions

Normal Distribution Characteristics:

  • Increasing the mean shifts the curve to the right.
  • Increasing the standard deviation flattens and widens the curve.

Standard Normal Random Variable (Z-score):

The Z-score standardizes a value from a normal distribution, allowing comparison across different datasets:

Z = (X - μ) / σ

Where X is the value, μ is the population mean, and σ is the population standard deviation. Use Z-scores to convert large numbers into a standard scale for Z-table lookups.

Probability Calculations with Z-scores:

  • P(Z ≥ z) = 1 - P(Z < z) (for right-tail probabilities)
  • P(Z < z) (for left-tail probabilities, directly from Z-table)

Sampling Distribution of the Sample Mean:

The Z-score for a sample mean (x̄) is calculated as:

Z = (x̄ - μ) / (σ / √n)

Where x̄ is the sample mean, μ is the population mean, σ is the population standard deviation, and n is the sample size.

Central Limit Theorem (CLT):

If a random sample is drawn from any population, the sampling distribution of the sample mean (x̄) will be approximately normal for a sufficiently large sample size (usually n > 30). As the sample size (n) gets larger, the sampling distribution of x̄ becomes increasingly bell-shaped.

Types of Errors in Data Collection:

  • Sampling Error:

    Refers to differences between a sample and the population due to the specific observations selected. It is expected to occur when making statements about a population based on a sample. It is the difference between the true (unknown) value of the population average (μ) and the sample estimate (x̄).

  • Non-Sampling Error:

    Errors that arise from mistakes made during data acquisition or improper selection of sample observations. These errors are not related to the sampling process itself.

    1. Errors in Acquisition: Includes recording incorrect responses, inaccurate measurements, faulty equipment, misinterpretation, or wrong answers.
    2. Non-Response Error: A bias that occurs when responses are not obtained from all members of the sample, potentially leading to a sample that is not representative of the target population (biased results). For example, an interviewer being unable to contact a person from the sample.
    3. Selection Bias: Occurs when members of the target population cannot possibly be selected for inclusion in the sample, leading to a non-representative sample.

Calculating X from a Z-score:

To find a specific value (X) given a Z-score, mean, and standard deviation:

X = μ + Z * σ

This is useful for finding a score corresponding to a certain percentile or probability from a Z-table.

Estimation and Hypothesis Testing

Estimators:

  • Point Estimator: Estimates the value of an unknown population parameter using a single value.
  • Interval Estimator: Draws inferences about a population parameter using a range (interval) of values.

Desirable characteristics of estimators include unbiasedness, consistency, and relative efficiency.

Estimating Population Mean (μ) when Population Variance (σ²) is Known:

The Z-statistic for the sample mean is:

Z = (x̄ - μ) / (σ / √n)

Confidence Interval (CI) for μ:

  • Lower Confidence Limit: x̄ - Zα/2 * (σ / √n)
  • Upper Confidence Limit: x̄ + Zα/2 * (σ / √n)

Where (1-α) is the confidence level.

Common Z-values for Confidence Levels:

  • 90% Confidence (α = 0.10): Zα/2 = Z0.05 = 1.645
  • 95% Confidence (α = 0.05): Zα/2 = Z0.025 = 1.96
  • 99% Confidence (α = 0.01): Zα/2 = Z0.005 = 2.575

Note: When constructing a confidence interval with α/2, the alpha value is split equally between the two tails of the distribution on a graph.

Estimating Population Mean (μ) when Population Variance is Unknown:

The t-statistic for the sample mean is used:

t = (x̄ - μ) / (s / √n)

Where s is the sample standard deviation and the degrees of freedom (df) = n-1. The t-distribution table is used for critical values.

Confidence Interval (CI) for μ:

  • Lower Confidence Limit: x̄ - tα/2, n-1 * (s / √n)
  • Upper Confidence Limit: x̄ + tα/2, n-1 * (s / √n)

Selecting Sample Size for Estimating μ:

To determine the required sample size (n) for estimating the population mean with a specified margin of error (B):

n = ((Zα/2 * σ) / B)²

Where B is the maximum allowable sampling error (the desired precision, or “within B units”).

Hypothesis Testing:

Hypothesis testing involves making inferences about a population parameter based on sample data. There are four possible outcomes:

  1. Correct Decision: Do not reject the null hypothesis (H₀) when H₀ is true.
  2. Type I Error (α): Reject H₀ when H₀ is true. The probability of this error is denoted by alpha (α).
  3. Type II Error (β): Do not reject H₀ when H₀ is false. The probability of this error is denoted by beta (β).
  4. Correct Decision: Reject H₀ when H₀ is false.

Six Steps of Hypothesis Testing:

  1. Formulate Hypotheses: State the null hypothesis (H₀) and the alternative hypothesis (H₁).
  2. Determine Critical Value(s): Find the Z-critical or t-critical value(s) based on the chosen significance level (α or α/2 for two-tailed tests).
  3. Calculate Test Statistic: Compute the Z-calculated or t-calculated value from the sample data.
  4. Formulate Decision Rule: Define the conditions for rejecting H₀. For example, reject H₀ if |test-calc| > |critical-value|.
  5. Perform Calculation: Apply the formulas and compute the test statistic.
  6. Draw Conclusion: Based on the decision rule, state whether to reject or not reject H₀, and interpret the findings in the context of the problem.

Test Statistics for Population Mean:

  • Population Variance (σ²) is Known: Z = (x̄ - μ) / (σ / √n)
  • Population Variance (σ²) is Unknown: t = (x̄ - μ) / (s / √n) (with df = n-1)

Simple Linear Regression and Correlation

The simple linear regression equation models the relationship between a dependent variable (y) and an independent variable (x):

ŷ = β₀ + β₁x + ε

  • ŷ (y-hat): The predicted value of the dependent variable.
  • β₀ (beta-naught): The y-intercept, representing the predicted value of y when x is 0.
  • β₁ (beta-one): The slope coefficient, indicating the change in ŷ for a one-unit increase in x.
  • ε (epsilon): The error term, representing the difference between the actual and predicted values.

Testing the Slope (β₁):

To determine if a linear relationship exists between x and y, we test the significance of the slope coefficient:

  • Null Hypothesis (H₀): β₁ = 0 (No linear relationship)
  • Alternative Hypothesis (H₁): β₁ ≠ 0 (A linear relationship exists)

The test statistic is a t-value:

t = (β̂₁ - β₁) / Sβ̂₁

Where β̂₁ is the estimated slope coefficient, β₁ is the hypothesized slope (usually 0), and Sβ̂₁ is the standard error of the slope estimate. The degrees of freedom (d.f.) = n-2.

Coefficient of Determination (R-squared, R²):

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It indicates the strength of the linear relationship:

  • R² = 1: Perfect match, 100% of the variation in y is explained by x.
  • R² = 0: No linear relationship, x explains none of the variation in y.

For example, if R² = 0.75, then 75% of the variation in the dependent variable is explained by the model, and the remaining 25% is unexplained.

Correlation Coefficient (r):

The correlation coefficient measures the strength and direction of the linear relationship between two variables. It ranges from -1 to +1:

  • r = -1: Perfect negative linear relationship.
  • r = 1: Perfect positive linear relationship.
  • r = 0: No linear relationship.

A strong indication of a relationship is when the absolute value of r is close to 1. In multiple regression, this is often referred to as “Multiple R”.

Time Series Analysis and Forecasting

Time series data can be decomposed into several components:

  • Trend: A long-term, relatively smooth pattern or direction in the data that persists, usually for more than one year (e.g., consistent growth or decline).
  • Cycle: A wavelike pattern describing long-term behavior, typically lasting for more than one year, often associated with economic cycles (e.g., boom and recession).
  • Seasonal: Exhibits a short-term (less than one year) repetitive pattern tied to a calendar (e.g., quarterly sales peaks, monthly variations).
  • Random Variation (Irregular Component): Unpredictable, unsystematic fluctuations that remain after accounting for trend, cycle, and seasonal components. This component hides other predictable patterns.

Moving Average:

A smoothing technique used to reduce random variation and highlight trends or cycles. A k-period moving average calculates the average of the current and k-1 preceding data points (e.g., a 3-year moving average averages the first three data points).

  • Too much smoothing may eliminate patterns of interest.
  • Too little smoothing leaves too much random variation, which can disguise real patterns.

Seasonality can often be removed or smoothed using a centered moving average, such as a 5-period moving average (5-PMA).

Seasonal Index:

A measure of how a particular season (e.g., month, quarter) compares to the average season. It helps quantify the seasonal effect.

Seasonal Index = (Seasonal Averaged Ratio * Number of Seasons) / Sum of Average Ratios

  • An index of 100% indicates an average seasonal effect.
  • An index above 100% indicates a positive seasonal effect (above average).
  • An index below 100% indicates a negative seasonal effect (below average).

Seasonally Adjusted Time Series:

A time series from which the seasonal component has been removed, allowing for better analysis of trend and cyclical components.

Seasonally Adjusted Time Series = Actual Time Series / Seasonal Index

Index Numbers and Economic Indicators

Consumer Price Index (CPI):

Measures the changes in the total price of a basket of goods and services typically purchased by consumers. It is calculated primarily using a variation of the Laspeyres method. Price changes for various expenditure classes are combined with their corresponding weights (based on consumption patterns).

  • Laspeyres Price Index (LPI):
    • Advantage (+): Requires consumption data only for the base period, which is often readily available.
  • Paasche Price Index (PPI):
    • Disadvantage (-): Requires consumption data for the current period, which is often difficult to obtain and not frequently used for official indices like CPI.

Simple Price Index:

Compares the price of a single item in the current period (P₁) to its price in a base period (P₀):

Simple Price Index = (P₁ / P₀) * 100

Real Income:

Measures the purchasing power of nominal income, adjusted for inflation:

Real Income = (Nominal Income / Current CPI) * Base CPI

Real GDP:

Measures the value of goods and services produced in an economy, adjusted for price changes (inflation or deflation):

Real GDP = (Nominal GDP / Current CPI) * 100

Purchasing Power:

Indicates the value of money in terms of what it can buy:

Purchasing Power = (Current Value / Current CPI) * 100

Key Concepts:

  • Indexing: Expressing a value in one period as a percentage of a value in a base period. This can be done on any type of time series.
  • Deflating: Removing the price-change effect from a monetary time series to express values in constant prices. This can only be done on a monetary time series.