Statistical Foundations for Data Analysis

Posted on Aug 19, 2025 in Mathematics and Statistics

Numerical Descriptive Statistics

In statistics, μ (mu) represents the population mean, while x̄ (x-bar) represents the sample mean.

Key Distribution Shapes:

Symmetric Distribution: Mean = Median = Mode
Right-Skewed Distribution (Positive Skew): Mean > Median > Mode
Left-Skewed Distribution (Negative Skew): Mean < Median < Mode

Calculating Standard Deviation:

Population Standard Deviation (σ):

Find the mean (μ) of the data.
For each data point (x), calculate the squared difference from the mean: (x – μ)².
Find the mean of these squared differences (this is the variance).
Take the square root of the variance.

Sample Standard Deviation (s): The calculation is similar to the population standard deviation, but when finding the mean of the squared differences, you divide by (n-1) instead of n, where n is the sample size. This is known as Bessel’s correction.

Probability and Probability Distributions

The sample space (S) is the set of all possible outcomes: S = {O₁, O₂, …, Oₙ}.

Axioms of Probability:

For each outcome Oᵢ, its probability P(Oᵢ) must be between 0 and 1, inclusive: 0 ≤ P(Oᵢ) ≤ 1.
The sum of the probabilities of all possible outcomes in the sample space must equal 1: Σ P(Oᵢ) = 1.

Key Probability Concepts:

Union (A ∪ B): The probability that event A occurs, or event B occurs, or both occur.
Intersection (A ∩ B): The probability that both event A and event B occur.
Complement (Ā): The probability that event A does not occur.
Conditional Probability (P(A|B)): The probability that event A occurs given that event B has already occurred.

Important Definitions:

Mutually Exclusive Events: Two events are mutually exclusive if they cannot both occur at the same time on any single trial.
Independent Events: Two events A and B are independent if the occurrence of one does not affect the probability of the other. Mathematically, P(A|B) = P(A) or P(B|A) = P(B).
Complement Rule: Every simple event must belong to either an event A or its complement, Ā.

Advanced Topics:

Probability Trees: Visual tools used to represent and calculate probabilities for sequential events.
Bayes’ Theorem: A formula used to update the probability of a hypothesis as more evidence or information becomes available.

Population Mean / Expected Value:

For a discrete random variable X with possible values x₁, x₂, …, the population mean (or expected value, E[X]) is calculated as the sum of each value multiplied by its probability.

Continuous Probability, Sampling, & Distributions

Normal Distribution Characteristics:

Increasing the mean shifts the curve to the right.
Increasing the standard deviation flattens and widens the curve.

Standard Normal Random Variable (Z-score):

The Z-score standardizes a value from a normal distribution, allowing comparison across different datasets:

Z = (X - μ) / σ

Where X is the value, μ is the population mean, and σ is the population standard deviation. Use Z-scores to convert large numbers into a standard scale for Z-table lookups.

Probability Calculations with Z-scores:

P(Z ≥ z) = 1 - P(Z < z) (for right-tail probabilities)
P(Z < z) (for left-tail probabilities, directly from Z-table)

Sampling Distribution of the Sample Mean:

The Z-score for a sample mean (x̄) is calculated as:

Z = (x̄ - μ) / (σ / √n)

Where x̄ is the sample mean, μ is the population mean, σ is the population standard deviation, and n is the sample size.

Central Limit Theorem (CLT):

If a random sample is drawn from any population, the sampling distribution of the sample mean (x̄) will be approximately normal for a sufficiently large sample size (usually n > 30). As the sample size (n) gets larger, the sampling distribution of x̄ becomes increasingly bell-shaped.

Types of Errors in Data Collection:

Sampling Error:
Refers to differences between a sample and the population due to the specific observations selected. It is expected to occur when making statements about a population based on a sample. It is the difference between the true (unknown) value of the population average (μ) and the sample estimate (x̄).
Non-Sampling Error:
Errors that arise from mistakes made during data acquisition or improper selection of sample observations. These errors are not related to the sampling process itself.
1. Errors in Acquisition: Includes recording incorrect responses, inaccurate measurements, faulty equipment, misinterpretation, or wrong answers.
2. Non-Response Error: A bias that occurs when responses are not obtained from all members of the sample, potentially leading to a sample that is not representative of the target population (biased results). For example, an interviewer being unable to contact a person from the sample.
3. Selection Bias: Occurs when members of the target population cannot possibly be selected for inclusion in the sample, leading to a non-representative sample.

Calculating X from a Z-score:

To find a specific value (X) given a Z-score, mean, and standard deviation:

X = μ + Z * σ

This is useful for finding a score corresponding to a certain percentile or probability from a Z-table.

Estimation and Hypothesis Testing

Estimators:

Point Estimator: Estimates the value of an unknown population parameter using a single value.
Interval Estimator: Draws inferences about a population parameter using a range (interval) of values.

Desirable characteristics of estimators include unbiasedness, consistency, and relative efficiency.

Estimating Population Mean (μ) when Population Variance (σ²) is Known:

The Z-statistic for the sample mean is:

Z = (x̄ - μ) / (σ / √n)

Confidence Interval (CI) for μ:

Lower Confidence Limit: x̄ - Z_α/2 * (σ / √n)
Upper Confidence Limit: x̄ + Z_α/2 * (σ / √n)

Where (1-α) is the confidence level.

Common Z-values for Confidence Levels:

90% Confidence (α = 0.10): Z_α/2 = Z_0.05 = 1.645
95% Confidence (α = 0.05): Z_α/2 = Z_0.025 = 1.96
99% Confidence (α = 0.01): Z_α/2 = Z_0.005 = 2.575

Note: When constructing a confidence interval with α/2, the alpha value is split equally between the two tails of the distribution on a graph.

Estimating Population Mean (μ) when Population Variance is Unknown:

The t-statistic for the sample mean is used:

t = (x̄ - μ) / (s / √n)

Where s is the sample standard deviation and the degrees of freedom (df) = n-1. The t-distribution table is used for critical values.

Confidence Interval (CI) for μ:

Lower Confidence Limit: x̄ - t_{α/2, n-1} * (s / √n)
Upper Confidence Limit: x̄ + t_{α/2, n-1} * (s / √n)

Selecting Sample Size for Estimating μ:

To determine the required sample size (n) for estimating the population mean with a specified margin of error (B):

n = ((Z_α/2 * σ) / B)²

Where B is the maximum allowable sampling error (the desired precision, or “within B units”).

Hypothesis Testing:

Hypothesis testing involves making inferences about a population parameter based on sample data. There are four possible outcomes:

Correct Decision: Do not reject the null hypothesis (H₀) when H₀ is true.
Type I Error (α): Reject H₀ when H₀ is true. The probability of this error is denoted by alpha (α).
Type II Error (β): Do not reject H₀ when H₀ is false. The probability of this error is denoted by beta (β).
Correct Decision: Reject H₀ when H₀ is false.

Six Steps of Hypothesis Testing:

Formulate Hypotheses: State the null hypothesis (H₀) and the alternative hypothesis (H₁).
Determine Critical Value(s): Find the Z-critical or t-critical value(s) based on the chosen significance level (α or α/2 for two-tailed tests).
Calculate Test Statistic: Compute the Z-calculated or t-calculated value from the sample data.
Formulate Decision Rule: Define the conditions for rejecting H₀. For example, reject H₀ if |test-calc| > |critical-value|.
Perform Calculation: Apply the formulas and compute the test statistic.
Draw Conclusion: Based on the decision rule, state whether to reject or not reject H₀, and interpret the findings in the context of the problem.

Test Statistics for Population Mean:

Population Variance (σ²) is Known: Z = (x̄ - μ) / (σ / √n)
Population Variance (σ²) is Unknown: t = (x̄ - μ) / (s / √n) (with df = n-1)

Simple Linear Regression and Correlation

The simple linear regression equation models the relationship between a dependent variable (y) and an independent variable (x):

ŷ = β₀ + β₁x + ε

ŷ (y-hat): The predicted value of the dependent variable.
β₀ (beta-naught): The y-intercept, representing the predicted value of y when x is 0.
β₁ (beta-one): The slope coefficient, indicating the change in ŷ for a one-unit increase in x.
ε (epsilon): The error term, representing the difference between the actual and predicted values.

Testing the Slope (β₁):

To determine if a linear relationship exists between x and y, we test the significance of the slope coefficient:

Null Hypothesis (H₀): β₁ = 0 (No linear relationship)
Alternative Hypothesis (H₁): β₁ ≠ 0 (A linear relationship exists)

The test statistic is a t-value:

t = (β̂₁ - β₁) / S_β̂₁

Where β̂₁ is the estimated slope coefficient, β₁ is the hypothesized slope (usually 0), and S_β̂₁ is the standard error of the slope estimate. The degrees of freedom (d.f.) = n-2.

Coefficient of Determination (R-squared, R²):

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It indicates the strength of the linear relationship:

R² = 1: Perfect match, 100% of the variation in y is explained by x.
R² = 0: No linear relationship, x explains none of the variation in y.

For example, if R² = 0.75, then 75% of the variation in the dependent variable is explained by the model, and the remaining 25% is unexplained.

Correlation Coefficient (r):

The correlation coefficient measures the strength and direction of the linear relationship between two variables. It ranges from -1 to +1:

r = -1: Perfect negative linear relationship.
r = 1: Perfect positive linear relationship.
r = 0: No linear relationship.

A strong indication of a relationship is when the absolute value of r is close to 1. In multiple regression, this is often referred to as “Multiple R”.

Time Series Analysis and Forecasting

Time series data can be decomposed into several components:

Trend: A long-term, relatively smooth pattern or direction in the data that persists, usually for more than one year (e.g., consistent growth or decline).
Cycle: A wavelike pattern describing long-term behavior, typically lasting for more than one year, often associated with economic cycles (e.g., boom and recession).
Seasonal: Exhibits a short-term (less than one year) repetitive pattern tied to a calendar (e.g., quarterly sales peaks, monthly variations).
Random Variation (Irregular Component): Unpredictable, unsystematic fluctuations that remain after accounting for trend, cycle, and seasonal components. This component hides other predictable patterns.

Moving Average:

A smoothing technique used to reduce random variation and highlight trends or cycles. A k-period moving average calculates the average of the current and k-1 preceding data points (e.g., a 3-year moving average averages the first three data points).

Too much smoothing may eliminate patterns of interest.
Too little smoothing leaves too much random variation, which can disguise real patterns.

Seasonality can often be removed or smoothed using a centered moving average, such as a 5-period moving average (5-PMA).

Seasonal Index:

A measure of how a particular season (e.g., month, quarter) compares to the average season. It helps quantify the seasonal effect.

Seasonal Index = (Seasonal Averaged Ratio * Number of Seasons) / Sum of Average Ratios

An index of 100% indicates an average seasonal effect.
An index above 100% indicates a positive seasonal effect (above average).
An index below 100% indicates a negative seasonal effect (below average).

Seasonally Adjusted Time Series:

A time series from which the seasonal component has been removed, allowing for better analysis of trend and cyclical components.

Seasonally Adjusted Time Series = Actual Time Series / Seasonal Index

Index Numbers and Economic Indicators

Consumer Price Index (CPI):

Measures the changes in the total price of a basket of goods and services typically purchased by consumers. It is calculated primarily using a variation of the Laspeyres method. Price changes for various expenditure classes are combined with their corresponding weights (based on consumption patterns).

Laspeyres Price Index (LPI):
- Advantage (+): Requires consumption data only for the base period, which is often readily available.
Paasche Price Index (PPI):
- Disadvantage (-): Requires consumption data for the current period, which is often difficult to obtain and not frequently used for official indices like CPI.

Simple Price Index:

Compares the price of a single item in the current period (P₁) to its price in a base period (P₀):

Simple Price Index = (P₁ / P₀) * 100

Real Income:

Measures the purchasing power of nominal income, adjusted for inflation:

Real Income = (Nominal Income / Current CPI) * Base CPI

Real GDP:

Measures the value of goods and services produced in an economy, adjusted for price changes (inflation or deflation):

Real GDP = (Nominal GDP / Current CPI) * 100

Purchasing Power:

Indicates the value of money in terms of what it can buy:

Purchasing Power = (Current Value / Current CPI) * 100

Key Concepts:

Indexing: Expressing a value in one period as a percentage of a value in a base period. This can be done on any type of time series.
Deflating: Removing the price-change effect from a monetary time series to express values in constant prices. This can only be done on a monetary time series.