Statistical Foundations for Data Analysis
Numerical Descriptive Statistics
In statistics, μ (mu) represents the population mean, while x̄ (x-bar) represents the sample mean.
Key Distribution Shapes:
- Symmetric Distribution: Mean = Median = Mode
- Right-Skewed Distribution (Positive Skew): Mean > Median > Mode
- Left-Skewed Distribution (Negative Skew): Mean < Median < Mode
Calculating Standard Deviation:
Population Standard Deviation (σ):
- Find the mean (μ) of the data.
- For each data point (x), calculate the squared difference from the mean: (x – μ)².
- Find the mean of these squared differences (this is the variance).
- Take the square root of the variance.
Sample Standard Deviation (s): The calculation is similar to the population standard deviation, but when finding the mean of the squared differences, you divide by (n-1) instead of n, where n is the sample size. This is known as Bessel’s correction.
Probability and Probability Distributions
The sample space (S) is the set of all possible outcomes: S = {O₁, O₂, …, Oₙ}.
Axioms of Probability:
- For each outcome Oᵢ, its probability P(Oᵢ) must be between 0 and 1, inclusive:
0 ≤ P(Oᵢ) ≤ 1
. - The sum of the probabilities of all possible outcomes in the sample space must equal 1:
Σ P(Oᵢ) = 1
.
Key Probability Concepts:
- Union (A ∪ B): The probability that event A occurs, or event B occurs, or both occur.
- Intersection (A ∩ B): The probability that both event A and event B occur.
- Complement (Ā): The probability that event A does not occur.
- Conditional Probability (P(A|B)): The probability that event A occurs given that event B has already occurred.
Important Definitions:
- Mutually Exclusive Events: Two events are mutually exclusive if they cannot both occur at the same time on any single trial.
- Independent Events: Two events A and B are independent if the occurrence of one does not affect the probability of the other. Mathematically, P(A|B) = P(A) or P(B|A) = P(B).
- Complement Rule: Every simple event must belong to either an event A or its complement, Ā.
Advanced Topics:
- Probability Trees: Visual tools used to represent and calculate probabilities for sequential events.
- Bayes’ Theorem: A formula used to update the probability of a hypothesis as more evidence or information becomes available.
Population Mean / Expected Value:
For a discrete random variable X with possible values x₁, x₂, …, the population mean (or expected value, E[X]) is calculated as the sum of each value multiplied by its probability.
Continuous Probability, Sampling, & Distributions
Normal Distribution Characteristics:
- Increasing the mean shifts the curve to the right.
- Increasing the standard deviation flattens and widens the curve.
Standard Normal Random Variable (Z-score):
The Z-score standardizes a value from a normal distribution, allowing comparison across different datasets:
Z = (X - μ) / σ
Where X is the value, μ is the population mean, and σ is the population standard deviation. Use Z-scores to convert large numbers into a standard scale for Z-table lookups.
Probability Calculations with Z-scores:
P(Z ≥ z) = 1 - P(Z < z)
(for right-tail probabilities)P(Z < z)
(for left-tail probabilities, directly from Z-table)
Sampling Distribution of the Sample Mean:
The Z-score for a sample mean (x̄) is calculated as:
Z = (x̄ - μ) / (σ / √n)
Where x̄ is the sample mean, μ is the population mean, σ is the population standard deviation, and n is the sample size.
Central Limit Theorem (CLT):
If a random sample is drawn from any population, the sampling distribution of the sample mean (x̄) will be approximately normal for a sufficiently large sample size (usually n > 30). As the sample size (n) gets larger, the sampling distribution of x̄ becomes increasingly bell-shaped.
Types of Errors in Data Collection:
- Sampling Error:
Refers to differences between a sample and the population due to the specific observations selected. It is expected to occur when making statements about a population based on a sample. It is the difference between the true (unknown) value of the population average (μ) and the sample estimate (x̄).
- Non-Sampling Error:
Errors that arise from mistakes made during data acquisition or improper selection of sample observations. These errors are not related to the sampling process itself.
- Errors in Acquisition: Includes recording incorrect responses, inaccurate measurements, faulty equipment, misinterpretation, or wrong answers.
- Non-Response Error: A bias that occurs when responses are not obtained from all members of the sample, potentially leading to a sample that is not representative of the target population (biased results). For example, an interviewer being unable to contact a person from the sample.
- Selection Bias: Occurs when members of the target population cannot possibly be selected for inclusion in the sample, leading to a non-representative sample.
Calculating X from a Z-score:
To find a specific value (X) given a Z-score, mean, and standard deviation:
X = μ + Z * σ
This is useful for finding a score corresponding to a certain percentile or probability from a Z-table.
Estimation and Hypothesis Testing
Estimators:
- Point Estimator: Estimates the value of an unknown population parameter using a single value.
- Interval Estimator: Draws inferences about a population parameter using a range (interval) of values.
Desirable characteristics of estimators include unbiasedness, consistency, and relative efficiency.
Estimating Population Mean (μ) when Population Variance (σ²) is Known:
The Z-statistic for the sample mean is:
Z = (x̄ - μ) / (σ / √n)
Confidence Interval (CI) for μ:
- Lower Confidence Limit:
x̄ - Zα/2 * (σ / √n)
- Upper Confidence Limit:
x̄ + Zα/2 * (σ / √n)
Where (1-α) is the confidence level.
Common Z-values for Confidence Levels:
- 90% Confidence (α = 0.10): Zα/2 = Z0.05 = 1.645
- 95% Confidence (α = 0.05): Zα/2 = Z0.025 = 1.96
- 99% Confidence (α = 0.01): Zα/2 = Z0.005 = 2.575
Note: When constructing a confidence interval with α/2, the alpha value is split equally between the two tails of the distribution on a graph.
Estimating Population Mean (μ) when Population Variance is Unknown:
The t-statistic for the sample mean is used:
t = (x̄ - μ) / (s / √n)
Where s is the sample standard deviation and the degrees of freedom (df) = n-1. The t-distribution table is used for critical values.
Confidence Interval (CI) for μ:
- Lower Confidence Limit:
x̄ - tα/2, n-1 * (s / √n)
- Upper Confidence Limit:
x̄ + tα/2, n-1 * (s / √n)
Selecting Sample Size for Estimating μ:
To determine the required sample size (n) for estimating the population mean with a specified margin of error (B):
n = ((Zα/2 * σ) / B)²
Where B is the maximum allowable sampling error (the desired precision, or “within B units”).
Hypothesis Testing:
Hypothesis testing involves making inferences about a population parameter based on sample data. There are four possible outcomes:
- Correct Decision: Do not reject the null hypothesis (H₀) when H₀ is true.
- Type I Error (α): Reject H₀ when H₀ is true. The probability of this error is denoted by alpha (α).
- Type II Error (β): Do not reject H₀ when H₀ is false. The probability of this error is denoted by beta (β).
- Correct Decision: Reject H₀ when H₀ is false.
Six Steps of Hypothesis Testing:
- Formulate Hypotheses: State the null hypothesis (H₀) and the alternative hypothesis (H₁).
- Determine Critical Value(s): Find the Z-critical or t-critical value(s) based on the chosen significance level (α or α/2 for two-tailed tests).
- Calculate Test Statistic: Compute the Z-calculated or t-calculated value from the sample data.
- Formulate Decision Rule: Define the conditions for rejecting H₀. For example, reject H₀ if |test-calc| > |critical-value|.
- Perform Calculation: Apply the formulas and compute the test statistic.
- Draw Conclusion: Based on the decision rule, state whether to reject or not reject H₀, and interpret the findings in the context of the problem.
Test Statistics for Population Mean:
- Population Variance (σ²) is Known:
Z = (x̄ - μ) / (σ / √n)
- Population Variance (σ²) is Unknown:
t = (x̄ - μ) / (s / √n)
(with df = n-1)
Simple Linear Regression and Correlation
The simple linear regression equation models the relationship between a dependent variable (y) and an independent variable (x):
ŷ = β₀ + β₁x + ε
- ŷ (y-hat): The predicted value of the dependent variable.
- β₀ (beta-naught): The y-intercept, representing the predicted value of y when x is 0.
- β₁ (beta-one): The slope coefficient, indicating the change in ŷ for a one-unit increase in x.
- ε (epsilon): The error term, representing the difference between the actual and predicted values.
Testing the Slope (β₁):
To determine if a linear relationship exists between x and y, we test the significance of the slope coefficient:
- Null Hypothesis (H₀): β₁ = 0 (No linear relationship)
- Alternative Hypothesis (H₁): β₁ ≠ 0 (A linear relationship exists)
The test statistic is a t-value:
t = (β̂₁ - β₁) / Sβ̂₁
Where β̂₁ is the estimated slope coefficient, β₁ is the hypothesized slope (usually 0), and Sβ̂₁ is the standard error of the slope estimate. The degrees of freedom (d.f.) = n-2.
Coefficient of Determination (R-squared, R²):
R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It indicates the strength of the linear relationship:
R² = 1
: Perfect match, 100% of the variation in y is explained by x.R² = 0
: No linear relationship, x explains none of the variation in y.
For example, if R² = 0.75, then 75% of the variation in the dependent variable is explained by the model, and the remaining 25% is unexplained.
Correlation Coefficient (r):
The correlation coefficient measures the strength and direction of the linear relationship between two variables. It ranges from -1 to +1:
r = -1
: Perfect negative linear relationship.r = 1
: Perfect positive linear relationship.r = 0
: No linear relationship.
A strong indication of a relationship is when the absolute value of r is close to 1. In multiple regression, this is often referred to as “Multiple R”.
Time Series Analysis and Forecasting
Time series data can be decomposed into several components:
- Trend: A long-term, relatively smooth pattern or direction in the data that persists, usually for more than one year (e.g., consistent growth or decline).
- Cycle: A wavelike pattern describing long-term behavior, typically lasting for more than one year, often associated with economic cycles (e.g., boom and recession).
- Seasonal: Exhibits a short-term (less than one year) repetitive pattern tied to a calendar (e.g., quarterly sales peaks, monthly variations).
- Random Variation (Irregular Component): Unpredictable, unsystematic fluctuations that remain after accounting for trend, cycle, and seasonal components. This component hides other predictable patterns.
Moving Average:
A smoothing technique used to reduce random variation and highlight trends or cycles. A k-period moving average calculates the average of the current and k-1 preceding data points (e.g., a 3-year moving average averages the first three data points).
- Too much smoothing may eliminate patterns of interest.
- Too little smoothing leaves too much random variation, which can disguise real patterns.
Seasonality can often be removed or smoothed using a centered moving average, such as a 5-period moving average (5-PMA).
Seasonal Index:
A measure of how a particular season (e.g., month, quarter) compares to the average season. It helps quantify the seasonal effect.
Seasonal Index = (Seasonal Averaged Ratio * Number of Seasons) / Sum of Average Ratios
- An index of 100% indicates an average seasonal effect.
- An index above 100% indicates a positive seasonal effect (above average).
- An index below 100% indicates a negative seasonal effect (below average).
Seasonally Adjusted Time Series:
A time series from which the seasonal component has been removed, allowing for better analysis of trend and cyclical components.
Seasonally Adjusted Time Series = Actual Time Series / Seasonal Index
Index Numbers and Economic Indicators
Consumer Price Index (CPI):
Measures the changes in the total price of a basket of goods and services typically purchased by consumers. It is calculated primarily using a variation of the Laspeyres method. Price changes for various expenditure classes are combined with their corresponding weights (based on consumption patterns).
- Laspeyres Price Index (LPI):
- Advantage (+): Requires consumption data only for the base period, which is often readily available.
- Paasche Price Index (PPI):
- Disadvantage (-): Requires consumption data for the current period, which is often difficult to obtain and not frequently used for official indices like CPI.
Simple Price Index:
Compares the price of a single item in the current period (P₁) to its price in a base period (P₀):
Simple Price Index = (P₁ / P₀) * 100
Real Income:
Measures the purchasing power of nominal income, adjusted for inflation:
Real Income = (Nominal Income / Current CPI) * Base CPI
Real GDP:
Measures the value of goods and services produced in an economy, adjusted for price changes (inflation or deflation):
Real GDP = (Nominal GDP / Current CPI) * 100
Purchasing Power:
Indicates the value of money in terms of what it can buy:
Purchasing Power = (Current Value / Current CPI) * 100
Key Concepts:
- Indexing: Expressing a value in one period as a percentage of a value in a base period. This can be done on any type of time series.
- Deflating: Removing the price-change effect from a monetary time series to express values in constant prices. This can only be done on a monetary time series.