Advanced Statistical Analysis and Econometrics in SPSS

Skewness and Kurtosis: Distribution Shapes

What They Measure

  • Skewness measures the asymmetry of a distribution around its mean.
    • Positive (right) skew: Long right tail—most observations are on the left (e.g., income).
    • Negative (left) skew: Long left tail—most observations are on the right.
    • Skewness = 0: Symmetric distribution (ideally normal).
  • Kurtosis measures tailedness and peakness—how heavy the tails are relative to a normal distribution.
    • Mesokurtic: Kurtosis ≈ 3 (normal distribution).
    • Leptokurtic: Kurtosis > 3—heavy tails, more outliers, and a sharper peak.
    • Platykurtic: Kurtosis < 3—light tails and a flatter peak.
    • Excess kurtosis = kurtosis − 3 (normal = 0). Positive excess indicates a leptokurtic distribution.

Why They Matter in Econometrics

  • Many parametric tests (t, F, OLS inference) rely on the normality of errors or large-sample approximations. If data or residuals are highly skewed or leptokurtic, small-sample inference can be invalid:
    • Skewness implies mean ≠ median; OLS coefficient interpretation still holds, but inference may be affected, especially for small n.
    • High kurtosis implies more extreme values or outliers; consequently, standard errors can be unreliable and tests may lose power.

How to Detect and Diagnose

  • In SPSS: Analyze → Descriptive Statistics → Explore or Descriptives provides skewness and kurtosis values along with standard errors. Also, check Graphs → Histogram and Q–Q plots.
  • Rules of thumb:
    • Skewness / SE_skewness: Compute z = skewness/SE to test significance.
    • Kurtosis / SE_kurtosis: Perform a similar z-test.
    • Visual checks: Use histograms, boxplots (for outliers), and Q–Q plots.

Remedies for Problematic Skewness or Kurtosis

  • Transformations: Apply log, square-root, or Box–Cox transformations to reduce skewness.
  • Robust methods: Use robust standard errors, nonparametric tests, or median regressions.
  • Winsorize or trim: Address extreme outliers carefully and document the rationale.
  • For kurtosis issues in finance, use models that allow for fat tails, such as t-distribution errors or GARCH for volatility.

Dummy Variables in Regression Analysis

What is a Dummy Variable?

  • A dummy (indicator) is a numeric variable representing categories. The usual coding is 1 for the presence of an attribute and 0 for its absence.
  • It is used to include categorical information, such as gender, region, or treatment, in regression models.

Determining the Number of Dummies

  • For a categorical variable with k categories, include k − 1 dummies. Leave one category as the reference or base. Including k dummies with an intercept creates perfect multicollinearity, known as the dummy variable trap.

Interpretation in Regression

  • If the model is Y = β₀ + β₁D + β₂X + u, where D = 1 for group A and 0 for reference group B:
    • β₀: The mean of Y for the reference group when X = 0.
    • β₁: The difference in the mean of Y between group A and the reference group, holding X constant.
  • With multiple dummies, each coefficient represents the effect relative to the base group.

Interaction with Other Variables

  • Dummies can be interacted with continuous variables: Y = β₀ + β₁X + β₂D + β₃(D × X).
    • β₃ tests whether the slope for X differs across groups.

Creating Dummies in SPSS

  • Use Transform → Recode into Different Variables to map categories to 0/1, or Transform → Compute Variable with an expression like D = (group = 1).
  • Set Value Labels (e.g., 1 = Male, 0 = Female) in the Variable View.

Common Exam Pitfalls and Checks

  • Do not include k dummies with an intercept. If you must include all k, you must drop the intercept.
  • Ensure the reference group is meaningful and report which group is the base.
  • If there are many categories (e.g., a region with 20 categories), consider collapsing rare categories or using fixed effects.

Factor and Cluster Analysis Techniques

Factor Analysis (FA)

Purpose

  • Reduce dimensionality by summarizing many observed variables into a few latent factors (constructs). This is useful for questionnaires and indices.

Types

  • Exploratory Factor Analysis (EFA): Used when the structure is unknown.
  • Confirmatory Factor Analysis (CFA): Used to test a hypothesized factor structure (typically part of SEM).

Assumptions and Data Requirements

  • Data should be on an interval scale (Likert scales are sometimes treated as interval for FA).
  • There must be sufficient correlations among variables.
  • A KMO (Kaiser-Meyer-Olkin) measure > 0.6 is desirable, and a significant Bartlett’s test (p < 0.05) indicates factorability.

Steps in SPSS

  1. Examine correlations and KMO/Bartlett results.
  2. Choose extraction method: Principal Components (PCA) or common factor (Principal Axis).
  3. Decide number of factors: Use the eigenvalue > 1 rule, scree plots, % variance explained, or parallel analysis.
  4. Rotate loadings: Use Varimax (orthogonal) or Promax (oblique) for an interpretable structure.
  5. Interpret loadings: Loadings > 0.4 are typically meaningful; communalities show the variance explained per variable.
  6. Compute factor scores if needed for regression.

Output Interpretation

  • Rotated Component Matrix: Shows variable loadings; high loading indicates a variable belongs to that factor.
  • Communalities: The proportion of each variable’s variance explained by the retained factors.
  • Total variance explained: The fraction of total variance captured by the factors.

Cluster Analysis (CA)

Purpose

  • Group observations (not variables) into homogeneous clusters based on similarity. This is used for segmentation (e.g., customers or regions).

Types

  • Hierarchical: Builds a dendrogram (agglomerative or divisive).
  • Partitioning (k-means): Specify k clusters and the algorithm assigns observations.

Steps and Choices

  1. Choose variables and preprocess them (standardize if scales differ).
  2. Select distance measure: Euclidean is typical for numeric data.
  3. Choose algorithm: Hierarchical is good for small n; k-means is efficient for larger n.
  4. Decide number of clusters: Use a dendrogram, elbow method, or silhouette score.
  5. Interpret clusters: Profile cluster centers and label the clusters.

Factor vs. Cluster Analysis

  • Factor analysis groups variables into latent constructs.
  • Cluster analysis groups cases or observations into segments.

Structural Equation Modeling (SEM) Theory

What is SEM?

  • SEM combines measurement models (CFA) and structural models (path analysis). It estimates multiple interrelated equations simultaneously, linking latent constructs and observed variables.

Components

  • Exogenous variables: Independent latent or observed variables.
  • Endogenous variables: Dependent latent or observed variables.
  • Measurement model: Shows how observed indicators load on latent factors.
  • Structural model: Shows causal paths between latent constructs.

Assumptions

  • Multivariate normality: Maximum likelihood estimation relies on this.
  • Linearity between variables.
  • No large outliers.
  • Correct model identification: Ensuring enough information to estimate parameters.
  • Uncorrelated error terms unless specifically modeled.

Identification

  • A model must be identified to be estimable:
    • Under-identified: Fewer unique moments than parameters (cannot estimate).
    • Exactly identified: Moments equal parameters (trivial fit).
    • Over-identified: More moments than parameters (desirable for testing fit).

Sample Size

  • SEM requires large n: A typical rule is 10–20 observations per parameter; n ≥ 200 is recommended for many indicators. Small samples make the χ² test overly sensitive.

Estimation and Fit

  • Estimation: Usually Maximum Likelihood (ML); alternatives include GLS or WLS for non-normal data.
  • Fit Indices:
    • χ² test: Prefers a non-significant p-value.
    • CFI and TLI (NNFI): Values > 0.90–0.95 indicate good fit.
    • RMSEA: < 0.06–0.08 is acceptable; lower is better.
    • SRMR: < 0.08 is desirable.

Steps to Conduct SEM

  1. Specify model: Draw a path diagram based on theory.
  2. Identify and estimate measurement model (CFA): Check loadings and validity.
  3. Estimate structural model: Test hypothesized causal paths.
  4. Assess fit: Use indices and inspect modification indices for improvements.
  5. Report results: Include standardized coefficients, SEs, p-values, and R².

Common Pitfalls and Remedies

  • Avoid blindly adding paths to improve fit, as this leads to overfitting.
  • If multivariate normality is violated, use robust estimation or bootstrap SEs.
  • Use SPSS to prepare data and AMOS to specify and estimate the SEM graphically.