Advanced Statistical Analysis and Econometrics in SPSS
Posted on Jan 3, 2026 in Statistics
Skewness and Kurtosis: Distribution Shapes
What They Measure
- Skewness measures the asymmetry of a distribution around its mean.
- Positive (right) skew: Long right tail—most observations are on the left (e.g., income).
- Negative (left) skew: Long left tail—most observations are on the right.
- Skewness = 0: Symmetric distribution (ideally normal).
- Kurtosis measures tailedness and peakness—how heavy the tails are relative to a normal distribution.
- Mesokurtic: Kurtosis ≈ 3 (normal distribution).
- Leptokurtic: Kurtosis > 3—heavy tails, more outliers, and a sharper peak.
- Platykurtic: Kurtosis < 3—light tails and a flatter peak.
- Excess kurtosis = kurtosis − 3 (normal = 0). Positive excess indicates a leptokurtic distribution.
Why They Matter in Econometrics
- Many parametric tests (t, F, OLS inference) rely on the normality of errors or large-sample approximations. If data or residuals are highly skewed or leptokurtic, small-sample inference can be invalid:
- Skewness implies mean ≠ median; OLS coefficient interpretation still holds, but inference may be affected, especially for small n.
- High kurtosis implies more extreme values or outliers; consequently, standard errors can be unreliable and tests may lose power.
How to Detect and Diagnose
- In SPSS:
Analyze → Descriptive Statistics → Explore or Descriptives provides skewness and kurtosis values along with standard errors. Also, check Graphs → Histogram and Q–Q plots. - Rules of thumb:
- Skewness / SE_skewness: Compute z = skewness/SE to test significance.
- Kurtosis / SE_kurtosis: Perform a similar z-test.
- Visual checks: Use histograms, boxplots (for outliers), and Q–Q plots.
Remedies for Problematic Skewness or Kurtosis
- Transformations: Apply log, square-root, or Box–Cox transformations to reduce skewness.
- Robust methods: Use robust standard errors, nonparametric tests, or median regressions.
- Winsorize or trim: Address extreme outliers carefully and document the rationale.
- For kurtosis issues in finance, use models that allow for fat tails, such as t-distribution errors or GARCH for volatility.
Dummy Variables in Regression Analysis
What is a Dummy Variable?
- A dummy (indicator) is a numeric variable representing categories. The usual coding is 1 for the presence of an attribute and 0 for its absence.
- It is used to include categorical information, such as gender, region, or treatment, in regression models.
Determining the Number of Dummies
- For a categorical variable with k categories, include k − 1 dummies. Leave one category as the reference or base. Including k dummies with an intercept creates perfect multicollinearity, known as the dummy variable trap.
Interpretation in Regression
- If the model is Y = β₀ + β₁D + β₂X + u, where D = 1 for group A and 0 for reference group B:
- β₀: The mean of Y for the reference group when X = 0.
- β₁: The difference in the mean of Y between group A and the reference group, holding X constant.
- With multiple dummies, each coefficient represents the effect relative to the base group.
Interaction with Other Variables
- Dummies can be interacted with continuous variables: Y = β₀ + β₁X + β₂D + β₃(D × X).
- β₃ tests whether the slope for X differs across groups.
Creating Dummies in SPSS
- Use
Transform → Recode into Different Variables to map categories to 0/1, or Transform → Compute Variable with an expression like D = (group = 1). - Set Value Labels (e.g., 1 = Male, 0 = Female) in the Variable View.
Common Exam Pitfalls and Checks
- Do not include k dummies with an intercept. If you must include all k, you must drop the intercept.
- Ensure the reference group is meaningful and report which group is the base.
- If there are many categories (e.g., a region with 20 categories), consider collapsing rare categories or using fixed effects.
Factor and Cluster Analysis Techniques
Factor Analysis (FA)
Purpose
- Reduce dimensionality by summarizing many observed variables into a few latent factors (constructs). This is useful for questionnaires and indices.
Types
- Exploratory Factor Analysis (EFA): Used when the structure is unknown.
- Confirmatory Factor Analysis (CFA): Used to test a hypothesized factor structure (typically part of SEM).
Assumptions and Data Requirements
- Data should be on an interval scale (Likert scales are sometimes treated as interval for FA).
- There must be sufficient correlations among variables.
- A KMO (Kaiser-Meyer-Olkin) measure > 0.6 is desirable, and a significant Bartlett’s test (p < 0.05) indicates factorability.
Steps in SPSS
- Examine correlations and KMO/Bartlett results.
- Choose extraction method: Principal Components (PCA) or common factor (Principal Axis).
- Decide number of factors: Use the eigenvalue > 1 rule, scree plots, % variance explained, or parallel analysis.
- Rotate loadings: Use Varimax (orthogonal) or Promax (oblique) for an interpretable structure.
- Interpret loadings: Loadings > 0.4 are typically meaningful; communalities show the variance explained per variable.
- Compute factor scores if needed for regression.
Output Interpretation
- Rotated Component Matrix: Shows variable loadings; high loading indicates a variable belongs to that factor.
- Communalities: The proportion of each variable’s variance explained by the retained factors.
- Total variance explained: The fraction of total variance captured by the factors.
Cluster Analysis (CA)
Purpose
- Group observations (not variables) into homogeneous clusters based on similarity. This is used for segmentation (e.g., customers or regions).
Types
- Hierarchical: Builds a dendrogram (agglomerative or divisive).
- Partitioning (k-means): Specify k clusters and the algorithm assigns observations.
Steps and Choices
- Choose variables and preprocess them (standardize if scales differ).
- Select distance measure: Euclidean is typical for numeric data.
- Choose algorithm: Hierarchical is good for small n; k-means is efficient for larger n.
- Decide number of clusters: Use a dendrogram, elbow method, or silhouette score.
- Interpret clusters: Profile cluster centers and label the clusters.
Factor vs. Cluster Analysis
- Factor analysis groups variables into latent constructs.
- Cluster analysis groups cases or observations into segments.
Structural Equation Modeling (SEM) Theory
What is SEM?
- SEM combines measurement models (CFA) and structural models (path analysis). It estimates multiple interrelated equations simultaneously, linking latent constructs and observed variables.
Components
- Exogenous variables: Independent latent or observed variables.
- Endogenous variables: Dependent latent or observed variables.
- Measurement model: Shows how observed indicators load on latent factors.
- Structural model: Shows causal paths between latent constructs.
Assumptions
- Multivariate normality: Maximum likelihood estimation relies on this.
- Linearity between variables.
- No large outliers.
- Correct model identification: Ensuring enough information to estimate parameters.
- Uncorrelated error terms unless specifically modeled.
Identification
- A model must be identified to be estimable:
- Under-identified: Fewer unique moments than parameters (cannot estimate).
- Exactly identified: Moments equal parameters (trivial fit).
- Over-identified: More moments than parameters (desirable for testing fit).
Sample Size
- SEM requires large n: A typical rule is 10–20 observations per parameter; n ≥ 200 is recommended for many indicators. Small samples make the χ² test overly sensitive.
Estimation and Fit
- Estimation: Usually Maximum Likelihood (ML); alternatives include GLS or WLS for non-normal data.
- Fit Indices:
- χ² test: Prefers a non-significant p-value.
- CFI and TLI (NNFI): Values > 0.90–0.95 indicate good fit.
- RMSEA: < 0.06–0.08 is acceptable; lower is better.
- SRMR: < 0.08 is desirable.
Steps to Conduct SEM
- Specify model: Draw a path diagram based on theory.
- Identify and estimate measurement model (CFA): Check loadings and validity.
- Estimate structural model: Test hypothesized causal paths.
- Assess fit: Use indices and inspect modification indices for improvements.
- Report results: Include standardized coefficients, SEs, p-values, and R².
Common Pitfalls and Remedies
- Avoid blindly adding paths to improve fit, as this leads to overfitting.
- If multivariate normality is violated, use robust estimation or bootstrap SEs.
- Use SPSS to prepare data and AMOS to specify and estimate the SEM graphically.