Statistics Fundamentals: A Comprehensive Guide

Statistics Fundamentals

Introduction to Statistics

Statistics refers to the body of techniques used for collecting, organizing, analyzing, and interpreting data. It is used in various fields to help make better decisions by understanding variation, patterns, and relationships in data.

Types of Data

Data can be classified into two main types:

  1. Quantitative: Values expressed numerically.
    1. Continuous: Infinite range of variables, decimals. Example: Grade on an exam.
    2. Discrete: Whole numbers. Example: Number of people in a class (25 students).
  2. Qualitative: Characteristics being tabulated.
    1. Nominal: Not able to order. Example: Gender (we can’t say that females are better than males).
    2. Ordinal: Order from the worst to the best. Example: Education (“high school, master’s degree”), order if a grade is better than another.

Obtaining Data for Statistical Purposes

  1. Direct Observation: Statistical process control is necessary, in which samples of output are systematically assessed (statistical experiment).
  2. Indirect Observation: If it is not possible to collect data directly, the information has to be obtained from individual respondents (survey or questionnaire).

Sampling Methods

  1. Random Sampling: Every item in a target population has a known, and usually equal, chance of being chosen for inclusion in the sample.
    1. Simple Random Sample: N* from 1-10, and select randomly n*10 (“lottery”).
    2. Systematic Random Sample: When someone enters a class, “select the 3rd person that enters”.
    3. Stratified Sampling: Only the people that have a characteristic “People with blue eyes”.
    4. Cluster Sampling: Select people randomly with different characteristics (2 or more characteristics) “people with glasses and blue eyes”.
  2. Nonrandom Sampling: An individual selects the items to be included in the sample based on judgment: Select people that I know (family and friends).

Quantitative Research

Quantitative research aims to quantify the collection and analysis of data. Its objective is to develop and employ mathematical models, theories, and hypotheses based on existing studies or known intuitions.

  • Exploratory: Investigate, done by us.
  • Descriptive: Given phenomena.
  • Causal: What is the cause and the effect.

Validity in Research

  1. Internal Validity: Refers to the extent to which a study can demonstrate a causal relationship between the independent and dependent variables. Is there a relationship between the selected variables and a cause-effect association? Example: Lack of sleep causes cancer (lack of sleep may not be the cause of cancer but, the fact that sleep-deprived people are also smokers). Cause: Smokers, Effect: Cancer.
  2. External Validity: Refers to the extent to which the results of a study can be generalized or applied to other contexts, populations, and times. How well does the conducted study or used sample relate to the general population?
    • Validity and reliability in the measurement of variables.
    • Representativeness of the sample.

Data Representation and Analysis

  1. Frequency Distribution: A table in which possible values for a variable are grouped into classes, and the number of observed values that fall into each class is recorded. Data organized in a frequency distribution are called grouped data.
  2. Histogram: A bar graph of a frequency distribution. The number of observations is listed along the vertical axis, and the exact class limits are presented along the horizontal axis.
  3. Frequency Polygon: A line graph of a frequency distribution.
  4. Frequency Curve: A smoothed frequency polygon.
    • Negatively Skewed: Nonsymmetrical with the “tail” to the left.
    • Positively Skewed: Nonsymmetrical with the “tail” to the right.
    • Symmetrical: Same amount to the right and left.
  5. Time Series: A set of observed values for a sequentially ordered series of time periods.
  6. Measures of Location: Values that are calculated for a group of data and used to describe the data in some way. The value calculated should be representative of all of the values in the group. Some kind of average is strongly desired. An average is a measure of central tendency for a collection of data points.
  7. Mode: The value that occurs most frequently in a set of values. For a dataset in which no measured values are repeated, there is no mode.
  8. Measures of Dispersion: Deviation, Variance, Standard Deviation, Coefficient of Variation.

Standard Deviation vs. Deviation

MeasureDescriptionSensitivity to OutliersUse Cases
Standard Deviation (SD)Measures the spread of data points around the mean.Higher sensitivity.When data distribution is normal and the mean is the best measure of center. Used in normally distributed datasets (e.g., heights of adults).
DeviationMeasures the average of absolute differences from the mean.Lower sensitivity.When data has distant outliers and is not normally distributed. Used when outliers are present (e.g., incomes with few extremely high values).

Regression Analysis

Objective: To estimate the value of a random variable (dependent variable) given that the value of an associated variable (independent variable) is known. Linear relationship between the dependent and independent variables.

Scatter Plot

A graph in which each plotted point represents an observed pair of values for the independent and dependent variables.

  • Independent Variable (X: horizontal axis): Factors that researchers manipulate or observe to understand the impact on other variables. Are not influenced by other variables. Example: Income per month / educational level.
  • Dependent Variable (Y: vertical axis): Are the outcomes or responses that researchers measure or observe. Are influenced by changes in the independent variable. Example: Cultural expenditures, happiness level.

Multiple Regression Analysis

An extension of simple regression analysis. It involves the use of two or more independent variables to estimate the value of the dependent variable. The multiple regression equation identifies the best-fitting line based on the method of least squares. The best-fitting line is a line through n-dimensional space.

Variables

A variable is any characteristic, number, or quantity that can be measured or quantified.

  1. Discrete Variable: Can have observed values only at isolated points along a scale of values. They are typically whole numbers and occur through the process of counting. Example: Number of students in a class “25”.
  2. Continuous Variable: Can assume a value at any fractional point along a specified interval of values. These values result from measuring and can include fractions and decimals. Example: Grade on an exam “8.5”.

Levels of Measurement

Levels of measurement refer to the different ways that variables and data can be quantified and analyzed in statistical research. They determine the types of statistical analyses that can be performed.

  1. Nominal: Involves data that can be categorized but not ranked. Categories that are non-ordered. Example: Gender (male, female, we can’t say that one is better than the other).
  2. Ordinal: Involves data that can be categorized and ranked. Some categories are higher than others. Example: Educational levels (high school, master’s degree, we can order from lower to higher).
  3. Interval: Data that can be categorized, ranked, and the intervals between values are meaningful and equal. Example: Temperature (the difference between 10°C and 20°C is the same as between 20°C and 30°C).
  4. Continuous / Discrete:
    • Continuous: Can take any value within a given range and are often measurements. Include fractions and decimals. Example: Height, weight, time.
    • Discrete: Can take only specific values, whole numbers, and result from counting. Example: Number of students in a class / Number of cars in a parking lot.

Collinearity

Collinearity occurs in multiple regression analysis when two or more independent variables are highly correlated with each other. This high correlation means that the variables share a significant amount of information, making it difficult to determine the individual effect of each variable on the dependent variable. Collinearity can lead to unreliable and unstable estimates of regression coefficients, making it challenging to identify the true relationship between the predictors and the outcome.

How to Determine Collinearity

  1. Correlation Matrix.
  2. Variance Inflation Factor.
  3. Tolerance.
  4. Eigenvalues and Condition Index.

How to Eliminate Collinearity

  1. Remove highly correlated predictors.
  2. Combine predictors.
  3. Regularization techniques.
  4. Centering the variables.
  5. Increasing sample size.

By identifying and addressing collinearity, researchers can improve the accuracy and stability of their multiple regression models, ensuring more reliable and interpretable results.

Example

Predicting car prices using engine size and horsepower. Engine size and horsepower are highly correlated, making it difficult for the regression model to determine their individual effects on car price. To address collinearity, remove one variable or use regularization techniques.

Dummy Variable

A dummy variable is used in regression analysis to represent categorical data with two or more categories. It allows these categories to be included in the model by converting them into numerical values, typically 0 and 1. This conversion makes it possible to use categorical data in mathematical models that require numerical input.

Importance of Dummy Variables

  1. Simplify Interpretation: By coding categories as 0 or 1, dummy variables make the model easier to interpret.
  2. Enable the Inclusion of Categorical Data: Allow categorical variables to be included in regression models, which can enhance the model’s explanatory power.
  3. Facilitate Interaction Effects: They allow the examination of interaction effects between categorical and continuous variables.

Example

Gender: Categorical variable for gender with two categories, male and female. Female = 0, Male = 1.

Cluster Analysis

Cluster analysis is a statistical method for processing data. It organizes items into groups (clusters) based on how similar they are.

Objective: Find similar groups of subjects, where the same features between each pair of subjects mean some global set of characteristics.

Importance of Cluster Analysis

  • Helps to identify distinct groups of customers with similar behaviors or characteristics for targeted marketing.
  • Helps to identify outliers that may indicate defects, fraud, or other significant events.

Types of Distances in Cluster Analysis

  • Intra-Cluster Distance: Distance between the data points inside the cluster (minimize).
  • Inter-Cluster Distance: Distance between data points in different clusters (maximize).

Coding a Cluster

2 categories male & female: Value 1 M= 1 F=0 VALUE 2= M= 0 F=1 Value 3 M= 1 F=0 Value 4 M= 0 F=1

Male is coded ad 1,0 Female is coded as 0,1. When coding clusters, each categroy or cluster is represented by a bianry variable. This allows the inclusion of categorical info in a numerical format suitable for statistical analysis & modeling. By coding clusters in this manner, you can incorporate categorical data into your analyses, allowing fro more unanced and detailed insights into te structure & relationships within your data set.

TYPES: 1.Partitional clustering: Division data objectis into non-overlapping substes (clusters) such that each data object is in exactly one subset 2.Hierarchical clustering: Set of nested clusters organized as hierarchical tree 3.Exclusive VS Non-exclusive: -Non-Exclusive: Points may belong to multiple clusters .Exclusive: Points belong to only one cluster 4.Fuzze VS Non-Fuzzy: -Fuzzy: Points belong to every cluster with some weight between 0 & 1. Non-fuzzy: Points belong to clusters with a binary membership 5.Partial VS Complete: -Partial: Only a portion of the data is clustured -Complete: The entire dataset is clustured 6.Heterogeneous VS Homogeneous: -Heterogeneous Exhibit diversified characteristics among the observations within them -Homogeneous: Demonstrate relatively similar characteristics within the categories being clustered.