Statistics Fundamentals

Posted on May 31, 2024 in Mathematics

Statistics

Statistics involve techniques for collecting, organizing, analyzing, and interpreting data to ensure accuracy and identify unusual patterns. This field helps in making informed decisions and understanding relationships within data.

Types of Statistics

Descriptive Statistics: Summarizes and describes numerical data for easier interpretation, providing basic information like averages.
Inferential Statistics: Draws conclusions and makes predictions about a population based on a sample.

Variables

Quantitative Variables

Quantitative variables represent numerical data suitable for analysis.

Continuous Variables: Have an infinite range of values within a given range, often with multiple decimal places (e.g., height, weight, average exam grade).
Discrete Variables: Represented by whole numbers and can only take specific values (e.g., number of students in a class, number of cars in a city).

Qualitative Variables

Qualitative variables represent non-numeric data.

Nominal Variables: Categories without inherent order, where one category is not superior to another (e.g., gender, types of fruit).
Ordinal Variables: Categories with a logical order, often ranked from best to worst (e.g., education level).

Sampling Methods

Random Sampling

Every member of the population has an equal chance of being selected.

Simple Random Sampling: Randomly selecting individuals (e.g., using random numbers).
Systematic Random Sampling: Selecting every nth individual from a list (e.g., every 3rd person entering a room).
Stratified Random Sampling: Dividing the population into subgroups (strata) based on a characteristic and randomly selecting from each stratum (e.g., selecting individuals with blue eyes).
Cluster Random Sampling: Dividing the population into clusters and randomly selecting entire clusters (e.g., selecting groups with both blue eyes and a certain income level).

Non-Random Sampling

Individuals are selected based on convenience or other non-random criteria.

Convenience Sampling: Selecting individuals who are easily accessible (e.g., friends or family).

Quantitative Research

Quantitative research involves quantifying data collection and analysis. It uses mathematical models and hypotheses to understand phenomena. Measurement of variables can be challenging when dealing with subjective concepts (e.g., different perceptions of poverty).

Types of Quantitative Research

Exploratory Research: Investigates a research question without pre-defined hypotheses.
Descriptive Research: Describes a phenomenon or population.
Causal Research: Investigates cause-and-effect relationships between variables (e.g., does income influence pet ownership?).

Levels of Measurement

Nominal: Categories without order or ranking (e.g., race, types of fruit).
Ordinal: Categories with a meaningful order (e.g., social class, education level).
Interval: Meaningful differences between values, but zero does not indicate absence (e.g., temperature, IQ scores).
Continuous/Discrete: Continuous variables allow for decimals, while discrete variables use whole numbers. Zero is meaningful in both cases (e.g., income, number of children).

Validity

Internal Validity: The accuracy of measuring the cause-and-effect relationship between variables. Controlling for extraneous variables is crucial to ensure internal validity.
External Validity: The generalizability of findings to other populations, settings, or times.

Measures of Dispersion

Measures of dispersion describe the spread or variability of data.

Mean Absolute Deviation (MAD): The average distance of each data point from the mean, less sensitive to outliers.
Variance: Measures the average squared difference from the mean.
Standard Deviation: The square root of the variance, indicating the spread of data around the mean. Useful for bell-shaped distributions.
Coefficient of Variation: Compares the standard deviation to the mean, expressed as a percentage. Higher values indicate greater dispersion.

Normal Distribution

In a normal distribution, approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

Skewness

Pearson’s coefficient of skewness measures the asymmetry of a distribution.

Coefficient of 0: Symmetrical distribution.
Positive Value: Right-skewed distribution (long tail on the right).
Negative Value: Left-skewed distribution (long tail on the left).

Measures of Central Tendency

Mean: The average of all values in a dataset.
Weighted Mean: Assigns weights to values based on their importance before calculating the average.
Median: The middle value in a sorted dataset.
Mode: The most frequent value in a dataset.

Regression Analysis

Regression analysis estimates the value of a dependent variable based on an independent variable. It assumes a linear relationship between the variables.

Scatter Plot

A scatter plot visualizes the relationship between two quantitative variables. The independent variable is plotted on the x-axis, and the dependent variable on the y-axis. The slope of the line indicates the direction of the relationship (positive, negative, or no relationship).

Correlation Analysis

Correlation analysis measures the strength and direction of the relationship between two variables.

Correlation of Determination (R-squared): Indicates how well the regression model fits the data. Higher values represent a better fit.
Correlation Coefficient (r): Ranges from -1 to +1, indicating the strength and direction of the relationship. Values closer to -1 or +1 represent stronger relationships.

Examples of Independent and Dependent Variables

Independent Variables: Income, distance, advertising expenses, temperature.
Dependent Variables: Cultural experiences, delivery time, sales, ice cream consumption, press expenses.

Box Plot

A box plot displays the distribution of data using quartiles.

Lowest Point: The minimum value in the dataset.
First Quartile (Q1): 25% of the data falls below this value.
Second Quartile (Q2 or Median): 50% of the data falls below this value.
Third Quartile (Q3): 75% of the data falls below this value.
Interquartile Range (IQR): The distance between the first and third quartiles (Q3 – Q1).
Outliers: Data points that fall outside a specified range from the IQR.