Essential Statistics Concepts and Formulas

Fundamental Statistical Definitions

1. Define Mean: Mean is the average. It is calculated as the Sum of all values ÷ Number of values.

2. Find Mean of First Ten Natural Numbers: The first 10 natural numbers are 1 to 10. The sum is 55. Mean = 55 ÷ 10 = 5.5.

3. Define Median: The Median is the middle value when data is arranged in ascending or descending order.

4. Find Median of First Ten Even Numbers: The first 10 even numbers are 2, 4, 6, 8, 10, 12, 14, 16, 18, 20. For an even count (10 values), the median is the average of the 5th and 6th values. 5th = 10, 6th = 12. Median = (10 + 12) / 2 = 11.

5. Define Mode: The Mode is the value that repeats most frequently in a dataset.

6. Define Mean Deviation: Mean Deviation is the average distance of all values from the Mean or Median. In simple terms, it represents the average gap.

7. Define Standard Deviation: Standard Deviation (SD) is the square root of the average of squared deviations. It measures how much the data shows spread or variability.

8. Define Bivariate Data: Bivariate data refers to data involving two variables measured at the same time. Example: The height and weight of students.

9. Define Correlation: Correlation measures how strong the relationship is between two variables and in which direction it moves.

10. Define Rank Correlation: Rank Correlation is a relationship measure using ranks rather than actual values.

Regression, Time Series, and Index Numbers

11. Regression Lines (Two Equations):
Regression of Y on X: Y − Ȳ = bᵧₓ (X − X̄)
Regression of X on Y: X − X̄ = bₓᵧ (Y − Ȳ)

12. What is Time Series? A Time Series is data collected in chronological order. Example: Monthly sales data.

13. What are Index Numbers? Index Numbers are statistics that compare changes in price or quantity from a base year.

Probability and Event Types

14. Define an Event: An Event is the result or outcome of an experiment. Example: Rolling a 4 on a die.

15. Probability of an Event: Probability is the chance of an event occurring. Formula: Favourable outcomes ÷ Total outcomes.

16. Mutually Exclusive Events: These are events that cannot occur at the same time. Example: A coin toss cannot result in both heads and tails simultaneously.

17. Independent Events: These are events where the result of one does not affect the other. Example: Tossing two separate coins.

18. Sure Event: A Sure Event is one that will 100% occur. Example: A die roll resulting in a number between 1 and 6.

19. Impossible Event: An Impossible Event is one that can never happen. Example: Rolling a 7 on a standard die.

20. Addition Theorem on Probability: P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Advanced Probability and Hypotheses

21. Conditional Probability: This is the probability of Event A occurring given that Event B has already occurred. Formula: P(A|B) = P(A ∩ B) / P(B)

22. Multiplication Theorem on Probability: P(A ∩ B) = P(A) × P(B|A)

23. Bayes’ Theorem: P(A|B) = [P(B|A) × P(A)] ÷ P(B)

24. What is Hypothesis? A Hypothesis is an assumption or idea that is tested statistically.

25. Null & Alternative Hypothesis:
Null Hypothesis (H₀): There is no change or effect.
Alternative Hypothesis (H₁): There is a change or effect.

Measures of Central Tendency (Detailed)

Meaning

Measures of central tendency are statistical values that summarize a dataset using a single representative value.

Purpose

Their main purpose is to show the center or typical value of the data so that the entire dataset can be understood through one number.

Types

There are three main measures: Mean, Median, and Mode.

  • Mean: The arithmetic average (sum of values ÷ total count). It indicates the mathematical center.
  • Median: The middle value when data is arranged in ascending order. It is not affected by extreme outliers.
  • Mode: The value that repeats most often. It is useful for qualitative data.

Usefulness

These measures simplify data, make comparisons easier, and assist in decision-making within business, economics, and research.

Understanding Central Tendency and Dispersion

Measures of Central Tendency

Measures of Central Tendency are single summary values that describe a whole set of data by identifying the central position. Colloquially referred to as averages, they provide a representative value for the distribution.

Measures of Dispersion

Measures of Dispersion, or variability, describe how much data values are scattered around the central value. A small dispersion indicates data is clustered closely, while a large dispersion suggests the average may not be representative.

Common Measures of Dispersion

  • 1. Range: The difference between the maximum and minimum values.
  • 2. Variance: The average of squared differences from the mean, giving weight to extreme deviations.
  • 3. Standard Deviation: The square root of variance. It returns the measure to original units, making it easier to interpret.
  • 4. Interquartile Range (IQR): The difference between the third quartile (Q3) and the first quartile (Q1), representing the middle 50% of data.

Correlation and Regression Analysis

What is Correlation?

Correlation is a statistical measure expressing the extent to which two variables are linearly related. It describes direction and strength but does not imply causation.

Types of Correlation

  • Positive Correlation: Variables move in the same direction (e.g., study hours and test scores).
  • Negative Correlation: Variables move in opposite directions (e.g., price and demand).
  • Zero Correlation: No apparent linear relationship exists.

Measuring Correlation

Correlation is measured using the Correlation Coefficient (r), specifically the Pearson Product-Moment Correlation Coefficient. It ranges from -1 to +1. Other methods include Scatter Diagrams and Spearman’s Rank Correlation Coefficient.

Rank Correlation

Rank correlation assesses the monotonic association between two sets of data using rank order rather than raw values. It is a non-parametric alternative to Pearson correlation, useful for ordinal data or datasets with outliers.

Regression Analysis

Regression Analysis investigates the relationship between a Dependent Variable (Y) and one or more Independent Variables (X). It is used for prediction and forecasting.

  • Linear Regression: Models continuous relationships.
  • Logistic Regression: Estimates the probability of binary outcomes (e.g., Yes/No).
  • Polynomial Regression: Models non-linear relationships.

Index Numbers and Hypothesis Testing

Weighted vs. Unweighted Index Numbers

Unweighted Index Numbers give equal importance to every item, which can be unrealistic in economic contexts. Weighted Index Numbers assign explicit weights based on relative importance (e.g., quantity consumed). Common formulas include Laspeyres, Paasche, and Fisher indices, used for the Consumer Price Index (CPI).

Null and Alternative Hypotheses

The Null Hypothesis (H₀) represents the status quo or a statement of no effect. The Alternative Hypothesis (Hₐ or H₁) is the claim the researcher seeks to prove. Tests can be Two-Tailed (checking for any difference) or One-Tailed (checking for a difference in a specific direction).

Estimation and Sampling

Point vs. Interval Estimation

Point Estimation provides a single best-guess value (e.g., sample mean). Interval Estimation provides a range of values, known as a Confidence Interval, which accounts for variability and provides a measure of precision.

Scatter Diagrams

A scatter diagram plots two variables on Cartesian coordinates. It visually identifies the direction and strength of correlation and helps spot outliers that might affect statistical calculations.

Components of Time Series

A Time Series is a sequence of data points collected over time. Its components include:

  • Trend (Tₜ): Long-term direction (growth or decline).
  • Seasonal Variation (Sₜ): Predictable patterns repeating within a year.
  • Cyclical Variation (Cₜ): Long-term oscillations linked to business cycles.
  • Irregular Variation (Iₜ): Unpredictable random fluctuations.

Sampling and Sampling Distribution

Sampling is the process of selecting a representative subset of a population. A Sampling Distribution is the probability distribution of a statistic (like the mean) derived from many random samples. It is essential for calculating standard error and performing statistical inference.