Data Analysis Fundamentals: Central Tendency & Variability
Descriptive Statistics: Central Tendency & Dispersion
Measures of Central Tendency
Understanding the Mean
The mean of the weights is the average of all weights in the table.
Remarks on the Mean
- Very easy to compute.
- Takes into consideration all values in the dataset.
- Highly sensitive to extreme values among the data (outliers).
There are some variations of the mean (harmonic mean, geometric mean…) which we will not study in this course.
Understanding the Median
The median is the number in the middle of the list of weights, once we have sorted them by increasing order.
Computing the Median
To compute the median, sort the data by increasing values, then find the middle point:
- If the sample has an odd size, then the median is the value in the middle of the list.
- If the sample has an even size, then the median is the average of the two values in the middle of the list.
Remarks on the Median
- Very easy to compute too.
- Depends on the order of the data, not on the actual values in the dataset.
- Not much affected by extreme values among the data (outliers).
Measures of Statistical Dispersion
Finding the central values of the data is just the first step in studying its distribution. The next step is to analyze the variability or dispersion of the data.
Depending on the central value chosen (mean or median), there are different ways to analyze the dispersion of the data.
Dispersion Measures Relative to the Mean
Variance, standard deviation, and Pearson’s coefficient of variation measure data dispersion with respect to the mean.
Variance and Standard Deviation
These measures quantify the dispersion of the data with respect to the mean (as the central value of the data).
- The standard deviation is measured in the same units as the original data. This is useful when interpreting it as a measure of dispersion.
- The use of the square root to define the standard deviation adds some bias (a technical inconvenience). Thus, the variance is a more accurate measure of dispersion (although more difficult to interpret).
- Both the variance and the standard deviation are sensitive to extreme values of the data (because they are based on the mean, which was already sensitive to such values).
- What happens if the standard deviation is zero?
To compute the variance, the formula is [formula not provided in original text].
To compute the standard deviation, the formula is [formula not provided in original text].
Pearson’s Coefficient of Variation
This is the ratio of the standard deviation with respect to the mean (consider it as a percentage of variation).
To compute Pearson’s coefficient of variation, the formula is [formula not provided in original text].
Dispersion Measures Relative to the Median
The quartiles and percentiles provide a notion of dispersion with respect to the median.
Quartiles
Quartiles are milestones, marking the 25%, 50%, and 75% points of the dataset. The median marks the 50% point.
- The first quartile (Q1) is the 25th percentile.
- The second quartile (Q2) is the 50th percentile (the median).
- The third quartile (Q3) is the 75th percentile.
Percentiles
Percentiles are similar to quartiles but mark any percentage point.
Computing the k-th Percentile
To compute the k-th percentile, follow these steps:
- Sort the data in increasing order.
- Multiply k/100 by n (the size of the sample).
- If the number in Step 2 is NOT a whole number, round it up to the next whole number. This number marks the position of the k-th percentile in the data (counting from left to right).
- If the number in Step 2 is a whole number, count the values in your dataset from left to right until you reach the number indicated by Step 2. The k-th percentile is the average of that corresponding value in your dataset and the value that directly follows it.