Essential Statistical Concepts and Formulas Reference

Posted on Oct 7, 2025 in Statistics

Descriptive Measures: Center and Variability

Measures of Variation

Standard Deviation (SD): The average measure of distance between data points and the mean (the square root of the variance). It indicates how far the data is, on average, from the mean.
- Calculation: Find the variance and take its square root.
Coefficient of Variation (CV): Used to compare the standard deviation of two different data sets. Shown as a percentage, it measures variation relative to the mean.
- Formula: CV = (Standard Deviation / Mean) * 100

Fundamental Concepts

Counting Numbers: Considered discrete data.
Descriptive Measures: A single number that comes from sample data.
Central Tendency: The middle or common value of the data (Mean, Median, Mode).
Variation (Scatter, Variability, Dispersion): How spread out and different the data is (Range, SD, CV).
- Variance Calculation Steps (Sample):
  1. Find the distance from each sample point to the mean.
  2. Square these distances.
  3. Sum the squares.
  4. Divide by $n-1$.
Measures of Position: Compare one value relative to the entire group (Percentiles, Quartiles).
Measures of Variation: Include the Range, Standard Deviation (SD), and Coefficient of Variation (CV).

Boxplots, Percentiles, and Quartiles

Percentiles

Definition: A value below which a certain percentage of scores fall.
- Position Formula: Value Position = (Percentile / 100) * ($n+1$)

Box Plots (Box and Whiskers)

A box plot visually displays the Interquartile Range (IQR) and quartiles.

Components: L (Lowest value), Q1 (First Quartile), Q2 (Median), Q3 (Third Quartile), IQR, H (Highest value).
Interquartile Range (IQR): $Q3 – Q1$. This is a measure of variability.
Fences (Outlier Detection):
- Lower Inner Fence: $Q1 – (1.5 imes ext{IQR})$
- Lower Outer Fence: $Q1 – (3 imes ext{IQR})$
- Upper Inner Fence: $Q3 + (1.5 imes ext{IQR})$
- Upper Outer Fence: $Q3 + (3 imes ext{IQR})$
Excel Calculation for Quartiles: Use the formula =QUARTILE.INC(array, quart) where quart is 1, 2, or 3.

Statistical Sampling Techniques

Types of Random Sampling

Simple Random Sampling: Each member of the population has an equal chance of being selected.
- Strengths: Simple, provides good representation.
- Weakness: Requires access to the entire population list.
Stratified Random Sampling: Dividing the population into non-overlapping groups (strata) based on characteristics (e.g., age, gender, income) and then sampling randomly from each stratum.
- Strengths: Ensures all groups relevant to the variable of interest are represented, often more accurate than simple random sampling.
- Weakness: More complex; requires defining non-overlapping groups.
Cluster Sampling: Dividing the population into clusters and randomly selecting entire clusters to sample.
- Strengths: Inexpensive and simpler to implement.
- Weakness: Items within clusters tend to have similar traits, potentially increasing sampling error.
Systematic Random Sampling: Selecting every $k^{th}$ item from a list until $n$ values are obtained.
- Strengths: Similar to simple random sampling, but ensures items are not clustered together.
- Weakness: Samples are no longer independent.
Convenience Sampling: Selecting observational units that are readily available (not technically random).
- Strengths: Convenient, simple, low cost.
- Weakness: Not random, therefore not representative of the population.

Statistical Analysis Using Excel

Descriptive Statistics in Excel

Click on the Data tab, then Data Analysis.
Select Descriptive Statistics.
Select the data range in the Input Range field.
Specify the Output Range where the results should appear.

Measures of Position (Percentiles) in Excel

Insert function and select PERCENTILE.INC.
In the Array field, select all data.
In the K field, select the percentile sought (a decimal value between 0 and 1).
The resulting percentile value will be displayed.

Binomial Distribution

Five Conditions for a Binomial Distribution

The variable being studied is random.
The outcomes of the variable are being counted.
There is a fixed number of trials, denoted by $n$.
There are only two possible outcomes (“success” and “failure”) for each trial. $\pi$ denotes the probability of success, and $1-\pi$ denotes the probability of failure.
The $n$ trials are independent and repeated under identical conditions. The outcome of one trial does not influence the outcome of another.

The Normal Distribution and Empirical Rule

Characteristics of the Normal Distribution

It is bell-shaped and symmetrical.
It is defined by the mean ($\mu$) and standard deviation ($\sigma$).
The Mean, Median, and Mode are all equal.
Z-Score Formula: $Z = (X – \text{Mean}) / \text{SD}$
Checking for Normality: Examine the histogram and compare the mean, median, and mode.

The Empirical Rule (68-95-99.7 Rule)

This rule applies specifically to normal distributions:

Approximately 68% of the data falls within 1 standard deviation of the mean.
Approximately 95% of the data falls within 2 standard deviations of the mean.
Approximately 99.7% of the data falls within 3 standard deviations of the mean.

Sampling Distributions and Central Limit Theorem

The Central Limit Theorem (CLT): If we take the means of samples of size $n$ and plot the frequencies of these means, the resulting distribution (the sampling distribution of the mean) will approach a normal distribution, regardless of the shape of the population distribution (provided $n$ is large enough).
Law of Large Numbers: States that as an experiment is performed a large number of times, the observed relative frequency (empirical probability) approaches the theoretical probability.
Standard Error (SE): The standard deviation of the sampling distribution of the mean.
- Formula: $\text{SE} = \sigma / \sqrt{n}$

Key Statistical Considerations

Level of Significance ($\alpha$)

A 1% level of significance ($\alpha=0.01$) implies high certainty and strong evidence supporting the claim (low tolerance for Type I error).
A higher level of significance (e.g., 10% or $\alpha=0.10$) means there is more room for error, suggesting the person making the claim is not as sure.

Consistency and Comparison

Consistency: Measured by the Coefficient of Variation (CV). A lower CV percentage indicates greater consistency (lower SD relative to the mean).
When comparing the variability of two different data sets, use the Coefficient of Variation (CV), not the Standard Deviation (SD).

Data Types Summary

Discrete Data: Typically whole numbers (counts).
Continuous Data: Can take on any value within a range (decimals/measurements).
Quantitative Data: Numerical data (counts, percents, measurements).
Categorical Data: Data that falls into groups or labels.

Evaluating Statistical Evidence (P-Value Interpretation)

To evaluate evidence against a null hypothesis (assumption), we determine the probability (P-value) of observing the evidence (or more extreme evidence) assuming the null hypothesis is true. We then assess if this event is unlikely:

If the P-value is unlikely (e.g., less than 1%), we reject the assumption, concluding the claim is likely true.
If the P-value is not unlikely (e.g., greater than 10%), we cannot conclude that the assumption is wrong, and thus cannot conclude the claim is true.

Sampling Definitions

Sampling Variability: The natural variation observed among random samples of size $n$ drawn from the same population.
Sampling Distribution: A probability distribution that describes some aspect of sampling variability.
Law of Large Numbers (Effect of Increasing Sample Size): As the sample size increases, the distribution of the sample approaches the shape of the population distribution. The sample mean and standard deviation get closer to the population mean and standard deviation.
Sampling Distribution of the Sample Means: As the sample size increases, the shape of this distribution will become approximately normal, its mean will equal the population mean ($\mu$), and its standard deviation (Standard Error) will decrease ($\sigma/\sqrt{n}$).

Case Study: Analyzing WestCoast Air No-Show Rates

1. Identifying the Random Variable

Question: Betty thinks the random variable is the number of people who show up. Tina thinks it is whether the flight is overbooked. Who is correct?

Answer: Betty is correct. The core concern is the no-show rate, which is related to how many people show up. Both variables are categorical/qualitative.

2. Required Statistical Assumption

Question: What assumption (including a probability) must be made for the statistical analysis?

Answer: They must assume the opposite of what they are trying to prove (the null hypothesis). They must assume that WestCoast Air’s no-show rate is the same as the industry standard: 8%. Alternatively, they assume the show-up rate is 92%.

Note: The 8% is the proportion of passengers who do not show up, not an average.

3. Binomial Distribution Parameters

Question: State the parameters for the binomial distribution given the situation (180 tickets sold).

Answer:

$n$ (number of trials/tickets) = 180
$\pi$ (probability of success/no-show) = 0.08

(Alternatively, $\pi$ could be 0.92 for the probability of showing up, provided consistency is maintained.)

4. Validity Check Based on Center and Variability

Question: Tina observed a 5.6% no-show rate (10 out of 180). Based on the center and variability of the distribution, is her argument that the rate is less than 8% valid?

Analysis using Center and Variability:

Assuming an 8% no-show rate ($\pi=0.08$ and $n=180$):

Expected Mean ($\mu$): $180 \times 0.08 = 14.4$ no-shows.
Standard Deviation ($\sigma$): $\sqrt{n\pi(1-\pi)} = \sqrt{180 \times 0.08 \times 0.92} \approx 3.64$.

The typical range (Mean $\pm$ 1 SD) is $14.4 \pm 3.64$, or 10.76 to 18.04 no-shows. Tina’s observation of 10 no-shows is slightly below the typical range. Therefore, having ten people not show up is atypical but not significantly abnormal. It is not yet clear whether the true no-show rate is less than 8%.

5. Calculating and Interpreting the P-Value

Calculation: $P(X \le 10) = 0.14$

Interpretation: This is the probability of obtaining a random sample of 180 tickets resulting in 10 or fewer no-shows, given the assumption that the population no-show rate is 8%.

Conclusion on Tina’s Evidence: The probability (P-value) is 14%. Since 14% is greater than a common significance level (like 1% or 10%), it is not unlikely that this evidence would be observed even if the true no-show rate is 8%. Therefore, Tina has not provided enough strong evidence to conclude that the no-show rate is less than 8% for WestCoast Airlines.

6. Comparing Evidence Sources

Question: Betty used industry data; Tina used a convenience sample specific to WestCoast Air. Which evidence is more appropriate for management?

Answer: Tina’s sample is a convenience sample. Although her sample size is larger, because it is not a random sample, it is inherently biased and statistically inappropriate for making population inferences. Betty’s industry data, while potentially not specific to WestCoast Air, is likely based on more rigorous sampling methods.

Normal Distribution Application: Tolerance Intervals

Probability of Falling Outside Tolerance

Question: What is the probability that a randomly selected item (e.g., a bolt) will not fit within the specified tolerance interval?

Solution Steps:

Define the lower tolerance limit ($X_1$) and the upper tolerance limit ($X_2$), along with the mean ($\mu$) and standard deviation ($\sigma$).
Calculate the probability of being below the lower limit: $P(X < X_1)$.
Calculate the probability of being above the upper limit: $P(X > X_2)$.
The total probability of not fitting is the sum of these two probabilities: $P(X < X_1) + P(X > X_2)$.