Statistical Inference: Sampling and Confidence Intervals
Chapter 9: Sampling Distributions
Quantile-Quantile Plot (QQ-Plot)
Empirical Rule: This property states that approximately 68%, 95%, and 99.7% of data falls within 1, 2, and 3 standard deviations of the mean, respectively.
Standard Normal Distribution
The Standard Normal distribution has a mean of 0 and a standard deviation of 1.
Example: If you want to know the percentage of babies that weigh less than 95 ounces at birth, you must first convert the value 95 to a standardized score (STAT).
Based on the Empirical Rule, we know this probability should fall between 2.5% and 16%.
Calculating Probabilities with pnorm()
We can calculate the exact probability of a baby weighing less than 95 ounces using the pnorm() function in R:
pnorm(q = 95, mean = 119.6, sd = 18.2)
*Note: The pnorm() function calculates the area less than or to the left of a specified value.
Calculating Values with qnorm()
Example: What if we want to calculate a value based on a percentage? For instance, 25% of babies weigh less than what weight? This is the reverse of our previous question.
qnorm(p = 0.25, mean = 119.6, sd = 18.2)
This returns a weight of 107.3 (or a STAT of -0.67). This means 25% of babies weigh less than 107.3 ounces, or are 0.67 standard deviations below the mean.
The T-Distribution
The T-distribution is used when the standard deviation of the population (σ) is unknown. In R programming, TRUE is treated as 1 and FALSE is treated as 0.
Analyzing Sampling Distributions
Let’s draw 10,000 repeated samples of n = 5 countries from this population. The data for the first two samples (replicates) is shown in section 9.4.
samples <- rep_sample_n(gapminder_2016, size = 5, reps = 10000) %>% select(replicate, country, year, life_expectancy, continent, region)
n <- 5
We can then calculate the variance for each sample (replicate) using two different formulas:
variances <- samples %>% group_by(replicate) %>% summarise(s2_n = sum((life_expectancy - mean(life_expectancy))^2) / n, s2 = sum((life_expectancy - mean(life_expectancy))^2) / (n - 1))
Table 9.5 shows the results for the first 10 samples. Let’s look at the average of s²n and s² across all 10,000 samples.
variances %>% summarize(mean_s2_n = mean(s2_n), mean_s2 = mean(s2))
Visualizing Estimator Bias
ggplot(variances) + geom_histogram(aes(x = s2_n, fill = "red"), color = "white", alpha = 0.5) + geom_histogram(aes(x = s2, fill = "blue"), color = "white", alpha = 0.5) + geom_vline(xintercept = mean(variances$s2_n), color = "red", size = 1) + geom_vline(xintercept = mean(variances$s2), color = "blue", size = 1) + geom_vline(xintercept = var(gapminder_2016$life_expectancy), linetype = 2, size = 1) + scale_fill_manual(name="Estimator", values = c('blue' = 'blue', 'red' = 'red'), labels = expression(s^2, s[n]^2))
xlab("Sample variance estimate") + ylab("Number of samples")
Understanding Standard Error
- Large Standard Error: An estimate may be far from the average of its sampling distribution, meaning the estimate is imprecise.
- Small Standard Error: An estimate is likely to be close to the average of its sampling distribution, meaning the estimate is precise.
- A set of characteristics that is both precise and unbiased is preferred.
Chapter 10: Confidence Intervals
Constructing Confidence Intervals
Suppose we have a sample of n = 20 and are using a T-distribution to construct a 95% confidence interval for the mean.
qt(p = .025, df = 19, lower.tail = TRUE)
| Confidence Level | Area in Each Tail | p-value for qt() |
|---|---|---|
| 90% | 5% = 0.05 | p = 0.05 |
| 95% | 2.5% = 0.025 | p = 0.025 |
| 99% | 0.5% = 0.005 | p = 0.005 |
Example: Football Fan Age Statistics
Returning to our football fans example, let’s assume we don’t know the true population standard deviation (σ). Instead, we estimate it using s, the standard deviation calculated from our sample.
mean_age_stats_unknown_sd <- sample_100_fans %>% summarize(n = n(), xbar = mean(age), s = sd(age), SE_xbar = s/sqrt(n))
Results: n = 100, xbar = 29.7, s = 7.78, SE_xbar = 0.778
qt(p = .025, df = 99)
Result: -1.98
CI <- mean_age_stats_unknown_sd %>% summarize(lower = xbar - 1.98*SE_xbar, upper = xbar + 1.98*SE_xbar)
Confidence Interval: Lower = 28.1, Upper = 31.2
Example: 90% Confidence Intervals
What if we constructed 90% confidence intervals instead? We would use 1.645 as our critical value instead of 1.96.
CIs_90_football_fans <- samp_means_football_fans %>% mutate(lower = xbar - 1.645*SE_xbar, upper = xbar + 1.645*SE_xbar, captured_90 = lower <= mu & mu <= upper)
CIs_90_football_fans %>% summarize(sum(captured_90)/n())
Answer: 0.901. Approximately 90% of the confidence intervals contain the true mean.
Confidence Intervals for Proportions
We can use the critical value 1.96 to construct a 95% confidence interval for a proportion.
CI <- prop_red_stats %>% summarize(lower = pi_hat - 1.96*SE_pi_hat, upper = pi_hat + 1.96*SE_pi_hat)
Result: Lower = 0.283, Upper = 0.557. We are 95% confident that the true proportion of red balls in the bowl is between 0.283 and 0.557.
| Confidence Level | Z-Critical Value |
|---|---|
| 95% | 1.96 |
| 99% | 2.576 |
Example: Computing the Point Estimate
Step 1: Calculate sample proportions and their variances.
prop_yes_stats <- mythbusters_yawn %>% group_by(group) %>% summarize(n = n(), pi_hat = sum(yawn == "yes") / n, var_pi_hat = pi_hat * (1 - pi_hat) / n)
Step 2: Calculate the point estimate for the difference in proportions and standard error.
diff_prop_yes_stats <- prop_yes_stats %>% summarize(diff_in_props = diff(pi_hat), SE_diff = sqrt(sum(var_pi_hat)))
