Statistical Inference: Sampling and Confidence Intervals

Chapter 9: Sampling Distributions

AD_4nXfRcElAPBL2dkpgITWia_2YORkU4jen0Kc0uRf-7Np2TdK5imBRcmlAaemCemawGJMMjKWZcI3ay5Kt2ZfeBFFUoUehSOe7fNE03Pk8DWJUPbsSjRB38SfP11Hrho5V4kKvC_TpOA?key=A0C-ePRN4NNNgOkXxBBprg


AD_4nXfyvWfOb42V2zJIWq7j7hTXZCx6ZmGSxdDobtCguKlawzPIZdNhP07F_NqL-4Gr7S73UkLp0d7zIUu7eIZfkFnL9Q4-hzkq2IlbkTAqpV8Y5D4-2p4OP8m88YBiXeJkQYiJ1GA3lw?key=A0C-ePRN4NNNgOkXxBBprg

Quantile-Quantile Plot (QQ-Plot)

Empirical Rule: This property states that approximately 68%, 95%, and 99.7% of data falls within 1, 2, and 3 standard deviations of the mean, respectively.

AD_4nXdKgOndzBU1_9M0VJgWpgBmDHmP_b2naUhIjeJDxnVrYAK_71o47za4uuyX3-s2hvA7c6qIDItph9npoteAPXzKCSLacLx41r38s7ZUooAtlb3s1z6ny5Y2yANNdtyzhc6_dtc4?key=A0C-ePRN4NNNgOkXxBBprg

Standard Normal Distribution

The Standard Normal distribution has a mean of 0 and a standard deviation of 1.

AD_4nXdPvINv2xcZRNSKguRquWoEF82pr7Mc69EkrtV4_w5r4z8vzjcBgPKs6tdQ8LJ4l9FYqGPiw4TJ-21rZAfmqdJqFKjlTBfw8RiGwiYfqdSPexJgyMwGvVgChqMNvpsxEMcDbgUMTQ?key=A0C-ePRN4NNNgOkXxBBprg

Example: If you want to know the percentage of babies that weigh less than 95 ounces at birth, you must first convert the value 95 to a standardized score (STAT).

AD_4nXc5DAv3XjjYLqsGnQt0-fammuWcVTQBwq-OEoSvevdmnMeYfg9_8R44pWwAmKiYypD27IZ32V1smOoVZHgY2GfDT-tlBTRQQs1k--E-F36mmBLRMwp3tBwrMJfLvPzBs6SYX2l8?key=A0C-ePRN4NNNgOkXxBBprg

AD_4nXdvYeSYfBEH_3JZcIeWMGdFuLHnJZ65lBO8NVezXDMNzDrmP97qCD0dvDv1uGdDuYh7jU1tHjyOokpRyo9XfY6DZNM0twuYKDNIrq1pRMHeUx96gYOtrFBsE2br40EdjJJ9zMljAw?key=A0C-ePRN4NNNgOkXxBBprg

Based on the Empirical Rule, we know this probability should fall between 2.5% and 16%.

Calculating Probabilities with pnorm()

We can calculate the exact probability of a baby weighing less than 95 ounces using the pnorm() function in R:

pnorm(q = 95, mean = 119.6, sd = 18.2)

*Note: The pnorm() function calculates the area less than or to the left of a specified value.

Calculating Values with qnorm()

Example: What if we want to calculate a value based on a percentage? For instance, 25% of babies weigh less than what weight? This is the reverse of our previous question.

qnorm(p = 0.25, mean = 119.6, sd = 18.2)

This returns a weight of 107.3 (or a STAT of -0.67). This means 25% of babies weigh less than 107.3 ounces, or are 0.67 standard deviations below the mean.

The T-Distribution

The T-distribution is used when the standard deviation of the population (σ) is unknown. In R programming, TRUE is treated as 1 and FALSE is treated as 0.

Analyzing Sampling Distributions

Let’s draw 10,000 repeated samples of n = 5 countries from this population. The data for the first two samples (replicates) is shown in section 9.4.

samples <- rep_sample_n(gapminder_2016, size = 5, reps = 10000) %>% select(replicate, country, year, life_expectancy, continent, region)

n <- 5

We can then calculate the variance for each sample (replicate) using two different formulas:

variances <- samples %>% group_by(replicate) %>% summarise(s2_n = sum((life_expectancy - mean(life_expectancy))^2) / n, s2 = sum((life_expectancy - mean(life_expectancy))^2) / (n - 1))

Table 9.5 shows the results for the first 10 samples. Let’s look at the average of n and across all 10,000 samples.

variances %>% summarize(mean_s2_n = mean(s2_n), mean_s2 = mean(s2))

Visualizing Estimator Bias

ggplot(variances) + geom_histogram(aes(x = s2_n, fill = "red"), color = "white", alpha = 0.5) + geom_histogram(aes(x = s2, fill = "blue"), color = "white", alpha = 0.5) + geom_vline(xintercept = mean(variances$s2_n), color = "red", size = 1) + geom_vline(xintercept = mean(variances$s2), color = "blue", size = 1) + geom_vline(xintercept = var(gapminder_2016$life_expectancy), linetype = 2, size = 1) + scale_fill_manual(name="Estimator", values = c('blue' = 'blue', 'red' = 'red'), labels = expression(s^2, s[n]^2))

AD_4nXd9Bt2o3NHBFWpJn3R2GqRtdOE9K97wHFKlcXD-OaN8QD3xbki0ga-xsfXhc_BHeZjzgT9hGLKlUZjsXAzRHGzMts7YhCMX-bcB_U2LuXjAPmttlMb2S8_zgQ0TF65imDbEe3WbKg?key=A0C-ePRN4NNNgOkXxBBprg

xlab("Sample variance estimate") + ylab("Number of samples")

Understanding Standard Error

  • Large Standard Error: An estimate may be far from the average of its sampling distribution, meaning the estimate is imprecise.
  • Small Standard Error: An estimate is likely to be close to the average of its sampling distribution, meaning the estimate is precise.
  • A set of characteristics that is both precise and unbiased is preferred.

Chapter 10: Confidence Intervals

AD_4nXcRiknUI6W04Uh_1xW8XqxgE4jVU4H_VX9khahS5PRRWpiRr0BY2tfOq34ytc8a18msibURia-yIMX1jtTE7UXlRxInN9FXKzqgmW2_CK9ijEMepaUlpXtL0yf1TY_VqlXzxJcw?key=A0C-ePRN4NNNgOkXxBBprg

Constructing Confidence Intervals

Suppose we have a sample of n = 20 and are using a T-distribution to construct a 95% confidence interval for the mean.

qt(p = .025, df = 19, lower.tail = TRUE)

Confidence LevelArea in Each Tailp-value for qt()
90%5% = 0.05p = 0.05
95%2.5% = 0.025p = 0.025
99%0.5% = 0.005p = 0.005

Example: Football Fan Age Statistics

Returning to our football fans example, let’s assume we don’t know the true population standard deviation (σ). Instead, we estimate it using s, the standard deviation calculated from our sample.

mean_age_stats_unknown_sd <- sample_100_fans %>% summarize(n = n(), xbar = mean(age), s = sd(age), SE_xbar = s/sqrt(n))

Results: n = 100, xbar = 29.7, s = 7.78, SE_xbar = 0.778

qt(p = .025, df = 99)
Result: -1.98

CI <- mean_age_stats_unknown_sd %>% summarize(lower = xbar - 1.98*SE_xbar, upper = xbar + 1.98*SE_xbar)

Confidence Interval: Lower = 28.1, Upper = 31.2

Example: 90% Confidence Intervals

What if we constructed 90% confidence intervals instead? We would use 1.645 as our critical value instead of 1.96.

CIs_90_football_fans <- samp_means_football_fans %>% mutate(lower = xbar - 1.645*SE_xbar, upper = xbar + 1.645*SE_xbar, captured_90 = lower <= mu & mu <= upper)

CIs_90_football_fans %>% summarize(sum(captured_90)/n())

Answer: 0.901. Approximately 90% of the confidence intervals contain the true mean.

Confidence Intervals for Proportions

We can use the critical value 1.96 to construct a 95% confidence interval for a proportion.

CI <- prop_red_stats %>% summarize(lower = pi_hat - 1.96*SE_pi_hat, upper = pi_hat + 1.96*SE_pi_hat)

Result: Lower = 0.283, Upper = 0.557. We are 95% confident that the true proportion of red balls in the bowl is between 0.283 and 0.557.

Confidence LevelZ-Critical Value
95%1.96
99%2.576

Example: Computing the Point Estimate

Step 1: Calculate sample proportions and their variances.

prop_yes_stats <- mythbusters_yawn %>% group_by(group) %>% summarize(n = n(), pi_hat = sum(yawn == "yes") / n, var_pi_hat = pi_hat * (1 - pi_hat) / n)

Step 2: Calculate the point estimate for the difference in proportions and standard error.

diff_prop_yes_stats <- prop_yes_stats %>% summarize(diff_in_props = diff(pi_hat), SE_diff = sqrt(sum(var_pi_hat)))