Statistics Concepts: Variables, Distributions, and Inference

Lesson 1: Variables

  • Explanatory Variable – aka Independent Variable; explains variations in the response variable (x-axis). This is the predictor.
    • Example: “Can quiz scores be used to predict exam scores?” (Explanatory = Quiz scores)
  • Response Variable – aka Dependent Variable; its value is predicted or its variation is explained by the explanatory variable (y-axis). This is the outcome.

Lesson 2: Variable Types and Data Visualization

  • Categorical vs. Quantitative Variables
    • Categorical Variables = names, labels, categories. They have no logical or inconsistent order.

Visuals/Graphs: Categorical vs. Quantitative

Summary of Corresponding Variable for Plot/Chart/Table

ONE CATEGORICAL

  • Frequency Table
  • Bar Chart
  • Pie Chart

REMINDERS:

→ Pie charts: Best for only a few categories

→ Frequency table & Bar Chart: Use when there are many categories

ONE QUANTITATIVE

  • Dotplot
  • Histogram
  • Boxplot

TWO CATEGORICAL

  • Contingency/2-Way Table
  • Stacked/Segmented Bar Chart
  • Clustered Bar Chart

TWO QUANTITATIVE

  • Scatterplot

THREE QUANTITATIVE

  • Bubble Plot

MIXED VARIABLES:

1 Quantitative + 1 Categorical

  • Histogram w/ Groups

2 Quantitative + 1 Categorical

  • Scatterplot w/ Groups


  • Proportion Formula: Proportion = $\frac{\text{Number in the Category}}{\text{Total Number}}$
    • Proportions are used for categorical variables.
    • P-hat ($\hat{p}$) is sample proportion; $p$ is a population proportion.
  • Probability
    • Notation for Probability: $P(A)$
      • “A” is the event we are looking to find the probability of.
    • Probability is usually written as a decimal.
    • Cannot have a negative probability or probability greater than 1.

Use the same formula for probability & proportions: $P(A)=\frac{\text{Number in group A}}{\text{Total number}}$

  • Risk
    • Risk – the likelihood of an event occurring.
      • Same as probability or proportion (can use the same formula!).
  • Risk Formula: $\text{Risk}=\frac{\text{number w/ the outcome}}{\text{total number of outcomes}}$

    Ex: 60 out of 1000 teens have asthma. Risk = $\frac{60}{1000} = 0.06 = 6\%$. So 6% of teens have asthma.

  • Odds
    • Odds – express risk by comparing the likelihood of an event happening to the likelihood it does not happen.
      • Often interpreted in relation to 1 (e.g., the odds of passing a test were 5.667 to 1).
  • Odds Formula: $\text{Odds}=\frac{\text{number w/ the outcome}}{\text{number WITHOUT the outcome}}$ OR $\text{Odds}=\frac{p}{1-p}$
    • ODDS DO NOT NEED TO BE BETWEEN 0 AND 1.00 (CAN BE ANY NON-NEGATIVE NUMBER).

Distributions (Skewed vs. Normal)

Skewed Distribution – A distribution in which values are more spread out on one side of the center than the other.

Right Skewed Distribution

  1. A distribution in which the higher values are more spread out than the lower values (AKA positively skewed).
    1. Direction of the skew is based on where the “tail” is pulled.

Left Skewed Distribution

A distribution in which the lower values are more spread out than the higher values (AKA negatively skewed).

Normal Distribution

A specific type of symmetrical distribution with a bell-shape. If the sample size is large, and the sample is a random sample, then the sampling distribution is approximately NORMAL.

Measures of Central Tendency

3 Measures of Central Tendency:

  • Mean
    • Average.
    • How to Calculate: The sum of ALL data values divided by the number of values.
    • Symbols: Population Mean $\mu$ (“mu”); Sample Mean $\bar{x}$ (“x-bar”).
  • Median
    • Middle of the distribution sorted from small to large.
    • How to Calculate: Put the values in order and find the middle.
      • Note: If it falls between 2 values, add them and divide by 2.
  • Mode
    • Most frequently occurring number(s).

**Mean is most influenced by outliers/skewness.**

**Median is most resistant to outliers.**

For SYMMETRICAL Distributions:

  1. Mean, Median, Mode are EQUAL. The Mean is the preferred measure in this case.

For SKEWED/OUTLIER Distributions:

  1. Mean, Median, Mode are not equal. The Median is the preferred measure in this case.
  • Median is more resistant to outliers (Mean gets pulled in the direction of the tail).
    • For LEFT Skew $\rightarrow$ Mean is pulled to the left and is the LOWEST measure out of the 3.
    • For RIGHT Skew $\rightarrow$ Mean is pulled to the right and is the HIGHEST.

Population & Sample Means

Population Mean Formula: $\mu=\frac{\sum x}{N}$

Sample Mean Formula: $\bar{x}=\frac{\sum x}{n}$

Empirical Rule

Empirical/95% Rule – a statement about normal distributions that approximately 95% of observations fall within 2 standard deviations of the mean.

  • 68% of data will be within 1 standard deviation of the mean.
  • 95% will be within 2 standard deviations.
  • 99.7% will be within 3 standard deviations.

Symbols to Know: Population mean is ($\mu$); Population standard deviation is ($\sigma$)

Empirical Rule/95% Rule Formulas



Middle 68%

1

Ex: $100 \pm 1(15)$ $\rightarrow 100-15=85$. What This Means: 68% of scores fall between 85-115 or [85, 115]

Middle 95%

2

Middle 99.7%

3


  • Z-Scores
    • Z-Score – Distance between an individual score and the mean in standard deviation units (also called standardized score).

POSITIVE Z-SCORES $\rightarrow$ ABOVE the mean

NEGATIVE Z-SCORES $\rightarrow$ BELOW the mean

FORMULA: Z-Score (sample): $z=\frac{x-\bar{x}}{s}$ Variables:

  • x = original data value
  • $\bar{x}$ = MEAN of the original
  • s = STANDARD DEVIATION of the original distribution

FORMULA: Z-Score (population): $z=\frac{x-\mu}{\sigma}$ Variables:

  • x = original data value
  • $\mu$ = MEAN of the original
  • $\sigma$ = STANDARD DEVIATION of the original distribution
  • Range & Interquartile Range

Range – the difference between the max and min values.

  • *Heavily influenced by outliers*

Range = Max – Min

Five Number Summary – composed of the following 5 values: Minimum, Q1, Median, Q3, Maximum

  • Minimum – smallest value
  • 1st Quartile (Q1) – 25th percentile
    • Value that separates the bottom 25% from the top 75%.
  • Median – MIDDLE value
  • 3rd Quartile (Q3) – 75th percentile
    • Value that separates the bottom 75% from the top 25%.
  • Maximum – largest value

Interquartile Range (IQR) – the difference between the 1st & 3rd quartiles.

  • *Resistant to outliers*

IQR = Q3 – Q1

*Represents the middle 50% of observations*


  • IQR Method for Identifying Outliers

IQR Method – setting up a “fence” outside of Q1 and Q3.

How it Works:

Building the Fence

  1. Find IQR (i.e., Q3 – Q1).
  2. Calculate $1.5 \times \text{IQR}$.
  3. Subtract this value from Q1 to get the lower fence & add it to Q3 to get the upper fence.
  4. Result $\rightarrow$ min & max fence posts (anything outside the fences are outliers).

Lesson 3: Data Interpretation and Relationships

  • Box & Whisker/Boxplot Interpretation

Boxplot – uses the five number summary values to make a graph.

  • Parts of a BoxPlot:
    • Outliers are given an asterisk (*).
    • Bottom Whisker $\rightarrow$ extends to the lowest value that is not an outlier.
    • Upper Whisker $\rightarrow$ extends to the highest value that is not an outlier.
    • Box $\rightarrow$ represents the middle 50%.
      • Lower end = 25th percentile (Q1).
      • Upper end = 75th percentile (Q3).
    • Line in Middle of Box $\rightarrow$ Median.


  • Scatterplots
    • Scatterplots – display the relationship between explanatory & response variables.

Must Consider the Following in Scatterplots:

  1. Direction (+ or -)
  • Positive/Direct Relationship $\rightarrow$ 2 variables both increase.
  • Negative/Inverse Relationship $\rightarrow$ 1 variable increases, 1 decreases.
  • Flat Line $\rightarrow$ No relationship.
Form: Linear or nonlinear or flat. Strength: Weak, moderate, strong.
  • Data points in a straight line = very strong.
  • More spread out data points = weaker.
Bivariate outliers.


  • Correlation

Pearson’s r – a measure of the linear relationship between 2 variables (heavily influenced by outliers).

Symbols:

  • Sample $\rightarrow$ use “r”
  • Population $\rightarrow$ use “$\rho$” (rho)

2 Criteria Needed to Use Pearson’s r:

  1. 2 quantitative variables.
  2. Linear relationship.

For a POSITIVE ASSOCIATION $\rightarrow r > 0$

For a NEGATIVE ASSOCIATION $\rightarrow r < 0$

NO relationship $\rightarrow r = 0$

The closer $r$ is to 0, the weaker the relationship.

The closer $r$ is to +1 or -1, the stronger the relationship (e.g., -0.8 is a stronger relationship than +0.6).

*The sign is for direction, not strength. SO… JUST LOOK AT THE NUMBER, NOT THE SIGN!*

GUIDELINES FOR EVALUATING CORRELATION COEFFICIENTS

Influential Outliers – points in a data set that increase the correlation coefficient.

  • Simple Linear Regressions (heavily influenced by outliers)

Formulas:

Simple Linear Regression Line: For a SAMPLE ($\hat{y} = b_0 + b_1 x$)

Simple Linear Regression Line: For a POPULATION ($y = \beta_0 + \beta_1 x$)

Residual – the difference between an observed $y$ value and the predicted $\hat{y}$ value (AKA THE ERROR ($e$))

For Scatterplots: it is the vertical distance between the line of best fit and the observation.

Symbols:

  • Sample $\rightarrow \hat{e}$ (epsilon-hat)
  • Population $\rightarrow \epsilon$ (epsilon)

Residual Formula: $e = y – \hat{y}$

If a point is ABOVE the regression line $\rightarrow$ has a POSITIVE RESIDUAL

If a point is BELOW the regression line $\rightarrow$ has a NEGATIVE RESIDUAL


EXAMPLE: Residuals

  1. The plot shows the line $y=6.5+1.8x$.

-Identify & interpret the y-intercept.

-Identify & interpret the slope.

-Compute & interpret the residual for the point (-0.2, 5.1).


*Identify the y-int:*

y-intercept = 6.5 $\rightarrow$ means that when $x=0$ the predicted value of $y$ is 6.5.

*Identify the slope:*

Slope = 1.8 $\rightarrow$ means for every 1 unit increase in $x$, the predicted value of $y$ increases by 1.8.


*Compute & Interpret the Residual at (-0.2, 5.1)*

  • Observed $x$ value is -0.2.
  • Observed $y$ value is 5.1.

Residual Formula: $e=y-\hat{y}$

Compute $\hat{y}$ using the regression equation & $x$:

$\hat{y}=6.5+1.8(-0.2)$

$\hat{y}=6.14$

Meaning: given an $x$ value of -0.2, we predict $y$ to be 6.14.

Find the Residual (difference between $y$ and $\hat{y}$):

$e=y-\hat{y}$

$e=5.1-6.14$

$e= -1.04$

Interpretation: This observation’s $y$ value is 1.04 less than predicted based on its $x$ value.

Lesson 4: Inference with Confidence Intervals

  • Confidence Intervals

Confidence Interval (CI) – a range of values that are thought to be reasonable estimates of the population parameter.

Types of Parameters:

  • Single proportion ($p$)
  • Difference in 2 proportions ($p_1-p_2$)
  • Single mean ($\mu$)
  • Difference in 2 means ($\mu_1-\mu_2$)
  • Correlation ($\rho$)
  • Simple linear regression slope ($\beta_1$)

IMPORTANT SYMBOLS

  • Population Parameters = FIXED values
    • We rarely know the parameter values since it’s hard to measure an entire population (e.g., mean age of all world campus students).
  • Sample Statistics = KNOWN values
    • Random variables since they vary between samples (e.g., mean age of a portion of WC students).

Margin of Error – half of the width of a confidence interval.

Margin of Error Formula: $\text{margin of error} = \text{multiplier} \times (\text{standard error})$

2 Factors Margin of Error Depends On:

  1. Level of confidence (which determines the multiplier).
  2. Value of the standard error.

Confidence Interval Formula: $\text{sample stat} \pm \text{margin of error}$

Sample Stat Can Be any of the following:

  • $\hat{p}$ (P-hat aka a proportion) $\rightarrow$ categorical
  • $\bar{x}$ (x-bar aka mean) $\rightarrow$ quantitative
  • r
  • $p_1-p_2$
  • $\bar{x}_1-\bar{x}_2$
  • $b_0$ (slope)

*Values outside of the CI are NOT reasonable estimates for the population parameter.*

*For correlations & difference of means…* Ask yourself: Is the entire CI greater than 0?

  • CI ALL POSITIVE $\rightarrow$ Convincing evidence of positive correlation (For Difference in Means $\rightarrow$ Group A > Group B).
  • CI Includes ZERO $\rightarrow$ Not significant.
  • CI ALL NEGATIVE $\rightarrow$ Convincing evidence of negative correlation (For Difference in Means $\rightarrow$ Group B > Group A).


Empirical Rule/95% Rule (for confidence intervals)

Empirical Rule/95% Rule Formulas (for confidence intervals)

*the sample stat is PROPORTION, MEAN, Correlation, etc;

68% CI

sample stat $\pm 1(\text{standard error})$

95% CI

sample stat $\pm 2(\text{standard error})$

99.7% CI

sample stat $\pm 3(\text{standard error})$

Format for Making Conclusions of Confidence Intervals

“I am [CI level] confident that the [population parameter] is between [L] and [U]”

Example (using a CI of [26, 32]):

$ ightarrow$ “We are 95% confident that the mean anxiety score of all students is between 26 and 32.”


  • Sampling Distributions
    • Uses either MEAN (quantitative variables) or PROPORTION (categorical variables).

Standard Error (SE) – the standard deviation of a sampling distribution.

IMPACT OF SAMPLE SIZE ON STANDARD ERROR & DISTRIBUTION

INVERSE relationship between sample size & standard error.

As sample size INCREASES, standard error DECREASES.

As sample size INCREASES, shape of the distribution becomes MORE NORMAL.

Distribution of SMALLER vs LARGER sample size

  • Randomization vs Bootstrap Distributions
    • Bootstrap distribution $\rightarrow$ centered at the sample statistic (from the data).
    • Randomization distribution $\rightarrow$ centered at the null hypothesized parameter.


  • Bootstrap Distributions

Bootstrapping – A resampling procedure for constructing a sampling distribution using DATA FROM A SAMPLE.

  1. AKA used when population values are unknown.


2 Methods for Constructing a CI for a Bootstrap:

  1. Standard Error Method
    1. Use the standard deviation (standard error) of the bootstrap distribution to construct a CI.

Formula: $\text{sample statistic} \pm 2(\text{SE})$

  1. *Can only use when the distribution is NORMAL*.
  1. Percentile Method
    1. *Preferred (the shape of the distribution doesn’t matter)*.
    2. For a 95% CI, find the middle 95% bootstrap statistics.

IMPACT OF SAMPLE SIZE ON CONFIDENCE INTERVALS

LARGER SAMPLE SIZE, NARROWER CI










OTHER REMINDERS:

  • Going from 95% to 99% CI $\rightarrow$ Interval gets WIDER!
    • (e.g.). [67, 73] becomes [66, 74]
  • Going from 95% to 90% CI $\rightarrow$ Interval gets NARROWER.
  • Inverse relationship between sample size and width of CI.
    • As SAMPLE SIZE INCREASES, STANDARD ERROR DECREASES.
    • As SAMPLE SIZE INCREASES, CI BECOMES NARROWER.

Example: Mean of Bootstrap Distribution

  • Suppose annual sugar consumption in the United States is normally distributed. In a sample of $n=50$ the mean sugar consumption was computed to be 160 pounds with a standard deviation of 56 pounds per person. If we were to take 5,000 bootstrap samples and record the mean of each, what would we expect the mean of that bootstrap distribution to be?
    • ANSWER: 160 pounds (bootstrap and sample means are the same).

Lesson 5: Hypothesis Testing

  • Hypothesis Testing

How Hypothesis Testing Differs from Confidence Intervals:

Hypothesis Testing $\rightarrow$ test a specific hypothesis (a statement about a population parameter).

-Distribution is based on the population parameter.

*KNOW POPULATION*

Use p-values.

Confidence Intervals $\rightarrow$ use unknown population parameters.


-Distribution is based on the sample statistic.


*DON’T KNOW POPULATION*

Use confidence intervals.

*Hypotheses are ALWAYS written in terms of POPULATION PARAMETERS.*

P-value – the probability that a population with the specified parameter would produce a sample statistic as extreme or more extreme than the one we observed in our sample.

  • P-value is compared to the alpha level (0.05).

3 Things We Need to Know When Writing Hypotheses:

  1. The parameter we are testing
    1. Single mean
    2. Paired means
    3. Single proportion
    4. Difference between 2 means
    5. Difference between 2 proportions
    6. Simple linear regression slope
    7. Correlation
  2. The direction of the test (look at the research question)
  • Non-directional $\rightarrow$ “different from” or “not equal to” ($=$)
  • Right-tailed $\rightarrow$ “Greater than” or “more than” ($>$)
  • Left-tailed $\rightarrow$ “Less than” or “Fewer than” ($<$)
The value of the hypothesized parameter (look at the research question)
  1. AKA the # that goes in the hypothesis statements.
  2. *Usually 0 for the following parameters:* Regression, Correlation, Difference between 2 groups.

5 STEP HYPOTHESIS TESTING PROCEDURE

Step 1: Determine what population parameter you need & write hypotheses.

2 types of hypotheses:

Null (NOTHING) Hypothesis – the statement that there is not a difference in the populations. Symbol: $H_0$. *Always associated w/ the equality (e.g., =).*

Alternative Hypothesis – the statement that there is some difference in the populations. Symbol: $H_a$ or $H_1$.

1st Ask Yourself: Is the variable quantitative or categorical?

2nd Ask Yourself: Is “more than” ($>$), “less than” ($<$) or “different” ($=$) used?

3rd Ask Yourself: What specific value is involved? $\rightarrow$ *Correlations, comparing 2 groups, & regressions always use 0.*

Step 2: Construct a randomization distribution, given that the null is true.

  • The hypothesized population parameter is the center of our sampling distribution.

Step 3: Use the randomization distribution to find the p-value.

  • Done in StatKey.

Step 4: Decide if you should reject or fail to reject the null hypothesis.

Step 5: State a real-world conclusion in relation to the original research question.

Null Hypothesis – no effect/difference/relationship between variables. Symbols: ALWAYS $=, \le, \text{ or } \ge$.

Alternative Hypothesis – an effect/difference/relationship exists. Symbols: $> , <, \text{ or } \ne$.


  • Determining Significance (alpha level)

How to Determine Statistical Significance:

$ ightarrow$ Compare the p-value to the alpha level.

Are the Results Statistically Significant?

Smaller p means more evidence against null.

$p > \alpha$

$p \le \alpha$

We fail to reject the null hypothesis.


Conclusion: Not enough evidence of a difference in the population from the null.


$ ightarrow$ RESULTS NOT SIGNIFICANT


*Just means that we do not have sufficient evidence to say the null hypothesis is likely false.*

We REJECT the null hypothesis.


Conclusion: There IS a difference in the population from the null.


(There is convincing evidence AGAINST THE NULL).


$ ightarrow$ RESULTS ARE SIGNIFICANT

On Stat Key the p-value is located in the red “tails.”

$ ightarrow$ Must add together 2 p-values if it’s 2-tailed ($p=0.103+0.103=0.206$).

Lesson 6: Errors and Power in Testing

  • Type I & II Errors

2 Possibilities When We REJECT the Null:

  1. We are correct about rejecting the null
    1. AKA there really is a difference in the population.
  2. We made a Type I error
    1. AKA there is not a difference in the population and the null is actually true.

2 Possibilities When We FAIL to REJECT the Null:

  1. We are correct about failing to reject the null
    1. AKA there is NOT a difference in the population.
  2. We made a Type II error
    1. AKA there IS a difference in the population and the null is actually FALSE.

TYPE 1 vs TYPE 2 ERRORS

Type I Error – rejecting the null ($H_0$) when it is really true.


Symbol $\rightarrow$ “alpha”


TYPE I = REJECT & ACTUALLY TRUE (FALSE POSITIVE)

Type II Error – failing to reject the null ($H_0$) when it is really false.


Symbol $\rightarrow$ “beta”


TYPE II = DON’T REJECT & ACTUALLY FALSE

Definition to Know:

Alpha level $\rightarrow$ the maximum p-value that will be rejected/the tolerable probability of making a Type 1 Error.


  • Correct Decision
    • Rejected the null and it was indeed false.

OR

  • Failed to reject the null and it was indeed true.
  • Alpha Level (Big vs Small)

USE SMALLER ALPHA LEVEL WHEN…

there are big consequences if a Type 1 error occurs (e.g., MEDICAL STUDIES).

  • Note: Smaller $\alpha$ levels = Smaller p-values (to reject the null).
  • Makes it more difficult to reject the null BUT reduces the probability of a Type I error.

USE LARGER ALPHA LEVEL WHEN…

there are big consequences if a Type 2 error occurs (e.g., PILOT STUDIES).


  • Multiple Testing

The MORE tests done, the HIGHER the alpha level $\rightarrow$ issue: increases the amount of Type 1 errors.

Formula: To Find # of Tests that are False Positives: $(\text{alpha level}) \times (\text{# of tests})$

  • Practical vs Statistical Significance

Practical Significance – the magnitude of the difference between the hypothesized population parameter and the observed sample statistic.

  • Results are practically significant when the results are meaningful in real life.
  • Effect Size – measure of the difference between the hypothesis and observation.

Practical $\rightarrow$ assess if the result has real-world significance.

  1. Uses effect size to determine the importance of the effect.
  2. NOT directly impacted by sample size.

Statistical $\rightarrow$ tells us if a result is unlikely due to random chance.

  1. Uses p-values to determine if an effect exists.
  2. Directly impacted by sample size.

If a confidence interval does not contain 0 $\rightarrow$ results are statistically significant.

  • Effect Size

Interpreting Measures of Effect Size (using Cohen’s d)

Formulas for Calculating Cohen’s d:

For Difference in 2 Means: $d=\frac{\bar{x}_1-\bar{x}_2}{s_p}$

  • Numerator = difference of 2 means.
  • Denominator = pooled standard deviation.

For a Single Mean: $d=\frac{\bar{x}-0}{s}$

  • Numerator = difference between observed sample mean & hypothesized mean.
  • Denominator = standard deviation.

For Correlation & Regression: $r^2$


Formula for Calculating Pooled SD

Interpreting Cohen’s d:

Ex: $d = 0.05$ $\rightarrow$ the sample is 0.5 SDs higher than the null hypothesized value.


  • Power

Power – the probability of correctly rejecting the null $\rightarrow$ AKA the correct decision was made.

Power $= 1 – \beta$ ($\beta$ is the probability of committing a Type II error).

Ways to INCREASE Power:

  1. Increase sample size
    1. P-value decreases which makes it more likely we reject the null.
  2. For a mean or difference in means, decrease sample SDs
    1. This will decrease the standard error, making the sampling distribution narrower.
  3. Increase effect size
    1. Look at the alternative hypothesis.
    2. Make that difference as large as possible.
  4. Increase alpha level
    1. Larger level means the p-value doesn’t need to be as small to be rejected.

*Discouraged due to error.*

Relationship Between Alpha and Beta

If the sample size is FIXED:

$\alpha$ increases; $\beta$ decreases.

*If we want to decrease the likelihood of Type 1 and 2 Errors, we should increase the sample size.*


  • When to Use CI vs Hypothesis Test

Use a CONFIDENCE INTERVAL When…

Use a HYPOTHESIS TEST When…

  • Estimating a population parameter.

$ ightarrow$ aka not given a specific value to test.

  • Given a specific population parameter.
  • Need to determine the likelihood that a population with that parameter would produce a sample as different as our sample.

Response Bias – when a representative sample is selected but respondents give answers different from their true opinions (what they think the researchers want to hear).