# Statistics Final Exam Study Guide: Key Concepts & Formulas

## Final Exam Study Guide

### Histograms

**Appropriate for quantitative data**

- X-axis is quantitative
- Y-axis is the frequency of the data in the bin
- Bins are a range of values for collecting data, should be the same size

#### Shape

- Symmetric/Bell-shaped
- Normal
- Skewed
- Uniform

#### Center

- Mean – useful when data are symmetric
- Median – useful when data are skewed/outliers
- Mode – useful for categorical data
- Unimodal
- Bimodal

#### Spread

- Range – easy to calculate, not useful
- Interquartile range – easy to calculate, useful when data are skewed/outliers
- Standard deviation – useful when data are symmetric
- Outliers/Extreme values – gaps

### Boxplots

**Appropriate for quantitative data**

- Visual representation of 5-number summary
- Easy to construct, conveys information easily
- Min, Q1, Med, Q3, Max
- Identifies outliers

### Descriptive Statistics

#### Mean

- Uses all values in calculation
- Sensitive to outliers/skewness
- Moves in the direction of the skew
- Always paired with standard deviation

#### Median

- Uses all the positions of the data
- Resistant to outliers/skewness
- Always paired with interquartile range (IQR)

#### Mode

- Only appropriate for categorical data
- Not paired with a measure of spread

#### Standard Deviation

- Measure of variability in the data
- Uses all values in calculation
- Sensitive to outliers/skewness
- Increases with skew
- Always a positive value

#### Interquartile Range (IQR)

- Measure of variability in the data
- Uses positions in calculation
- Resistant to outliers/skewness
- Q3-Q1

### Research Designs

## Observational Studies

#### Random Selection

- Avoids bias and makes sample representative of population
- Simple Random Sample – name out of a hat
- Stratified Sample – organized by a similar trait
- Cluster Sample – organized by location: classroom, neighborhood, etc.

**Observe what is naturally occurring**

- No manipulation
- Survey for attitudes, beliefs
- Measurements for physical, mental, skills

#### Association

- Two variables can be associated in an observational study
- Correlation does not prove causation

## Experimental Design

#### Random Assignment

- The intent is to create groups that are similar for comparison

**Change something**

- Manipulate one variable – treatment group
- Do nothing to another group – control group
- Blind the control group to the fact that nothing is happening – placebo
- The control group experiences some change – placebo effect

#### Causation

- Causation can be proved through an experiment
- Only generalizable to the same population as the volunteers

#### Ethical Questions

- You are asking/forcing participants to do something they wouldn’t normally do
- It is unethical to have participants do something harmful
- Tuskegee Experiment
- Prisoner Experiments
- Skinner’s fear experiment
- IRB (Institutional Review Board) protects against unethical experiments

### Distributions

## Normal Distributions

- Used with proportions and means when population standard deviation (σ) is known
- Unimodal
- Symmetric
- Mean=Median=Mode

- Area under the curve = 1
- This means we can calculate percentages, proportions, and probabilities for a given range of values

- The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations

#### Standard Normal Distribution

- Center = 0
- Standard Deviation = 1

#### Empirical Rule

- 68% of all values are within ±1 standard deviation
- 95% of all values are within ±2 standard deviations
- 99.7% of all values are within ±3 standard deviations

## Student t Distributions

- Used with means when σ is unknown
- Unimodal
- Symmetric
- Mean=Median=Mode

- Area under the curve = 1
- This means we can calculate percentages, proportions, and probabilities for a given range of values

- The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations
*NO Standard t Distribution*- Family of curves based on degrees of freedom (d.f.)
- As the d.f. approaches infinity it becomes the Normal Distribution

- Center = 0
- Standard Deviation = 1

- Family of curves based on degrees of freedom (d.f.)
*NO Empirical Rule*- The area in the tails is higher than in the Normal Distribution
- It changes with every d.f.

## Sampling Distributions

- A sampling distribution is a histogram of the average value of all possible samples of that size

#### What does the Central Limit Theorem tell us about Sampling Distributions?

- The center of the distribution is the same as the center of the population
- The amount of variability decreases as sample size increases
- An individual varies the most
- When we use population proportions or population standard deviations it is called the standard deviation of the sample
- s.d.(p-hat) or s.d.(x-bar)

- When we use the sample proportion or sample standard deviation it is called the standard error. This helps people keep them straight BUT it still measures variability of the samples
- s.e.(p-hat) or s.e.(x-bar)

- Even if the population trait is not normally distributed, the sampling distribution will be if the conditions are met.

**When you use proportions or the σ is known**

- Use Normal Distribution

**When the population standard deviation is not known**

- Use Student t Distribution

### Linear Regression

- Only for two variables that are quantitative
- Explanatory variable is on the x-axis
- It is the one we think explains changes in the response

- Response variable is on the y-axis
- It is the one we think responds to changes in the explanatory

- Explanatory variable is on the x-axis
- Appropriate graph/display is a scatterplot
- The slope measures how much response per increase of the explanatory variable
- The intercept measures how much response expected when zero explanatory variable
- R
^{2}measures the amount of variability in the response due to changes in the explanatory variable- Value from 0 – 1
- Think of like a %

- The line of best fit, the linear regression, is best because it minimizes the error
- Smallest Sum of Squared Errors (SSE)
- The error is a residual
- The residual is the difference between the actual and predicted
- Residual = actual – predicted

### Inferential Statistics

## Confidence Intervals

- Useful when you don’t have a population estimate
- Because of the Central Limit Theorem, samples are the best estimate
- But samples vary, so there is a margin of error
- This creates the lower endpoint and upper endpoint
- m.e. = critical value (based on confidence) x s.d. (based on sample size)
- Decreasing confidence decreases the interval, makes it smaller
- Because the critical value gets smaller

- Increasing sample size decreases the interval, makes it smaller
- Because the s.d.(p-hat) or the s.d.(x-bar) gets smaller

- 95% confidence means that 95 out of 100 samples will contain the true population parameter
- All values inside the interval are equally likely
- Values outside the interval are unlikely

- Supports a claim if it is within the interval
- This means it can be used like a hypothesis test
- Better, because it gives a range (effect size) of how much different the value is from the claim

## Hypothesis Test

- Useful when you do have a population estimate
- You are testing a claim that the value/trait of the sample is significantly different
- Significantly different – the value is farther away from the expected by more than random variation
- The process tests a sample (evidence) against the null hypothesis (nothing changed, no difference)
- If the P-value is small, there is a small chance that nothing happened
- Therefore, we reject the null hypothesis in favor of the alternative (something did happen)
- If this is a mistake (an error), We reject the null when it is true, we call it a Type I error
- If the P-value is large, there is a large chance that nothing happened
- Therefore, we do not reject (fail to reject) the null hypothesis. Nothing probably did happen
- If this is a mistake (an error), We do not reject the null when it is false, we call it a Type II error
- How large or small depends on your level of significance (α). This becomes the cut-off for random variation to something happened
- Alpha (α) is directly related to the probability of making a Type I error
- α is exactly the probability of making a Type I error
- Alpha (α) is linked to the probability of making a Type II error
- Increasing α, decreases your probability of making a Type II error
- A confidence interval can be used to test a hypothesis
- If the null hypothesis value is in the confidence interval, then do not reject
- If the null hypothesis value is NOT in the confidence interval, then reject