lol

  • Standard deviation:  average measure from data and mean (square root of  variance).How far the data is on average from the mean.

    • Solve: Find the variance and take the square root of it.

  • Coefficient of Variation: standard deviation of two sets of data. Shown as as a percentage and measures the variation side by side with the mean

    • Solve:: CV = standard deviation/ mean * 100

  • Counting numbers is discrete


Center and Variability:

  • Descriptive Measures: a single number thats comes from the sample data  

  • Central tendency: Middle of data and common value (mean, median, mode)

  • Variation (scatter, variability, dispersion): How spread out and different the data is (Range, S.D., C.V.)

    • Solve: 1. find the distance from the sample to the mean. 2. square them. 3) add the squares 4) divide by n-1

  • Measures of Position: One value compared to one full group (Percentiles, quartiles)

  • Measures of Variation: Include the Range, SD, and CofV

Boxplots, percentiles, and quartiles:

  • Percentiles: a number where a certain percentage of scores fall below that percentile

    • Value = Percentile / 100 * (n+1)

  • Box plot: shows IQR in quartiles

    • Charts —-> Box and Whiskers

    • L, Q1, Q2, Q3, IQR, H

      • Lower Inner Fence: Q1 – (1.5 x Inner Quartile Range)

      • Lower Outer Fence: Q1 – (3 x Inner Quartile Range)

      • Upper Inner Fence: Q3 + (1.5 x Inner Quartile Range)

      • Upper Outer Fence: Q3 + (3 x Inner Quartile Range)

    • Interquartile range: Q3 – Q1.

      • is a measure of variability

    • To Solve in Excel: type in “=q” select “.inc” then select first data point then hold shift and click last. After type in a comma and select 1, 2, 3 for quartile


Sampling techniques:

  • Simple random sampling: Each member of pop. Has equal chance of being selected. Strengths: Simple and good representation. Negative: Requires entire pop.

  • Stratified random sampling: groups within groups, ex. Identifiable by age, gender, income, etc… Strengths ensures all groups with respect to the variable of interest so more accurate than random sampling. Negative more complex and choose non-overlapping groups

  • Cluster Sampling: Same as stratified but you choose the groups at random. Strengths: inexpensive and simpler. Negative: item within clusters tend to have similar traits therefore increase sampling error.

  • Systematic random sampling: Chooses every k occurring item on the list until n values are gotten. Strengths: same as simple random sampling but ensures no two items will be near each other. Negatives: No longer independent

  • Convenience sampling: grab a convenient set of observational units which is not technically random. Strengths: convenient, simple, low cost. Negatives: not random so not representative

Excel:

Descriptive Statistics Excel:

1.        Click on data then data analysis

2.        Find descriptive statistics

3.        Select data in input range

4.        Output range where you want chart to go

measures of position Excel:

1.        Insert function and click “percentile.inc”

2.        In array select all data

3.        In “K” select the percentile you seek (between 0 and 1)

4.        Value will be at the bottom of the window

Binomial distribution:

5 Conditions of Binomial:

1.   The variable being studied is random.

2. The outcomes of the variable are being counted.

3. There are a fixed number of trials. The letter n denotes the number of trials.

4. There are only two possible outcomes, called “success” and “failure,” for each trial, π denotes the  probability of a success on one trial, and 1-π denotes the probability of a failure on one trial.

5. The n trials are independent and are repeated using identical conditions. Because the n trials are independent, the outcome of one trial does not help in predicting the outcome of another trial.


Normal Curve and Empirical Rule:

The Normal Distribution

  • Bell shaped and symmetrical

  • Ensures equal #s on both sides of the scale

  • Represented by the mean and standard deviation

  • The mean, median and mode will be the same

  • 68% falls 1 standard deviation of the mean

  • 95% falls 2 standard deviations of the mean

  • 99.7% falls 3 standard deviations of the mean

    • This is the empirical rule and applies only to this

  • To check if it is normal look at histogram and standard deviations

    • Also mean = median = mode

  • Zscore = X – Mean / SD

Understand the characteristics of a sampling distribution and central limit theorem:

  • The Central Limit Theorem:  the mean of the samples n and plot the frequencies of their mean we get a normal distribution

  • Law of Large Numbers: performing the same experiment a large number of times

    • The law of large numbers states that the longer you do an experiment, the closer you get to the theoretical probability.

  • Standard Error: Sigma / square root of n


Level of Significance:

  • 1% level of significance mean that you are very sure and have evidence to back up your claim

    • Having a higher level of significance means that there is room for error and that the person proving a point is also no as sure

————————————————————————————————————————————————————

  • What lowers SD?: Consistency is measure by CV so get the mean and sd for all values then get cv. Higher the percent the higher the inconsistency, so lower percent = more consistent

  • Trying to prove less than x so we assume it is = x

    • Need to assume x has exactly the same x at x

  • Compare Coefficient of Variance when looking at two data sets NOT standard deviation

  • Descriptive = Whole Numbers, Continuous = Decimals, Quantitative = Counts/percents, Categorical = Groups




  • To evaluate the evidence, we want to determine the probability of observing the evidence (or even better evidence against the assumption) assuming the assumption is true. Once we determine this probability, we need to determine if the event is unlikely or not unlikely. That is, we want to determine if it unlikely we could have observed the evidence, if the assumption is true. Or is it not unlikely that we observed the evidence, if the assumption is true. If it is unlikely to have observed the evidence, then most likely there is something wrong with the assumption and the claim is likely true. If it is not unlikely to have observed the evidence, then we can’t actually conclude that there is something wrong with the assumption and we cannot conclude that the claim is true.

    • 1% is unlikely and 10% is not unlikely

  • Sampling Variability: The variability among random samples of size n from the same population

  • Sampling Distribution: A probability distribution that characterizes some aspect of sampling variability

  • Law of Large Numbers Increasing: As the sample size increases, the distribution of the sample approaches the shape of the population distribution. That is, the shape, mean and standard deviation of the sample gets closer to the shape, mean, and standard deviation of the population.

  • Sampling Distribution of the Sample Means: As the sample size increases, the shape of the sampling distribution will become approximately normal, the mean of the sampling distribution will equal the mean of the population, and the standard deviation of the sampling distribution will get smaller and smaller, specifically it will be sigma/SQRT n.

1.      Before proceeding any further, Betty and Tina need to clearly identify the random variable being studied here. Betty thinks it is the number of people who show up (with a ticket). Tina thinks it’s whether the flight is overbooked or not. Who is correct and why? (2)

Betty is correct. The concern is about the no-show rate, which is related to how many people show up.

Both the variable of people who show up and whether a flight is overbooked or not can be considered random variables. The randomness would come from how the sample is collected (i.e. randomly or not). The issue here is what is actually being studied – not the randomness of the variable.

Both variables are categorical/qualitative. For the show up variable, the data would be “passengers showed up” or “passenger did not show up”. For the overbooked variable, the data would be “flight overbooked” or “flight not overbooked”.

2.      To do a statistical analysis of the evidence that Tina has provided, what assumption (including a probability) do they need to make? Explain your answer. (2)

Betty and Tina need to start with the opposite assumption of what they are trying to prove. They must assume that their airline is the same as the rest of the industry and that their no-show rate is 8%. They don’t want to assume Tina is correct because then they would be introducing bias in their analysis.

It would also be appropriate to say that they want to assume that the show up rate is 92%.

Note: The 8% is NOT an average. It is the proportion of passengers with tickets who do not show up for their flights. As the variable is categorical, the appropriate numerical descriptive statistic is proportion.

Regardless of your answer above, use a binomial distribution to model this situation.

4.      Clearly state the parameters for the binomial distribution for this situation. (2)

n = 180, π = 0.08

Alternatively, n = 180, π = 0.92

Both answers are appropriate. The key is staying consistent.

5.      As stated above, Tina suspects that the number of no-shows at WestCoast Air is less than 8% (i.e., the show rate is more than 92%). Her argument is that she looked at the five most recent flights between Calgary and Vancouver (all flights had sold 36 tickets) and observed that the number of no shows was was only 5.6% (only 10 out of 180 people). Based on the centre and variability for this distribution, do you think her argument is valid? (4)

NOTE: Most people did not recognize that the question was asking for an examination of the center (the mean) and the variability (the standard deviation).

If the probability of a no-show actually is 8%, then it would not be unusual to expect some fluctuation from this value. Indeed if you sell 180 tickets, you would expect 14.4 people to not show up, on average, but with a standard deviation of 3.64. When looked at in these terms, we would expect that typically between 10.76 and 18.04 passengers would not show up when 180 tickets are sold. Therefore, having ten people not show up for a flight, assuming an 8% no-show rate, is atypical but not significantly so. That is, it is only slightly less than the expected 10.76. Therefore, it is not yet clear whether the no-show rate is less than 8% or not, based on Tina’s evidence.

Here you are being asked to determine whether 10 out of 180 no shows is abnormal, if the no show rate is actually 8% (or alternatively whether 170 out of 180 passengers showing up is abnormal, if the show up rate is 92%). To determine this, you need to find out what the typical range of no shows would be, based on the no show rate of 8%. The hint for this is when you are asked to consider centre and variability. This hint is telling you to find the typical range (the mean plus and minus the standard deviation), then compare the 10 out of 180 to the typical range.

6.

P(X<= 10) = 0.14

This is the probability of obtaining a random sample of 180 where 10 or fewer did not show up, given the assumption that within the population, 8% would not show up.

Another way to look at this: Perfect proof that our no-show rate is less than 8% would be that if we had zero no shows in our sample. How close is 10 no shows to perfect evidence against the initial assumption that we have an 8% no-show rate? Our sample evidence is 0.14 away from perfect evidence, which in statistical terms is not very strong evidence.

7.      In sentence form, explain what probability you have found in the context of the question. In addition to the explanation, comment on whether Tina has provided enough evidence to suggest that the no-show rate is in fact less than 8%. Explain your answer. (6)

The probability that at most 10 passengers will not show up for flights when 180 tickets are sold, under the assumption that the no-show rate is 8%, is 14%.

If the probability was less than 1%, then it is unlikely that we would have observed this evidence if the no-show rate is indeed 8% for WestCoast Airlines. If the probability was greater than 10%, then it is not unlikely that we would have observed this evidence if the no-show rate is indeed 8% for WestCoast Airlines. Therefore, it appears that Tina is incorrect and the no-show rate may be indeed be 8% for WestCoast Airlines.

Note: It would be wrong to compare the 8% no show rate and the 14% probability. They represent very different things and cannot be compared. The 8% no show rate is the proportion of passengers who do not show up for a flight. The 14% is the probability we observed our evidence (or even better evidence against the assumption), assuming π = 0.08. Therefore, 14% is NOT the no show rate.

8.  Betty and Tina have both provided collected their evidence differently. Which evidence is more appropriate to show to management? Why? (2)

Betty’s evidence came from industry, so it is fair to assume it is accurate for the industry, but it may not apply to West Coast Air. Tina’s evidence is specific to West Coast Air, but it is a convenience sample.

Tina’s sample is a convenience sample. It does not matter that her sample size is bigger. As it is not a random sample, it is never the right choice.

Normal Distribution:

  • What is the probability that a randomly selected bolt will not fit in the tolerance interval?

    • Put +/- tolerance values into x1 then x2 with mean and sd

      • Then look at x < x1 and x > x2

        • Add the values