Statistics Terminology and Concepts

Basic Terminology

Data Types

Sample: A portion of the population units sampled to gather information.

Target Population: The entire group of individuals or objects that the researcher is interested in studying.

Continuous Data: Data that can take on any possible value within a range.

Discrete Data: Data that have built-in restrictions on decimal places, such as whole numbers.

Categorical Measurements

Measurements where a unit is placed into a category based on an observed attribute or quality.

Nominal Labels: Categories with no inherent order (e.g., male, female).

Ordinal Labels: Categories with a sense of order but not corresponding to specific numbers (e.g., large, small, medium).

Probability

Addition Rule

P(A or B) = P(A) + P(B) – P(A & B)

If A & B are mutually exclusive/disjoint: P(A or B) = P(A) + P(B).

Multiplication Rule

P(A & B) = P(A)P(B|A) = P(B)P(A|B)

Conditional Probability

P(A|B) = P(A & B)/P(B), P(B|A) = P(A & B)/P(A)

If A and B are independent: P(A and B) = P(A).P(B)

Probability Distributions

Cumulative Probability Distribution (CDF)

P(X ≤ x) for -∞ < x < ∞ (e.g., P(1.60 ≤ 1.75) = P(X ≤ 1.75) – P(X ≤ 1.60))

Bernoulli (Binary) Random Variable

Y = px(1-p)(1-x)

Y = 1 if success is observed, & Y = 0 if not.

Binomial Distribution

aZo6lkXYgAAAABJRU5ErkJggg==

Expected Value

E(x) of a random variable is the average value expected after infinitely many samples are drawn. E(x) = np.

Variance and Standard Deviation

Variance: V(x) = np(1-p)

Standard Deviation: SD(x) = sqrt(np(1-p))

Both measure the expected variability.

Normal Distribution

PDF of Normal Distribution: wFmvQurATVjNgAAAABJRU5ErkJggg==

Standard Normal Distribution: μ = 0 and σ = 1: Z ~ N(0,1)

Standardization: Z = (X-μ)/σ

Inverse Standardization: X = σZ+μ

Standard Error: (Standard Deviation)/sqrt(Mean)

Empirical Rule

  • 68%: Data falls within 1 standard deviation of the mean.
  • 95%: Data falls within 2 standard deviations of the mean.
  • 99.7%: Data falls within 3 standard deviations of the mean.

Sampling Distributions

Mean: The average value of the statistic.

Standard Deviation: How spread out the statistic is from its mean.

Influencing factors: Shape of the distribution, sample size, statistic being computed.

Central Limit Theorem (CLT)

The distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. The approximation improves with larger sample sizes and a parent distribution closer to normal.

Mean: μ

Variance: σ2

Confidence Intervals

Inference: Drawing conclusions about a population using data from a sample.

Example: With 95% confidence, the interval (9.4,15.0) covers the true mean PSA value for men awaiting radical prostatectomy.

Hypothesis Testing

Procedure

  1. State question and define parameters: Let μ0 and μy be the mean log-PSA for men 65 or older and under 65 awaiting radical prostatectomy.
  2. State Hypothesis: H0 (null hypothesis): μ0 = μy vs HA (alternate hypothesis): μ0 ≠ μy
  3. State type 1 error: α = 0.05
  4. State test statistic (T statistic): nV6GfG7O4EkAAAAASUVORK5CYII=
  5. Compute test statistic and p-value
  6. Decision: If p-value > α, do not reject H0. Else, reject H0.
  7. Conclusion: Interpret the results in the context of the problem.

Contingency Tables

Proportion: A part of the whole (e.g., 15 out of 1453).

Rate: A proportion rescaled for better interpretation (e.g., 5 out of 100).

binom.confint(): Computes Wilson Confidence Intervals.

In a two-way table, rows represent the explanatory variable (X) and columns represent the response variable (Y).

9B0v7IRTTwAAAABJRU5ErkJggg== wO9QFqAGkXWIAAAAABJRU5ErkJggg==

Diagnostic Testing

Truth: Unknown and possibly unverifiable.

Test Result: The binary outcome of a test (+ve/-ve).

Sensitivity: Probability of a positive test on a positive person (true positive rate).

Specificity: Probability of a negative test on a negative person (true negative rate).

Specificity = 1 – Sensitivity

h9RPdaswgpGcAAAAABJRU5ErkJggg==

Prevalence: Proportion of the population with the condition.

High Sensitivity: Detects most cases.

Low Specificity: May produce false alarms.

Perfect Test: Sensitivity = 1, Specificity = 0.

Good Threshold: Sensitivity increases slowly as (1-Specificity) increases.

Relative Risk (RR): Ratio of probabilities of an event in two groups.

RR = p1/p2, where p1 = n11/(n11 + n12), p2 = n21/(n21 + n22).

5EKI6+a2tmQAAAABJRU5ErkJggg==

Odds Ratio: Ratio of odds of success in two groups.

Odds of Success: P(success)/P(failure) = p/(1-p)

Pearson Chi-Squared Test

(aa=chsiq.test(x=c.table, correct=False))

  1. Assume the null hypothesis H0 (independence) is true.
  2. Estimate expected cell counts under H0.
  3. Compare observed and expected cell counts to assess evidence against H0.
  4. Interpret p-value: If p-value < α, reject H0 (variables are associated). If p-value > α, no evidence to suggest association.

AoSMecMxjC+RAAAAAElFTkSuQmCC

Fisher’s Exact Test

(fisher.test): An alternate test for smaller samples. Uses permutations to compute exact p-values.