Statistics Cheat Sheet: Key Concepts and Formulas

Tema 2: Data Analysis and Descriptive Statistics

Types of Variables

Variables can be categorized as either categorical (non-numerical values, e.g., hair color) or numerical. Numerical variables can be further classified as discrete (integer values, e.g., goals scored) or continuous (decimal values, e.g., height or weight).

Data Classification

  • Qualitative Data:
    • Nominal: Categories with no inherent order (e.g., hair color).
    • Ordinal: Ordered categories (e.g., education level).
  • Quantitative Data:
    • Interval: Numerical data where the zero point is arbitrary (e.g., temperature in Celsius).
    • Ratio: Numerical data with a meaningful zero point (e.g., weight).

Frequency Distributions and Graphs

  • Frequency Distribution Tables: Summarize the frequency of each value or category of a variable, including absolute frequency (ni), relative frequency (fi), cumulative absolute frequency (Ni), and cumulative relative frequency (Fi).
  • Graphs for Categorical Variables:
    • Bar Chart: Represents categories with bars.
    • Pie Chart: Displays categories as slices of a circle.
  • Graphs for Continuous Variables:
    • Histogram: Shows the distribution of continuous data using bars.
    • Line Chart: Plots data points connected by lines over time or another continuous variable.
  • Graphs for Two or More Variables:
    • Scatter Plot: Displays the relationship between two numerical variables.
    • Contingency Table: Analyzes the association between categorical variables.

Measures of Central Tendency

  • Mean (X): The average of all values in a dataset.
  • Median (Me): The middle value when data is ordered from least to greatest.
  • Mode (Mo): The value that occurs most frequently.

Measures of Dispersion

  • Range (Rg): The difference between the highest and lowest values.
  • Variance (S^2x): Measures the spread of data around the mean.
  • Standard Deviation (Sx): The square root of the variance.
  • Coefficient of Variation: A relative measure of dispersion, expressed as a percentage of the mean.

Measures of Association

  • Covariance (Sxy): Measures the linear relationship between two variables.
  • Correlation Coefficient (r): A standardized measure of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

Tema 3: Probability

Basic Concepts

  • Sample Space (E): The set of all possible outcomes of a random experiment.
  • Events: Subsets of the sample space.
    • Incompatible Events: Events that cannot occur simultaneously.
    • Exhaustive Events: Events that cover all possible outcomes.
    • Basic/Complementary Events: Events that are both incompatible and exhaustive.

Approaches to Measuring Probability

  1. Classical Approach (Laplace): P(s) = Favorable Events / Possible Events
  2. Frequentist Approach: P(s) = lim nà∞ ns/N (Limit of the relative frequency of an event as the number of trials increases)
  3. Subjective Approach: Based on personal belief or judgment.

Probability Rules and Formulas

0≤P(s)≤1   P(O)=0   P(E)=1   P(A)=1-P(a)   P(a)+P(A)=1   P(aub)=P(a)+P(b)-P(anb)   P(a)=P(anb)+P(anB)   P(a/b)=P(anb)/P(b)   P(A∩B)=1-P(anb)   P(AnB)=1-P(aub)   P(a/B)=P(anB)/P(A)    Teorema BayesàP(H/E)=P(E/H)*P(H)/P(E) 

Tema 4: Random Variables and Probability Distributions

Random Variables

  • Discrete Random Variable: Takes on a finite or countable number of values.
  • Continuous Random Variable: Can take on any value within a given range.

Probability Function

P(X=x) gives the probability that a discrete random variable X takes on the value x.

Properties:

  • 0≤P(X=x)≤1
  • Sumx P(X=x)=1

Cumulative Probability Function

F(xo)=P(X≤x) gives the probability that a random variable X is less than or equal to a certain value xo.

Properties:

  • 0<=F(xo)≤1
  • if B>A then F(B)≥F(A)
  • P(A

Expected Value

E(x)=Mux=Sumx x*P(X=x) represents the average value of a random variable.

Properties:

  • if X=k then E(k)=k (constant)
  • E(a+bX)=a+b*E(X) with a and b constants
  • For two random variables X & Y à E(X+Y)=E(X)+E(Y)
  • For two independent variables X & Yà E(X*Y)=E(X)*E(Y)

Variance and Standard Deviation

V(x)=S^2x=Sumx (x-Mux)2*P(X=x)=E(x-Mux)^2 measures the spread of a random variable around its mean.

Properties:

  • if X=k then V(k)=0 (Constant)
  • V(a+bX)=b2*V(X) with constant a and b
  • For two independent variables X&YàV(X+-Y)=V(X)+V(Y)
  • For two random variables X&YàV(X+-Y)=V(X)+V(Y)+-2cov(X,Y)

Discrete Random Variable Models

  • Binomial (X—Bin(n,p)): Models the number of successes in n independent trials, each with probability p of success.
    • P(X=x)=(nx)*p^x (1-p)^n-x
    • E(X)=Mux=n*p
    • V(X)=S^2x=n*p*(1-p)
  • Poisson (L=n*p): Models the number of events occurring in a fixed interval of time or space, given an average rate of occurrence L.
    • P(X=x)=e^-L *L^x /X!
    • E(X)=Mux=L
    • V(X)=S^2x=L
  • Bernoulli Trial (X—Bin(1,p)): A special case of the binomial distribution with only one trial.
    • P(X=x)=p^x (1-p)^1-x
    • E(X)=p
    • V(X)=p(1-p)

*When n is very large and p is very small, the Poisson distribution can be used to approximate the binomial distribution.

Tema 5: Statistical Inference

Classic Statistics

  • Descriptive Statistics: Summarizes and describes data using graphical and numerical methods.
  • Inferential Statistics: Uses sample data to make inferences about a larger population, with a degree of uncertainty or error.

Estimation and Hypothesis Testing

  • Estimation: Involves estimating population parameters (e.g., mean weight of Spanish women) using sample data.
  • Hypothesis Testing: Tests claims about population parameters (e.g., testing if the population mean weight is 60kg).

*Inference is the process of drawing conclusions about a population based on sample results.

Population vs. Sample

  • Population (N): The entire set of individuals or items of interest.
  • Population Variables (E,n,o): Random features of interest in the population.
  • Parameter: A numerical characteristic of a population (e.g., population mean (Mu), population variance (S^2)).
  • Sample: A subset of the population used to collect data.
  • Statistic: A numerical characteristic of a sample (e.g., sample mean (X), sample variance (S^2x)).

Properties of Simple Random Sampling

  • The mean of the distribution of the sample mean is the population mean à MuX=Mu
  • The standard deviation of the distribution of the sample mean decreases when the sample size n increasesà SX=S/Raiz n

Central Limit Theorem

If the population is normally distributed or the sample size is large enough (nà∞), the sampling distribution of the sample mean (X) will be approximately normal, with mean Mu and standard deviation S/Raiz n.

Sampling Distribution of p^

For dichotomous data (0,1), the sampling distribution of the sample proportion (p^) will be approximately normal, with mean p and standard deviation Raiz p(1-p)/n, under certain conditions.

Tema 6: Estimation

Estimators and Estimates

  • Estimator (Ô): A random variable that depends on sample information and provides an approximation to an unknown population parameter.
  • Estimate (Ôo): A specific value of the estimator obtained from an observed sample.

Confidence Interval

A range of values that is likely to contain the true population parameter with a certain level of confidence (Y%).

Intervalo de Confianza para Y%(Y=Nivel de Confianza)àMu ¢(X-Za/2 *S/Raiz n, X+Za/2 *S/Raiz n)

Confidence Factor (Za/2)

The value from the standard normal distribution that leaves a probability of a/2 to its right, where y=1-a.

Margin of Error (ME)

ME=Za/2 *S/Raiz n

Common confidence levels and corresponding Z-scores:

  • (-1,1) à68%
  • (-2,2) à95%
  • (-3,3) à99%