Pearson Correlation and Linear Regression Formulas

Pearson Correlation & Linear Regression Cheat Sheet


Pearson Product-Moment Correlation & Linear Regression

Correlation Definition

Correlation measures the strength and direction of a relationship between two variables.

Warning: Correlation does NOT imply causation.


Types of Correlation

TypeDescriptionGraph Trend
Positive CorrelationVariables increase/decrease togetherUpward slope
Negative CorrelationOne increases while the other decreasesDownward slope
Zero CorrelationNo predictable relationshipRandom scatter

Strength of Correlation

Correlation Coefficient (r)Interpretation
0.00No correlation
±0.01 – ±0.20Very low
±0.21 – ±0.40Slight
±0.41 – ±0.70Moderate
±0.71 – ±0.90High
±0.91 – ±0.99Very high
±1.00Perfect correlation

Pearson Correlation Coefficient Formula

r=\frac{n\sum xy-(\sum x)(\sum y)}{\sqrt{\left[n\sum x^2-(\sum x)^2\right]\left[n\sum y^2-(\sum y)^2\right]}}

Formula Variables

  • r = correlation coefficient
  • n = number of observations
  • x, y = variables

Interpreting the Correlation Coefficient (r)

Value of rMeaning
r > 0Positive relationship
r < 0Negative relationship
r = 0No linear relationship
r → ±1Strong relationship

Example: Math & English Scores

Given data:

VariableSum
∑ x48
∑ y50
∑ x²296
∑ y²310
∑ xy298
n10

Solution and Interpretation

r=\frac{10(298)-48(50)}{\sqrt{(10(296)-48^2)(10(310)-50^2)}}\approx0.92

Interpretation: There is a very high positive correlation between Math and English scores.


Coefficient of Determination (r²)

Formula

r^2=(0.92)^2=0.8464\approx0.85

Interpretation

  • About 85% of the variation in one variable is explained by the other.
  • The remaining 15% is caused by other factors.

Scatterplot Patterns and Meanings

PatternMeaning
Tight upward clusterStrong positive correlation
Tight downward clusterStrong negative correlation
Random dotsNo correlation

Linear Regression Analysis

Linear regression predicts a dependent variable using an independent variable.

General Equation

\hat{y}=a+bx

Where:

  • ˆy = predicted value
  • a = y-intercept
  • b = slope

Slope Formula and Example

b=\frac{n\sum xy-(\sum x)(\sum y)}{n\sum x^2-(\sum x)^2}

Example Calculation

b=\frac{10(298)-48(50)}{10(296)-48^2}=\frac{580}{656}\approx0.88

Interpretation: Every 1-point increase in Math score increases the English score by 0.88.


Y-Intercept Formula and Example

a=\frac{\sum y}{n}-b\frac{\sum x}{n}

Example Calculation

a=\frac{50}{10}-0.88\left(\frac{48}{10}\right)=0.76


Regression Equation Result

\hat{y}=0.76+0.88x


Making Predictions with Regression

Example 1: If x = 95

\hat{y}=0.76+0.88(95)=84.36

Predicted English score = 84.36


Example 2: If x = 80

\hat{y}=0.76+0.88(80)=71.16

Predicted English score = 71.16


Important Limitations and Notes

When Correlation May Fail

  • The relationship is non-linear
  • Outliers exist in the dataset
  • The data range is restricted

Assumptions of Pearson Correlation

Variables must be:

  • Continuous
  • Normally distributed
  • Linearly related

Real-World Applications

FieldApplication
EducationPredict student performance
BusinessSales forecasting
Real EstatePrice prediction
ManufacturingQuality control
HealthcareRisk analysis

Quick Memory Tricks

ConceptShortcut
CorrelationMeasures relationship
RegressionPredicts values
rStrength + direction
Explained variation
Positive rVariables move together
Negative rVariables move opposite

Formula Summary Reference

Pearson Correlation

r=\frac{n\sum xy-(\sum x)(\sum y)}{\sqrt{\left[n\sum x^2-(\sum x)^2\right]\left[n\sum y^2-(\sum y)^2\right]}}

Coefficient of Determination

r^2

Regression Equation

\hat{y}=a+bx

Slope

b=\frac{n\sum xy-(\sum x)(\sum y)}{n\sum x^2-(\sum x)^2}

Y-Intercept

a=\frac{\sum y}{n}-b\frac{\sum x}{n}


Final Takeaways

Correlation tells us:

  • Whether variables are related
  • How strong the relationship is
  • The direction of the relationship

Linear Regression helps us:

  • Model relationships
  • Predict future values
  • Make data-driven decisions

Strong correlation = r close to ±1
Strong prediction power = high r²