# Regression

Regression is a well-known statistical technique to model the predictive relationship between several independent variables (DVs) and one dependent variable. The objective is to find the best-fitting curve for a dependent variable in a multidimensional space, with each independent variable being a dimension. The curve could be a straight line, or it could be a nonlinear curve. The quality of fit of the curve to the data can be measured by a coefficient of correlation (*r*), which is the square root of the amount of variance explained by the curve.

The key steps for regression are simple:

1. List all the variables available for making the model.

2. Establish a dependent variable of interest.

3. Examine visual (if possible) relationships between variables of interest.

4. Find a way to predict dependent variable using the other variables.

**Caselet: Data-Driven Prediction Markets**

*Traditional pollsters still seem to be using methodologies that worked well a decade or two ago. Nate Silver is a new breed of data-based political forecasters who are seeped in**big data**and advanced analytics. In the 2012 elections, he predicted that Obama would win the election with 291 electoral votes, compared to 247 for Mitt Romney, giving the President a 62-percent lead and re-election. He stunned the political forecasting world by correctly predicting the presidential winner in all 50 states, including all 9**swing states. He also correctly predicted the winner in 31 of the 33 U.S. Senate races.*

*Nate Silver brings a different view to the world of forecasting political elections, viewing it as a scientific discipline. State the hypothesis scientifically, gather all available information, analyze the data and extract insights using sophisticated models and algorithms, and finally, apply human judgment to interpret those insights. The results are likely to be much more grounded and successful.**(Source: The Signal and the Noise: Why Most Predictions Fail but Some Don’t, by Nate Silver, 2012)*

Q1. *What is the impact of this story on traditional pollsters and commentators?*

**Correlations and Relationships**

Statistical relationships are about which elements of data hang together, and which ones hang separately. It is about categorizing variables that have a relationship with one another, and categorizing variables that are distinct and unrelated to other variables. It is about describing significant positive relationships and significant negative differences.

The first and foremost measure of the strength of a relationship is co-relation (or correlation). The strength of a correlation is a quantitative measure that is measured in a normalized range between 0 (zero) and 1. A correlation of 1 indicates a perfect relationship, where the two variables are in perfect sync. A correlation of 0 indicates that there is no relationship between the variables.

The relationship can be positive, or it can be an inverse relationship, that is, the variables may move together in the same direction or in the opposite direction. Therefore, a good measure of correlation is the correlation coefficient, which is the square root of correlation. This coefficient, called *r*, can thus range from -1 to +1. An r value of 0 signifies no relationship. An *r* value of 1 shows perfect relationship in the same direction, and an *r* value of -1 shows a perfect relationship but moving in opposite directions.

Given two numeric variables *x* and *y*, the coefficient of correlation *r* is mathematically computed by the following equation. is the mean of *x*, and is the mean of *y*.

*Figure 6.1 Scatter plots showing types of relationships among two variables*

(Source: Groebner et al. 2013)

A scatter plot (or scatter diagram) is a simple exercise for plotting all data points between two variables on a two-dimensional graph. It provides a visual layout of where all the data points are placed in that two-dimensional space. The scatter plot can be useful for graphically intuiting the relationship between two variables.

Here is a picture that shows many possible patterns in scatter diagrams (Figure 6.1).

Chart (a) shows a very strong linear relationship between the variables *x* and *y*. That means the value of *y* increases proportionally with *x*. Chart (b) also shows a strong linear relationship between the variables *x* and *y*. Here it is an inverse relationship. That means the value of *y* decreases proportionally with *x*.

Chart (c) shows a curvilinear relationship. It is an inverse relationship, which means that the value of *y* decreases proportionally with *x*. However, it seems a relatively well-defined relationship, like an arc of a circle, which can be represented by a simple quadratic equation (quadratic means the power of two, that is, using terms like *x*2 and *y*2). Chart (d) shows a positive curvilinear relationship. However, it does not seem to resemble a regular shape, and thus would not be a strong relationship. Charts (e) and (f) show no relationship. That means variables *x* and *y* are independent of each other.

Charts (a) and (b) are good candidates that model a simple linear regression model (the terms regression model and regression equation can be used interchangeably). Chart (c) too could be modeled with a little more complex, quadratic regression equation. Chart (d) might require an even higher order polynomial regression equation to represent the data. Charts (e) and (f) have no relationship, thus, they cannot be modeled together, by regression or using any other modeling tool.

The regression model is described as a linear equation that follows. *y* is the dependent variable, that is, the variable being predicted. *x* is the independent variable, or the predictor variable. There could be many predictor variables (such as *x*1, *x*2, . . .) in a regression equation. However, there can be only one dependent variable (*y*) in the regression equation.

*y* = *β*0 + *β*1*x* + *ε*

A simple example of a regression equation would be to predict a house price from the size of the house. Here are sample house data:

*Figure 6.2 Scatter plot and regression equation between House price and house size*

The two dimensions of (one predictor, one outcome variable) data can be plotted on a scatter diagram. A scatter plot with a best-fitting line looks like the graph that follows (Figure 6.2).

Visually, one can see a positive correlation between house price and size (sqft). However, the relationship is not perfect. Running a regression model between the two variables produces the following output (truncated).

It shows the coefficient of correlation is 0.891. *r*2, the measure of total variance explained by the equation, is 0.794, or 79 percent. That means the two variables are moderately and positively correlated. Regression coefficients help create the following equation for predicting house prices.

**House Price ($) = 139.48 X Size (sqft) – 54,191**

This equation explains only 79 percent of the variance in house prices. Suppose other predictor variables are made available, such as the number of rooms in the house, it might help improve the regression model.

The house data now looks like this:

While it is possible to make a three-dimensional scatter plot, one can alternatively examine the correlation matrix among the variables.

It shows that the house price has a strong correlation with number of rooms (0.944) as well. Thus, it is likely that adding this variable to the regression model will add to the strength of the model.

Running a regression model between these three variables produces the following output.

It shows the coefficient of correlation of this regression model is 0.984. *r*2, the total variance explained by the equation, is 0.968, or 97 percent. That means the variables are positively and very strongly correlated. Adding a new relevant variable has helped improve the strength of the regression model.

Using the regression coefficients helps create the following equation for predicting house prices.

**House Price ($) = 65.6 X Size (sqft) + 23,613 X Rooms + 12,924**

This equation shows a 97-percent goodness-of-fit with the data, which is very good for business and economic data. There is always some random variation in naturally occurring business data, and it is not desirable to overfit the model to the data.

This predictive equation should be used for future transactions. Given a situation that follows, it will be possible to calculate the price of the house with 2,000 sqft and 3 rooms.

**House Price ($) = 65.6 X 2,000 (sqft) + 23,613 X 3 + 12,924 = $214,963**

The predicted values should be compared to the actual values to see how close the model is able to predict the actual value. As new data points become available, there are opportunities to fine-tune and improve the model.

The relationship between the variables may also be curvilinear. For example, given past data from electricity consumption (kWh) and temperature (temp), the objective is to predict the electrical consumption from the temperature value. Here are a dozen past observations.

In two dimensions (one predictor, one outcome variable), data can be plotted on a scatter diagram. A scatter plot with a best-fitting line looks like the graph that follows (Figure 6.3).

*Figure 6.3 Scatter plots showing regression between (a) Kwatts and temp, and (b) Kwatts and temp-sq*

It is visually clear that the first line does not fit the data well. The relationship between temperature and Kwatts follows a curvilinear model, where it hits bottom at a certain value of temperature. The regression model confirms the relationship since *r* is only 0.77 and *r*2 is also only 60 percent. Thus, only 60 percent of the variance is explained.

The regression model can then be enhanced using a temp-sq variable in the equation. The second line is the relationship between kWh and temp-sq. Visually plotting the energy consumption shows a strong linear relationship with the quadratic temp-sq variable.

Running the regression model after adding the quadratic variable leads to the following results:

It shows that the coefficient of correlation of the regression model is now 0.99. *r*2, the total variance explained by the equation, is 0.985, or 98.5 percent. That means the variables are very strongly and positively correlated. The regression coefficients help create the following equation for

**Energy Consumption = 15.87 X temp-sq – 1,911 X Temp + 67,245**

This equation shows a 98.5-percent fit, which is very good for business and economic contexts. Now one can predict the Kwatts value for when the temperature is 72 degrees.

Energy consumption = (15.87 X 72 X 72) – (1,911 X 72) + 67,245 = 11,923 Kwatts