Bus Q600 Web Page

Chapter 11: Correlation Coefficient and Simple Linear Regression Analysis

Correlation

Correlation between your height and your handspan

Your height and handspan data (2013): All Sections

Here is the scatter plot and regression results of these data values. Do you see a correlation between these two variables?

Correlation between fathers' and sons' heights

We will use this graph to investigate this correlation. (This data set contains the heights of 1078 fathers and their grown sons in about the year 1900.)

Can we predict, with certainty, the newborn son's height when he becomes an adult from his father's height? Can we predict it using ``regression''?

Regression towards the mean: "Extreme characteristics (e.g., height) in parents are not passed on completely to their offspring. Rather, the characteristics in the offspring regress towards the mean." (Identified by Sir Francis Galton.)

Correlation between economic power and gold medals

Do gold medals collected at the olympics mirror a country's economic power? See this graph and this Excel file. We will calculate the covariance and the correlation coefficient for these two variables (economic power and gold medals).

Warning! Correlation does not imply causation

Consider the following cases: (i) Cigarette smoking vs. lung health, (ii) Ice cream sales vs. the number of drowning deaths, (iii) Number of pirates vs. global warming. There may be a correlation between the two variables indicated, but which one of these cases correpond to causation?

Look at this! Dilbert knows about correlation, too.

When do we have r = 1?

This happens when the sample (x_i,y_i) points fall on the same line. See the explanation in this file.

Simple linear regression

"Harvey's" Restaurants

The real Harvey's has almost 300 restaurants across Canada.

"Harvey's" restaurants data and regression results: (We have a sample of 10 in the examples.) Student population vs. monthly sales. What can we expect as the monthly sales if we open a new restaurant in a town with 10,000 students?

This file has the complete solution with my comments.

CI for true but unknown mean and PI for a predicted value

Note that there is only one true but unknown mean for the dependent variable (for a given value of the independent variable). But depending on which sample used the predicted values could be quite different. So the CI for unknown mean is smaller than the PI for the predicted value. This MegaStat file calculates the required intervals. (Note: MegaStat's "Leverage" is our "Distance Value.")

Real Estate Data

Real Estate Data: This was discussed in Chapter 2. Let's do a simple regression using square feet as the independent variable (x) and sale price as dependent variable (y). If your home is 2,800 sqft, what seems to be the "right" price for your home?

The regression equation is obtained as

Price = 55 + 0.09*SqrFt,

so if SqrFt = 2,800, then Price = $314,579.

Is this a good fit? In this problem, we have r² = 0.69, only. Is it reasonable to expect that square feet is the only predictor of price? The other parameters certainly play a role. This is the topic for the next chapter (Multiple Regression).

Residual Analysis

This is an important part of linear regression analysis which can help us determine whether or not our model is correct and its assumptions are satisfied.

We will discuss the elements of residual analysis by referring to the handspan/height data you supplied in the first class. Here we can assume that the population is the set of all 150 students in our first year MBA program. So, the regression line MegaStat computes is the true model's line.

In this data set look at the tab called Residuals where MegaStat calculautes the errors (residuals) in the regression. Do these errors satisfy the requirements?

Is the variance constant?
Is the normality assumption satisfied?
Is the independence assumption satisfied? Check the Durbin-Watson statistic. (Not in the exam, though.)