Session 2

II. REGRESSION: R functions for multiple regression and its ramifications

Topic 1 (Data modelling - Basics)
Topic 2 (Data modelling - Making models more flexible)
Topic 3 (Data modelling - Making models more selective)

(a) Topic 1: Data modelling - Basics [Chapter 3 in Jank]

In this section we will look at sales/advertising/income data and consider simple and multiple linear regression models.

a.1 Sales vs. advertising

vs.

Here is a cool animation for finding the regression line: Animation for regression. [Sorry, this stopped working after I installed the new version of Java. No idea why this is happening with Java.]

Here is a graphical explanation of what linear regression does. Note the meaning of SST, SSE and SSR from which we obtain R^2 = SSR/SST which is always between 0 and 1. (Could SST, SSE and SSR be interpreted as Total Cholesterol, Bad Cholesterol and Good Cholesterol, resp.?)

Example: Let's use the sales and advertising data [Table3.1Sales-Advertising.csv] and analyze it.

Here is the graph of this data set as obtained by Rcmdr.
How did Rcmdr find the green line on the graph? This is the regression line, and here is what we do to find it. (To find the regression equation, you will need to use Rcmdr's Statistics > Fit Models > Linear Regression.) So, the regression line is Sales = 51.489 + 7.527 x Advertising with R-squared = 0.24. What do these numbers mean?
It is very important to look at the Pr(> | t |) values to get a feel for how significant is a coefficient. (This is the p-value we introduced in the Coke-Pepsi hypothesis testing experiment. Here, we have H0: The coefficient of a variable = 0.)
You can consider the ratio of "Estimate to Std. Error" (which gives the t value) as a Signal-to-Noise ratio. If this ratio is high [and the Pr(> | t |) is low] then the coefficient is non-zero and so the result is significant. The values indicated by *, **, etc., indicate the significance codes (smaller code, the more significant is the coefficient.)

How do we predict the sales for given advertising levels and also find prediction and confidence intervals?

The next two commands are used to do the predictions:
newPoint <- data.frame(ADVT=10.25); newPoint
predict(RegModel.1,newPoint,interval="prediction")
With the help of Rcmdr, we get these results. (The highlighted commands are entered manually in the Rcmdr window. The package UsingR and its function simple.lm is optional.) Note that with a single R line we can easily obtain the prediction results.
Good news! Basically, the only time you will need to enter commands in Rcmdr is when you are doing predictions. We will do this one more time with multiple regression below. We will also look at a nice correlation plot (corrplot) package where you will enter a few simple commands. All other commands are generated by Rcmdr automatically.

Once the regression model is computed, we need to check certain things. Perhaps most importantly, the errors must be approximately normally distributed. This can be done graphically as follows:

Models > Graphs > Basic diagnostic plots: The top two tell us how errors look.
Models > Graphs > Residual quantile comparison plot: If the dots are all approximately on or near the line, then the normality of the errors is assured.

R documentation for linear model lm().

Background material on linear regression from Wikipedia.

Example : Here's an amazing example of four problems with very different datasets which give the same results. Anscombe's quartet.

Example : Your height-handspan data [T711-Height-Handspan.xlsx]. We will see if this model works well to estimate someone'e height from his/her handspan.

New <- data.frame(Handspan=xx)
predict(RegModel.1,New,interval="prediction")

Exercise: Consider this house price dataset [HousePrices-Data.csv] for 124 houses. Do a simple regression with Price as dependent variable and SqrFt as independent variable. Perform analysis similar to what we did above with the advertising/sales problem. (Answer: Regression equation is Price = 259.88 + 120.16 x SqrFt.)

a.2 Sales vs. advertising and income

R documentation for linear model lm().

vs. and

Example: Now we use the sales vs. advertising and income data [Table 3.1 Sales-Advertising-Income.csv]. So, this is a multiple regression problem which is solved as follows:

Rcmdr can plot the 3D graph of the dataset and even show the fitted linear surface. It also finds the coefficients of the regression equation. The results are here without the 3D graph. (To find the regression equation, you will need to use Rcmdr's Statistics > Fit Models > Linear Regression.) The regression equation is then Sales = 36.894 + 5.069 x Advertising + 0.808 x Income with R-squared = 0.45.
The comments made about the significance of the coefficients also apply here.

We should be able to predict the Sales from the given values of Advertising and Income as follows:

new <- data.frame(ADVT=8, INCOME=53)
new
predict(RegModel.1, new, interval="prediction")
This is quite easy with R and the results are found here. (Note again that the highlighted commands are entered manually in the Rcmdr window.) The confint() command used here gives the 95% confidence intervals for the regression coefficients. ¶

Background material on linear regression from Wikipedia.

Exercise: Consider the same house price dataset [HousePrices-Data.csv] for 124 houses used above. Now use SqrFt, LotSize, Bedrooms and Bathrooms as independent variables and Price as dependent variable and perform a multiple regression.

a.3 Customers' Spending Patterns (Direct Marketing)

Example: Let's retun to the same dataset we looked at the first day, i.e., the direct marketing data. Recall that this dataset has four numerical variables and the rest are qualitative. Let's do a multiple regression on these four varibles with AmountSpent as the dependent variable. Here's the Direct marketing data set [Table 2.6 DirectMarketing.csv] again.

The model now is: RegModel.1 <- lm(AmountSpent~Catalogs+Children+Salary, data=Dataset)
The regression equation is found as AmountSpent = -442 + .02 x Salary + 47.7 x Catalogs - 198 x Children
We also have R^2 = .6584 which is reasonably high.
The following command gives the 95% CI for all coefficients. Interpretation?
> confint(RegModel.1)
2.5 % 97.5 %
(Intercept) -548.17071899 -337.3451739
Catalogs      42.28849708 53.1021521
Children    -232.22618388 -165.1631964
Salary         0.01924234    0.0215693
Intercept is always the value of the response variable when everything else is set at zero. But in this case the negative intercept (-442) really has no meaning.

Example : How long would it take to generate 1,000,000 observations (standard normal) for 100 variables plus the means? Once that is done, how long does it take to run a 100 independent variable multiple regression problem with the 1,000,000 observations? (Just a minute or two!) We will do this in class.

Exercise: The problem statement for this Education Level/Gender/Income problem is here. You will need this Excel data file [Education-Gender-Income.xlsx] to import into Rcmdr and do the calculations.

a.4 House Prices (Optional)

R documentation for linear model lm().

Example: (Optional) This is a more challenging problem with qualitative factors (such as Yes/No, North/West/East) for some variables in the dataset [Table2.1HousePrices.csv]. To find the regression equation, you will need to use Rcmdr's Statistics > Fit Models > Linear Model as LinearModel.1 <- lm(Price ~ Bathrooms + Bedrooms + Brick + Neighborhood + Offers + SqFt, data=Dataset). The confint(LinearModel.1) command will produce the confidence intervals for coefficients. Here are the results.

Prediction of the dependent variable when factors are present (such as Brick, Neighboorhood) can be a little tricky. Here is what you need to do if you want to predict the price of a house with 2 bathrooms, 5 bedrooms, brick, north neighborhood, 2 offers and 1000 sqft. As usual, the highlighted commands are entered manually.

I used the corrplot package to plot the correlations between different numerical variables in this example. Here is the result. ¶

Top of page

(b) Data modelling - Making models more flexible [Chapter 4 in Jank]

Dummy variables are needed when some of the variables assume binary values, or they are categorical in nature.
Including interaction terms aid the analyst to obtain more accurate results in regression.

b.1 "Dummy" variables

R documentation for linear model lm().
Background material for dummy variables on Wikipedia.

Example (very simple): Suppose we have this dataset [Trucking-Factor.csv] where we want to estimate the travel time, given, (i) distance travelled, (ii) number of deliveries and (iii) truck type. Note that truck type (Pickup, or Van) is not given as a numerical value. In previous regression examples, the variables were all numerical, but now the truck type is a categorical variable. So, what do we do now?

We can introduce a dummy (indicator) variable and indicate, say, Pickup by 0 and Van by 1. (So, Pickup is the base here.) This gives us the new (re-coded) dataset [Trucking-Dummy.csv]. Now, we can solve the problem the usual way and find Time = 0.5222 + 0.0464 x Km + 0.7102 x Deliveries + 0.9 x TruckType. (Here, if we decide to use a Van rather than a Pickup, our expected travel time goes up by 0.9 hours.)
What if we had three types of vehicles? Pickup, Van and Scooters (for small deliveries)? Now we would have to consider a more complex set of dummy variables. (I will show in class.)
But there is an easier way to do these things. Use the original dataset [Trucking-Factor.csv]. Rcmdr will recognize that TruckType is a factor and will let you use the Linear Model option.
The result now will be Time = 0.5222 + 0.0464 x Km + 0.7102 x Deliveries + 0.9 x TruckType[T.Van]. Once again, if we move from the base Pickup to Van, the coefficient 0.9 of TruckType[T.Van] tells us that we spend about 0.9 hours longer on the road.

Prediction

New <- data.frame(Km=45, Deliveries=3, TruckType="Van")
predict(LinearModel.1, New, interval="prediction")

Example: Let's consider the dataset for salaries [Table4.1GenderDiscrimination-Factor.csv]. When we ignore the gender and plot the dataset, we are missing out on the information inherent in the gender differences. Only when we plot the data according to gender group, we see a more clear picture. The Rcmdr results are here. So, if we just consider the Experience and Salary columns, the regression equation is found as Salary = 59033.1 + 1727.3 x Experience. (R-squared = 0.31.) But is this accurate?¶

Example (continued): Here, we need to use a dummy variable to distinguish between males and females. We define Gender.Male = 1 if gender is "male" and 0 otherwise. The new dataset with this information is here [Table4.1GenderDiscrimination-Dummy.csv]. Using RegModel.3 <- lm(Salary~Experience+Gender.Male, data=Dataset), we find Salary = 53260.0 + 1744 x Experience + 17020 x Gender.Male. (R-squared = 0.44.)

Note: Of course, there is an easier way to do this with Rcmdr without using the Gender.Male construct. Just use the "Linear model". Rcmdr knows that Gender is a factor. If you pick it as an independent variable, R figures out the rest and you get the same result as Salary = 53260 + 1744 x Experience + 17020 x Gender[T.Male].

So, here is the summary of what we have found so far:

Starting salary for females is 53,260.
Starting salary for males is 53,260 + 17,020 = 70,280. (Discrimination?)
Each additional year of experience is worth 1,744 for either gender. (Seems strange! We will fix this soon with an interaction term.)
The Rcmdr results of this analysis. ¶

b.2 Interaction terms

This is a somewhat difficult material. So, let me give you a few real-life examples of interaction (from Wikipedia.)

Interaction between adding sugar to coffee and stirring the coffee. Neither of the two individual variables has much effect on sweetness but a combination of the two does.

Interaction between smoking and inhaling asbestos fibres: Both raise lung carcinoma risk, but exposure to asbestos multiplies the cancer risk in smokers and non-smokers. Here, the joint effect of inhaling asbestos and smoking is higher than the sum of both effects.

Now, back to our discussion...

R documentation for linear model lm().
Background material for interaction on Wikipedia.

Example: We noted above that each additional year of experience is 1,744 for either gender. But this is not quite logical. Could it be more for males? We analyse this using an interaction term in the form Gender.Exp.Int = Gender.Male x Experience. The dataset with the interaction term is here [Table 4.1 Gender Discrimination-Dummy-Interaction.csv]. The regression equation is obtained as Salary = 66,333 + 666 x Experience - 8,034 x Gender.Male + 2,086 x Gender.Exp.Int. (R-squared = 0.55.)

Now what happens?

For females, Gender.Males = 0, so
- Salary = 66,333 + 666 x Experience. (Meaning?)
For males, it is more complicated: Since Gender.Males = 1,
- Salary = 66,333 + 666 x Experience - 8,034 x 1 + 2,086 x 1 x Experience. Simplifying we have,
- Salary = 58,299 + 2,752 x Experience. (Meaning?)
Rcmdr results.

Suppose you want to estimate a fair salary for a male with 13 years of experience. We do this:

New <- data.frame(Experience=13, Gender.Male=1, Gender.Exp.Int=13); New
predict(RegModel.1,New,interval="prediction")
This gives us
fit lwr upr
1 94087.71 64075.43 124100

Note 1: Of course, as before, there is an easier way to do this with Rcmdr without using the Gender.Exp.Int construct. Just use the "Linear model" and incorporate the product of Gender and Exp as a new variable. Rcmdr figures out what to do and finds exactly the same result. Here again is the dataset for salaries [Table4.1GenderDiscrimination-Factor.csv].

Here, the model is,

LinearModel.1 <- lm(Salary ~ Experience + Gender +Experience*Gender, data=Dataset).

The output is obtained as follows. (The coefficients are shown in bold font, and Experience:Gender[T.Male]

is the interaction term Gender.Exp.Int.)

Coefficients:
                         Estimate     Std. Error t value Pr(>|t|)
(Intercept)              66333.6         2811.7 23.592 < 2e-16 ***
Experience                 666.7          206.5    3.228 0.00145 **
Gender[T.Male] -8034.3         4110.6 -1.955 0.05201 .
Experience:Gender[T.Male] 2086.2          287.3    7.261 7.95e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note 2 : There is an even easier way to get the same results. Just enter Gender=="Male" in the subset box in Linear model, and R gives the regression results for males, only. You can do this as an Exercise. ¶

Note 3: [Caveat! Advanced material] What exactly do we mean by "interaction." Let me make it more clear by comparing two functions (assuming you know some calculus), (i) f(x,y) = ax + by, and (ii) g(x,y) = ax + by + cxy.

In the first case, there is no interaction between x and y, but in the second there is!

Here's why: df/dx = a, so that df = a*dx, i.e., the changes in y do not affect f in this case.

But for the second function, dg/dx = a + cy, so that dg = (a + cy)*dx, i.e., changes in y do affect g!

b.3 Nonlinear relationships and logarithmic data transformations (Optional)

Background material for data transformations from Wikipedia.

Example: (Optional) In every model we discussed, we assumed that the relationship between the independent variable(s) and the dependent variable was essentially linear. This allowed us to use linear regression. But, what if the relationship is not linear? Does using linear regression produce inaccurate results? In fact, it does, and we need to revise our approach to solving such problems.

To illustrate, consider this dataset [Table2.6DirectMarketing.csv] where we examine the relationship between Salary and AmountSpent variables. In this output, we first look at the scatterplot and do a simple regression. The data seem to be approximating a funnel, and we get poor results with an R^2 = 0.4894. Next, we transform both variables using logarithms, and this gives a better looking scatterplot. Running the regression, we find R^2 = 0.5816. We suspect that the AmountSpent also depends on the Location variable (as seen in the new scatterplot), so we include Location as a factor and obtain even better results with R^2 = 0.645.

The interpretation of the estimates of the coefficients will be provided in the next simple example.

But, first a bit of "theory."

Suppose you are not convinced that a linear relationship between price (P) and demand (D) is justifiable; so you decide to use a nonlinear one, e.g., D = exp(a)*P^b where a and b are your coefficients. If you differentiate both sides, we get

dD/dP = exp(a)*b*P^(b-1) = b*exp(a)*P^b*(1/P) = b*D/P.

Rewriting, we have,

dD/D = b*dP/P,

which says that if you increase price P by 1%, demand D will change by b%. (For example, if P = 100, dP = 1, then dP/P = 1%. Now, if b = -2.5, this means that dD/D = -2.5%, so demand will go down by 2.5%.)

Now, taking the logarithms of both sides of D = exp(a)*P^b, we get log(D) = a + b*log(P), which is a nice linear equation, if you write y = log(D) and x = log(P). This means that we can do whatever we were doing as before on the log transformed problem and use the results keeping in mind the meaning of the coefficient b. This is what we will do next. ¶

Example : (Optional) Here is a fascinating example with dataset from the United Nations on a country's GDP and the infant mortality. (The data file is in native R format, i.e., it is not an Excel file.) The data is highly nonlinear and a linear fit gives terrible results. But a log transformation of both variables produces a reasonably linear cloud of points through which we fit a line quite accurately. Here are the results. ¶

Example: (Optional) Let's consider this data set [Table 4.2 PriceAndDemand.csv] for this example. The scatterplot reeals that there may be a nonlinear relationship between price and quantity. Initially, we fit a line to the data and obtain an R^2 = 0.6236. The graphs obtained for the model reveal that the assumption of errors being normal is not satisfied. When we further do a scatterplot with log x and log y axes, the graph of transformed data appears more linear. A fit using log(Price) and log(Qty) gives the coefficient b = -1.1810 which means that for a 1% increase in price, quantity demanded would reduce by about 1.181%. Here is the output of the results. ¶

Top of page

(c) Data modelling - Making models more selective [Chapter 5 in Jank]

c.1 Multicollinearity

R documentation for linear model lm().
Background material on multicollinearity on Wikipedia.

Is it always a good idea to include as many independent variables as we can in a regression problem? No! Let's see why.

Example: Consider the dataset [Table5.1Sales-and-Assets.csv] for a problem with sales and assets as independent variables and profit as the dependent variable.

If we include both sales and assets in our problem we find very high p-values for both (0.340 and 0.643) indicating that they are not individually significant, and so they are useless! However, the F-value for the model is very small (0.002), now indicating that the model with these two variables the model is significant. There is also a high R-squared value (0.63). How could this be?
There is more to this. When we run the regression with a single variable (Profit vs. Sales, and Profit vs. Assets) we get logical results. Individually, both variables are significant! So, all of this tells us that jointly the two variables do not contribute to our understanding, but individually, they do.
The reason is that these variables are highly correlated as we see in the correlation matrix.
Here are the results from Rcmdr.
How do we cure this problem? By "Variable Selection." ¶

Exercise: Use this dataset [Butler-x1-x2-Multicollinear.csv] to find a regression equation for Time as dependent variable and Km, Deliveries and Gas (consumed) as independent variables. Is there a high correlation between Km and Gas? What can go wrong with such problems?

c.2 Variable selection by stepwise regression

R documentation for stepwise regression.
Background material on variable selection on Wikipedia.

R has a nice way of dealing with the multicollinearity problem using stepwise regression.

Example: We use the same dataset [Table5.1Sales-and-Assets.csv] as above. The stepwise procedure is applied after the regression problem is solved by using Models > Stepwise Model Selection... Here is the result.

We use the Akaike Information Criterion (AIC) and the backward/forward direction.
Smaller the AIC, the better is the model.
Start: AIC = 175.64 and Profit ~ Assets + Sales
If we remove Assets, we get the smallest AIC value 173.92.
Any further removal (of Sales) results in a higher AIC, so we stop.
Thus, we have Profit = -124.85492 + 0.02918 x Sales. ¶

Exercise: Consider again the house price dataset [HousePrices-Data.csv] for 124 houses.

Generate a correlation plot using corrplot for all the numerical variables.
Run a regression with Price as the dependent variable and everything else (except SubDiv) as independent variable.
Next, use Models > Stepwise Model Selection... and reduce the model with Akaike Information Criterion (AIC) and the backward/forward direction.
How many independent variables do you have left and what is the AIC? (Answer: We have 5 variables left as Price ~ Bathrooms + Bedrooms + Distance + LotSize + SqrFt, and AIC is 736.41.)