Chapter 3: Multiple Regression Analysis

Below you find the script for all examples in chapter 3 of Wooldridge (2013). The only difference to the last chapter is how to use multiple independent variables in the lm function. This is easily achieved by separating each exogenous variable that should enter the model with a plus sign, e.g. lm(colGPA ~ hsGPA + ACT).

The other commands that appear new are quite helpful and you should take a minute to think about them:

  • rm(list = ls()): Sometimes it might be helpful to keep your memory clean in order to keep an overview of the data you use. This command removes positions from the window in the upper right, i.e. it deletes data that R temporarily stored in your memory. list tells R that you are going to delete more than one items. ls() is the so-called environment that you want to delete. In our case this is a quite big one. If you are interested type ls(1), ls(2), etc. to see what happens.
  • #: This is the sign which you put in front of a script line to indicate that the rest of the line is a comment. R takes it as text which is not executed. Whenever you work with scripts you should always comment what you do in order to make it easier for others to understand how your script works and for yourself to know what you did when you wrote the code.

Before we estimate the models of the chapter we clear the memory and load the wooldridge package.

rm(list = ls())

library(wooldridge)

Example 3.1

data("gpa1")

lm_1 <- lm(colGPA ~ hsGPA + ACT, data = gpa1)
summary(lm_1)
## 
## Call:
## lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85442 -0.24666 -0.02614  0.28127  0.85357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.286328   0.340822   3.774 0.000238 ***
## hsGPA       0.453456   0.095813   4.733 5.42e-06 ***
## ACT         0.009426   0.010777   0.875 0.383297    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3403 on 138 degrees of freedom
## Multiple R-squared:  0.1764, Adjusted R-squared:  0.1645 
## F-statistic: 14.78 on 2 and 138 DF,  p-value: 1.526e-06
lm_2 <- lm(colGPA ~ ACT, data = gpa1)
summary(lm_2)
## 
## Call:
## lm(formula = colGPA ~ ACT, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85251 -0.25251 -0.04426  0.26400  0.89336 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.40298    0.26420   9.095  8.8e-16 ***
## ACT          0.02706    0.01086   2.491   0.0139 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3656 on 139 degrees of freedom
## Multiple R-squared:  0.04275,    Adjusted R-squared:  0.03586 
## F-statistic: 6.207 on 1 and 139 DF,  p-value: 0.0139

Example 3.2

data("wage1")

lm_3_2 <- lm(lwage ~ educ + exper + tenure, data = wage1)
summary(lm_3_2)
## 
## Call:
## lm(formula = lwage ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

Example 3.3

data("k401k")

lm_3_3 <- lm(prate ~ mrate + age, data = k401k)
summary(lm_3_3)
## 
## Call:
## lm(formula = prate ~ mrate + age, data = k401k)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.162  -8.067   4.787  12.474  18.256 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.1191     0.7790  102.85  < 2e-16 ***
## mrate         5.5213     0.5259   10.50  < 2e-16 ***
## age           0.2432     0.0447    5.44 6.21e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.94 on 1531 degrees of freedom
## Multiple R-squared:  0.09225,    Adjusted R-squared:  0.09106 
## F-statistic: 77.79 on 2 and 1531 DF,  p-value: < 2.2e-16

Example 3.4

data("gpa1")

lm_3_4 <- lm(colGPA ~ hsGPA + ACT, data = gpa1)
summary(lm_3_4)
## 
## Call:
## lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85442 -0.24666 -0.02614  0.28127  0.85357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.286328   0.340822   3.774 0.000238 ***
## hsGPA       0.453456   0.095813   4.733 5.42e-06 ***
## ACT         0.009426   0.010777   0.875 0.383297    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3403 on 138 degrees of freedom
## Multiple R-squared:  0.1764, Adjusted R-squared:  0.1645 
## F-statistic: 14.78 on 2 and 138 DF,  p-value: 1.526e-06

Example 3.5

data("crime1")

lm_3_5_1 <- lm(narr86 ~ pcnv + ptime86 + qemp86, data = crime1)
lm_3_5_2 <- lm(narr86 ~ pcnv + avgsen + ptime86 + qemp86, data = crime1)
summary(lm_3_5_1)
## 
## Call:
## lm(formula = narr86 ~ pcnv + ptime86 + qemp86, data = crime1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7118 -0.4031 -0.2953  0.3452 11.4358 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.711772   0.033007  21.565  < 2e-16 ***
## pcnv        -0.149927   0.040865  -3.669 0.000248 ***
## ptime86     -0.034420   0.008591  -4.007 6.33e-05 ***
## qemp86      -0.104113   0.010388 -10.023  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8416 on 2721 degrees of freedom
## Multiple R-squared:  0.04132,    Adjusted R-squared:  0.04027 
## F-statistic:  39.1 on 3 and 2721 DF,  p-value: < 2.2e-16
summary(lm_3_5_2)
## 
## Call:
## lm(formula = narr86 ~ pcnv + avgsen + ptime86 + qemp86, data = crime1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9330 -0.4247 -0.2934  0.3506 11.4403 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.706756   0.033151  21.319  < 2e-16 ***
## pcnv        -0.150832   0.040858  -3.692 0.000227 ***
## avgsen       0.007443   0.004734   1.572 0.115993    
## ptime86     -0.037391   0.008794  -4.252 2.19e-05 ***
## qemp86      -0.103341   0.010396  -9.940  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8414 on 2720 degrees of freedom
## Multiple R-squared:  0.04219,    Adjusted R-squared:  0.04079 
## F-statistic: 29.96 on 4 and 2720 DF,  p-value: < 2.2e-16

Example 3.6

data("wage1")

lm_3_6 <- lm(lwage ~ educ, data = wage1)
summary(lm_3_6)
## 
## Call:
## lm(formula = lwage ~ educ, data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21158 -0.36393 -0.07263  0.29712  1.52339 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.583773   0.097336   5.998 3.74e-09 ***
## educ        0.082744   0.007567  10.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4801 on 524 degrees of freedom
## Multiple R-squared:  0.1858, Adjusted R-squared:  0.1843 
## F-statistic: 119.6 on 1 and 524 DF,  p-value: < 2.2e-16

Regression through the origin (no intercept)

This is achieved by adding -1 to the formula of the lm function. But only do this if you have a good reason for that, e.g. the population model goes through the origin.

data("crime1")

lm_no_intercept <- lm(narr86 ~ pcnv + ptime86 + qemp86 - 1, data = crime1)

An “alternative” way to obtain slope coefficients

The textbook (p. 74) mentions an additional way to get slope coefficients: Dividing the sum of the residuals of the regression of x1 on all the other independent variables multiplied by the outcome variable (dep. variable) through the sum of those squared residuals. To express and check this in R should pose no problem since we already know the necessary commands for that:

data("gpa1")

lm_alt <- lm(colGPA ~ hsGPA + ACT, data = gpa1)
summary(lm_alt)
## 
## Call:
## lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85442 -0.24666 -0.02614  0.28127  0.85357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.286328   0.340822   3.774 0.000238 ***
## hsGPA       0.453456   0.095813   4.733 5.42e-06 ***
## ACT         0.009426   0.010777   0.875 0.383297    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3403 on 138 degrees of freedom
## Multiple R-squared:  0.1764, Adjusted R-squared:  0.1645 
## F-statistic: 14.78 on 2 and 138 DF,  p-value: 1.526e-06
lm_x <- lm(hsGPA ~ ACT, data = gpa1)
sum(lm_x$residuals * gpa1$colGPA) / sum(lm_x$residuals^2)
## [1] 0.4534559

Compare the coefficients on hsGPA and you will find them to be the same.

Admittedly, this was kind of boring. Let’s see if chapter 4 on statistical inference is more exciting.