Statistical Inference

Before we start, let’s clear our memory, set the working directory and load the wooldridge data package.

rm(list = ls())

setwd("C:/path/to/the/working/directory")

library(wooldridge)
data(wage1)

lm.1 <- lm(lwage ~ educ + exper + tenure, data = wage1)
s.1 <- summary(lm.1)
s.1
## 
## Call:
## lm(formula = lwage ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

In order to be able to use the values of the coefficients for further calculations, you can use copy and paste with the numbers provided in the summary s.1. Alternatively, and for the sake of more precision, you can access the coefficients directly. If you use the names function and run names(s.1), you will notice a subdirectory called coefficients, which contains the coefficients, standard errors, t-statistics and p-values of the regression. Entering s.1$coefficients or coefficients(s.1) gives those values for all coefficients, but you can also extract them separately using the [row, column] signs, where the first value defines the row index and the second the column index. If no value is specified for a row or column, R will use all available values. This happens in the third line of the following code, where the t-values are calculated from the betas and their respective standard errors. The forth line of the code specifies a certain row which means that the t-value is calculated only for the parameter listed third in the coefficient list.

names(s.1)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
s.1$coefficients
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 0.284359541 0.104190379  2.729230 6.562466e-03
## educ        0.092028988 0.007329923 12.555246 8.824197e-32
## exper       0.004121109 0.001723277  2.391437 1.713562e-02
## tenure      0.022067218 0.003093649  7.133070 3.294407e-12
s.1$coefficient[, 1] / s.1$coefficient[, 2]
## (Intercept)        educ       exper      tenure 
##    2.729230   12.555246    2.391437    7.133070
s.1$coefficient[3, 1] / s.1$coefficient[3, 2]
## [1] 2.391437

The following code illustrates the differences in statistical significance that can result from the use of log-values.

Example 4.2

data("meap93")

lm.1 <- lm(math10 ~ totcomp + staff + enroll, data = meap93)
summary(lm.1)
## 
## Call:
## lm(formula = math10 ~ totcomp + staff + enroll, data = meap93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.235  -7.008  -0.807   6.097  40.689 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.2740209  6.1137938   0.372    0.710    
## totcomp      0.0004586  0.0001004   4.570 6.49e-06 ***
## staff        0.0479199  0.0398140   1.204    0.229    
## enroll      -0.0001976  0.0002152  -0.918    0.359    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.24 on 404 degrees of freedom
## Multiple R-squared:  0.05406,    Adjusted R-squared:  0.04704 
## F-statistic: 7.697 on 3 and 404 DF,  p-value: 5.179e-05
lm.2 <- lm(math10 ~ ltotcomp + lstaff + lenroll, data = meap93)
summary(lm.2)
## 
## Call:
## lm(formula = math10 ~ ltotcomp + lstaff + lenroll, data = meap93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.735  -6.838  -0.835   6.139  39.718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -207.6648    48.7031  -4.264 2.50e-05 ***
## ltotcomp      21.1550     4.0555   5.216 2.92e-07 ***
## lstaff         3.9800     4.1897   0.950   0.3427    
## lenroll       -1.2680     0.6932  -1.829   0.0681 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.18 on 404 degrees of freedom
## Multiple R-squared:  0.06538,    Adjusted R-squared:  0.05844 
## F-statistic:  9.42 on 3 and 404 DF,  p-value: 4.974e-06

Example 4.3

data("gpa1")

lm.1 <- lm(colGPA ~ hsGPA + ACT + skipped, data = gpa1)
summary(lm.1)
## 
## Call:
## lm(formula = colGPA ~ hsGPA + ACT + skipped, data = gpa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85698 -0.23200 -0.03935  0.24816  0.81657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.38955    0.33155   4.191 4.95e-05 ***
## hsGPA        0.41182    0.09367   4.396 2.19e-05 ***
## ACT          0.01472    0.01056   1.393  0.16578    
## skipped     -0.08311    0.02600  -3.197  0.00173 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3295 on 137 degrees of freedom
## Multiple R-squared:  0.2336, Adjusted R-squared:  0.2168 
## F-statistic: 13.92 on 3 and 137 DF,  p-value: 5.653e-08

The next examples illustrate how the t-statistic is calculated in R if you want to know whether a coefficient is different from a value other than zero. This works by obtaining the coefficients from the regression summary, subtracting the value against which it should be tested and diving the result by the standard error.

Example 4.4

data("campus")

lm.1 <- lm(lcrime ~ lenroll, data = campus)
s.1 <- summary(lm.1)
s.1
## 
## Call:
## lm(formula = lcrime ~ lenroll, data = campus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5136 -0.3858  0.1174  0.4363  2.5782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.6314     1.0335  -6.416 5.44e-09 ***
## lenroll       1.2698     0.1098  11.567  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8946 on 95 degrees of freedom
## Multiple R-squared:  0.5848, Adjusted R-squared:  0.5804 
## F-statistic: 133.8 on 1 and 95 DF,  p-value: < 2.2e-16
(s.1$coefficients[2,1] - 1) / s.1$coefficients[2,2]
## [1] 2.45737

Example 4.5

data("hprice2")

names(hprice2)
##  [1] "price"    "crime"    "nox"      "rooms"    "dist"     "radial"  
##  [7] "proptax"  "stratio"  "lowstat"  "lprice"   "lnox"     "lproptax"
ldist <- log(hprice2$dist)

lm.1 <- lm(lprice ~ lnox + ldist + rooms + stratio, data = hprice2)
s.1 <- summary(lm.1)
s.1
## 
## Call:
## lm(formula = lprice ~ lnox + ldist + rooms + stratio, data = hprice2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05890 -0.12427  0.02128  0.12882  1.32531 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.083862   0.318111  34.843  < 2e-16 ***
## lnox        -0.953539   0.116742  -8.168 2.57e-15 ***
## ldist       -0.134339   0.043103  -3.117  0.00193 ** 
## rooms        0.254527   0.018530  13.736  < 2e-16 ***
## stratio     -0.052451   0.005897  -8.894  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.265 on 501 degrees of freedom
## Multiple R-squared:  0.584,  Adjusted R-squared:  0.5807 
## F-statistic: 175.9 on 4 and 501 DF,  p-value: < 2.2e-16
(s.1$coefficients[2, 1] + 1) / s.1$coefficients[2, 2]
## [1] 0.3979827

Example 4.6

data("k401k")

lm.1 <- lm(prate ~ mrate + age + totemp, data = k401k)
summary(lm.1)
## 
## Call:
## lm(formula = prate ~ mrate + age + totemp, data = k401k)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.698  -8.074   4.716  12.505  30.307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.029e+01  7.777e-01 103.242  < 2e-16 ***
## mrate        5.442e+00  5.244e-01  10.378  < 2e-16 ***
## age          2.692e-01  4.514e-02   5.963 3.07e-09 ***
## totemp      -1.291e-04  3.666e-05  -3.521 0.000443 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.88 on 1530 degrees of freedom
## Multiple R-squared:  0.09954,    Adjusted R-squared:  0.09778 
## F-statistic: 56.38 on 3 and 1530 DF,  p-value: < 2.2e-16

The following code limits the used data to those observations, where the year is 1987 and the value for union is zero. The specification of data = tells R to take only those observations into consideration which meet these conditions. You could read this command in a way that R takes data from jtrain, where in each row - note that the second part of the [,] sign is empty! - the conditions x and y have to hold.

Example 4.7

data("jtrain")

lm.1 <- lm(lscrap ~ hrsemp + lsales + lemploy, data = jtrain[jtrain$year==1987 & jtrain$union==0,])
summary(lm.1)
## 
## Call:
## lm(formula = lscrap ~ hrsemp + lsales + lemploy, data = jtrain[jtrain$year == 
##     1987 & jtrain$union == 0, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6301 -0.7523 -0.4016  0.8697  2.8273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.45837    5.68677   2.191   0.0380 *
## hrsemp      -0.02927    0.02280  -1.283   0.2111  
## lsales      -0.96203    0.45252  -2.126   0.0436 *
## lemploy      0.76147    0.40743   1.869   0.0734 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.376 on 25 degrees of freedom
##   (97 observations deleted due to missingness)
## Multiple R-squared:  0.2624, Adjusted R-squared:  0.1739 
## F-statistic: 2.965 on 3 and 25 DF,  p-value: 0.05134

The next example shows how you can calculate confidence intervals from estimation results. For this purpose you can use the confint function which needs the stored regression results (here lm.1) and the level of confidence specified by the argument level = something.

Example 4.8

data("rdchem")

lm.1 <- lm(lrd ~ lsales + profmarg, data = rdchem)
summary(lm.1)
## 
## Call:
## lm(formula = lrd ~ lsales + profmarg, data = rdchem)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97681 -0.31502 -0.05828  0.39020  1.21783 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.37827    0.46802  -9.355 2.93e-10 ***
## lsales       1.08422    0.06020  18.012  < 2e-16 ***
## profmarg     0.02166    0.01278   1.694    0.101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5136 on 29 degrees of freedom
## Multiple R-squared:  0.918,  Adjusted R-squared:  0.9123 
## F-statistic: 162.2 on 2 and 29 DF,  p-value: < 2.2e-16
# 10% confidence
confint(lm.1, level = .90)
##                       5 %      95 %
## (Intercept) -5.173496e+00 -3.583051
## lsales       9.819409e-01  1.186499
## profmarg    -6.362296e-05  0.043375
# 5% confidence
confint(lm.1, level = .95)
##                    2.5 %      97.5 %
## (Intercept) -5.335478679 -3.42106808
## lsales       0.961107260  1.20733251
## profmarg    -0.004487725  0.04779911

Equations 4.21 & 4.27

data("twoyear")

lm.1 <- lm(lwage ~ jc + univ + exper, data = twoyear)
s.1 <- summary(lm.1)

lm.2 <- lm(lwage ~ jc + totcoll + exper, data = twoyear)
s.2 <- summary(lm.2)

(s.1$coefficient[2, 1] - s.1$coefficient[3, 1]) / s.2$coefficient[2, 2]
## [1] -1.467657

Next, we want to calculate the F-test to evaluate the joint significance of independent variables. For this purpose you have to exert the regression of the restricted and the unrestricted model and store the results. Then, use the anova function. Specify which two regressions have to be used and let R do the rest.

Equations 4.31 & 4.33

data("mlb1")

# unrestricted model
lm.1 <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)
# restricted model
lm.2 <- lm(lsalary ~ years + gamesyr, data = mlb1)
# f-test
anova(lm.1, lm.2)
## Analysis of Variance Table
## 
## Model 1: lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr
## Model 2: lsalary ~ years + gamesyr
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    347 183.19                                  
## 2    350 198.31 -3   -15.125 9.5503 4.474e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 4.9

data("bwght")

lm.1 <- lm(bwght ~ cigs + parity + faminc + motheduc + fatheduc, data = bwght)
lm.2 <- lm(bwght ~ cigs + parity + faminc, data = bwght[is.na(bwght$motheduc) != TRUE & is.na(bwght$fatheduc) != TRUE,])
anova(lm.1, lm.2)
## Analysis of Variance Table
## 
## Model 1: bwght ~ cigs + parity + faminc + motheduc + fatheduc
## Model 2: bwght ~ cigs + parity + faminc
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1185 464041                           
## 2   1187 465167 -2   -1125.7 1.4373  0.238

Equation 4.48

data("hprice1")

lm.1 <- lm(lprice ~ lassess + llotsize + lsqrft + bdrms, data = hprice1)

lprice <- hprice1$lprice - hprice1$lassess
lm.2 <- lm(lprice ~ 1)
anova(lm.1, lm.2)
## Analysis of Variance Table
## 
## Model 1: lprice ~ lassess + llotsize + lsqrft + bdrms
## Model 2: lprice ~ 1
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     83 1.8215                           
## 2     87 1.8801 -4  -0.05862 0.6678 0.6162

Example 4.10

data("meap93")

bs <- meap93$benefits / meap93$salary
lm.1 <- lm(lsalary ~ bs, data = meap93)
lm.2 <- lm(lsalary ~ bs + lenroll + lstaff, data = meap93)
lm.3 <- lm(lsalary ~ bs + lenroll + lstaff + droprate + gradrate, data = meap93)

summary(lm.1)
## 
## Call:
## lm(formula = lsalary ~ bs, data = meap93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43386 -0.10537 -0.00963  0.09673  0.51864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.52318    0.04156 253.203  < 2e-16 ***
## bs          -0.82540    0.19989  -4.129 4.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1514 on 406 degrees of freedom
## Multiple R-squared:  0.0403, Adjusted R-squared:  0.03794 
## F-statistic: 17.05 on 1 and 406 DF,  p-value: 4.422e-05
summary(lm.2)
## 
## Call:
## lm(formula = lsalary ~ bs + lenroll + lstaff, data = meap93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29251 -0.08885 -0.01611  0.07210  0.39151 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.843840   0.251643  43.092  < 2e-16 ***
## bs          -0.604772   0.165368  -3.657 0.000289 ***
## lenroll      0.087397   0.007346  11.897  < 2e-16 ***
## lstaff      -0.222033   0.050077  -4.434 1.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1246 on 404 degrees of freedom
## Multiple R-squared:  0.3527, Adjusted R-squared:  0.3479 
## F-statistic: 73.39 on 3 and 404 DF,  p-value: < 2.2e-16
summary(lm.3)
## 
## Call:
## lm(formula = lsalary ~ bs + lenroll + lstaff + droprate + gradrate, 
##     data = meap93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28711 -0.08902 -0.01949  0.07255  0.39754 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.7384651  0.2582651  41.579  < 2e-16 ***
## bs          -0.5893195  0.1648739  -3.574 0.000394 ***
## lenroll      0.0881204  0.0073240  12.032  < 2e-16 ***
## lstaff      -0.2182779  0.0499503  -4.370 1.58e-05 ***
## droprate    -0.0002826  0.0016145  -0.175 0.861113    
## gradrate     0.0009674  0.0006625   1.460 0.145032    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1241 on 402 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3531 
## F-statistic: 45.43 on 5 and 402 DF,  p-value: < 2.2e-16