Chapter 2: Simple OLS

Franz X. Mohr, Created: October 4, 2018, Last update: October 4, 2018

In R the function for basic linear regression models is lm, which is short for linear model. Its first argument is a formula of the regression model, which has the form y ~ a. The tilde between y and a indicates that y is the dependent variable and a is the explanatory variable. It is also possible to add a further explanatory variable - for example b - to the regression by adding a plus sign followed by the the name of the additional variable to the formula. In our example this would result in y ~ a + b. But since this chapter only covers models with a single explanatory variable, we postpone this issue to the next chapter.

Beside the formula, the lm function also requires you to specify the data used for the estimation. This can be done in multiple ways, but it is quite common practise to specify a data frame as the data argument, e.g. data = my_data_frame, which contains the variables that are mentioned in the model formula.

With this knowledge we can already estimate a simple model. In the textbook this is a regression of the variable salary on the variable roe, where both variables are contained in the ceosal1 data set. We can access this data set by loading the wooldridge data package with library(wooldridge and then load the sample with data("ceosal1"):

library(wooldridge) # Load the data package
data("ceosal1") # Load the data

Then we can proceed with the model estimation:

lm(salary ~ roe, data = ceosal1)

## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Coefficients:
## (Intercept)          roe  
##       963.2         18.5

This yields the values for a constant and the coefficient on roe as they are given in the textbook. Note also that R automatically adds a constant term to the model, since linear regression models usually contain such a term. However, if it is appropriate to drop such a constant, this can be done by adding 0 or -1 to the formula in lm. More information on this topic will be given in the next chapter. For now it is just sufficient to know that the constant term is included automatically and, hence, it should not come as a surpise that it is shown in the estimation results.

However, the results shown above are not very conclusive, since we only see the estiamtes of the coefficients, but no test statistics or other information. In order to change this execute

summary(lm(salary ~ roe, data = ceosal1))

## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.2  -526.0  -254.0   138.8 13499.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   963.19     213.24   4.517 1.05e-05 ***
## roe            18.50      11.12   1.663   0.0978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared:  0.01319,    Adjusted R-squared:  0.008421 
## F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777

She summary functions takes the output of lm and transforms its contents so that the estimation results are more informative. This is helpful, since you can get information on the significants of your parameters as well as the general fit of you model as indicated by the Multiple R-squared value. But more on this in later chapters.

Digression (you can skip this rather technical part)

Note that the lm function, basically, does nothing else than generating a vector of values of the dependent variable y and a matrix of explanatory variables x. If the model contains a constant, a vector of ones will be added to the matrix x. After that the function uses the standard estimation formula to estimate the coefficients. To see that this is true, compare the result of the following code to the model results from above. They should be exactly the same.

# Extract y values and store them as a matrix
y <- matrix(ceosal1[,"salary"], ncol = 1)

# Extract x values and store them as a matrix
x <- as.matrix(cbind(1, ceosal1[,"roe"]))

# Apply the esimation formula
solve(t(x) %*%x) %*% t(x) %*% y

##           [,1]
## [1,] 963.19134
## [2,]  18.50119

%*% tells R that it has to multiply vectors/matrices with each other. t(x) generates the transposed matrix of x and the solve function calculates the inverse of a square matrix. And as you can see, the values are the same.

…continued…

A very common way to do econometric analyses in R is to estimate a model, save it and let R give you the summary of the saved results. This looks like the following:

# Estimate and save the model as object lm_1
lm_1 <- lm(salary ~ roe, data = ceosal1)

# Show summary statistics
summary(lm_1)

## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.2  -526.0  -254.0   138.8 13499.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   963.19     213.24   4.517 1.05e-05 ***
## roe            18.50      11.12   1.663   0.0978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared:  0.01319,    Adjusted R-squared:  0.008421 
## F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777

The first command generates a new item in the upper right window of RStudio and the summary function gives the same values as before, except that it provides additional values like standard errors and R-squared.

Example 2.4

Example 2.4 works exactly in the same way as the last example. Load the data set wage1 from the already loaded wooldridge package and regress wage on educ:

# Load data
data("wage1")

# Estimate
lm_1 <- lm(wage ~ educ, data = wage1)
summary(lm_1)

## 
## Call:
## lm(formula = wage ~ educ, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3396 -2.1501 -0.9674  1.1921 16.6085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.90485    0.68497  -1.321    0.187    
## educ         0.54136    0.05325  10.167   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.378 on 524 degrees of freedom
## Multiple R-squared:  0.1648, Adjusted R-squared:  0.1632 
## F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Example 2.5

data("vote1")

lm_1 <- lm(voteA ~ shareA, data = vote1)
summary(lm_1)

## 
## Call:
## lm(formula = voteA ~ shareA, data = vote1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.8919  -4.0660  -0.1682   3.4965  29.9772 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.81221    0.88721   30.22   <2e-16 ***
## shareA       0.46383    0.01454   31.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.385 on 171 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.8553 
## F-statistic:  1018 on 1 and 171 DF,  p-value: < 2.2e-16

Example 2.6

To get the results for table 2.2 redo the script from example 2.3. R does not only save the coefficients, but also some other important values. Type in names(lm_1) to get a list of variables that are saved under lm_._1 as well. Among them are fitted.values and residuals which represent salaryhat and uhat from the textbook, respectively.

In order to make a table we generate a data frame with the first two colums containing the first 15 observersations from the original dataset and two additional columns into which we later will paste the estimated values. We define the data frame with

df_1 <- data.frame(roe = ceosal1$roe[1:15],
                   salary = ceosal1$salary[1:15],
                   salaryhat = NA,
                   uhat = NA)

data.frame is the function which generates the frame. roe is the label of the column which is defined as the value of ceosal1$salary at the positions 1 to 15, i.e. [1:15]. The structure is the same for the following part with salary. salaryhat and uhat are not defined and NA (not available) is used to indicate that. NA is in general the indicator of missing values in R.

As a next step we paste the fitted values and the residuals from the regression into the data frame. Therefore, we seperately paste the first 15 fitted values (lm_1$fitted.values[1:15]) from the regression into the column salaryhat (df_1$salaryhat). The same method applies to the resudials. Finally, display the table with df_1

df_1$salaryhat <- lm_1$fitted.values[1:15]
df_1$uhat <- lm_1$resid[1:15]
df_1

##     roe salary salaryhat       uhat
## 1  14.1   1095  71.99251 -3.9925125
## 2  10.9   1001  55.05048  6.9495217
## 3  23.5   1122  71.81027  1.1897280
## 4   5.9    578  69.67154 -0.6715378
## 5  13.8   1368  60.49183 14.5081669
## 6  20.0   1145  71.51750 -2.5174997
## 7  16.4   1078  63.09647 -4.0964750
## 8  16.3   1094  72.64273 -1.6427343
## 9  10.5   1237  71.79179  4.2082107
## 10 26.3    833  67.36770  5.6322953
## 11 25.9    567  70.79752 -2.7975191
## 12 26.8    933  61.05246  9.9475419
## 13 14.8   1339  53.22138 -1.2213819
## 14 22.3    937  63.17152 15.8284830
## 15 56.3   2011  49.40805  0.5919517

Example 2.7

For example 2.7 we have to make R recall the coefficients from the regression of wage on education. Thus, we save the regression and access the saved results via the $ operator.

data("wage1")

lm_1 <- lm(wage ~ educ, data = wage1)
lm_1$coefficients

## (Intercept)        educ 
##  -0.9048516   0.5413593

Since lm_1$coefficients is a vector, we can access each position by “[#]”, where “#” is the position of the element. So, to get the intercept value we have to access the first position and to get the coefficient of educ we have to take the second position.

# Get value for "(Intercept)"
lm_1$coefficients[1]

## (Intercept) 
##  -0.9048516

# Get value for educ
lm_1$coefficients[2]

##      educ 
## 0.5413593

Example 2.7 calculates the fittet value for a person with an average amount of years of education. Recall the function mean from the chapter on summary statistics and take into account what you have learned so far about obtaining the coefficient values of an estimation. This allows us to calculate the fitted values by typing

lm_1$coefficients[1] + lm_1$coefficients[2] * mean(wage1$educ)

## (Intercept) 
##    5.896103

Example 2.8

This is the same example as 2.3. But this time it is about the R-squared. Repeat the command from above and let R display summary(lm_1). You will find the R-squared in the second line from the bottom called “Multiple R-squared”. Rounded it should be the same value as in [2.39] in the book.

Example 2.9

Repeat the script from example 2.5 and check the multiple R-squared. It states 0.8561, just like in the book.

Example 2.10 (taking logs)

The example uses logged values as dependend variable. How do we get them? Just type

data("wage1")

lwage <- log(wage1$wage)

The last line is what we are looking for. log takes the natural logarithm of the variable in parentheses. In this example R calculates this value for each position in wage and saves it as a separate object, which I named lwage.

Now we can proceed with the regression which works in the usual manner, except that lwage is the dependent variable now:

lm_1 <- lm(lwage ~ wage1$educ)
summary(lm_1)

## 
## Call:
## lm(formula = lwage ~ wage1$educ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21158 -0.36393 -0.07263  0.29712  1.52339 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.583773   0.097336   5.998 3.74e-09 ***
## wage1$educ  0.082744   0.007567  10.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4801 on 524 degrees of freedom
## Multiple R-squared:  0.1858, Adjusted R-squared:  0.1843 
## F-statistic: 119.6 on 1 and 524 DF,  p-value: < 2.2e-16

Example 2.11

In 2.11 we proceed the same way as in 2.10. We generate the log values of salary and sales in the same manner and estimate the model to obtain the elasticities.

lsalary <- log(ceosal1$salary)
lsales <- log(ceosal1$sales)
lm_1 <- lm(lsalary ~ lsales)
summary(lm_1)

## 
## Call:
## lm(formula = lsalary ~ lsales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01038 -0.28140 -0.02723  0.21222  2.81128 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.82200    0.28834  16.723  < 2e-16 ***
## lsales       0.25667    0.03452   7.436  2.7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5044 on 207 degrees of freedom
## Multiple R-squared:  0.2108, Adjusted R-squared:  0.207 
## F-statistic:  55.3 on 1 and 207 DF,  p-value: 2.703e-12

Example 2.12

The only new thing in this expamle is the sample. So load it and estimate the model:

data("meap93")

lm_1 <- lm(math10 ~ lnchprg, data = meap93)
summary(lm_1)

## 
## Call:
## lm(formula = math10 ~ lnchprg, data = meap93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.386  -5.979  -1.207   4.865  45.845 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.14271    0.99758  32.221   <2e-16 ***
## lnchprg     -0.31886    0.03484  -9.152   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.566 on 406 degrees of freedom
## Multiple R-squared:  0.171,  Adjusted R-squared:  0.169 
## F-statistic: 83.77 on 1 and 406 DF,  p-value: < 2.2e-16

So, this was chapter 2, where we estimated simple OLS models. But since we would like to introduce more independent variabels in order to get better estimates and to avoid spurious correlation we move on to chapter 3 on multiple regression analysis.