Simple OLS (Chapter 2)

Since I am gradually moving this page to my new website, I recommend to read this post there.

Example 2.3

Recall the script from the page on summary statistics.

setwd("path to your working directory")

Now you can perform your first regression in R and estimate the effect of the return on equity on the salary of a CEO. For this purpose we use the lm function, where “lm” stands for “linear model”. Its first argument is the regression model followed by further options. The tilde between “salary” and “roe” indicates that salary is the dependent variable and roe is the independent variable, it indicates a “relation”. R automatically uses a constant, except you tell it not to do so by adding -1 to the formula. Additionally, you have to tell the command which sample it should use. There are some other ways to do this, but the one used in this example is the most convenient. The simple regression we want to estimate is

lm(salary ~ roe, data=ceosal1)

This yields the values for a constant and the coefficient on ROE as they are given in the textbook. However, the number of digits is quite small. In order to change that use

summary(lm(salary ~ roe, data=ceosal1))

This gives more information on the values calculated by the lm function.

Digression (you can skip this rather technical part)

Note that the lm-function, basically, does nothing else than generating a vector of values of the dependent variable y and a matrix of explanatory variables x. If the model contains a constant, a vector of ones will be added to the matrix x. After that function uses the standard estimation formula (x'x)^{-1}x'y to estimate the coefficients. Compare the result of the following code to the estimates of the previous model. They should be exactly the same.

y = matrix(ceosal1[,"salary"],ncol=1)
x = as.matrix(cbind(1,ceosal1[,"roe"]))

%*% tells R that it has to multiply vectors/matrices with each other. t(x) generates the transposed matrix of x and solve calculates the inverse of a square matrix.


The usual way in which estimations are done in R is that you estimate the model, save it and let R give you the summary of the saved results. This looks like the following:

lm.1<-lm(salary ~ roe, data=ceosal1)

The first command generates a new item in the upper right window and the summary function gives the same values as before, except that it provides additional values like standard errors and R-squared.

Example 2.4

Example 2.4 works exactly in the same manner. Download the dta-file from here, save it in your working directory and import it into R. Then, regress “wage” on “educ” using the “wage1” sample.

setwd("path of your working directory")
lm.1<-lm(wage ~ educ, data=wage1)

Example 2.5
After downloading the data from here example 2.5 looks like this in R:

setwd("path of your working directory")
lm.1<-lm(voteA ~ shareA, data=vote1)

Example 2.6
To get the results for table 2.2 redo the script from example 2.3. R does not only save the coefficients, but also some other important values. Type in


to get a list of variables that are saved under “lm.1” as well. Among them are “fitted.values” and “residuals” which represent “salaryhat” and “uhat”, respectively.

In order to make a table we generate a data frame with the first two colums containing the first 15 observersations from the original dataset and two additional columns into which we later will paste the estimated values. We define the data frame with

df.1<-data.frame(roe=ceosal1$roe[1:15],salary=ceosal1$salary[1:15],salaryhat=NA,uhat=NA )

data.frame is the function which generates the frame. “roe” is the label of the column which is defined as the value of “ceosal1$salary” at the positions 1 to 15 (“[1:15]”). The structure is the same for the following part with “salary”. “salaryhat” and “uhat” are not defined and “NA” (not available) is used to indicate that. NA is in general the indicator of missing values in R.

As a next step we paste the fitted values and the residuals from the regression into the data frame. Therefore, we seperately paste the first 15 fitted values (“lm.1$fitted.values[1:15]”) from the regression into the column “salaryhat” (“df.1$salaryhat”). The same method applies to the resudials. Finally, display the table with “df.1”


Example 2.7

For example 2.7 we have to make R recall the coefficients from the regresson of wage on education. Thus, we save the regression and access the saved results via the “$”-command.

lm.1<-lm(wage ~ educ, data=wage1)

Since "lm.1$coefficients" is a list, we can access each position by "[#]", where "#" is the position in the list. So, to get the intercept value we have to access the first position and to get the coefficient of educ we have to take the second position.


Example 2.7 calculates the fittet value for a person with an average amount of years of education. Recall the function mean from the chapter on summary statistics and take into account what you have learned so far about obtaining the coefficient values of an estimations. This allows us to calculate the fitted values by typing

lm.1$coefficients[1] + lm.1$coefficients[2] * mean(wage1$educ)

Example 2.8

This is the same example as 2.3. But this time it is about the R-squared. Repeat the command from above and let R display summary(lm.1). You will find the R-squared in the second line from the bottom called “Multiple R-squared”. Rounded it should be the same value as in [2.39] in the book.

Example 2.9

Repeat the script from example 2.5 and check the multiple R-squared. It states 0.8561, just like in the book.

Example 2.10 (taking logs)

The example uses logged values as dependend variable. How do we get them? Just type


The last line is what we are looking for. log takes the natural logarithm of the variable in parentheses. In this example R calculates this value for each position in “wage” and saves it as a separate list which I have named “lwage”. Note that the dataset already contained those values, but we overwrote them by executing the command. (However, the values should not have changed.)

Now we can proceed with the regression which works in the usual manner, except that “lwage” is the dependent variable now:

lm.1<-lm(lwage ~ wage1$educ)

Example 2.11

In 2.11 we proceed the same way as in 2.10. We generate the log values of salary and sales in the same manner and estimate the model to obtain the elasticities.

lm.1<-lm(lsalary ~ lsales)

Example 2.12

The only new thing in this expamle is the sample. So download it from here, save it in your working directory, import the data and estimate the model.

lm.1<-lm(math10 ~ lnchprg, data=meap93)

So, this was chapter 2, where we estimated simple OLS models. But since we would like to introduce more independent variabels in order to get better estimates and rid of spurious correlation we move on to chapter 3, multiple regression analysis.



  1. Thanks for the detailed tutorial. I think the code above should be df.1<-data.frame(roe=ceosal1$roe[1:15] or else both the first and second column will display CEO salary.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s