Note: This page is work in progress. I will continuously enhance it. Feel free to leave me a message if you would like to have something covered.
Before we start with any commands make sure that R is running on your system. If this is not the case, just follow the instructions from the site on making R work.
Once R runs on your system, we can start with this crash course.
1. Setting the working directory
The very first thing you have to do is to set your working directory. This is the standard directory which contains the data sets you want to work with and where R stores data in case you want to save a set which you have created in R by yourself. Under Windows this command can look like the following:
Note that it is necessary that the path to the working directory is surrounded by quotation marks.
If you estimate statistical models, your will definitely need data. The most common file formats of data I have encountered so far are .csv and .dta, where the first is a common file format of spreadsheets and the latter is the standard format of the widely used statistical package called “Stata”.
If you use a data set from a .csv file, make sure that the first line contains the headers of the variables followed by the respective values. Moreover the values should be separated by commas, the decimals should be indicated by a dot and the text should be indicated by quotation marks. You might want to do this in a spreadsheet program before you import the data. This will make it easier for R to read it and you do not have to specify the reading command any further, i.e., I do not have to cover it here, but it is covered here.
Since this page is predominantly based on Wooldridge (2013) which uses “Stata” and therefore .dta files to store data, I will cover it here in this crash course.
First, download the data set into the working directory of your computer.
The command loads the file from the address and stores it as “ceosal.dta” in your working directory. The option “mode=’wb'” specifies in which mode the file is written on the hard drive. It ensures that you can read the file on the majority of operating systems. (The same applies to .csv files.)
Now that the file is on your computer, we can tell R to read the data. But since .dta files were not originally developed for R we have to enhance its base system a little. This is quickly done by activating the library “foreign”
If this package is for some reason not already included in your installation of R you can quickly install it by entering the following into the console
Then read the data with
Well, this does not help much. Instead run
ceosal <- read.dta("ceosal.dta")
The first part
ceosal<- ensures that the data is not just displayed in the console, but that it also is stored in the memory so that you can work with it.
3. Simple OLS
The basic command for linear regression models in R is lm(). Just fill the brackets with the model that you want to estimate and specify the data set, where R finds the values of the variables. In R dependent an independent variables are separated by a circumflex sign. The simplest case of a regression command in R is then the following:
lm(salary ~ roe,data=ceosal1)
Like the data set you can also save the results of the lm() command in your memory which is an essential feature since you might want to see more than just the estimates of the intercept and coefficient. Thus, run
lm.1 <- lm(salary ~ roe,data=ceosal1)
to save the results and then tell R to give you a summary of the estimation results by typing
In order to estimate models with more than one independent variable we just have to add another variable into the formula separating it from the other with a plus sign.
lm.2 <- lm(salary ~ roe + sales,data=ceosal1) summary(lm.2)
5. t-tests and F-tests
If you enter
summary(lm.2), R will already provide you with the t-statistic of each coefficient. But since we might also be interested in the joint significance of two or more coefficients in our research projects, we cannot use t-tests for that. Instead we apply F-tests, for which we have to estimate restricted and unrestricted models. The unrestricted model contains all the variables, which you think have an influence on the dependent variable. The restricted model contains the same variables as the unrestricted model, but omits the predictors which should be tested.
lm.3 <- lm(salary ~ roe + sales + ros,data=ceosal) summary(lm.3) anova(lm.1,lm.3)
In this example
lm.3 is the unrestricted model and
lm.1 from above is the restricted model. The summary shows that “sales” is statistically significant at the 10% level and that “ros” is not significant. Using anova()
with the two models yields an F-statistic of approximately 1.9827 (P=0.1403). This indicates, that the two coefficients are not jointly significant.