Especially for presentations, it is meaningful to plot data in order to get an intuition about the relations of interest. In regression analysis this is very often done by scatterplots, where for each observation the variable values of interest are plotted against each other. For this purpose we use the
In this example I use the data set from the previous section on summary statistics.
ceosal1 <- read.dta("ceosal1.dta") plot(ceosal1$salary,ceosal1$roe)
There appears a graph in the lower right window of RStudio. Each point gives the value of salary measured on the x-axis and the ROE value on the y-axis for each of the observations in the sample.
Personally, I like the command
scatterplotMatrix() from the
car package. It automatically produces a series of scatterplots with regression lines and smooths for all variables. But since this requires a lot of calculations, you might want to make sure that the command does not include too much variables – or that you have a good computer.
The output of
scatterplotMatrix allows you to get an intuition about the correlations between the dependent variable and the independent variables,i.e., covariates, and correlations between covariates. This is especially useful when you want to check for multicollinearity, which is a serious problem in OLS estimators.
Time series (longitudinal) data
If you have observations of the same variable over time, you might want to plot the evolution of these variables. You can also do this with the
plot function. Everything you have to do is to set the time variable as the variable of the x-axis. Nothing else changes. So, download the data, read it into R, use the
plot function with the year of the observation measured on the x-asis and see what happens.
download.file('http://fmwww.bc.edu/ec-p/data/wooldridge/phillips.dta','phillips.dta',mode="wb") phillips <- read.dta('phillips.dta') plot(phillips$year,phillips$inf)
Well, this series of unconnected circles does not look very satisfying, does it? The reason why we have no lines between the data points is that the
plot function does not differentiate between between cross sectional and time series data in the first place. Thus, we have to make a small adaptation by using the option
type="l" to tell R that we want to connect that data points by lines.
Better, but since a good graph should always contain a title and axes labels you should add the second line of the following code to the previous. <codemain and the following text between the quotation marks specifies the title of the graph.
ylab are used to set the name of the x- and y-axis, respectively.
If you are interested in more sophisticated time series graphs, you can also go through my post on plotting time series using the
Congratulations! You successfully went through the introduction. If you want to proceed with some simple regressions, click here. Otherwise have a nice day!