Graphical illustration is essential both for good presentations, good seminar papers or other forms of content orientated communication and analysis. Humans are visual beings and although there are those “just-show-me-numbers!”-types out there, nearly everybody is happy – or at least not opposed – to see a good graph, which allows to intuitively comprehend what you are trying to say.
The first step is always the hardest
Although there are plenty of tutorials and blog entries of undoubtedly good quality, which explain how to make a nice graph in R, I have the impression that their codes cause more confusion than enlightenment among students, who have just started to work with it. And since I believe that I am not the only one, who had this starting problem, I will try to provide a comprehensive, easy-to-follow introduction into the creation of time series graphs as they appear in (seminar) papers and other publications in the realm of economics and business.
Of course, it would be much easier to do a tutorial on the basic
plot function. But I decided to use the
ggplot2 package, since it has a steep learning curve, looks much prettier, offers a larger variety of specifications and it is much easier to make more sophisticated graphs with it once the basic structure of the function is understood.
The three basic steps of graph creation
For the remainder of this post I suggest to divide the process of graph creation into three basic steps, according to which the following will be structured:
- Data acquisition: import data from your computer or the internet into R.
- Data preparation: put your data together an prepare it for the
- Plotting: specify the appearance of your plot.
But before we can proceed you have to make sure that your installation of R already contains the packages which we are going to use. If you are totally new to R, visit this page, where you will find all the information needed to set up your installation properly (and hopefully quickly). Once RStudio runs on your system paste the following code into the script and run it.
install.packages(c("reshape2","ggplot2")) # If needed, install the required packages library(reshape2) # Package for more effective data transformation library(ggplot2) # Package for plotting library(grid) # Package for certain plot specifications library(scales) # Package for certain plot specifications
It is simple: In order to plot something you need data. Usually, R related articles use random number generators to create a sample or they use built-in data sets like
data(car). Since most students in business and economics deal with spreadsheet data, I choose a different approach and build our sample on Federal Reserve Bank data, which can be obtained directly from the internet as csv-files. The following lines show how this works in R for the 3-month Treasury Bill rate and the U.S. Consumer Prices Index. Since we are going to read csv-files, I will use the
read.csv function. If you are not familiar with it, you might find this introduction useful. I also recommend to read this blog entry on importing data from files with different formats into R. Read the two series by running the following:
m3 <- read.csv("https://research.stlouisfed.org/fred2/data/TB3MS.csv") # Load the 3-month Treasury Bill rate cpi <- read.csv("https://research.stlouisfed.org/fred2/data/CPIAUCSL.csv") # Load consumer price index for the U.S.
The next step is to make your values comparable. Note that the values of “m3” are already in percentage terms, whereas “cpi” is an index that ranges from about 21 to above 237. Thus, we have to transform the CPI sample. This is also reasonable, because economists are usually interested in the CPI growth rate, i.e. the inflation rate, and not in its level, which is practically meaningless on its own. Therefore, we take logs and the difference between every 12th observation using
diff(log(cpi[,2]),12) to obtain the (approximate) percentage change of a month’s CPI value relative to its value in the previous year. The part
[,2] indicates that you only use the values of the second column of the “cpi” data frame. This is necessary, because the function
diff does not work with the character strings of the first column. Multiplying the results by 100 scales the values so that they are comparable with the other two series, where percentages are expressed in integer numbers and not in decimals, i.e. 1 percent instead of 0.01. Note, that since taking differences in our example means that we lose 12 observations at the beginning of the series and we have to replace the whole second column of the “cpi” sample, we have to add 12 observations with empty (NA) values at the beginning of the difference series with
c(rep(NA,12),...) before we can do that. This is all done in a single line:
cpi[,2] <- c(rep(NA,12),diff(log(cpi[,2]),12)*100)
As a next step we have to re-format the data so that the
ggplot function can interpret it. For this purpose we have to put the samples together into a single data frame with one column for the date and additional columns for each variable. This is achieved with the merge function, where we have to specify three things:
- The objects we want to merge: “m3” and “cpi”.
- A variable that both objects have in common to specify the by-option. In our example this is “DATE”. It is used to attribute the observations of different length from “m3” and “cpi” to the right date.
- The option
all=TRUEmeans that not only those observations are used, where for a given date there is a value both for m3 and cpi. Rather, all observations are considered and empty values (NAs) are generated for a variable, when values for the other variable are available.
dat <- merge(m3,cpi,by="DATE",all=TRUE) # Put the sample together into a single frame and order it by date
Since the date will be plotted on the x-axis, it is convenient to set its class to “Date” so that
ggplot already knows how to deal with it. Further, rename the columns to keep track of the data.
dat[,1] <- as.Date(dat[,1]) # Turn class of the first column into dates. names(dat) <- c("Date","3M","Inflation") # Rename the columns
Although the resulting object would be a standard data frame as commonly used for model estimation with
lm and more sophisticated estimators,
ggplot requires a “slightly” different table structure. Think of it as having a long sample with three columns:
- The fist column contains the date series, repeated for each additional variable resp. column of the old sample.
- The second column consists of the variable names, where each name appears as often as the amount of its available observations.
- The third column is made up by the variable values.
This is done with the
melt function, where you have to specify the object, which should be transformed, the id-variable used to sort the data
id.vars="Date" and if the empty values should be omitted
na.rm=TRUE, which is done in our example.
dat <- melt(dat,id.vars="Date",na.rm=TRUE)
That’s it. Now we are ready for plotting.
An alternative/quicker approach:
Admittedly, we could have done all this a bit faster. I just took the longer way to include the
melt function, which I deem to be extremely useful, when data is provided column-wise. If you are only interested in plotting, the appropriate table for
ggplot can be produced with the following lines:
dat <- rbind(m3,cpi) # Append the CPI series to the interest rate series dat <- cbind(dat,data.frame(c(rep("3M",nrow(m3)),rep("Inflation",nrow(cpi))))) # Add column with variable names dat[,1] <- as.Date(dat[,1]) # Change class of the first column to "Date" names(dat) <- c("Date","value","variable") # Rename the columns dat <- na.omit(dat) # Omit empty values
Basically, we could run the following line to finish.
ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_line()
We use the
ggplot function and specify the sample with the data. We also have to add so-called “aesthetics” with
aes(), where we set the data used for the axes – the “Date”-column in dat for the x-axis and “value”-column for the y-axis – and the column name in dat, which contains information about the series names, i.e. the “variable” column. Note that using
linetype as indicator for the series produces lines with different dashing. If we use
colour instead, we will get a graph, where the series are potted with different colours. Last, the command
+ geom_lin() is added to indicate that a line should be plotted.
This is the basic structure of a
ggplot2 package plot. First, you create a general environment with the
ggplot command, which is then followed by further “subcommands” that are linked by a plus sign.
The result of the above code looks quite neat. But if we look at the plot a bit closer, we find some aspects, which do not look very pretty. We might not like the labels of the axes, the missing title, the look and labels of the legend, the shaded background, the grid lines, the empty space to the left and right of the series etc. Although it does not solve all of those problems, we can easily help ourselves here by using one of the themes that come with the
ggplot2 package. Here I use the “classic” theme by adding
ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_line() + theme_classic()
In figure 2 we got rid of the shaded background, the grid lines and the shaded background of the legend keys. But this is still not enough. The following lines provide some sort of template, which can be used to alter aspects, that I think to be most essential in a time series plot. It produces the plot shown in figure 3.
ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_abline(intercept=0,slope=0,colour="grey80") + # Create a horizontal line with value 0 geom_line(size=.5) + # Create line with series and specify its thickness labs(x="Year",y="Rate", # Rename the x- and y- axis title="3-month Treasury Bill Rate and U.S. Inflation\n", # Title linetype="Legend:") + # Title of the legend coord_cartesian(xlim=c(min(dat[,1]),max(dat[,1])),ylim=c(-4,17)) + # Set displayed area and get rid of unused space guides(linetype=guide_legend()) + # Set the variables contained in the legend scale_x_date(labels = date_format("%Y"), breaks = date_breaks("5 years")) + # Rescale the x-axis theme(legend.position="bottom", # Position of the legend legend.key=element_rect(fill="white"), # Set background of the legend keys panel.background = element_rect(fill = "white"), # Set background of the graph axis.line=element_line(size=.3,colour="black"), # Set the size and colour of the axes axis.text=element_text(colour="black"), # Set the colour of the axes text panel.grid=element_blank()) # Set grid lines off
geom_abline(intercept=0,slope=0,colour="grey80"): Create a horizontal line with value 0. It intercepts the y-axis at value zero and has a slope of zere, i.e. runs horizontally. The colour of the line is set to grey, where 80 indicates its transparency.
geom_line(size=.5): Create the lines of the series and specify their thickness. Note that this command has to follow the
geom_ablinecommand, since it has to be laid above the latter. Otherwise the horizontal line would overwrite the series.
labs(x="Year",y="Rate",title="3-month Treasury Bill Rate and U.S. Inflation\n",linetype="Legend:"): Set the labels of the axes, the title, where
\nindicates a line break, so that it is a bit more above the plot area.
linetype="Legend:"specifies the title of the legend.
coord_cartesian(xlim=c(min(dat[,1]),max(dat[,1])),ylim=c(-4,17)): This helps to get rid of the unused space to the left and right of the series in the above figures. We set the limits of the displayed x-axis to the minimum and maximum values of the date series. The range of the y-axis is set manually by trying.
guides(linetype=guide_legend()): Set the variables contained in the legend. The variables in the legend are the same as those used by
scale_x_date(labels = date_format("%Y"), breaks = date_breaks("5 years")): Alter the axis text, so that ticks indicate five year intervals by specifying the
breaksoption and use
labels = date_format("%Y")so that only years appear in the axis text.
theme():Within this command further options can be altered.
legend.position="bottom": Position of the legend.
legend.key=element_rect(fill="white"): Set the background of the legend keys.
panel.background = element_rect(fill = "white"): Set background of the graph.
axis.line=element_line(size=.3,colour="black"): Set the size and colour of the axes.
axis.text=element_text(colour="black"): Set the colour of the axes’ text.
panel.grid=element_blank()): Set the colour etc. of grid lines. In this example they are omitted.
Feel free to play with all those options. I hope you will find the right design for your personal graphs.
If you are more interested in
ggplot2 now, you might like the homepage of the package, where you find plenty of further plot functions with good examples.