Basic plotting in R with ggplot2: A Beginner’s Guide to Time Series

Graphical illustration is essential both for good presentations, good seminar papers or other forms of content orientated communication and analysis. Humans are visual beings and although there are those “just-show-me-numbers!”-types out there, nearly everybody is happy – or at least not opposed – to see a good graph, which allows to intuitively comprehend what you are trying to say.

The first step is always the hardest

Although there are plenty of tutorials and blog entries of undoubtedly good quality, which explain how to make a nice graph in R, I have the impression that their codes cause more confusion than enlightenment among students, who have just started to work with it. And since I believe that I am not the only one, who had this starting problem, I will try to provide a comprehensive, easy-to-follow introduction into the creation of time series graphs as they appear in (seminar) papers and other publications in the realm of economics and business.

Of course, it would be much easier to do a tutorial on the basic plot function. But I decided to use the ggplot2 package, since it has a steep learning curve, looks much prettier, offers a larger variety of specifications and it is much easier to make more sophisticated graphs with it once the basic structure of the function is understood.

The three basic steps of graph creation

For the remainder of this post I suggest to divide the process of graph creation into three basic steps, according to which the following will be structured:

  • Data acquisition: import data from your computer or the internet into R.
  • Data preparation: put your data together an prepare it for the ggplot function.
  • Plotting: specify the appearance of your plot.

But before we can proceed you have to make sure that your installation of R already contains the packages which we are going to use. If you are totally new to R, visit this page, where you will find all the information needed to set up your installation properly (and hopefully quickly). Once RStudio runs on your system paste the following code into the script and run it.

install.packages(c("reshape2","ggplot2")) # If needed, install the required packages
library(reshape2) # Package for more effective data transformation
library(ggplot2) # Package for plotting
library(grid) # Package for certain plot specifications
library(scales) # Package for certain plot specifications

Data acquisition

It is simple: In order to plot something you need data. Usually, R related articles use random number generators to create a sample or they use built-in data sets like data(car). Since most students in business and economics deal with spreadsheet data, I choose a different approach and build our sample on Federal Reserve Bank data, which can be obtained directly from the internet as csv-files. The following lines show how this works in R for the 3-month Treasury Bill rate and the U.S. Consumer Prices Index. Since we are going to read csv-files, I will use the read.csv function. If you are not familiar with it, you might find this introduction useful. I also recommend to read this blog entry on importing data from files with different formats into R. Read the two series by running the following:

m3 <- read.csv("https://research.stlouisfed.org/fred2/data/TB3MS.csv") # Load the 3-month Treasury Bill rate
cpi <- read.csv("https://research.stlouisfed.org/fred2/data/CPIAUCSL.csv") # Load consumer price index for the U.S.

Data preparation

The next step is to make your values comparable. Note that the values of “m3” are already in percentage terms, whereas “cpi” is an index that ranges from about 21 to above 237. Thus, we have to transform the CPI sample. This is also reasonable, because economists are usually interested in the CPI growth rate, i.e. the inflation rate, and not in its level, which is practically meaningless on its own. Therefore, we take logs and the difference between every 12th observation using diff(log(cpi[,2]),12) to obtain the (approximate) percentage change of a month’s CPI value relative to its value in the previous year. The part [,2] indicates that you only use the values of the second column of the “cpi” data frame. This is necessary, because the function diff does not work with the character strings of the first column. Multiplying the results by 100 scales the values so that they are comparable with the other two series, where percentages are expressed in integer numbers and not in decimals, i.e. 1 percent instead of 0.01. Note, that since taking differences in our example means that we lose 12 observations at the beginning of the series and we have to replace the whole second column of the “cpi” sample, we have to add 12 observations with empty (NA) values at the beginning of the difference series with c(rep(NA,12),...) before we can do that. This is all done in a single line:

cpi[,2] <- c(rep(NA,12),diff(log(cpi[,2]),12)*100)

As a next step we have to re-format the data so that the ggplot function can interpret it. For this purpose we have to put the samples together into a single data frame with one column for the date and additional columns for each variable. This is achieved with the merge function, where we have to specify three things:

  1. The objects we want to merge: “m3” and “cpi”.
  2. A variable that both objects have in common to specify the by-option. In our example this is “DATE”. It is used to attribute the observations of different length from “m3” and “cpi” to the right date.
  3. The option all=TRUE means that not only those observations are used, where for a given date there is a value both for m3 and cpi. Rather, all observations are considered and empty values (NAs) are generated for a variable, when values for the other variable are available.
dat <- merge(m3,cpi,by="DATE",all=TRUE) # Put the sample together into a single frame and order it by date

Since the date will be plotted on the x-axis, it is convenient to set its class to “Date” so that ggplot already knows how to deal with it. Further, rename the columns to keep track of the data.

dat[,1] <- as.Date(dat[,1]) # Turn class of the first column into dates.
names(dat) <- c("Date","3M","Inflation") # Rename the columns

Although the resulting object would be a standard data frame as commonly used for model estimation with lm and more sophisticated estimators, ggplot requires a “slightly” different table structure. Think of it as having a long sample with three columns:

  • The fist column contains the date series, repeated for each additional variable resp. column of the old sample.
  • The second column consists of the variable names, where each name appears as often as the amount of its available observations.
  • The third column is made up by the variable values.

This is done with the melt function, where you have to specify the object, which should be transformed, the id-variable used to sort the data id.vars="Date" and if the empty values should be omitted na.rm=TRUE, which is done in our example.

dat <- melt(dat,id.vars="Date",na.rm=TRUE)

That’s it. Now we are ready for plotting.

An alternative/quicker approach:

Admittedly, we could have done all this a bit faster. I just took the longer way to include the melt function, which I deem to be extremely useful, when data is provided column-wise. If you are only interested in plotting, the appropriate table for ggplot can be produced with the following lines:

dat <- rbind(m3,cpi) # Append the CPI series to the interest rate series
dat <- cbind(dat,data.frame(c(rep("3M",nrow(m3)),rep("Inflation",nrow(cpi))))) # Add column with variable names
dat[,1] <- as.Date(dat[,1]) # Change class of the first column to "Date"
names(dat) <- c("Date","value","variable") # Rename the columns
dat <- na.omit(dat) # Omit empty values

Plotting

Basically, we could run the following line to finish.

ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_line()
Figure 1

Figure 1

We use the ggplot function and specify the sample with the data. We also have to add so-called “aesthetics” with aes(), where we set the data used for the axes – the “Date”-column in dat for the x-axis and “value”-column for the y-axis – and the column name in dat, which contains information about the series names, i.e. the “variable” column. Note that using linetype as indicator for the series produces lines with different dashing. If we use colour instead, we will get a graph, where the series are potted with different colours. Last, the command + geom_lin() is added to indicate that a line should be plotted.

This is the basic structure of a ggplot2 package plot. First, you create a general environment with the ggplot command, which is then followed by further “subcommands” that are linked by a plus sign.

The result of the above code looks quite neat. But if we look at the plot a bit closer, we find some aspects, which do not look very pretty. We might not like the labels of the axes, the missing title, the look and labels of the legend, the shaded background, the grid lines, the empty space to the left and right of the series etc. Although it does not solve all of those problems, we can easily help ourselves here by using one of the themes that come with the ggplot2 package. Here I use the “classic” theme by adding theme_class.

ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_line() + theme_classic()
Figure 2

Figure 2

In figure 2 we got rid of the shaded background, the grid lines and the shaded background of the legend keys. But this is still not enough. The following lines provide some sort of template, which can be used to alter aspects, that I think to be most essential in a time series plot. It produces the plot shown in figure 3.

ggplot(dat,aes(x=Date,y=value,linetype=variable)) + 
  geom_abline(intercept=0,slope=0,colour="grey80") + # Create a horizontal line with value 0
  geom_line(size=.5) + # Create line with series and specify its thickness
  labs(x="Year",y="Rate", # Rename the x- and y- axis
       title="3-month Treasury Bill Rate and U.S. Inflation\n", # Title
       linetype="Legend:") + # Title of the legend
  coord_cartesian(xlim=c(min(dat[,1]),max(dat[,1])),ylim=c(-4,17)) + # Set displayed area and get rid of unused space
  guides(linetype=guide_legend()) + # Set the variables contained in the legend
  scale_x_date(labels = date_format("%Y"), breaks = date_breaks("5 years")) + # Rescale the x-axis
  theme(legend.position="bottom", # Position of the legend
        legend.key=element_rect(fill="white"), # Set background of the legend keys
        panel.background = element_rect(fill = "white"), # Set background of the graph
        axis.line=element_line(size=.3,colour="black"), # Set the size and colour of the axes
        axis.text=element_text(colour="black"), # Set the colour of the axes text
        panel.grid=element_blank()) # Set grid lines off
  • geom_abline(intercept=0,slope=0,colour="grey80"): Create a horizontal line with value 0. It intercepts the y-axis at value zero and has a slope of zere, i.e. runs horizontally. The colour of the line is set to grey, where 80 indicates its transparency.
  • geom_line(size=.5): Create the lines of the series and specify their thickness. Note that this command has to follow the geom_abline command, since it has to be laid above the latter. Otherwise the horizontal line would overwrite the series.
  • labs(x="Year",y="Rate",title="3-month Treasury Bill Rate and U.S. Inflation\n",linetype="Legend:"): Set the labels of the axes, the title, where \n indicates a line break, so that it is a bit more above the plot area. linetype="Legend:" specifies the title of the legend.
  • coord_cartesian(xlim=c(min(dat[,1]),max(dat[,1])),ylim=c(-4,17)): This helps to get rid of the unused space to the left and right of the series in the above figures. We set the limits of the displayed x-axis to the minimum and maximum values of the date series. The range of the y-axis is set manually by trying.
  • guides(linetype=guide_legend()): Set the variables contained in the legend. The variables in the legend are the same as those used by linetype.
  • scale_x_date(labels = date_format("%Y"), breaks = date_breaks("5 years")): Alter the axis text, so that ticks indicate five year intervals by specifying the breaks option and use labels = date_format("%Y") so that only years appear in the axis text.
  • theme(): Within this command further options can be altered.
  • legend.position="bottom": Position of the legend.
  • legend.key=element_rect(fill="white"): Set the background of the legend keys.
  • panel.background = element_rect(fill = "white"): Set background of the graph.
  • axis.line=element_line(size=.3,colour="black"): Set the size and colour of the axes.
  • axis.text=element_text(colour="black"): Set the colour of the axes’ text.
  • panel.grid=element_blank()): Set the colour etc. of grid lines. In this example they are omitted.

Feel free to play with all those options. I hope you will find the right design for your personal graphs.

Figure 3

Figure 3

If you are more interested in ggplot2 now, you might like the homepage of the package, where you find plenty of further plot functions with good examples.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s