Blog

Estimating a Bayesian TVP-VAR Model in R

This post guides through the estimation process of a Bayesian TVP-VAR model, where the algorithm proposed by Durbin & Koopman (2002) is used to draw samples of the parameter coefficients. It is intended to provide a brief, hands-on introduction to the implementation of multivariate state space models in R.

Continue reading

Is There Climate Change in Austria?

In this post I analyse the development of mean temperatures in Austrian towns to find evidence for climate change. Using band-pass filters I extract a trend from climate data and find that though mean monthly temperatures tended to decrease during the 50s and 60s they have persistently increased by about 1.5 degree Celsius over the past four decades.

Continue reading

An Exercise in Growth Accounting

This post provides a short theoretical introduction to the concept of growth accounting and uses data from the Penn World Tables (8.1) to calculate TFP growth rates for a series of countries using the simple Solow-method.

Based on neoclassical growth theory the economic discipline of growth accounting as suggested by Solow (1957) tries to assess the relative contribution of labour, capital and technology to the growth rate of the economy. As presented in the textbook by Aghion and Howitt (2009) it departs from the idea that the economy can be described by a single production function of the form Y=AKαL1-α, which in per-capita terms becomes y=Akα, where y=Y/L is output per capita, k=K/L is capital per capita and A is total factor productivity (TFP). Taking logs and differentiating with respect to time gives an expression of the economy’s growth rate g, which depends on the percentage change of TFP Δ and capital over time δk

Δy=ΔA + α Δk.

Since output, capital and labour can be observed, but not A and α, latter have to be inferred somehow. The first approach is the so-called primal method which assumes that α is equal to the share of capital income in national income. Once this value is obtained the TFP growth rate can be calculated with

ΔA =Δy – α Δk.

Note that since TFP growth is the residual value after the contribution of capital growth was subtracted from output growth, TFP growth is also called the Solow residual.

Applying this method in R is straightforward and very easy, since the necessary sample can be downloaded from CRAN via install.packages("pwt8") and activated with data(pwt8.1).

Next define which country and time period you want to analyse. I chose the period from 1960 to 2000, because this is also done in Aghion and Howitt (2009) and I want to reproduce the results.

i <- "AUS"
dat<-pwt8.1[pwt8.1$isocode==i & pwt8.1[,"year"]>=1960 & pwt8.1[,"year"]<=2000,]

Then calculate the real output and capital per worker (in national currency), take logs and the first difference to obtain percentage changes. $alpha; is one minus the share of labour compensation in GDP at current national prices. You have to omit the first observation so that you have the same amount of observations as of output and capital growth.

dy <- diff(log(dat[,"rgdpna"]/dat[,"emp"]))*100
dk <- diff(log(dat[,"rkna"]/dat[,"emp"]))*100
a <- 1-dat[-1,"labsh"]

Finally, calculate the Solow residual and let you display everything with a data frame.

dtfp <- dy-a*dk

data.frame("country"=i,"g"=mean(dy),"tfp"=mean(dtfp),"k"=mean(a*dk),"tfp.share"=mean(dtfp)/mean(dy),"capital.share"=mean(dk)/mean(dy))

This result is very similar to the numbers in table 5.1 in Aghion and Howitt (2009). However, the results differ from the table when you perform the exercise for other countries.

Literature

Aghion, Philippe; Howitt, Peter (2009): The Economics of Growth: The MIT Press (MIT Press Books).

Solow, Robert M. (1957): Technical Change and the Aggregate Production Function. In The Review of Economics and Statistics 39 (3), pp. 312–320. Available online at http://www.jstor.org/stable/1926047.

Extracting Business Cycles From Raw Data in R

Some sort of introduction

The analysis of business cycles requires the extraction of the cyclical component of a times series that is usually also influenced by further factors such as an underlying trend or noise. This post presents some methods that are used in the recent literature to extract business cycles from a given series. It is based on the chapter on business cycles by Stock and Watson (1999) in the Handbook of Macroeconomics. I also introduce relatively new methods, like wavelet filters or empirical mode decomposition which are not covered in the handbook. Since the focus of this post is on the implementation of certain filter techniques in R, I will not cover the math. Rather, I will refer to the respective literature. For the examples quarterly data on US real GDP is used, which I obtained directly from FRED.

# Download the data via Quanld and transform it
library(Quandl)
gdp <- Quandl("FRED/GDPC1",order="asc") # Download data via Quandl
names(gdp) <- c("Time","GDP") # Rename variables
gdp[,"GDP"] <- log(gdp[,"GDP"]) # Take logs

To get an intuition about what it means to extract the cyclical component of a time series, look at the development of log real GDP over time in the following figure.

library(ggplot2)
library(reshape2)
ggplot(gdp,aes(x=Time,y=GDP)) + geom_line(size=.5) + theme_classic() + labs(title="Log Real US GDP")
00_gdp

There is a clear increasing trend in the data, which appears to become gradually smaller up to the present. Additionally, the series seems to fluctuate around this trend in a more or less regular manner. There are very small deviations of the series from the trend which occur very frequently, but also rather large deviations which can persist over several subsequent periods. The latter are the fluctuations that are relevant for business cycle analysis.

Time detrending

A first approach to exclude a trend from a series is to regress the variable of interest on a time variable and to obtain the residual values. These are plotted in the following figure, where the linear trend was removed.

time.detrend <- residuals(lm(GDP ~ Time, data=gdp)) # Regress GDP on Time and obtain residuals

# Plot
dat <- data.frame("Time"=gdp[,"Time"],"Linearly.Detrended"=time.detrend)
ggplot(dat,aes(x=Time,y=Linearly.Detrended)) + geom_hline(yintercept=0,colour="grey80") + geom_line(size=.5) + theme_classic() + labs(title="Linearly Detrended",y="")
01_linear_trend

This approach is relatively controversial, since it assumes that there is a constant, linear time trend. As we have seen above, this is not very likely given the steady decrease of the trend’s growth rate over time. However, it is still possible to assume a different functional form of the time trend, e.g. adding a quadratic term, to get rid of the trend. A further disadvantage of this method is that it does only exlude the trend, but not the noise, i.e. the very small fluctuations in the series.

First difference

The next approach is to take first differences as it is commonly taught to obtain stationary time series. This assumes that the data is difference stationary. The result of taking first differences is shown in the following figure, where it is also compared to the time detrended series. The differenced data fluctuates much more around the zero line, but it also contains much noise as well.

fd <- diff(gdp[,"GDP"]) # Take first difference

# Plot
dat <- data.frame("Time"=dat[,"Time"],"First.Difference"=c(NA,fd),"Linearly.Detrended"=dat[,"Linearly.Detrended"])
g <- melt(dat,id.vars="Time",na.rm=TRUE)
levels(g[,2]) <- c("Linear Trend","First Difference")

# Define plot function
plot.cycles <- function(d,t) {
ggplot(g,aes(x=Time,y=value,linetype=variable)) +
geom_hline(yintercept=0,colour="grey80") + # Create a horizontal line with value 0
geom_line(size=.5) + # Create line with series and specify its thickness
labs(x="Time",y="",title=t,linetype="Legend:") + # Title of the legend
coord_cartesian(xlim=c(min(g[,1]),max(g[,1])),expand=FALSE) + # Set displayed area
guides(linetype=guide_legend()) + # Set the variables contained in the legend
theme(legend.position="bottom", # Position of the legend
legend.key=element_rect(fill="white"), # Set background of the legend keys
panel.background = element_rect(fill = "white"), # Set background of the graph
axis.line=element_line(size=.3,colour="black"), # Set the size and colour of the axes
axis.text=element_text(colour="black"), # Set the colour of the axes text
panel.grid=element_blank()) # Set grid lines off
}

# Plot
plot.cycles(d=g,t="Linearly Detrended vs. First Difference")
02_first_difference

Hodrick Prescott filter

Hodrick and Prescott (1981) developed a filter, which seprates a time series into a trend, cyclical and noise component. The hpfilter function is contained in the mFilter package and requires the time series and a smoothing parameter. The literature suggest a value of 1600 for the latter. However, it is possible to choose a much higher value as well. The follwoing figure plots the values of the cyclical component of real GDP obtained by the Hodrick-Prescott filter and compares it with the values of a linearly detrended series. The behaviour of both series appear quite similar, except that the HP-series fluctuates more around zero, whereas the linearly detrended series still includes components of the trend. Additionally, the cyclical HP series still includes some noise-like components.

library(mFilter)
hp <- hpfilter(gdp[,2],freq=1600)$cycle # Apply filter and obtain data of the cycle compontent

# Plot
dat <- cbind(dat,data.frame("Hodrick.Prescott"=hp))
g <- melt(dat[,c(1,4,3)],id.vars="Time",na.rm=TRUE)
levels(g[,2]) <- c("Hodrick Prescott","Linearly Detrended")

plot.cycles(g,"Hodrick Prescott vs. Linearly Detrended")
03_Hodrick_Prescott

Although widely used in economics, the HP filter is also heavily criticised for some features. See, for example, a good overview in a Bruegel blog post and here for a recent (and bit technical) critique.

Baxter King filter

Baxter and King (1994,1999) proposed a filter which yields similar results as the HP filter, but which takes out a lot of that noise-like behaviour shown above. The function bkfilter is also contained in the mFilter package. It requires the series, a lower and an upper bound of the amount of periods, where cycles are assumed to occur (pl and pu), and a smoothing factor nfix. The literature (cf. NBER, Stock and Watson (1999)) suggests that business cycles last from 6 to 32 months. These values were used to specify the lower and upper bound of the cycle periodicity. The results of the BK filter are shown in the following figure. A relatively series drawback of this method is that the smoothing factor leads to the loss of observations at the beginning and the end of the series. This might be a problem with small samples.

bk <- bkfilter(gdp[,2],pl=6,pu=32,nfix=12)$cycle[,1] # Apply filter and obtain cycle component

# Plot
dat <- cbind(dat,data.frame("Baxter.King"=bk))
g <- melt(dat[,c(1,5,4)],id.vars="Time",na.rm=TRUE)
levels(g[,2]) <- c("Baxter King","Hodrick Prescott")

plot.cycles(g,"Baxter King vs. Hodrick Prescott")
04_Baxter_King

Wavelet filter

Yogo (2008) proposed to use wavelet filters to extract business cycles from time series data. The advantage of this method is that the function does not only allow to extract the trend, cycle and noise of a series, but also to become more specific about the periods within which cycles occur. However, there is not full freedom in that, since the technique can only capture periodicities of a power of two, i.e. 2,4,8,16,32 and so on.

The methods implementation in R is also neat, but requires some additional data transformation before it can be used. A useful function is contained in the waveslim package and is called mra (“multiresolution analysis”). It requires a differenced version of the time series and the depth of the decomposition.

The function gives multiple series, which have to be cumulated with cumsum to translate them back into data that reflects cyclical patterns. Further, some series can be combined with rowSums. This is useful when the cycles that last 8 to 16 and 16 to 32 periods should be analysed together as it is done in the following figure. Not surprisingly, the wavelet filter yields a similar result as the BK filter, since the upper bound of cycle periods is equal in both and the lower bound differs only by two.

library(waveslim)

wavelet <- as.data.frame(mra(diff(gdp[,2]),J=5)) # Apply filter

# Plot
dat <- cbind(dat,data.frame("Wavelet"=c(NA,cumsum(rowSums(wavelet[,3:4])))))
g <- melt(dat[,c(1,6,5)],id.vars="Time",na.rm=TRUE)
levels(g[,2]) <- c("Wavelet","Baxter King")

plot.cycles(g,"Wavelet vs. Baxter King")
05_wavelet

Empirical mode decomposition (EMD)

Based on Huang et al. (1998) Kozic and Sever (2014) propose empirical mode decomposition as a further method of business cycle extraction. The function emd can be found in the EMD package and requires a differenced time series, a boundary condition and a rule that specifies at which point the iterative process has achieved a sufficiently satisfying result and can stop. The results of this filter method are relatively different from the HP, BK and wavelet filter. It is the task of every researching to assess whether working with this method is justified or not.

library(EMD)
emd <- as.data.frame(emd(xt=diff(gdp[,2]),boundary="wave",stoprule="type2")$imf)

dat <- cbind(dat,data.frame("EMD"=c(NA,cumsum(rowSums(emd[,3:6])))))
g <- melt(dat[,c(1,7,4)],id.vars="Time",na.rm=TRUE)
levels(g[,2]) <- c("EMD","Hodrick Prescott")

plot.cycles(g,"EMD vs. Hodrick Prescott")
06_EMD

Literature

Baxter, M., & King, R. O. (1999). Measuring Business Cycles: Approximate Band-Pass Filters for Economic Time Series. Review of Economics & Statistics, 81(4), 575–593.

Kožić, I., & Sever, I. (2014). Measuring business cycles: Empirical Mode Decomposition of economic time series. Economics Letters, 123(3), 287–290. doi:10.1016/j.econlet.2014.03.009

Stock, J. H., & Watson, M. W. (1999). Business Cycle Fluctuations in US Macroeconomic Time Series. In J. B. Taylor & M. Woodford (Eds.), Handbook of Macroeconomics. Handbook of Macroeconomics (pp. 3–64). Elsevier.

Yogo, M. (2008). Measuring business cycles: A wavelet analysis of economic time series. Economics Letters, 100(2), 208–212. doi:10.1016/j.econlet.2008.01.008

Basic plotting in R with ggplot2: A Beginner’s Guide to Time Series

Graphical illustration is essential both for good presentations, good seminar papers or other forms of content orientated communication and analysis. Humans are visual beings and although there are those “just-show-me-numbers!”-types out there, nearly everybody is happy – or at least not opposed – to see a good graph, which allows to intuitively comprehend what you are trying to say.

The first step is always the hardest

Although there are plenty of tutorials and blog entries of undoubtedly good quality, which explain how to make a nice graph in R, I have the impression that their codes cause more confusion than enlightenment among students, who have just started to work with it. And since I believe that I am not the only one, who had this starting problem, I will try to provide a comprehensive, easy-to-follow introduction into the creation of time series graphs as they appear in (seminar) papers and other publications in the realm of economics and business.

Of course, it would be much easier to do a tutorial on the basic plot function. But I decided to use the ggplot2 package, since it has a steep learning curve, looks much prettier, offers a larger variety of specifications and it is much easier to make more sophisticated graphs with it once the basic structure of the function is understood.

The three basic steps of graph creation

For the remainder of this post I suggest to divide the process of graph creation into three basic steps, according to which the following will be structured:

  • Data acquisition: import data from your computer or the internet into R.
  • Data preparation: put your data together an prepare it for the ggplot function.
  • Plotting: specify the appearance of your plot.

But before we can proceed you have to make sure that your installation of R already contains the packages which we are going to use. If you are totally new to R, visit this page, where you will find all the information needed to set up your installation properly (and hopefully quickly). Once RStudio runs on your system paste the following code into the script and run it.

install.packages(c("reshape2","ggplot2")) # If needed, install the required packages
library(reshape2) # Package for more effective data transformation
library(ggplot2) # Package for plotting
library(grid) # Package for certain plot specifications
library(scales) # Package for certain plot specifications

Data acquisition

It is simple: In order to plot something you need data. Usually, R related articles use random number generators to create a sample or they use built-in data sets like data(car). Since most students in business and economics deal with spreadsheet data, I choose a different approach and build our sample on Federal Reserve Bank data, which can be obtained directly from the internet as csv-files. The following lines show how this works in R for the 3-month Treasury Bill rate and the U.S. Consumer Prices Index. Since we are going to read csv-files, I will use the read.csv function. If you are not familiar with it, you might find this introduction useful. I also recommend to read this blog entry on importing data from files with different formats into R. Read the two series by running the following:

m3 <- read.csv("https://research.stlouisfed.org/fred2/data/TB3MS.csv") # Load the 3-month Treasury Bill rate
cpi <- read.csv("https://research.stlouisfed.org/fred2/data/CPIAUCSL.csv") # Load consumer price index for the U.S.

Data preparation

The next step is to make your values comparable. Note that the values of “m3” are already in percentage terms, whereas “cpi” is an index that ranges from about 21 to above 237. Thus, we have to transform the CPI sample. This is also reasonable, because economists are usually interested in the CPI growth rate, i.e. the inflation rate, and not in its level, which is practically meaningless on its own. Therefore, we take logs and the difference between every 12th observation using diff(log(cpi[,2]),12) to obtain the (approximate) percentage change of a month’s CPI value relative to its value in the previous year. The part [,2] indicates that you only use the values of the second column of the “cpi” data frame. This is necessary, because the function diff does not work with the character strings of the first column. Multiplying the results by 100 scales the values so that they are comparable with the other two series, where percentages are expressed in integer numbers and not in decimals, i.e. 1 percent instead of 0.01. Note, that since taking differences in our example means that we lose 12 observations at the beginning of the series and we have to replace the whole second column of the “cpi” sample, we have to add 12 observations with empty (NA) values at the beginning of the difference series with c(rep(NA,12),...) before we can do that. This is all done in a single line:

cpi[,2] <- c(rep(NA,12),diff(log(cpi[,2]),12)*100)

As a next step we have to re-format the data so that the ggplot function can interpret it. For this purpose we have to put the samples together into a single data frame with one column for the date and additional columns for each variable. This is achieved with the merge function, where we have to specify three things:

  1. The objects we want to merge: “m3” and “cpi”.
  2. A variable that both objects have in common to specify the by-option. In our example this is “DATE”. It is used to attribute the observations of different length from “m3” and “cpi” to the right date.
  3. The option all=TRUE means that not only those observations are used, where for a given date there is a value both for m3 and cpi. Rather, all observations are considered and empty values (NAs) are generated for a variable, when values for the other variable are available.
dat <- merge(m3,cpi,by="DATE",all=TRUE) # Put the sample together into a single frame and order it by date

Since the date will be plotted on the x-axis, it is convenient to set its class to “Date” so that ggplot already knows how to deal with it. Further, rename the columns to keep track of the data.

dat[,1] <- as.Date(dat[,1]) # Turn class of the first column into dates.
names(dat) <- c("Date","3M","Inflation") # Rename the columns

Although the resulting object would be a standard data frame as commonly used for model estimation with lm and more sophisticated estimators, ggplot requires a “slightly” different table structure. Think of it as having a long sample with three columns:

  • The fist column contains the date series, repeated for each additional variable resp. column of the old sample.
  • The second column consists of the variable names, where each name appears as often as the amount of its available observations.
  • The third column is made up by the variable values.

This is done with the melt function, where you have to specify the object, which should be transformed, the id-variable used to sort the data id.vars="Date" and if the empty values should be omitted na.rm=TRUE, which is done in our example.

dat <- melt(dat,id.vars="Date",na.rm=TRUE)

That’s it. Now we are ready for plotting.

An alternative/quicker approach:

Admittedly, we could have done all this a bit faster. I just took the longer way to include the melt function, which I deem to be extremely useful, when data is provided column-wise. If you are only interested in plotting, the appropriate table for ggplot can be produced with the following lines:

dat <- rbind(m3,cpi) # Append the CPI series to the interest rate series
dat <- cbind(dat,data.frame(c(rep("3M",nrow(m3)),rep("Inflation",nrow(cpi))))) # Add column with variable names
dat[,1] <- as.Date(dat[,1]) # Change class of the first column to "Date"
names(dat) <- c("Date","value","variable") # Rename the columns
dat <- na.omit(dat) # Omit empty values

Plotting

Basically, we could run the following line to finish.

ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_line()
Figure 1

Figure 1

We use the ggplot function and specify the sample with the data. We also have to add so-called “aesthetics” with aes(), where we set the data used for the axes – the “Date”-column in dat for the x-axis and “value”-column for the y-axis – and the column name in dat, which contains information about the series names, i.e. the “variable” column. Note that using linetype as indicator for the series produces lines with different dashing. If we use colour instead, we will get a graph, where the series are potted with different colours. Last, the command + geom_lin() is added to indicate that a line should be plotted.

This is the basic structure of a ggplot2 package plot. First, you create a general environment with the ggplot command, which is then followed by further “subcommands” that are linked by a plus sign.

The result of the above code looks quite neat. But if we look at the plot a bit closer, we find some aspects, which do not look very pretty. We might not like the labels of the axes, the missing title, the look and labels of the legend, the shaded background, the grid lines, the empty space to the left and right of the series etc. Although it does not solve all of those problems, we can easily help ourselves here by using one of the themes that come with the ggplot2 package. Here I use the “classic” theme by adding theme_class.

ggplot(dat,aes(x=Date,y=value,linetype=variable)) + geom_line() + theme_classic()
Figure 2

Figure 2

In figure 2 we got rid of the shaded background, the grid lines and the shaded background of the legend keys. But this is still not enough. The following lines provide some sort of template, which can be used to alter aspects, that I think to be most essential in a time series plot. It produces the plot shown in figure 3.

ggplot(dat,aes(x=Date,y=value,linetype=variable)) + 
  geom_abline(intercept=0,slope=0,colour="grey80") + # Create a horizontal line with value 0
  geom_line(size=.5) + # Create line with series and specify its thickness
  labs(x="Year",y="Rate", # Rename the x- and y- axis
       title="3-month Treasury Bill Rate and U.S. Inflation\n", # Title
       linetype="Legend:") + # Title of the legend
  coord_cartesian(xlim=c(min(dat[,1]),max(dat[,1])),ylim=c(-4,17)) + # Set displayed area and get rid of unused space
  guides(linetype=guide_legend()) + # Set the variables contained in the legend
  scale_x_date(labels = date_format("%Y"), breaks = date_breaks("5 years")) + # Rescale the x-axis
  theme(legend.position="bottom", # Position of the legend
        legend.key=element_rect(fill="white"), # Set background of the legend keys
        panel.background = element_rect(fill = "white"), # Set background of the graph
        axis.line=element_line(size=.3,colour="black"), # Set the size and colour of the axes
        axis.text=element_text(colour="black"), # Set the colour of the axes text
        panel.grid=element_blank()) # Set grid lines off
  • geom_abline(intercept=0,slope=0,colour="grey80"): Create a horizontal line with value 0. It intercepts the y-axis at value zero and has a slope of zere, i.e. runs horizontally. The colour of the line is set to grey, where 80 indicates its transparency.
  • geom_line(size=.5): Create the lines of the series and specify their thickness. Note that this command has to follow the geom_abline command, since it has to be laid above the latter. Otherwise the horizontal line would overwrite the series.
  • labs(x="Year",y="Rate",title="3-month Treasury Bill Rate and U.S. Inflation\n",linetype="Legend:"): Set the labels of the axes, the title, where \n indicates a line break, so that it is a bit more above the plot area. linetype="Legend:" specifies the title of the legend.
  • coord_cartesian(xlim=c(min(dat[,1]),max(dat[,1])),ylim=c(-4,17)): This helps to get rid of the unused space to the left and right of the series in the above figures. We set the limits of the displayed x-axis to the minimum and maximum values of the date series. The range of the y-axis is set manually by trying.
  • guides(linetype=guide_legend()): Set the variables contained in the legend. The variables in the legend are the same as those used by linetype.
  • scale_x_date(labels = date_format("%Y"), breaks = date_breaks("5 years")): Alter the axis text, so that ticks indicate five year intervals by specifying the breaks option and use labels = date_format("%Y") so that only years appear in the axis text.
  • theme(): Within this command further options can be altered.
  • legend.position="bottom": Position of the legend.
  • legend.key=element_rect(fill="white"): Set the background of the legend keys.
  • panel.background = element_rect(fill = "white"): Set background of the graph.
  • axis.line=element_line(size=.3,colour="black"): Set the size and colour of the axes.
  • axis.text=element_text(colour="black"): Set the colour of the axes’ text.
  • panel.grid=element_blank()): Set the colour etc. of grid lines. In this example they are omitted.

Feel free to play with all those options. I hope you will find the right design for your personal graphs.

Figure 3

Figure 3

If you are more interested in ggplot2 now, you might like the homepage of the package, where you find plenty of further plot functions with good examples.

William Shakespeare in a Word Cloud

Word or tag clouds seem to be quite popular at the moment. Although their analytical power might be limited, they do serve an aesthetic purpose and, for example, could be put on the cover page of a thesis or a presentation using the content of your work or the literature you went through.

Inspired by Andrew Collier’s blog entry on creating word clouds from multiple PDF-sources I played around with some texts from the Gutenberg project to visualize some of the most prominent pieces of classic literature. First, I started with an analysis of William Shakespeare’s “Romeo and Juliet”, which yielded a word cloud as shown in figure 1.

romeojuliet

Figure 1: Romeo and Juliet

With “love”, “Romeo”, “Juliet”, “night” and “death” as some of the most frequently used words in the piece  the result isn’t really surprising, or is it?

Then I remembered my old English teacher, who said that Shakespeare’s work basically hinged on the motives of “love”, “death” and “aristocracy”. So, I became curious about how this might be reflected in a word cloud that is drawn from the complete works of William Shakespeare. Fortunately, like “Romeo and Juliet”, a complete version of the master’s work is available on the Project Gutenberg website, where you can find a myriad of free literature to download and read (legally!!!). But let’s postpone the reading and concentrate on how to make a word cloud from such data. I will go through this step-by-step as suggested by Andrew Collier.

  1. Download the raw text file into R and read its lines.
    rawc<-readLines(file("http://www.gutenberg.org/files/100/100.txt"))
    
  2. Take a look at the raw data. Both at the beginning and the end of the file there is a lot of information on how to use the file and copyright issues. Since such modern text might bias our results, we omit those lines by looking only at the part between them using rawc[173:124369]. Then we put the relevant lines together into a single character string.
    dat<-paste(rawc[173:124369],collapse=" ")
    
  3. Convert all letters into lower cases. After that you will need the tm package, which contains functions to remove numbers and punctuation.
    dat<-tolower(dat) # Convert all letters into lower cases 
    dat<-removeNumbers(dat) # Remove all numbers
    dat<-removePunctuation(dat) # Remove punctuation
    
  4. The tm package also includes a function that allows to remove words from the sample, which are very frequently used, but add no content to the analysis. First, there is a prespecified list of words available that contains such words. It can be easily accessed by stopwords("english"). When it is used in removeWords(), a bunch of unnecessary words will be removed from the data.
     dat<-removeWords(dat,stopwords("english")) 

    Secondly, by looking at the raw text file I also noticed that there is a paragraph of legal information between the end of a piece and the beginning of the next. I extracted one such paragraph in the manner as shown above – extract, single character string, lower cases, remove numbers and punctuation -, split the single character into a list of characters with strsplit(...,split=" ")[[1]] and extracted the unique characters with unique(). I amended this list with another one, which I generated from iteratively running the code of this post and putting all those words onto it that I did not want to have in the word cloud.

    exclude<-unique(strsplit(removePunctuation(removeNumbers(tolower(paste(rawc[2802:2813],collapse="")))),split=" ")[[1]])
    exclude<-append(exclude,c("shall","thee","thy","thus","will","come","know","may","upon","hath","now","well","make","let","see","tell","yet","like","put","speak","give","speak","can","comes","makes","sees","tells","likes","puts","speaks","gives","speaks","knows","say","says","take","takes","exeunt","though","hear","think","hears","thinks","listen","listens","hear","hears","follow","commercially","commercial","readable","personal","doth","membership","stand","therefore","complete","tis","electronic","prohibited","must","look","looks","call","calls","done","prove","whose","enter","one","words","thou","came","much","never","wit","leave","even","ever","distributed","keep","stay","made","scene","many","away","exit","shalt"))
    dat<-removeWords(dat,exclude) # Remove the words contained in the self created list
    
  5. Split the single character string into a list of words, get rid of the empty entries and exclude all entries, where the words’ numbers of letters are lower than 3.
    dat<-strsplit(dat,split=" ")[[1]]
    dat<-dat[dat!=""]
    dat<-dat[nchar(dat)>2]
    
  6. For the next step you will need the dplyr package. Create a table from a data frame of the word list and count the number of instances of a word in that table. It is useful to let the count function sort the result according to the number of instances.
    dat.tbl<-tbl_df(as.data.frame(dat)) # Create table
    dat.tbl<-count(dat.tbl,sort=TRUE,vars=dat) # Count and sort words
    
  7. Basically, you could already make a word cloud from this data. However, since the amount of words is quite high (27,642), I would like to limit the set of plotted words to 300, which I deem to be a good size for word clouds, because you are still impressed by the number of words in the cloud without losing the ability to get a basic impression of the content from it. (Alternatively, you could work with shares of word counts in the total sample and extract the, e.g., first third of the most frequent words. The resulting length of data frames allows you to compare different texts with respect to the degree of concentration of specific topics, where relatively longer frames indicate less concentration.)
    dat.tbl<-dat.tbl[1:300,] # Extract the first 300 most frequent words
    
    # Alternative
    #dat.tbl<-dat.tbl[,"n"]/sum(dat.tbl[,"n"]) # Transform counts into percentages
    #dat.tbl<-cbind(dat.tbl,cumsum(dat.tbl[,"n"])) # Add cumulated sum of shares
    #names(dat.tbl)<-c("vars","n","cumsum") # Rename columns 
    #dat.tbl<-dat.tbl[dat.tbl[,"cumsum"]<=.33,] # Extract the first third of most frequent words
    
  8. Finally, you can plot the word cloud using the function tagcloud() from the tagcloud package, where you have to specify the words and their weights, i.e. counts. Additionally, I used fvert=.3 so that 30 percent of the words in the could are aligned vertically.
    tagcloud(dat.tbl$vars,weights=dat.tbl$n,fvert=.3)
    
  9. Look at the result. As we can see, my English teacher was quite right, although I did not expect that high concentration of words with masculine aristocratic connotation. However, since we excluded words like “thy”, “shall”, “will”, “thou” etc., much content from the sonnets is lost, where a lot of motives like “love” and “desire” are not explicitly mentioned, but at least implicitly present. This might cause a bias towards motives from the dramatic pieces.

    Figure 2: Complete Works of William Shakespeare

    Figure 2: Complete Works of William Shakespeare

If you want to see the whole code of this entry, visit my GitHub repository on this.