Word or tag clouds seem to be quite popular at the moment. Although their analytical power might be limited, they do serve an aesthetic purpose and, for example, could be put on the cover page of a thesis or a presentation using the content of your work or the literature you went through.
Inspired by Andrew Collier’s blog entry on creating word clouds from multiple PDF-sources I played around with some texts from the Gutenberg project to visualize some of the most prominent pieces of classic literature. First, I started with an analysis of William Shakespeare’s “Romeo and Juliet”, which yielded a word cloud as shown in figure 1.
With “love”, “Romeo”, “Juliet”, “night” and “death” as some of the most frequently used words in the piece the result isn’t really surprising, or is it?
Then I remembered my old English teacher, who said that Shakespeare’s work basically hinged on the motives of “love”, “death” and “aristocracy”. So, I became curious about how this might be reflected in a word cloud that is drawn from the complete works of William Shakespeare. Fortunately, like “Romeo and Juliet”, a complete version of the master’s work is available on the Project Gutenberg website, where you can find a myriad of free literature to download and read (legally!!!). But let’s postpone the reading and concentrate on how to make a word cloud from such data. I will go through this step-by-step as suggested by Andrew Collier.
- Download the raw text file into R and read its lines.
- Take a look at the raw data. Both at the beginning and the end of the file there is a lot of information on how to use the file and copyright issues. Since such modern text might bias our results, we omit those lines by looking only at the part between them using
rawc[173:124369]. Then we put the relevant lines together into a single character string.
- Convert all letters into lower cases. After that you will need the
tmpackage, which contains functions to remove numbers and punctuation.
dat<-tolower(dat) # Convert all letters into lower cases dat<-removeNumbers(dat) # Remove all numbers dat<-removePunctuation(dat) # Remove punctuation
tmpackage also includes a function that allows to remove words from the sample, which are very frequently used, but add no content to the analysis. First, there is a prespecified list of words available that contains such words. It can be easily accessed by
stopwords("english"). When it is used in
removeWords(), a bunch of unnecessary words will be removed from the data.
Secondly, by looking at the raw text file I also noticed that there is a paragraph of legal information between the end of a piece and the beginning of the next. I extracted one such paragraph in the manner as shown above – extract, single character string, lower cases, remove numbers and punctuation -, split the single character into a list of characters with
strsplit(...,split=" ")[]and extracted the unique characters with
unique(). I amended this list with another one, which I generated from iteratively running the code of this post and putting all those words onto it that I did not want to have in the word cloud.
exclude<-unique(strsplit(removePunctuation(removeNumbers(tolower(paste(rawc[2802:2813],collapse="")))),split=" ")[]) exclude<-append(exclude,c("shall","thee","thy","thus","will","come","know","may","upon","hath","now","well","make","let","see","tell","yet","like","put","speak","give","speak","can","comes","makes","sees","tells","likes","puts","speaks","gives","speaks","knows","say","says","take","takes","exeunt","though","hear","think","hears","thinks","listen","listens","hear","hears","follow","commercially","commercial","readable","personal","doth","membership","stand","therefore","complete","tis","electronic","prohibited","must","look","looks","call","calls","done","prove","whose","enter","one","words","thou","came","much","never","wit","leave","even","ever","distributed","keep","stay","made","scene","many","away","exit","shalt")) dat<-removeWords(dat,exclude) # Remove the words contained in the self created list
- Split the single character string into a list of words, get rid of the empty entries and exclude all entries, where the words’ numbers of letters are lower than 3.
dat<-strsplit(dat,split=" ")[] dat<-dat[dat!=""] dat<-dat[nchar(dat)>2]
- For the next step you will need the
dplyrpackage. Create a table from a data frame of the word list and count the number of instances of a word in that table. It is useful to let the count function sort the result according to the number of instances.
dat.tbl<-tbl_df(as.data.frame(dat)) # Create table dat.tbl<-count(dat.tbl,sort=TRUE,vars=dat) # Count and sort words
- Basically, you could already make a word cloud from this data. However, since the amount of words is quite high (27,642), I would like to limit the set of plotted words to 300, which I deem to be a good size for word clouds, because you are still impressed by the number of words in the cloud without losing the ability to get a basic impression of the content from it. (Alternatively, you could work with shares of word counts in the total sample and extract the, e.g., first third of the most frequent words. The resulting length of data frames allows you to compare different texts with respect to the degree of concentration of specific topics, where relatively longer frames indicate less concentration.)
dat.tbl<-dat.tbl[1:300,] # Extract the first 300 most frequent words # Alternative #dat.tbl<-dat.tbl[,"n"]/sum(dat.tbl[,"n"]) # Transform counts into percentages #dat.tbl<-cbind(dat.tbl,cumsum(dat.tbl[,"n"])) # Add cumulated sum of shares #names(dat.tbl)<-c("vars","n","cumsum") # Rename columns #dat.tbl<-dat.tbl[dat.tbl[,"cumsum"]<=.33,] # Extract the first third of most frequent words
- Finally, you can plot the word cloud using the function
tagcloudpackage, where you have to specify the words and their weights, i.e. counts. Additionally, I used
fvert=.3so that 30 percent of the words in the could are aligned vertically.
- Look at the result. As we can see, my English teacher was quite right, although I did not expect that high concentration of words with masculine aristocratic connotation. However, since we excluded words like “thy”, “shall”, “will”, “thou” etc., much content from the sonnets is lost, where a lot of motives like “love” and “desire” are not explicitly mentioned, but at least implicitly present. This might cause a bias towards motives from the dramatic pieces.
If you want to see the whole code of this entry, visit my GitHub repository on this.