This is a tutorial on how to download data from Google Ngram.

Google scans books as a part of its Google Books service. The aim of the service is to allow people to search the content of books, ultimately to facilitate book sales.

As a byproduct of its scanning efforts is the generation of a large corpus of words that it makes available to the public. You can get an estimate of how often a word was used in print during a particular year. A wikipedia article explaining the service is here:

http://en.wikipedia.org/wiki/Google_Ngram_Viewer

Google provides a website to graph word usage over time:

https://books.google.com/ngrams

You can also download their entire corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html), but that would involve downloading many hundreds of gigabytes of data.

If you wanted to look only at select words, there is an R package to download the data. We'll cover how to do that in this tutorial.

Load the ngramr package.

library(ngramr)

Create a dataframe to house the data.

We will first create a dataframe to house the data. We will start off with a dataframe that has one column, and 109 rows - equal to the number of years from 1900 to 2008.

data <- as.data.frame(matrix(ncol=1, nrow=109))

Create a column of years.

Next, we will list out the years in sequence, and rename the column to “year”

data$V1 <- seq(from=1900, to=2008)

names(data)[names(data)=="V1"] <- "Year"

Specify the terms to search for in Google Ngram.

search_terms <- c("arts", "business", "science")

Loop through those search terms and pull in Jstor data.

# Create a loop
for(i in 1:length(search_terms)){

# Get each search term and store those in objects
term <- search_terms[i]

# Search for the term in the English 2012 corpus, starting from the year 1900 to 2008
# Then house the output in a dataframe
temp <- ngram(term, corpus = "eng_2012", year_start = 1900, smoothing = 0, 
              count = T, tag = NULL, case_ins = FALSE)

# Merge NYT data with dataframe created step 1, matching by years
data <- merge(data, temp[,c("Year", "Count")], by ="Year", all.x=TRUE)

# Reaname column by search term
colname <- paste(term, sep="")

# Rename added column with ID
names(data)[names(data)=="Count"] <- colname

# Remove temporary dataframe
rm(temp)

}

# Rename the year variable from "term" to "year"
#names(data)[names(data)=="term"] <- "year"

Inspect the data frame downloaded.

A look at the data.

head(data)
##   Year  arts business science
## 1 1900 36155   265231  119665
## 2 1901 38291   235711  104036
## 3 1902 33719   250901  106028
## 4 1903 28753   266980   97765
## 5 1904 36403   298290  121023
## 6 1905 35930   275559  110596

Reshape the data to put it in a graph friendly (long) format.

data_long <- reshape(data, 
  varying = c("arts", "business", "science"), 
  v.names = "count",
  timevar = "search_term", 
  times = c("arts", "business", "science"), 
  direction = "long")

Graph the data to observe trends over time.

As you can see from the graph below, the term “business” has been mentioned to a greater extent than “arts” or “science”, but the disparity in use has accelerated with time. One thing to carefully think through when analysing word count data are possible collisions in meaning. The term “business” could refer a for-profit organization, or one's state of affairs, as in “mind your own business”. With only word counts, there is no way to disentangle the two uses of the meaning.

plot of chunk unnamed-chunk-8

library(ggplot2)

p <- ggplot(data_long, aes(x=Year, y=count, group=search_term))

p +  geom_line(aes(colour = search_term))