This is a tutorial on how to download data from Google Ngram.
Google scans books as a part of its Google Books service. The aim of the service is to allow people to search the content of books, ultimately to facilitate book sales.
As a byproduct of its scanning efforts is the generation of a large corpus of words that it makes available to the public. You can get an estimate of how often a word was used in print during a particular year. A wikipedia article explaining the service is here:
Google provides a website to graph word usage over time:
You can also download their entire corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html), but that would involve downloading many hundreds of gigabytes of data.
If you wanted to look only at select words, there is an R package to download the data. We'll cover how to do that in this tutorial.
Load the ngramr package.
Create a dataframe to house the data.
We will first create a dataframe to house the data. We will start off with a dataframe that has one column, and 109 rows - equal to the number of years from 1900 to 2008.
Create a column of years.
Next, we will list out the years in sequence, and rename the column to “year”
Specify the terms to search for in Google Ngram.
Loop through those search terms and pull in Jstor data.
Inspect the data frame downloaded.
A look at the data.
Reshape the data to put it in a graph friendly (long) format.
Graph the data to observe trends over time.
As you can see from the graph below, the term “business” has been mentioned to a greater extent than “arts” or “science”, but the disparity in use has accelerated with time. One thing to carefully think through when analysing word count data are possible collisions in meaning. The term “business” could refer a for-profit organization, or one's state of affairs, as in “mind your own business”. With only word counts, there is no way to disentangle the two uses of the meaning.