Data Science for Librarians Final Project

Generating Search Queries Using Regular Expressions and NLTK

Key Lessons

Learn how to use regular expressions--they are helpful in a variety of tasks, especially with data munging.
A bit more advanced, but still vital and time-saving if you are analyzing textual data: Python NLTK module. Using this module you can do anything from basic tokenization to different types of word frequency measurements to advanced semantic analysis.

The screen scraping method described in "Screen Scraping" could only associate NY Times articles with ADS papers if there was a direct link from NY Times article to the website of a scientific publication. The majority of NY Times articles did not have such a link. A few articles simply had no corresponding ADS papers(s), but many of them directly described research published in one or more astronomical journals. This project used text from these articles to try to automate the production of a series of queries to the ADS to find the most likely paper in the ADS that corresponded with the article.

Script Output for NY Times article A Planet ‘Just Right’ for Life? Perhaps, if It Exists by Dennis Overbye
Likely names	['Hatzes', 'Nachrichten', 'Sagan', 'Vogt', 'Cruz', 'Way', 'Butler', 'Ford', 'Mayor', 'Forveille']
Bibstems	['AN', 'ApJ', 'AN']
Possibly important numbers and names	['GL 581', 'GLEE-']
Frequent words (filtered stopwords + common astronomical terms)	['forveille', 'butler', 'hatzes', 'false', 'alarm']
Frequent bigrams (no filtering)	['vogt colleagues', 'gliese 581g', 'habitable zone']
Suggested bibcodes	[('2012AN....333..561V', 12), ('2002ApJ...581L.115B', 10), ('2013AN....334..184A', 8)]

How the Script Works

The query generator took the following steps to try to extract author names, astronomical object names, and other important identifying words and phrases.

Author Names: Using the NLTK module's named entity capability, I identified as many person names in the text of the article as possible, assuming these were potential authors of the scientific paper. However, frequently astronomers who were not involved in the research were interviewed in the NY Times for their expert opinion--I dealt with this by excluding any person names within a certain distance of the word "not" (as in "Professor Albertson, who was not involved in the study..."
Journal Names: This was easier. I collected a list of high profile Astronomy and Astrophysics journals and used regular expressions to identify any mention of them in the article. (You could frequently find sentences such as "The study, which is to be published next month in The Astrophysical Journal..."
Astronomical Objects: For this step, I used regex yet again to look for possible names of astronomical objects--frequently identified by capital letters followed by numbers (as in "GJ1214", a red dwarf).
Important Words: After filtering out stopwords and popular astronomy terms, this step searched for words characterized by unusual frequency, assuming that these would function as keywords for the ADS search.
Important Bigrams: The same as in the above step, but for bigrams instead of lone words.

After collecting this array of words, phrases and names, the remaining code just generates different search queries based on different combinations of terms. It counts the most frequently returned articles and suggests a likely match from the most frequently returned article. Sometimes it works; sometimes it doesn't, but again, it was a great learning experience. Here is a link to the code, which is quite messy, but if you have an ADS API key, you can test the accuracy of the code for yourself.