Data Science for Librarians Final Project

Screen Scraping to Associate News Articles with Papers in the ADS

Key Lessons

  1. After learning the basics of Python, if you are at all interested in working with information, the next thing you need to do is learn how to work with APIs. Python provides a great module called Requests that makes this easy. You can see an intro tutorial I wrote on using Requests to query APIs here.
  2. If for some reason the information you want isn't easily accessible through an API, a bit of screen scraping knowledge will serve you well. (To use Scrapy you need to be pretty comfortable with Python.)
  3. Try to familiarize yourself with the basic types of information containers you will be working with. If you have a moderate understanding of csv files, json data structures, xml, and Python dictionaries and lists, you should be good to go.

Building a Spider

The New York Times "Astronomy and Astrophysics" section offers an excellent series of articles explaining recent breakthroughs in the field. At the beginning of this class, we set out to find an automated way to associate articles in the Astronomy and Astrophysics section with papers in the Astrophysics Data System.

Unfortunately, the NY Times API was too limited to allow us to associate articles easily, so I decided to use screenscraping instead. I ended up using the Python Scrapy framework to write a spider. What it did was search through every Astronomy and Astrophysics article in the NY Times, looking for a link to the scientific publication that featured the article(s) under discussion. If it could find such a link, it would follow it and search the scientific publication's page for a DOI (document object identifier). Once this was found, it was easy to automate a search in the ADS API for the bibcode that uniquely identified that paper in the ADS (a bibcode is similar to a DOI, but unique to the ADS).

Outcome

Ultimately, this method yielded fewer matches than anticipated, because it seems that authors of NY Times articles do not always offer a link to the relevant scientific publication (sometimes the NY Times article featuring research is published before the scientific paper that features the research, for instance). However, it was a great introduction to the information-aggregation power that web scraping can hold in the face of a lacking API. To view the tab-delimited csv file that the spider program produced, look below. I am not linking to the spider code itself because it was written just as I was learning Python.