The screen scraping method described in "Screen Scraping" could only associate NY Times articles with ADS papers if there was a direct link from NY Times article to the website of a scientific publication. The majority of NY Times articles did not have such a link. A few articles simply had no corresponding ADS papers(s), but many of them directly described research published in one or more astronomical journals. This project used text from these articles to try to automate the production of a series of queries to the ADS to find the most likely paper in the ADS that corresponded with the article.
Script Output for NY Times article A Planet ‘Just Right’ for Life? Perhaps, if It Exists by Dennis Overbye | |
---|---|
Likely names | ['Hatzes', 'Nachrichten', 'Sagan', 'Vogt', 'Cruz', 'Way', 'Butler', 'Ford', 'Mayor', 'Forveille'] |
Bibstems | ['AN', 'ApJ', 'AN'] |
Possibly important numbers and names | ['GL 581', 'GLEE-'] |
Frequent words (filtered stopwords + common astronomical terms) | ['forveille', 'butler', 'hatzes', 'false', 'alarm'] |
Frequent bigrams (no filtering) | ['vogt colleagues', 'gliese 581g', 'habitable zone'] |
Suggested bibcodes | [('2012AN....333..561V', 12), ('2002ApJ...581L.115B', 10), ('2013AN....334..184A', 8)] |
After collecting this array of words, phrases and names, the remaining code just generates different search queries based on different combinations of terms. It counts the most frequently returned articles and suggests a likely match from the most frequently returned article. Sometimes it works; sometimes it doesn't, but again, it was a great learning experience. Here is a link to the code, which is quite messy, but if you have an ADS API key, you can test the accuracy of the code for yourself.