This is a tutorial on how to download data from Amazon.
Amazon makes a lot of data avalailable to systematically download using its Application Programming Interface (API). If you sign up to be an Amazon associate - meaning you are willing to show Amazon advertising on your website - you can obtain a key to make queries directly to Amazon. With this key, you can obtain, among other data, sales ranks of various products.
Details about the associate program here:
This tutorial will demonstrate how to download such data with R. In particular, we will search for the search ranks associated with items when searching for the term 'blender' on Amazon. We will then tabulate the sales ranks of the top ten items associated with that search term.
This tutorial borrows from the solution found on Stackoverflow:
First, you'll need to define a number of variables you'll need to make the query.
These include the access key provided to you by Amazon when you sign up for the affiliate program, your secret key, your associate ID, and the search term.
Details about obtaining the access key, secret key, and associate ID can be found in the API documentation link above. My actual details have been obscured, and you'll have to get your own credentials.
Define the function used to query the API.
This is the solution suggested by user Mischa Vreeburg on Stackoverflow in the link above.
Use the function to get the returned results into an object.
The API returns structured XML data. We'll need to parse the XML to make it easier to work with.
More on XML here: http://www.informit.com/articles/article.aspx?p=2215520
Restructure the data.
We'll need to load some packages to manipulate the data, including the XML package.
We can use the xmlRoot package to arrange the XML to a structure we can navigate.
We now onbtain the structure of the XML - We can now examine the top level nodes of the XML object using the function xmlRoot.
We can also look at the content of the nodes.
The first node contains meta-data, while the second node contains the results of the query.
Put the query results into a dataframe.
Tabulate some cleaned results
Two columns contain useful data - one called “ASIN”, and another called “SalesRank”.
ASIN refers to a unique number Amazon assigns to a product, much like an ISBN. Sales rank is exactly what it sounds like - the sales rank of the product, with lower numbers indicating higher sales. More information about Amazon sales ranks and their properties in these papers:
Chevalier, J. A., & Mayzlin, D. (2006). The Effect of Word of Mouth on Sales: Online Book Reviews. Journal of Marketing Research, 43(3), 345-354.
Deschatres, F., & Sornette, D. (2005). Dynamics of book sales: Endogenous versus exogenous shocks in complex networks. Physical Review E, 72(1). http://doi.org/10.1103/PhysRevE.72.016112