This is a tutorial on how to download data from Amazon.

Amazon makes a lot of data avalailable to systematically download using its Application Programming Interface (API). If you sign up to be an Amazon associate - meaning you are willing to show Amazon advertising on your website - you can obtain a key to make queries directly to Amazon. With this key, you can obtain, among other data, sales ranks of various products.

Details about the associate program here:

https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html

This tutorial will demonstrate how to download such data with R. In particular, we will search for the search ranks associated with items when searching for the term 'blender' on Amazon. We will then tabulate the sales ranks of the top ten items associated with that search term.

This tutorial borrows from the solution found on Stackoverflow:

http://stackoverflow.com/questions/8251632/amazon-product-api-with-r

First, you'll need to define a number of variables you'll need to make the query.

These include the access key provided to you by Amazon when you sign up for the affiliate program, your secret key, your associate ID, and the search term.

Details about obtaining the access key, secret key, and associate ID can be found in the API documentation link above. My actual details have been obscured, and you'll have to get your own credentials.

AWSAccessKeyId <- "REDACTED"

AWSsecretkey <- "REDACTED" 

AssociateTag <- "REDACTED" 

Keywords <- "blender"

Define the function used to query the API.

This is the solution suggested by user Mischa Vreeburg on Stackoverflow in the link above.

search.amazon <- function(Keywords, SearchIndex = 'All', AWSAccessKeyId, 
                          AWSsecretkey, AssociateTag, 
                          ResponseGroup = 'Small', 
                          Operation = 'ItemSearch'){
  library(digest)
  library(RCurl)

  base.html.string <- "http://ecs.amazonaws.com/onca/xml?"
  SearchIndex <- match.arg(SearchIndex, c('All',
                                          'Apparel',
                                          'Appliances',
                                          'ArtsAndCrafts',
                                          'Automotive',
                                          'Baby',
                                          'Beauty',
                                          'Blended',
                                          'Books',
                                          'Classical',
                                          'DigitalMusic',
                                          'DVD',
                                          'Electronics',
                                          'ForeignBooks',
                                          'Garden',
                                          'GourmetFood',
                                          'Grocery',
                                          'HealthPersonalCare',
                                          'Hobbies',
                                          'HomeGarden',
                                          'HomeImprovement',
                                          'Industrial',
                                          'Jewelry',
                                          'KindleStore',
                                          'Kitchen',
                                          'Lighting',
                                          'Magazines',
                                          'Marketplace',
                                          'Miscellaneous',
                                          'MobileApps',
                                          'MP3Downloads',
                                          'Music',
                                          'MusicalInstruments',
                                          'MusicTracks',
                                          'OfficeProducts',
                                          'OutdoorLiving',
                                          'Outlet',
                                          'PCHardware',
                                          'PetSupplies',
                                          'Photo',
                                          'Shoes',
                                          'Software',
                                          'SoftwareVideoGames',
                                          'SportingGoods',
                                          'Tools',
                                          'Toys',
                                          'UnboxVideo',
                                          'VHS',
                                          'Video',
                                          'VideoGames',
                                          'Watches',
                                          'Wireless',
                                          'WirelessAccessories'))

  Operation <- match.arg(Operation, c('ItemSearch',
                                      'ItemLookup',
                                      'BrowseNodeLookup',
                                      'CartAdd',
                                      'CartClear',
                                      'CartCreate',
                                      'CartGet',
                                      'CartModify',
                                      'SimilarityLookup'))

  ResponseGroup <- match.arg(ResponseGroup, c('Accessories',
                                              'AlternateVersions',
                                              'BrowseNodeInfo',
                                              'BrowseNodes',
                                              'Cart',
                                              'CartNewReleases',
                                              'CartTopSellers',
                                              'CartSimilarities',
                                              'Collections',
                                              'EditorialReview',
                                              'Images',
                                              'ItemAttributes',
                                              'ItemIds',
                                              'Large',
                                              'Medium',
                                              'MostGifted',
                                              'MostWishedFor',
                                              'NewReleases',
                                              'OfferFull',
                                              'OfferListings',
                                              'Offers',
                                              'OfferSummary',
                                              'PromotionSummary',
                                              'RelatedItems',
                                              'Request',
                                              'Reviews',
                                              'SalesRank',
                                              'SearchBins',
                                              'Similarities',
                                              'Small',
                                              'TopSellers',
                                              'Tracks',
                                              'Variations',
                                              'VariationImages',
                                              'VariationMatrix',
                                              'VariationOffers',
                                              'VariationSummary'),
                             several.ok = TRUE)

  version.request = '2011-08-01'
  Service = 'AWSECommerceService'
  if(!is.character(AWSsecretkey)){
    message('The AWSsecretkey should be entered as a character vect, ie be qouted')
  }

  pb.txt <- Sys.time()

  pb.date <- as.POSIXct(pb.txt, tz = Sys.timezone)

  Timestamp = strtrim(format(pb.date, tz = "GMT", usetz = TRUE, "%Y-%m-%dT%H:%M:%S.000Z"), 24)

  str = paste('GET\necs.amazonaws.com\n/onca/xml\n',
              'AWSAccessKeyId=', curlEscape(AWSAccessKeyId),
              '&AssociateTag=', AssociateTag,
              '&Keywords=', curlEscape(Keywords),
              '&Operation=', curlEscape(Operation),
              '&ResponseGroup=', curlEscape(ResponseGroup),
              '&SearchIndex=', curlEscape(SearchIndex),
              '&Service=AWSECommerceService',
              '&Timestamp=', gsub('%2E','.',gsub('%2D', '-', curlEscape(Timestamp))),
              '&Version=', version.request,
              sep = '')

  ## signature test
  Signature = curlEscape(base64(hmac( enc2utf8(AWSsecretkey), 
                                      enc2utf8(str), algo = 'sha256', 
                                      serialize = FALSE, raw = TRUE)))  

  AmazonURL <- paste(base.html.string,
                     'AWSAccessKeyId=', AWSAccessKeyId,
                     '&AssociateTag=', AssociateTag,
                     '&Keywords=', Keywords,
                     '&Operation=',Operation,
                     '&ResponseGroup=',ResponseGroup,
                     '&SearchIndex=', SearchIndex,
                     '&Service=AWSECommerceService',
                     '&Timestamp=', Timestamp,
                     '&Version=', version.request,
                     '&Signature=', Signature,
                     sep = '')
  AmazonResult <- getURL(AmazonURL)
  return(AmazonResult)
}

Use the function to get the returned results into an object.

data <- search.amazon(Keywords, SearchIndex = 'All', 
                      AWSAccessKeyId, AWSsecretkey, AssociateTag, 
                      ResponseGroup = 'SalesRank', Operation = 'ItemSearch')

The API returns structured XML data. We'll need to parse the XML to make it easier to work with.

More on XML here: http://www.informit.com/articles/article.aspx?p=2215520

Restructure the data.

We'll need to load some packages to manipulate the data, including the XML package.

library("XML")
library("plyr")

We can use the xmlRoot package to arrange the XML to a structure we can navigate.

xml_data = xmlParse(data)

We now onbtain the structure of the XML - We can now examine the top level nodes of the XML object using the function xmlRoot.

xmltop = xmlRoot(xml_data) 
class(xmltop)

## [1] "XMLInternalElementNode" "XMLInternalNode"       
## [3] "XMLAbstractNode"

We can also look at the content of the nodes.

The first node contains meta-data, while the second node contains the results of the query.

xmltop[1]

## $OperationRequest
## <OperationRequest>
##   <RequestId>REDACTED</RequestId>
##   <Arguments>
##     <Argument Name="AWSAccessKeyId" Value="REDACTED"/>
##     <Argument Name="AssociateTag" Value="REDACTED"/>
##     <Argument Name="Keywords" Value="blender"/>
##     <Argument Name="Operation" Value="ItemSearch"/>
##     <Argument Name="ResponseGroup" Value="SalesRank"/>
##     <Argument Name="SearchIndex" Value="All"/>
##     <Argument Name="Service" Value="AWSECommerceService"/>
##     <Argument Name="Timestamp" Value="2015-04-05T07:01:53.000Z"/>
##     <Argument Name="Version" Value="2011-08-01"/>
##     <Argument Name="Signature" Value="REDACTED"/>
##   </Arguments>
##   <RequestProcessingTime>0.0652380000000000</RequestProcessingTime>
## </OperationRequest> 
## 
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"

xmltop[2]

## $Items
## <Items>
##   <Request>
##     <IsValid>True</IsValid>
##     <ItemSearchRequest>
##       <Keywords>blender</Keywords>
##       <ResponseGroup>SalesRank</ResponseGroup>
##       <SearchIndex>All</SearchIndex>
##     </ItemSearchRequest>
##   </Request>
##   <TotalResults>34415</TotalResults>
##   <TotalPages>3442</TotalPages>
##   <MoreSearchResultsUrl>REDACTED</MoreSearchResultsUrl>
##   <Item>
##     <ASIN>B00EI7DPI0</ASIN>
##     <ParentASIN>B00VK0IMYA</ParentASIN>
##     <SalesRank>74</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B003XU3C7M</ASIN>
##     <ParentASIN>B00JVGQP8U</ParentASIN>
##     <SalesRank>73</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B00KVZ27UA</ASIN>
##     <ParentASIN>B00TF5F61O</ParentASIN>
##     <SalesRank>35</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B0081PTLGU</ASIN>
##     <SalesRank>762</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B004P2OLB8</ASIN>
##     <ParentASIN>B0056Y4ZD8</ParentASIN>
##     <SalesRank>108</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B00939FV8K</ASIN>
##     <ParentASIN>B00JXG6SX0</ParentASIN>
##     <SalesRank>161</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B003ZDNILM</ASIN>
##     <ParentASIN>B00UHHHOMS</ParentASIN>
##     <SalesRank>2336</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B009NIN2UU</ASIN>
##     <ParentASIN>B00V0170K6</ParentASIN>
##     <SalesRank>2214</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B001E7XGX6</ASIN>
##     <ParentASIN>B00JTP5TPI</ParentASIN>
##     <SalesRank>317</SalesRank>
##   </Item>
##   <Item>
##     <ASIN>B00JFLVMNE</ASIN>
##     <ParentASIN>B00VKOROEK</ParentASIN>
##     <SalesRank>122</SalesRank>
##   </Item>
## </Items> 
## 
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"

Put the query results into a dataframe.

# We'll put the contents of the second node into its own object.
node.2 <- xmltop[[2]]

# The first child of this node contains the search term.
xmlChildren(node.2)[[1]]

## <Request>
##   <IsValid>True</IsValid>
##   <ItemSearchRequest>
##     <Keywords>blender</Keywords>
##     <ResponseGroup>SalesRank</ResponseGroup>
##     <SearchIndex>All</SearchIndex>
##   </ItemSearchRequest>
## </Request>

# The second child of this node contains the number of results associated with the search term.
xmlChildren(node.2)[[2]]

## <TotalResults>34415</TotalResults>

# The third child of this node the number of pages of results associated with the search term.  
xmlChildren(node.2)[[3]]

## <TotalPages>3442</TotalPages>

# The fourth child of this node contains a URL that points to more search results.
xmlChildren(node.2)[[4]]

## <MoreSearchResultsUrl>REDACTED</MoreSearchResultsUrl>

# The fifth child of this node contains the first result.
xmlChildren(node.2)[[5]]

## <Item>
##   <ASIN>B00EI7DPI0</ASIN>
##   <ParentASIN>B00VK0IMYA</ParentASIN>
##   <SalesRank>74</SalesRank>
## </Item>

# We'll remove the first four rows, and transform the XML into a dataframe. 
query_results <- ldply(xmlToList(node.2), data.frame) 

query_results <- query_results[5:14,]

Tabulate some cleaned results

Two columns contain useful data - one called “ASIN”, and another called “SalesRank”.

ASIN refers to a unique number Amazon assigns to a product, much like an ISBN. Sales rank is exactly what it sounds like - the sales rank of the product, with lower numbers indicating higher sales. More information about Amazon sales ranks and their properties in these papers:

Chevalier, J. A., & Mayzlin, D. (2006). The Effect of Word of Mouth on Sales: Online Book Reviews. Journal of Marketing Research, 43(3), 345-354.

Deschatres, F., & Sornette, D. (2005). Dynamics of book sales: Endogenous versus exogenous shocks in complex networks. Physical Review E, 72(1). http://doi.org/10.1103/PhysRevE.72.016112

data_tab <- query_results[,c("ASIN", "SalesRank")]

# Load a package to neatly tabulate the data.
library(xtable)
table <- xtable(data_tab)
print(table, type = "html", include.rownames = FALSE)

ASIN	SalesRank
B00EI7DPI0	74
B003XU3C7M	73
B00KVZ27UA	35
B0081PTLGU	762
B004P2OLB8	108
B00939FV8K	161
B003ZDNILM	2336
B009NIN2UU	2214
B001E7XGX6	317
B00JFLVMNE	122