Back

How to scrape and analyse your Amazon spending data

June 3, 2021 8 minute read
Phone with amazon logo on the screen
Source: Unsplash

Ever wondered just how much you've spent on Amazon since signing up? Well I read an article recently from Dataquest which outlined how to find out how much you've spent on Amazon. However, I quickly found out that this feature of downloading your spending in a report, is not available on the UK version of this site! I really wanted to gather this data, and started a small project to do just that. So, if you're interested in gathering and analysing your Amazon spending data with Python, while learning some web scraping, you're in the right place.

Before starting

Before starting you will need a few things. These things will set you up to carry out other Data Science projects in the future too.

  • Anaconda
  • Jupyter Notebooks (installed with Anaconda)
  • Selenium
  • Google Chrome (latest version)
  • Chrome Driver (latest version)

This article will not cover installing programs in detail, but here is a starting point. Install Anaconda first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using pip install selenium. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of Google Chrome installed and ChromeDriver for the version number of Chrome you're running. On Windows, ensure chromedriver.exe is in a suitable location such as C:\Windows.

There is a link to download the Jupyter Notebook at the end of this article so you can try out the code on your own. Alternatively, just use the code you find in this page if you don't want to use Anaconda and Jupyter Notebooks, and install the required Python packages in a virtual environment.

What will the web scraper do?

Here are the step by step actions the web scraper will perform to scrape Amazon spending data:

  • Launches a Chrome browser controlled by Selenium
  • Navigates to the Amazon login page
  • Waits 30 seconds for you to manually log in
  • After login, navigates to the Orders page
  • Scrapes Item Costs, Order IDs, and Order Dates
  • Repeats for each year in the year filter and each page in the pagination filter until finished
  • Outputs the data model to a CSV file

The result will be enough to answer questions such as:

  • How much have I spent in total?
  • How much do I spend on average per order?
  • What were the most expensive orders?
  • What is my spending like per day of the week, month, year?

Before we step into the code, let's take a look at the automated scraper in action. Pay attention to the &orderFilter= and &startIndex= parameters in the URL bar. I've blurred out personal details of course, but you'll see how the scraper moves from year to year, and then page to page to scrape all of the order data.

How to change video quality?
How to change video playback speed?

Scraping the data

Let's look at the AmazonOrderScraper class which will be center stage. Bear in mind, this script was accurate at the time of writing, however if the Amazon website changes (id or class names, page structure or url paths) this script may no longer work and will require amending. Underneath this fairly long snippet you can simulate running the code to understand what it's doing, and what the final dataframe would look like.

order-scraper.py
import numpy as np
import pandas as pd
import bs4
from bs4 import BeautifulSoup
import requests
import csv
import datetime
import time

import os  
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options 

class AmazonOrderScraper:
    
    def __init__(self):
        self.date = np.array([])
        self.cost = np.array([])
        self.order_id = np.array([])
        
    
    def URL(self, year: int, start_index: int) -> str:
        return "https://www.amazon.co.uk/gp/your-account/order-history/" + \
                "ref=ppx_yo_dt_b_pagination_1_4?ie=UTF8&orderFilter=year-" + \
                str(year) + \
                "&search=&startIndex=" + \
                str(start_index)
    
    
    def scrape_order_data(self, start_year: int, end_year: int) -> pd.DataFrame:
        years = list(range(start_year, end_year + 1))
        driver = self.start_driver_and_manually_login_to_amazon()

        for year in years:
            driver.get(
                self.URL(year, 0)
            )
            
            number_of_pages = self.find_max_number_of_pages(driver)
            
            self.scrape_first_page_before_progressing(driver)

            for i in range(number_of_pages):
                self.scrape_page(driver, year, i)

            print(f"Order data extracted for { year }") 
            
        driver.close()
        
        print("Scraping done :)")
            
        order_data = pd.DataFrame({
            "Date": self.date,
            "Cost £": self.cost,
            "Order ID": self.order_id
        })
        
        order_data = self.prepare_dataset(order_data)
        
        order_data.to_csv(r"amazon-orders.csv")
            
        return order_data
    

    def start_driver_and_manually_login_to_amazon(self) -> webdriver:
        options = webdriver.ChromeOptions()
        options.add_argument("--start-maximized")
        
        driver = webdriver.Chrome("chromedriver.exe", options=options)
        amazon_sign_in_url = "https://www.amazon.co.uk/ap/signin?" + \
            "_encoding=UTF8&accountStatusPolicy=P1&" + \
            "openid.assoc_handle=gbflex&openid.claimed_id" + \
            "=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&" + \
            "openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier" + \
            "_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid" + \
            ".net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net" + \
            "%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid" + \
            ".return_to=https%3A%2F%2Fwww.amazon.co.uk%2Fgp%2Fcss%2Forder-history" + \
            "%3Fie%3DUTF8%26ref_%3Dnav_orders_first&" + \
            "pageId=webcs-yourorder&showRmrMe=1"

        driver.get(amazon_sign_in_url)
        time.sleep(30) # allows time for manual sign in - increase if you need more time
        
        
        return driver
    
    
    def find_max_number_of_pages(self, driver: webdriver) -> int:
        time.sleep(2)
        page_source = driver.page_source
        page_content = BeautifulSoup(page_source, "html.parser")

        a_normal = page_content.findAll("li", {"class": "a-normal"})
        a_selected = page_content.findAll("li", {"class": "a-selected"})
        max_pages = len(a_normal + a_selected) - 1
       

        return max_pages
    
    
    def scrape_first_page_before_progressing(self, driver: webdriver) -> None:
        time.sleep(2)
        page_source = driver.page_source
        page_content = BeautifulSoup(page_source, "html.parser")
        order_info = page_content.findAll("span", {"class": "a-color-secondary value"})

        orders = []
        for i in order_info:
            orders.append(i.text.strip())

        index = 0
        for i in orders:
            if index == 0:
                self.date = np.append(self.date, i)
                index += 1
            elif index == 1:
                self.cost = np.append(self.cost, i)
                index += 1
            elif index == 2:
                self.order_id = np.append(self.order_id, i)
                index = 0
    
    
    def scrape_page(self, driver: webdriver, year: int, i: int) -> None:
        start_index = list(range(10, 110, 10))
        
        driver.get(
            self.URL(year, start_index[i])
        )
        time.sleep(2)

        data = driver.page_source
        page_content = BeautifulSoup(data, "html.parser")

        order_info = page_content.findAll("span", {"class": "a-color-secondary value"})

        orders = []
        for i in order_info:
            orders.append(i.text.strip())

        index = 0
        for i in orders:
            if index == 0:
                self.date = np.append(self.date, i)
                index += 1
            elif index == 1:
                self.cost = np.append(self.cost, i)
                index += 1
            elif index == 2:
                self.order_id = np.append(self.order_id, i)
                index = 0
                
    
    def prepare_dataset(self, order_data: pd.DataFrame) -> pd.DataFrame:
        order_data.set_index("Order ID", inplace=True)

        order_data["Cost £"] = order_data["Cost £"].str.replace("£", "").astype(float)
        order_data['Order Date'] = pd.to_datetime(order_data['Date'])
        order_data["Year"] = pd.DatetimeIndex(order_data['Order Date']).year
        order_data['Month Number'] = pd.DatetimeIndex(order_data['Order Date']).month
        order_data['Day'] = pd.DatetimeIndex(order_data['Order Date']).dayofweek
        
        day_of_week = { 
            0:'Monday', 
            1:'Tuesday', 
            2:'Wednesday', 
            3:'Thursday', 
            4:'Friday', 
            5:'Saturday', 
            6:'Sunday'
        }
        
        order_data["Day Of Week"] = order_data['Order Date'].dt.dayofweek.map(day_of_week)
        
        month = { 
            1:'January', 
            2:'February', 
            3:'March', 
            4:'April', 
            5:'May', 
            6:'June', 
            7:'July', 
            8:'August', 
            9:'September', 
            10:'October', 
            11:'November', 
            12:'December'
        }

        order_data["Month"] = order_data['Order Date'].dt.month.map(month)
        
        return order_data


if __name__ == "__main__":
    aos = AmazonOrderScraper()
    order_data = aos.scrape_order_data(start_year = 2010, end_year = 2021)
    print(order_data.head(3))
TERMINAL
Python
user@ShedloadOfCode:~$ order-scraper.py

Once instantiated as aos, we call the scrape_order_data method and it handles everything else. You will need to pass start_year and end_year as parameters to it, this allows for scraping the full range of years applicable to you, or a selected range.

I used a similar method to this in How to scrape AutoTrader with Python and Selenium to search for multiple makes and models.

Analysing the data

The prepare_dataset method applied some feature engineering to enhance the dataset. This is simply to ensure that the data is able to be sliced by date, year, month and day of the week. It carried out a series of data manipulation steps, such as removing the pound sign from the cost column, ensuring data types were correct, and mapping day and month names to their integer representations ready to use with charts.

So now you have your data, you can apply any analysis you would like to it. I will give you some inspiration on the kinds of questions you might wish to ask. You might find (like I did) your spending is higher or lower than you expected, so brace yourself for unexpected surprises!

Import packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set(rc={'figure.facecolor':'white'})

Summary statistics

order_data.describe()
Cost £YearMonth NumberDay
count523.000000523.000000523.000000523.000000
mean18.6959852015.1395796.6998092.797323
std23.7936753.2761803.6124172.164905
min0.0000002010.0000001.0000000.000000
25%5.3300002012.0000003.5000001.000000
50%12.7500002015.0000007.0000003.000000
75%23.0150002018.00000010.0000005.000000
max299.9900002021.00000012.0000006.000000

Total spend

total_amount_spent = order_data["Cost £"].sum()
print(f"Total amount spent: £{ total_amount_spent }")
TERMINAL
Python
user@ShedloadOfCode:~$

Average spend per order

average_amount_spent_per_order = order_data["Cost £"].mean()
print(f"Average amount spent per order: £{ round(average_amount_spent_per_order, 2) }")
TERMINAL
Python
user@ShedloadOfCode:~$

Most and least expensive orders

order_data.loc[order_data["Cost £"] == order_data["Cost £"].max()]
Order IDDateCost £Order DateYearDay Of WeekMonth
205-1516165-123456731 March 2020299.992020-03-312020TuesdayMarch
order_data.loc[order_data["Cost £"] == order_data["Cost £"].min()]
Order IDDateCost £Order DateYearDay Of WeekMonth
123-5616156-123456721 June 20110.02011-06-212011TuesdayJune

Top five most expensive orders

order_data.sort_values(ascending=False, by="Cost £").head(5)
Order IDDateCost £Order DateYearDay Of WeekMonth
205-2452455-912350531 March 2020299.992020-03-312020TuesdayMarch
204-4525421-716911715 November 2020239.002020-11-152020SundayNovember
205-5245215-942670628 February 2020138.222020-02-282020FridayFebruary
202-5278588-785785717 November 2018135.992018-11-172018SaturdayNovember
204-2542525-56546455 December 2020127.372020-12-052020SaturdayDecember

Total spend per year

fig, ax = plt.subplots(figsize=(15,6))
yoy_cost = order_data.groupby(["Year"], as_index=False).sum()
sns.lineplot(x=yoy_cost["Year"], y=yoy_cost["Cost £"], color="grey")
plt.title("How much spending per year?")
plt.ylabel("Spending £")

Total spend per year graph

Count of orders per year

fig, ax = plt.subplots(figsize=(15,6))
yoy_order_count = order_data.groupby(["Year"], as_index=False).count()
sns.lineplot(x=yoy_order_count["Year"], y=yoy_order_count["Cost £"], color="Grey")
plt.title("How many orders per year?")
plt.ylabel("Count of Orders")

Count of orders per year graph

Total monthly spend

months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

fig, ax = plt.subplots(figsize=(15,6))
monthly_cost = order_data.groupby(["Month"], as_index=False).sum()
sns.barplot(x=monthly_cost["Month"], y=monthly_cost["Cost £"], order=months, color="Grey")
plt.ylabel("Spending £")
plt.title("How much overall spending per month?")

Total monthly spend graph

Average monthly spend

months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

fig, ax = plt.subplots(figsize=(15,6))
monthly_cost = order_data.groupby(["Month"], as_index=False).mean()
sns.barplot(x=monthly_cost["Month"], y=monthly_cost["Cost £"], order=months, color="Grey")
plt.ylabel("Spending £")
plt.title("Average spending per month?")

Average monthly spend graph

Day of the week with highest spend

days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

fig, ax = plt.subplots(figsize=(15,6))
day_of_week_cost = order_data.groupby(["Day Of Week"], as_index=False).sum()
sns.barplot(x=day_of_week_cost["Day Of Week"], y=day_of_week_cost["Cost £"], order=days_of_week, color="Grey")
plt.ylabel("Spending £")
plt.title("Which day of the week has the highest spend?")

Day of week with highest spend graph

Full time series

fig, ax = plt.subplots(figsize=(15,6))
sns.lineplot(x=order_data['Order Date'], y=order_data["Cost £"], color="Grey")
plt.ylabel("Spending £")
plt.title("Spending Time Series")

Total spending graph

Final words and next steps

So there it is, you can now scrape and analyse your Amazon spending data using Python. Hopefully, the answers to the questions we've asked in this article haven't caused too many surprises! Now you have a way to monitor, track and analyse spending to identify trends. If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article. The full Jupyter notebook can be downloaded for reference.

Ideas for future development might include importing the CSV into Power BI or other analysis tools. This would allow interactive data exploration and would introduce cross-filtering functionality. You could then cross examine day of the week with year, or day of the month with month and all other combinations. This could unlock further insights.