How to scrape and analyse your data

July 10, 2021 8 minute read
Black chess pieces on a white table
Source: Unsplash

In this article I will scrape data from my profile and analyse my historical performance in live matches. This is a reproducible pipeline using Python. I took up Chess again at the end of 2020 after a long hiatus, so was eager to monitor my performance and see where the weaknesses were. The good part of this pipeline is that the data will be automatically updated so I can always see what I need to improve on and ask the interesting questions on my performance just by re-running these scripts.

Before starting

Before starting you will need a few things. These will set you up to carry out other Data Science projects in the future too - like analysing your Amazon spending data or scraping AutoTrader for multiple makes / models

  • Anaconda
  • Jupyter Notebooks (installed with Anaconda)
  • Selenium
  • Google Chrome (latest version)
  • Chrome Driver (latest version)

This article will not cover installing programs in detail, but here is a starting point. Install Anaconda first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using pip install selenium. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of Google Chrome installed and ChromeDriver for the version number of Chrome you're running. On Windows, ensure chromedriver.exe is in a suitable location such as C:\Windows.

What will the web scraper do?

Here are the step by step actions the web scraper will perform to scrape Amazon spending data:

  • Launches a Chrome browser controlled by Selenium
  • Navigates to the login page and logs in with your given details
  • After login, navigates to the My Games page
  • Scrapes all game data
  • Repeats for each page in the archive until finished

The resulting data will be enough to answer questions such as:

  • Do I win more matches as black or white?
  • Do I win shorter or longer games?
  • Am I losing to higher or lower rated players?
  • Is time-pressure affecting my wins?
  • How many of my games reach the endgame?
  • Do specific days affect my results?
  • Does seasonality affect my results?
  • How has my rating developed in 30 min games?

Scraping games data

First to scrape the required data using Selenium. You must provide your USERNAME and PASSWORD so the script can log you in so be sure to amend these variables these first.
import numpy as np
import pandas as pd
import bs4
from bs4 import BeautifulSoup
import requests
import csv
import datetime
import time
import hashlib
import os  
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
from import Options 

options = webdriver.ChromeOptions()
now =

USERNAME = "DeadlyKnightX"
PASSWORD = "Your password here"
GAMES_URL = "" + \
        USERNAME + \
        "&gameType=live&gameResult=&opponent=&opening=&color=&gameTourTeam=&" + \
        "timeSort=desc&rated=rated&startDate%5Bdate%5D=08%2F01%2F2013&endDate%5Bdate%5D=" + \ 
        str(now.month) + "%2F" + str( + "%2F" + str(now.year) + \ 

driver = webdriver.Chrome("chromedriver.exe", options=options)

tables = []
game_links = []

for page_number in range(4):
    driver.get(GAMES_URL + str(page_number + 1))
            attrs={'class':'table-component table-hover archive-games-table'}
    table_user_cells = driver.find_elements_by_class_name('archive-games-user-cell')
    for cell in table_user_cells:
        link = cell.find_elements_by_tag_name('a')[0]

games = pd.concat(tables)

identifier = pd.Series(
    games['Players'] + str(games['Result']) + str(games['Moves']) + games['Date']
).apply(lambda x: x.replace(" ", ""))

    identifier.apply(lambda x: hashlib.sha1(x.encode("utf-8")).hexdigest())

GameIdUnnamed: 0PlayersResultAccuracyMovesDateUnnamed: 6
7e0c2bc5f27e0251 hourDominikHrbaty (1319) DeadlyKnightX (1387)0 184.7 84.468Dec 22,2020NaN
7f6c05e773ebe2330 minsOmarricardo34 (1126) DeadlyKnightX (1359)0 149 57.252Dec 19,2020NaN
af2b8492691184430 minsDeadlyKnightX (1344) albert106 (1138)1 094.4 5.613Dec 19,2020NaN

Now we have a games DataFrame which holds the raw data, we can concentrate on transforming the data by splitting columns, removing unnecessary columns, and adding calculated columns to derive more insight.

Transform games data
# Create white player, black player, white rating, black rating
new = games.Players.str.split(" ", n=5, expand=True)
new = new.drop([1,4], axis=1)
new[2] = new[2].str.replace('(','').str.replace(')','').astype(int)
new[5] = new[5].str.replace('(','').str.replace(')','').astype(int)
games['White Player'] = new[0]
games['White Rating'] = new[2]
games['Black Player'] = new[3]
games['Black Rating'] = new[5]

# Add results
result = games.Result.str.split(" ", expand=True)
games['White Result'] = result[0]
games['Black Result'] = result[1]

# Drop unneccessary columns
games = games.rename(columns={"Unnamed: 0": "Time"})
games = games.drop(['Players', 'Unnamed: 6', 'Result', 'Accuracy'], axis=1)

# Add calculated columns for wins, losses, draws, ratings, year, game links
conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '1'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '1'),
        (games['White Player'] == USERNAME) & (games['White Result'] == '0'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '0'),
choices = ["Win", "Win", "Loss", "Loss"]
games['W/L'] =, choices, default="Draw")

conditions = [
        (games['White Player'] == USERNAME),
        (games['Black Player'] == USERNAME)
choices = ["White", "Black"]
games['Colour'] =, choices)

conditions = [
        (games['White Player'] == USERNAME),
        (games['Black Player'] == USERNAME)
choices = [games['White Rating'], games['Black Rating']]
games['My Rating'] =, choices)

conditions = [
        (games['White Player'] != USERNAME),
        (games['Black Player'] != USERNAME)
choices = [games['White Rating'], games['Black Rating']]
games['Opponent Rating'] =, choices)

games['Rating Difference'] = games['Opponent Rating'] - games['My Rating']

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '1'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '1')
choices = [1, 1]
games['Win'] =, choices)

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '0'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '0')
choices = [1, 1]
games['Loss'] =, choices)

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '½'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '½')
choices = [1, 1]
games['Draw'] =, choices)

games['Year'] = pd.to_datetime(games['Date']).dt.to_period('Y')

games['Link'] = pd.Series(game_links)

# Optional calculated columns for indicating black or white pieces - uncomment if interested in these
# games['Is_White'] = np.where(games['White Player']==USERNAME, 1, 0)
# games['Is_Black'] = np.where(games['Black Player']==USERNAME, 1, 0)

# Correct date format
games["Date"] = pd.to_datetime(
    games["Date"].str.replace(",", "") + " 00:00", format = '%b %d %Y %H:%M'

GameIdTimeMovesDateWhite PlayerWhite RatingBlack PlayerBlack RatingWhite ResultBlack ResultW/LColourMy RatingOpponent RatingRating DifferenceWinLossDrawYearLink
7e0c2bc5f27e025b741fa464cf45a40054e0e6371 hour6822/12/2020DominikHrbaty1319DeadlyKnightX138701WinBlack13871319-681002020
17f6c05e773ebe23c52164b09fec2ea9de2a9dc630 min5219/12/2020Omarricardo341126DeadlyKnightX135901WinBlack13591126-2331002020
af2b84926911833c2e644d6400f39437f8fe034130 min1319/12/2020DeadlyKnightX1344albert106113810WinWhite13441138-2061002020

Great! The data has been transformed, extended and is now ready for analysis.

Analysing games data

With a solid dataset prepared, you can now apply any analysis you would like to it. These are the visualisations I produced based upon what I was interested in. First let's import the key visualisations libraries matplotlib and seaborn.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Overall rating

fig, ax = plt.subplots(figsize=(15,6))
plt.title(" Rating Development")
sns.lineplot(x="Date", y="My Rating", data=games.iloc[::-1], color="black")

I can quite clearly see here that I didn't play for a while, until the end of 2020 when I picked Chess back up. This was met by a few losses and a rating dip - I was certainly out of practice.

Wins, losses and draws

fig, ax = plt.subplots(figsize=(15,6))
plt.title("Wins, Losses and Draws")
sns.countplot(data=games, x='W/L', palette="Greys", edgecolor="black")

Wins, losses and draws chart

The good news from this data, is that I win more than I lose... but plenty of room for improvement!

Wins with white vs black pieces

fig, ax = plt.subplots(figsize=(15,6))
plt.title("Wins, Losses and Draws by Colour")
sns.countplot(data=games, x='W/L', hue="Colour", palette={"Black": "Grey", "White": "White"}, edgecolor="black");

Wins, losses and draws by piece colour

This clearly shows that I am stronger playing as black.

Win rate with white vs black pieces

fig, ax = plt.subplots(figsize=(15,6))
ax.set_title("Win Rate by Colour")
sns.barplot(data=games, x='Colour', y='Win', palette={"Black": "Grey", "White": "White"}, edgecolor="black", ax=ax);

Win rate by piece colour

A higher win rate as black.


corr = games.corr()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
sns.heatmap(corr, cmap="Greys", annot=True, fmt='.2f', linewidths=.05, ax=ax).set_title("Chess Results Correlation Heatmap")

Correlation heat chart

Can see an immediate negative correlation on Wins with Rating Difference and Moves.

Moves in a typical game

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("How many moves in my typical game?")

sns.histplot(games, x="Moves", hue="Colour", palette={"Black": "Black", "White": "Grey"})

Most of my games are around 25 to 30 moves in length.

Moves vs wins

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("Does the amount of moves affect my win rate?")

sns.histplot(games, x="Moves", hue="W/L", multiple="stack", palette={"Loss": "Black", "Win": "Gray", "Draw": "lightgray"})

My win rate does seem to decrease the more moves taken - around the 40 to 80 range is a problem. The number of draws increases as moves taken goes up also. I seem to win more around the sub-35 move range. Lets confirm that...

grouped_df = games.groupby(['W/L', pd.cut(games['Moves'], 10)])
grouped_df = grouped_df.size().unstack().transpose()

total_games = grouped_df["Win"] + grouped_df["Loss"] + grouped_df["Draw"]
total_wins = grouped_df["Win"]

grouped_df["Win Rate %"] = round((total_wins / total_games) * 100, 0)
W/LDrawLossWinWin Rate %
(0.846, 16.4]151267
(16.4, 31.8]0374454
(31.8, 47.2]2192958
(47.2, 62.6]9171435
(62.6, 78.0]04343
(78.0, 93.4]10267
(93.4, 108.8]000NaN
(108.8, 124.2]000NaN
(124.2, 139.6]000NaN
(139.6, 155.0]1000

As thought, only a 35% win rate in the 47-63 moves bin, and a 43% win rate in the 62-78 move bin. Seems like a good idea to practice the endgame more right?

Opponent's rating vs wins

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("Does my opponent's rating affect my win rate?")

sns.histplot(games, x="Rating Difference", hue="Win", palette={0: "Black", 1: "Grey"})

Opponent's rating vs wins chart

Clearly a higher loss rate against higher rated opponents (+) which I think is to be expected.

Time pressure vs wins

fig = plt.figure(figsize=(14,8))
plt.title("How is time pressure affecting my game?")
sns.countplot(data=games, x='Time', hue="W/L", palette={"Win":"#CCCCCC", "Loss":"Grey", "Draw":"White"}, edgecolor="Black");

Time pressure vs wins chart

Overwhelmingly better at 30 and 10 minute games, quicker games fair much worse - a lesson to be learnt here, take your time and play long games.

Rating vs wins

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("How does my rating affect wins?")

sns.histplot(games, x="My Rating", hue="Win", multiple="dodge", palette={0: "Black", 1: "Grey"})

Rating vs wins chart

There is a pattern of high losses, then an increase in rating, higher wins then high losses again - this must be a development pattern in action. Importantly, must get more experience playing games at the higher level to match the 1000 - 1200 range. The 1400 - 1600 should be as high to be able to break into the 1600 - 1800 range.

Final words

I hope you enjoyed this tutorial. Now you have a way to monitor, track and analyse your games archive to identify trends. Some of the actions this analysis has led me to are:

  • Concentrating on improving on the endgame.
  • Increasing my exposure to higher rated games.
  • Strengthening play with the White pieces.
  • Playing more consistently to ensure rating is accurate.

If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article.

If you want to export the data to CSV you can use something like this on the games DataFrame:

path = os.path.join(os.path.dirname(os.getcwd()), 'my-chess-games-data.csv')
games.to_csv(path, index=False)