How to scrape and analyse your Chess.com data

July 10, 2021 8 minute read

Contents

In this article I will scrape data from my Chess.com profile and analyse my historical performance in live matches. This is a reproducible pipeline using Python. I took up Chess again at the end of 2020 after a long hiatus, so was eager to monitor my performance and see where the weaknesses were. The good part of this pipeline is that the data will be automatically updated so I can always see what I need to improve on and ask the interesting questions on my performance just by re-running these scripts.

Before starting

Before starting you will need a few things. These will set you up to carry out other Data Science projects in the future too - like analysing your Amazon spending data or scraping AutoTrader for multiple makes / models

Anaconda
Jupyter Notebooks (installed with Anaconda)
Selenium
Google Chrome (latest version)
Chrome Driver (latest version)

This article will not cover installing programs in detail, but here is a starting point. Install Anaconda first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using pip install selenium. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of Google Chrome installed and ChromeDriver for the version number of Chrome you're running. On Windows, ensure chromedriver.exe is in a suitable location such as C:\Windows.

What will the web scraper do?

Here are the step by step actions the web scraper will perform to scrape Amazon spending data:

Launches a Chrome browser controlled by Selenium
Navigates to the Chess.com login page and logs in with your given details
After login, navigates to the My Games page
Scrapes all game data
Repeats for each page in the archive until finished

My Games archive

My Games archive - the data to be scraped

The resulting data will be enough to answer questions such as:

Do I win more matches as black or white?
Do I win shorter or longer games?
Am I losing to higher or lower rated players?
Is time-pressure affecting my wins?
How many of my games reach the endgame?
Do specific days affect my results?
Does seasonality affect my results?
How has my rating developed in 30 min games?

Scraping games data

First to scrape the required data using Selenium. You must provide your Chess.com USERNAME and PASSWORD so the script can log you in so be sure to amend these variables these first.

chess-scraper.py

import numpy as np
import pandas as pd
import bs4
from bs4 import BeautifulSoup
import requests
import csv
import datetime
import time
import hashlib
import os  
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options 

options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
now = datetime.datetime.now()

USERNAME = "DeadlyKnightX"
PASSWORD = "Your password here"
GAMES_URL = "https://www.chess.com/games/archive?gameOwner=other_game&username=" + \
        USERNAME + \
        "&gameType=live&gameResult=&opponent=&opening=&color=&gameTourTeam=&" + \
        "timeSort=desc&rated=rated&startDate%5Bdate%5D=08%2F01%2F2013&endDate%5Bdate%5D=" + \ 
        str(now.month) + "%2F" + str(now.day) + "%2F" + str(now.year) + \ 
        "&ratingFrom=&ratingTo=&page="
LOGIN_URL = "https://www.chess.com/login"

driver = webdriver.Chrome("chromedriver.exe", options=options)
driver.get(LOGIN_URL)
driver.find_element_by_id("username").send_keys(USERNAME)
driver.find_element_by_id("password").send_keys(PASSWORD)
driver.find_element_by_id("login").click()
time.sleep(5)

tables = []
game_links = []

for page_number in range(4):
    driver.get(GAMES_URL + str(page_number + 1))
    time.sleep(5)
    tables.append(
        pd.read_html(
            driver.page_source, 
            attrs={'class':'table-component table-hover archive-games-table'}
        )[0]
    )
    
    table_user_cells = driver.find_elements_by_class_name('archive-games-user-cell')
    for cell in table_user_cells:
        link = cell.find_elements_by_tag_name('a')[0]
        game_links.append(link.get_attribute('href'))
        
driver.close()

games = pd.concat(tables)

identifier = pd.Series(
    games['Players'] + str(games['Result']) + str(games['Moves']) + games['Date']
).apply(lambda x: x.replace(" ", ""))

games.insert(
    0, 
    'GameId', 
    identifier.apply(lambda x: hashlib.sha1(x.encode("utf-8")).hexdigest())
)

print(games.head(3))

GameId	Unnamed: 0	Players	Result	Accuracy	Moves	Date	Unnamed: 6
7e0c2bc5f27e025	1 hour	DominikHrbaty (1319) DeadlyKnightX (1387)	0 1	84.7 84.4	68	Dec 22,2020	NaN
7f6c05e773ebe23	30 mins	Omarricardo34 (1126) DeadlyKnightX (1359)	0 1	49 57.2	52	Dec 19,2020	NaN
af2b84926911844	30 mins	DeadlyKnightX (1344) albert106 (1138)	1 0	94.4 5.6	13	Dec 19,2020	NaN

Now we have a games DataFrame which holds the raw data, we can concentrate on transforming the data by splitting columns, removing unnecessary columns, and adding calculated columns to derive more insight.

Transform games data

chess-scraper.py

# Create white player, black player, white rating, black rating
new = games.Players.str.split(" ", n=5, expand=True)
new = new.drop([1,4], axis=1)
new[2] = new[2].str.replace('(','').str.replace(')','').astype(int)
new[5] = new[5].str.replace('(','').str.replace(')','').astype(int)
games['White Player'] = new[0]
games['White Rating'] = new[2]
games['Black Player'] = new[3]
games['Black Rating'] = new[5]

# Add results
result = games.Result.str.split(" ", expand=True)
games['White Result'] = result[0]
games['Black Result'] = result[1]

# Drop unneccessary columns
games = games.rename(columns={"Unnamed: 0": "Time"})
games = games.drop(['Players', 'Unnamed: 6', 'Result', 'Accuracy'], axis=1)

# Add calculated columns for wins, losses, draws, ratings, year, game links
conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '1'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '1'),
        (games['White Player'] == USERNAME) & (games['White Result'] == '0'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '0'),
        ]
choices = ["Win", "Win", "Loss", "Loss"]
games['W/L'] = np.select(conditions, choices, default="Draw")

conditions = [
        (games['White Player'] == USERNAME),
        (games['Black Player'] == USERNAME)
        ]
choices = ["White", "Black"]
games['Colour'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] == USERNAME),
        (games['Black Player'] == USERNAME)
        ]
choices = [games['White Rating'], games['Black Rating']]
games['My Rating'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] != USERNAME),
        (games['Black Player'] != USERNAME)
        ]
choices = [games['White Rating'], games['Black Rating']]
games['Opponent Rating'] = np.select(conditions, choices)

games['Rating Difference'] = games['Opponent Rating'] - games['My Rating']

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '1'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '1')
        ]
choices = [1, 1]
games['Win'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '0'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '0')
        ]
choices = [1, 1]
games['Loss'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '½'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '½')
        ]
choices = [1, 1]
games['Draw'] = np.select(conditions, choices)

games['Year'] = pd.to_datetime(games['Date']).dt.to_period('Y')

games['Link'] = pd.Series(game_links)

# Optional calculated columns for indicating black or white pieces - uncomment if interested in these
# games['Is_White'] = np.where(games['White Player']==USERNAME, 1, 0)
# games['Is_Black'] = np.where(games['Black Player']==USERNAME, 1, 0)

# Correct date format
games["Date"] = pd.to_datetime(
    games["Date"].str.replace(",", "") + " 00:00", format = '%b %d %Y %H:%M'
)

print(games.head(3))

GameId	Time	Moves	Date	White Player	White Rating	Black Player	Black Rating	White Result	Black Result	W/L	Colour	My Rating	Opponent Rating	Rating Difference	Win	Year	Link
7e0c2bc5f27e025b741fa464cf45a40054e0e637	1 hour	68	22/12/2020	DominikHrbaty	1319	DeadlyKnightX	1387	0	1	Win	Black	1387	1319	-68	1	2020	https://www.chess.com/game/live/6032087036
17f6c05e773ebe23c52164b09fec2ea9de2a9dc6	30 min	52	19/12/2020	Omarricardo34	1126	DeadlyKnightX	1359	0	1	Win	Black	1359	1126	-233	1	2020	https://www.chess.com/game/live/6009160294
af2b84926911833c2e644d6400f39437f8fe0341	30 min	13	19/12/2020	DeadlyKnightX	1344	albert106	1138	1	0	Win	White	1344	1138	-206	1	2020	https://www.chess.com/game/live/6009042670

Great! The data has been transformed, extended and is now ready for analysis.

Analysing games data

With a solid dataset prepared, you can now apply any analysis you would like to it. These are the visualisations I produced based upon what I was interested in. First let's import the key visualisations libraries matplotlib and seaborn.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.facecolor':'white'})

Overall rating

fig, ax = plt.subplots(figsize=(15,6))
plt.title("Chess.com Rating Development")
sns.lineplot(x="Date", y="My Rating", data=games.iloc[::-1], color="black")
plt.xticks(rotation=0)
plt.show()

I can quite clearly see here that I didn't play for a while, until the end of 2020 when I picked Chess back up. This was met by a few losses and a rating dip - I was certainly out of practice.

Wins, losses and draws

fig, ax = plt.subplots(figsize=(15,6))
plt.title("Wins, Losses and Draws")
sns.countplot(data=games, x='W/L', palette="Greys", edgecolor="black")

Wins, losses and draws chart

The good news from this data, is that I win more than I lose... but plenty of room for improvement!

Wins with white vs black pieces

fig, ax = plt.subplots(figsize=(15,6))
plt.title("Wins, Losses and Draws by Colour")
sns.countplot(data=games, x='W/L', hue="Colour", palette={"Black": "Grey", "White": "White"}, edgecolor="black");

Wins, losses and draws by piece colour

This clearly shows that I am stronger playing as black.

Win rate with white vs black pieces

fig, ax = plt.subplots(figsize=(15,6))
ax.set_title("Win Rate by Colour")
sns.barplot(data=games, x='Colour', y='Win', palette={"Black": "Grey", "White": "White"}, edgecolor="black", ax=ax);

Win rate by piece colour

A higher win rate as black.

Correlation

corr = games.corr()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
sns.heatmap(corr, cmap="Greys", annot=True, fmt='.2f', linewidths=.05, ax=ax).set_title("Chess Results Correlation Heatmap")
fig.subplots_adjust(top=0.93)

Correlation heat chart

Can see an immediate negative correlation on Wins with Rating Difference and Moves.

Moves in a typical game

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("How many moves in my typical game?")

sns.histplot(games, x="Moves", hue="Colour", palette={"Black": "Black", "White": "Grey"})
plt.close(2)

Moves in a typical game

Most of my games are around 25 to 30 moves in length.

Moves vs wins

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("Does the amount of moves affect my win rate?")

sns.histplot(games, x="Moves", hue="W/L", multiple="stack", palette={"Loss": "Black", "Win": "Gray", "Draw": "lightgray"})
plt.close(2)

Moves vs wins chart

My win rate does seem to decrease the more moves taken - around the 40 to 80 range is a problem. The number of draws increases as moves taken goes up also. I seem to win more around the sub-35 move range. Lets confirm that...

grouped_df = games.groupby(['W/L', pd.cut(games['Moves'], 10)])
grouped_df = grouped_df.size().unstack().transpose()

total_games = grouped_df["Win"] + grouped_df["Loss"] + grouped_df["Draw"]
total_wins = grouped_df["Win"]

grouped_df["Win Rate %"] = round((total_wins / total_games) * 100, 0)
grouped_df

W/L	Draw	Loss	Win	Win Rate %
Moves
(0.846, 16.4]	1	5	12	67
(16.4, 31.8]	0	37	44	54
(31.8, 47.2]	2	19	29	58
(47.2, 62.6]	9	17	14	35
(62.6, 78.0]	0	4	3	43
(78.0, 93.4]	1	0	2	67
(93.4, 108.8]	0	0	0	NaN
(108.8, 124.2]	0	0	0	NaN
(124.2, 139.6]	0	0	0	NaN
(139.6, 155.0]	1	0	0	0

As thought, only a 35% win rate in the 47-63 moves bin, and a 43% win rate in the 62-78 move bin. Seems like a good idea to practice the endgame more right?

Opponent's rating vs wins

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("Does my opponent's rating affect my win rate?")

sns.histplot(games, x="Rating Difference", hue="Win", palette={0: "Black", 1: "Grey"})
plt.close(2)

Clearly a higher loss rate against higher rated opponents (+) which I think is to be expected.

Time pressure vs wins

fig = plt.figure(figsize=(14,8))
plt.title("How is time pressure affecting my game?")
sns.countplot(data=games, x='Time', hue="W/L", palette={"Win":"#CCCCCC", "Loss":"Grey", "Draw":"White"}, edgecolor="Black");

Time pressure vs wins chart

Overwhelmingly better at 30 and 10 minute games, quicker games fair much worse - a lesson to be learnt here, take your time and play long games.

Rating vs wins

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("How does my rating affect wins?")

sns.histplot(games, x="My Rating", hue="Win", multiple="dodge", palette={0: "Black", 1: "Grey"})
plt.close(2)

There is a pattern of high losses, then an increase in rating, higher wins then high losses again - this must be a development pattern in action. Importantly, must get more experience playing games at the higher level to match the 1000 - 1200 range. The 1400 - 1600 should be as high to be able to break into the 1600 - 1800 range.

Final words

I hope you enjoyed this tutorial. Now you have a way to monitor, track and analyse your Chess.com games archive to identify trends. Some of the actions this analysis has led me to are:

Concentrating on improving on the endgame.
Increasing my exposure to higher rated games.
Strengthening play with the White pieces.
Playing more consistently to ensure rating is accurate.

If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article.

If you want to export the data to CSV you can use something like this on the games DataFrame:

path = os.path.join(os.path.dirname(os.getcwd()), 'my-chess-games-data.csv')
games.to_csv(path, index=False)

Multiple authentication schemes with ASP.NET Core and Azure Active Directory

Searching for text in PDFs at increasing scale

How to scrape and analyse your Chess.com data

Before starting

What will the web scraper do?

Scraping games data

Transform games data

Analysing games data

Overall rating

Wins, losses and draws

Wins with white vs black pieces

Win rate with white vs black pieces

Correlation

Moves in a typical game

Moves vs wins

Opponent's rating vs wins

Time pressure vs wins

Rating vs wins

Final words

Share this article

Support the site

Related articles