<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Shedload Of Code</title>
        <link>https://shedloadofcode.com/feed.xml</link>
        <description>Shedload Of Code is a blog on all things programming, coding, building software and doing data science.</description>
        <lastBuildDate>Tue, 07 Jan 2025 16:36:51 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/nuxt-community/feed-module</generator>
        <item>
            <title><![CDATA[Top five habits and tools to keep your data science technical skills sharp]]></title>
            <link>https://shedloadofcode.com/blog/top-five-habits-and-tools-to-keep-your-data-science-technical-skills-sharp/</link>
            <guid>https://shedloadofcode.com/blog/top-five-habits-and-tools-to-keep-your-data-science-technical-skills-sharp/</guid>
            <pubDate>Thu, 02 Jan 2025 20:30:00 GMT</pubDate>
            <description><![CDATA[Learn how to keep your data science technical skills sharp with these five habits and tools.]]></description>
            <content:encoded><![CDATA[
<affiliate-disclaimer></affiliate-disclaimer>

Recently I have been reviewing my process for continuous learning and practice to keep my data science skills sharp. To stay ahead, it's essential to adopt habits and leverage tools that foster growth in both technical and analytical skills. It's also crucial to take the time to revisit core topics and concepts you might not have used in a while.  

For each habit I've provided a companion platform and explained why these platforms like DataCamp, StrataScratch, Kaggle, O’Reilly, and DigitalOcean can help you both revisit the basics and sharpen up those skills. 

Keeping the basics sharp also helps with interview preparation - if that's something you're interested in check out [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview) and [Exploring coding interview topics in Python](/blog/exploring-coding-interview-topics-in-python). I think staying technically sharp makes you more confident, job-ready, interview-ready and all round a better Data Scientist.

All of the habits and tools mentioned in this article have proven immensely useful to me. Let's dive in!

## Learn with DataCamp

**Why DataCamp?**

DataCamp offers interactive courses tailored for data scientists at all levels, covering topics from data cleaning to advanced machine learning. Its bite-sized lessons and hands-on coding exercises ensure you learn concepts effectively and can apply them immediately. It also has a great mobile app which I've been using to practice in the evening during downtime. As an intermediate Data Scientist I've been using my DataCamp subscription to revisit basic topics like [Supervised Learning with scikit-learn](https://datacamp.pxf.io/q4ozOY) and advanced topics like [Retrieval Augmented Generation (RAG) with LangChain](https://datacamp.pxf.io/dOyE1Q) and [MLOps](https://https://datacamp.pxf.io/MAKW6P). The pricing is clear, there are fairly regular sales and discounts so watch out for those if interested, and there is a discount for a [yearly subscription](https://datacamp.pxf.io/LKDDx0).

Other courses I have taken recently include: 

* [Machine Learning with PySpark](https://datacamp.pxf.io/QjKKN6)
* [Introduction to Deep Learning with PyTorch](https://datacamp.pxf.io/POaarR)
* [Retrieval Augmented Generation (RAG) with LangChain](https://datacamp.pxf.io/dOyE1Q)

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/datacamp_qqfxbz.png" 
  alt="RetroPie folders" 
  loading="lazy" 
  styling=""
  caption="Interactive exercise on the Machine Learning with PySpark course" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/datacamp_qqfxbz.png" 
  :showsource="false">
</article-image>

**Key Benefits:**

* Wide range of courses covering Python, R, SQL, and more
* Skill tracks and certifications for guided learning paths
* Built-in coding environments to practice directly in the browser

**Habit:** 

Dedicate 15–30 minutes daily to progress through a course or learn a new skill. This incremental approach builds consistency over time.

**Full review:**

For a more in-depth review on DataCamp and why it's great for continuous learning check out [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/).

## Code with StrataScratch

**Why StrataScratch?**

[StrataScratch](https://www.stratascratch.com/) is designed to help data scientists improve their coding and problem-solving skills with real-world SQL and Python challenges. The platform features interview-style questions sourced from companies like Google, Airbnb, and Facebook. 

I really love this platform and took the option to purchase a [Lifetime membership](https://platform.stratascratch.com/pricing) - great that they offer this in the age of subscriptions! Sometimes they have sales like 30% discounts so if you're interested watch out for those. Couldn't recommend it highly enough. Their frontpage states they have 1,000+ interview questions, 200+ companies tracked, the first 50 questions are free and new interview questions are released every month. What's not to love for staying sharp?

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/stratascratch_xbxpwr.png" 
  alt="RetroPie folders" 
  loading="lazy" 
  styling=""
  caption="Easy SQL problem in StrataScratch" 
  captionsrc="https://www.stratascratch.com/" 
  :showsource="false">
</article-image>

**Key Benefits:**

* [Coding questions](https://platform.stratascratch.com/coding?code_type=5&page_size=50) - Analytical, Algorithm and Visualisation questions with optional hints and a guided Solution section if you're struggling, Solutions from Users, Discussion area and Resources for learning
* [Non-coding questions](https://platform.stratascratch.com/technical?page_size=50) - Business Case, Modelling, Probability, Product, Statistics, System Design, Technical
* [Data projects](https://platform.stratascratch.com/data-projects?page_size=50) - Business Analysis, Classification, EDA, NLP, Regression, Clustering, Data Engineering
* [Guides](https://www.stratascratch.com/guides/) - SQL and Python data manipulation plus time and date manipulation
* Covers many language options - PostgreSQL, MySQL, MS SQL Server, Oracle, Python-Pandas, Python-Polars, pySpark and also R
* Database-focused questions that mimic real-world scenarios
* Tutorials and solutions for a better learning experience
* Excellent preparation for technical interviews

**Habit:**

Solve 2–3 coding challenges daily. Focus on SQL queries and data manipulation techniques, as these are core to many data science roles. Expand into Python data manipulation alongside data structures and algorithms questions.

## Practice with Kaggle

**Why Kaggle?**

[Kaggle](https://www.kaggle.com/) is a cornerstone platform for data science practice. It offers everything from datasets and competitions to an active community where you can collaborate and learn from others.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/kaggle_qqv492.png" 
  alt="RetroPie folders" 
  loading="lazy" 
  styling=""
  caption="Kaggle Learn has courses from beginner to intermediate" 
  captionsrc="https://www.kaggle.com/learn" 
  :showsource="false">
</article-image>

**Key Benefits:**

* Participate in competitions to solve real-world problems
* Explore extensive datasets for hands-on practice
* Learn from notebooks and solutions shared by other users

**Habit:**

Start with smaller competitions or use Kaggle’s practice problems to familiarise yourself with its tools. Build a project portfolio by publishing your own notebooks.

## Read with O’Reilly

**Why O’Reilly?**

O’Reilly’s vast library of books, videos, and live training sessions is an indispensable resource for any data scientist. Covering cutting-edge technologies, methodologies, and trends, it’s your go-to for staying informed. At the time of writing O’Reilly look to [offer a free-trial](https://www.oreilly.com/start-trial/) to try out the service.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/oreilly_yamjuh.png" 
  alt="RetroPie folders" 
  loading="lazy" 
  styling=""
  caption="O’Reilly's features " 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/oreilly_yamjuh.png" 
  :showsource="false">
</article-image>

**Key Benefits:**

* Access to thousands of books and videos on technical topics, some of my favourites include:
    * [Practical Statistics for Data Scientists, 2nd Edition](https://www.oreilly.com/library/view/practical-statistics-for/9781492072935/)
    * [Python for Data Analysis, 3rd Edition](https://www.oreilly.com/library/view/python-for-data/9781098104023/)
    * [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
* Regular live events and webinars led by industry experts
* Coverage of tools like TensorFlow, PyTorch, and more

**Habit:**

Allocate time weekly to read a chapter or watch a tutorial on a new concept.

## Deploy with DigitalOcean

**Why DigitalOcean?**

Deployment is a critical skill for data scientists. [DigitalOcean](https://digitalocean.pxf.io/zxjjeG) provides a straightforward and affordable way to host your models, dashboards, or applications in a production environment. [Simple, predictible pricing](https://digitalocean.pxf.io/09YYxN) with monthly caps and flat pricing alongside [good documentation](https://digitalocean.pxf.io/2aKKQA) make DigitalOcean one for you to strongly consider when it comes to deploying solutions to the cloud. At the time of writing, there is an offer to [try DigitalOcean free with a $200 credit](https://digitalocean.pxf.io/c/4971160/1373759/15890). A great way to begin exploring cloud infrastructure!

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1735838132/App%20Images/Blog%20Images/Article%20Images/Five%20Habits%20DS%20Skills/digitalocean_uobwcg.png" 
  alt="RetroPie folders" 
  loading="lazy" 
  styling=""
  caption="The range of cloud products DigitalOcean offer" 
  captionsrc="https://digitalocean.pxf.io/o4nn5n" 
  :showsource="false">
</article-image>

**Key Benefits:**

* Simple interface for setting up virtual machines and cloud resources
* Pre-configured templates for deploying apps, including ML models
* Scalable solutions for professional-grade projects

**Habit:**

Use DigitalOcean to host a personal project, such as a dashboard or API, to understand the end-to-end workflow of deploying data science solutions. I will be releasing an article covering how to quickly deploy projects to DigitalOcean in the near future.


## Enjoy the learning journey

By leveraging these platforms, DataCamp for learning, StrataScratch for coding, Kaggle for practice, O’Reilly for reading, and DigitalOcean for deploying, you can keep your data science skills sharp and versatile. Consistency is key and integrating these tools into your routine, and you’ll be well-equipped to tackle challenges and keep skills sharp in the field. 

One final bonus resource I will mention is [Ace the Data Science Interview: 201 Real Interview Questions Asked By FAANG, Tech Startups, & Wall Street](https://www.amazon.co.uk/Ace-Data-Science-Interview-Questions/dp/0578973839) which I've found to be an effective complete refresher across data science as a whole albeit focused on technical interviewing. On that note if you were preparing for a technical interview where the annoying data structures and algorithms (DSA) style questions pop up, probably should mention [Leetcode](https://leetcode.com/) to prepare for that - I don't find these DSA questions as practical as the main StrataScratch questions though!

I hope you enjoyed the article and as always be sure to check out other articles on the site. You may be interested in:

* [Creating statistical neighbours comparator benchmarking models with Python](/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/)
* [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/)
* [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to learn Python for data analysis]]></title>
            <link>https://shedloadofcode.com/blog/how-to-learn-python-for-data-analysis/</link>
            <guid>https://shedloadofcode.com/blog/how-to-learn-python-for-data-analysis/</guid>
            <pubDate>Mon, 02 Dec 2024 13:45:00 GMT</pubDate>
            <description><![CDATA[Get started in learning Python for data analysis and data science. Explore the packages and concepts that will accelerate your journey to proficiency.]]></description>
            <content:encoded><![CDATA[
This article will provide you with a clear, no-nonsense, step-by-step guide to learning Python for data analysis for beginners and intermediates. The goal of this article is to show you the areas you need to know and others you can dive deeper into, from the basics to the advanced stuff. Knowing the lay of the land is half the battle and this guide will show you what you need to know vs what you can delay until needed. 

We'll explore the key practical skills and knowledge like Python's benefits for data analysis, installing Python for data analysis and it's basics. We'll discover the best Python packages for data science, data visualisation, and data manipulation workflows, along with advanced practical skills like API and SQL integration, web scraping, machine learning, natural language processing, large language models, and cloud computing. We'll start from basics to advanced, so you can progress through at your own pace.

Writing this article took me back in time to the beginning of my learning journey. I've been working in a Data Scientist role in a large organisation since 2019 and before that had other roles in analysis and digital. Although my university degree contained statistics, the course that really taught me hands-on practical skills was the [Microsoft Professional Certificate in Data Science](https://devblogs.microsoft.com/premier-developer/microsoft-professional-program-for-data-science-sharpen-your-data-science-skills/) (no longer available) and the [Data Analysis courses from DataCamp](https://www.datacamp.com/tracks/data-analyst-with-python). Let's begin.

## Why Python for data analysis?

In the analytical commmunity, Python and R seem to dominate as the go-to languages. Python has great packages for data science and ML and R does too with the [Tidyverse](https://www.tidyverse.org/). Both help to create [reproducible analytical pipelines](https://raps-with-python.dev/) or RAP, which provides a robust framework for recreating findings way better than any Excel spreadsheet ever could. 

Python is a beginner-friendly, versatile, and open-source programming language widely used for data analysis, with robust libraries and scalability for projects of any size. Its strong community support, industry-wide adoption, seamless integration with tools like SQL and Excel, and applicability to fields like machine learning and AI make it an essential tool for working with data.

## Installing Python or Anaconda
      
For those getting started with Python for data analysis, installation is the first step to run any Python code on your machine. There are two ways to install Python either the base version or Anaconda, a platform designed specifically for data science and machine learning.

**Option 1: Installing Python**

1. Download Python: Visit the [official Python website](https://www.python.org/downloads/) and download the latest version for your operating system.
2. Install Python:
    * Run the installer and follow the on-screen instructions.
    * Ensure you tick the box to add Python to your system’s PATH during installation.
3. Verify Installation:
    * Open your terminal or command prompt and type `python --version` to confirm Python is installed.

**Option 2: Using Anaconda**

Anaconda is a free, open-source distribution of Python tailored for data science. It comes preloaded with popular libraries like NumPy, Pandas, Matplotlib, and Jupyter Notebook.

1. Download Anaconda:
    * Visit the [Anaconda website](https://www.anaconda.com/download) and download the latest version for your operating system.
2. Install Anaconda:
    * Run the installer and follow the instructions for your system.
    * On Windows, ensure you choose the option to add Anaconda to your PATH (if prompted).
3. Verify installation:
    * Open a terminal and type `conda --version` to confirm the installation.

**Which to choose?**

Choose base Python if you prefer a lightweight setup and plan to install libraries manually or work on projects outside of data analysis. You can also use a virtual environment with base Python to install specific packages per project. We discuss this in the libraries section later.

Choose Anaconda if you’re focused on data science, as it provides a ready-to-use environment with most libraries pre-installed and easy access to Jupyter Notebook.

If using base Python you can install the essential libraries using pip in a new command prompt or terminal:

```bash
pip install numpy pandas matplotlib seaborn scikit-learn nltk spacy openai requests bs4
```

## Choosing an IDE for Python

An Integrated Development Environment (IDE) is a tool for writing, testing, and debugging your Python code efficiently. I really like VS Code as a simple code editor with powerful add-ons or extensions but your main options include:

* [VS Code](https://code.visualstudio.com/) - lightweight IDE with extensions, suited for both simple and complex Python workflows
* [Jupyter Notebook](https://jupyter.org/) - interactive, cell-based environment ideal for exploratory data analysis and visualisation
* [Spyder](https://www.spyder-ide.org/) - data science-focused IDE similar to [RStudio](https://posit.co/download/rstudio-desktop/) with features like variable exploration and inline plotting, perfect for Anaconda users
* [PyCharm](https://www.jetbrains.com/pycharm/) - A professional-grade IDE with advanced debugging and refactoring tools, designed for large-scale Python projects

**An alternative is to use cloud-based IDEs**

Although I believe installing Python and an IDE locally is preferable for flexibility, you can also use a cloud-based IDE like [Kaggle Notebooks](https://www.kaggle.com/code) or [Google Colab](https://colab.research.google.com/). These are convienient, beginner-friendly and remove the need for local installations. These platforms mainly provide hosted Jupyter Notebook services, pre-installed libraries, and support real-time collaboration, making them good for group projects or learning on the go.

## Python core knowledge

To use Python effectively for data analysis, you should be comfortable with a few basics. Start by learning how to import packages, which allow you to leverage pre-built functionalities (e.g., import numpy as np for numerical computations). Variables are used to store data, which can be of various types like integers, strings, and floats. Loops (for, while) enable repetitive tasks, and functions let you organise reusable code. These fundamentals provide the foundation for writing efficient and structured Python scripts.

```python
# Import packages
import numpy as np
import pandas as pd
import math 

# Variables and Data Types
name = "Data Analysis"  # String
number = 42             # Integer
pi = 3.14159            # Float

# Loop
for i in range(1, 4):
    print(f"Iteration {i}: Learning Python Basics")

# Function
def square(num):
    """Returns the square of a number."""
    return num ** 2

result = square(5)  # Call the function
print(f"The square of 5 is {result}")
```

Good resources for this stage are reading references like [The Python Standard Library](https://docs.python.org/3/library/index.html) of Python's built-in functions, alongside [W3Schools Python tutorials](https://www.w3schools.com/python/).

It's also very important to understand Python [data structures](https://docs.python.org/3/tutorial/datastructures.html) like lists, sets, dictionaries, and tuples for beginners in data science. You will rely on these to parse data, hold it in memory and pass it into the libraries we'll look at later.

```python
# Example: Using lists and dictionaries to store and Analyse student grades
students = [
    {"name": "Alice", "grades": [85, 90, 88]},
    {"name": "Bob", "grades": [72, 78, 80]},
    {"name": "Charlie", "grades": [90, 92, 94]},
]

# Calculate and display average grades for each student
for student in students:
    avg_grade = sum(student["grades"]) / len(student["grades"])
    print(f"{student['name']} - Average Grade: {avg_grade:.2f}")
```

## Python data analysis essential skills

Before we dive into the specific libraries in the next section, let's look at the essential skills needed to perform data analysis with Python:

* Basic Python programming - understand variables, data types, loops, and functions, see previous section
* Version control - understand [why it's important](https://www.freecodecamp.org/news/introduction-to-git-and-github/) to use version control with [Git](https://git-scm.com/book/ms/v2/Getting-Started-About-Version-Control)
* Virtual environments - understand [why it's important](https://www.reddit.com/r/learnpython/comments/15nuehj/why_do_i_need_a_virtual_environment/) to use either [venv](https://docs.python.org/3/library/venv.html) or [pipenv](https://pipenv.pypa.io/en/latest/) to install packages per project
* Data cleaning - handling missing values and correct inconsistencies in datasets
* Data manipulation - use tools to transform, filter, and aggregate data
* Exploratory Data Analysis (EDA) - summarise datasets and identify patterns
* Data visualisation - create graphs and charts to present data insights
* File handling - read and write data in formats like CSV, Excel, and JSON
* Time-series analysis - Analyse and summarise date and time-based data
* Statistical analysis - Perform calculations like mean, median, correlation
* Automation - Write scripts to automate repetitive analysis tasks
* Library proficiency - Familiarity with essential Python libraries such as Pandas and NumPy

Good resources for this stage are [Python for Data Analysis](https://wesmckinney.com/book/) or [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) alongside the [pandas](https://pandas.pydata.org/docs/) and [numpy](https://numpy.org/doc/stable/) documentation. The Pandas [getting started](https://pandas.pydata.org/docs/getting_started/index.html) is very useful, so is the [W3Schools Pandas tutorial](https://www.w3schools.com/python/pandas/default.asp).

To learn more about essential skills for a Data Analyst or Data Scientist I found these fairly clear descriptions at [National Careers Service](https://nationalcareers.service.gov.uk/job-profiles/data-scientist), [Analysis Function](https://analysisfunction.civilservice.gov.uk/careers/role-profiles-and-career-pathways/role-profile-data-scientist/) and [DDaT Capability Framework](https://ddat-capability-framework.service.gov.uk/role/data-scientist) alongside my article [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview).

## Python libraries for data analysis

In this section we will explore the top libraries for data analysis in Python. These are vital to learn more about and in each of these I have a quick example and links to the documentation for further study. If you learn these libraries inside out, you'll be a Python data analysis pro.

**NumPy for numerical computation**

Create and manipulate arrays, perform matrix operations, and optimise performance with [NumPy](https://numpy.org/doc/stable/).

```python
import numpy as np
arr = np.array([1, 2, 3])
print(arr.mean())
```

**Pandas for data manipulation**

Read and process structured data (CSV, Excel, SQL) with [Pandas](https://pandas.pydata.org/docs/index.html). Think of this as the 'Excel equivalent' for data manipulation. Everything you could do in Excel, you can do in Pandas. Pandas is one of the most popular Python libraries for data science and is essential for tasks like data manipulation and cleaning. This will become your number one go-to library.

```python
import pandas as pd
data = pd.read_csv('data.csv')
print(data.describe())
```

**Matplotlib and Seaborn for data visualisation**

Create high-quality visualisations with [Matplotlib](https://matplotlib.org/stable/index.html) and [Seaborn](https://seaborn.pydata.org/).

```python
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(data['column_name'])
plt.show()
```

To see great examples check out the [Seaborn gallery](https://seaborn.pydata.org/examples/index.html).

**Scikit-learn for machine learning**

Perform predictive modelling, build classification, regression, and clustering models with [scikit-learn](https://scikit-learn.org/stable/).

```python
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)
```

**NLTK or spaCy for natural language processing**

Perform text preprocessing and named entity recognition with [NLTK](https://www.nltk.org/) or [spaCy](https://spacy.io/usage/spacy-101):

```python
import nltk
from nltk.tokenize import word_tokenize
text = "Learn Python for data analysis!"
print(word_tokenize(text))
```

**BeautifulSoup for web scraping**

Scrape and parse web data using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It would be useful to know the structure of web pages by [learning some basic HTML](https://www.w3schools.com/html/) for this, especially the main HTML tags alongside id, class and xpath selectors to target HTML elements.

```python
from bs4 import BeautifulSoup
import requests

response = requests.get("https://example.com")
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title.text)
```

**OpenAI for large language models (LLMs)**

Use LLMs like [GPT-4o](https://openai.com/index/hello-gpt-4o/) for text and conversation generation but also analysing and summarising data with the [OpenAI](https://pypi.org/project/openai/) package.

```python
import openai
import pandas as pd

openai.api_key = "your-api-key"

df = pd.DataFrame({
    "Name": ["Alice", "Bob"], 
    "Age": [25, 30], 
    "Salary": [50000, 60000]
})

response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=f"Analyse this data:\n{df.to_string(index=False)}",
    max_tokens=100
)

print(response.choices[0].text.strip())
```

**venv for virtual environments**

Isolate your Python projects and manage dependencies effectively with venv. Last but certainly not least, since it may be the first thing you do on your project but I didn't want to confuse this section by putting this first. Understand why it's important to use virtual environments. In a nutshell, it means others (and servers in the cloud) use the exact same package versions as you so the code and dependencies work.

In a command prompt window or terminal:

```bash
# Create a virtual environment
python3 -m venv myenv

# Activate the virtual environment
# On Windows
myenv\Scripts\activate

# On macOS/Linux
source myenv/bin/activate

# Install dependencies in the virtual environment
pip install pandas numpy

```

Good resources for this stage are actually getting some data like in an Excel file, reading that into pandas and having a go at analysis! This repo [awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets) contains many dataset collections organised by topic. Other good choices include [Find open data](https://www.data.gov.uk/) and [ONS](https://www.ons.gov.uk/). Later in this article we look at a practical data analysis workflow with the Titanic dataset.

If you're feeling daring try some web scraping to get a table of data from a web page! The main thing is to get hands-on and learn by doing data analysis. If you need more inspiration in this process, check out [Doing Data Science](https://www.oreilly.com/library/view/doing-data-science/9781449363871/).

## Querying APIs and SQL databases

Besides Excel and CSV files, the main data sources to interact with are APIs and SQL databases so it's very useful to understand how to get data from these.

**Querying data from APIs**

Fetch data from APIs using the requests library:

```python
import requests
import pandas as pd

response = requests.get("https://api.example.com/data")
data = response.json()
df = pd.DataFrame(data)

print(df.head())
```

**Querying data from SQL databases**

Integrate SQL with Python for database queries:

For smaller projects using SQLite use [SQLite3](https://docs.python.org/3/library/sqlite3.html):

```python
import sqlite3
conn = sqlite3.connect('example.db')
data = pd.read_sql_query("SELECT * FROM table_name", conn)
```

For larger projects using a SQL Server database use [pyodbc](https://pypi.org/project/pyodbc/):

```python
import pyodbc
import pandas as pd

# Connect to the SQL Server
conn = pyodbc.connect(
    "DRIVER={SQL Server};SERVER=your_server_name;DATABASE=your_database_name;UID=your_username;PWD=your_password"
)

# Query the database
query = "SELECT TOP 10 * FROM your_table_name;"
df = pd.read_sql(query, conn)

# Perform data analysis
print(df.describe())

# Close the connection
conn.close()
```

For larger projects using a Postgres database use [psycopg2](https://www.psycopg.org/docs/):

```python
import psycopg2
import pandas as pd

# Connect to the PostgreSQL database
conn = psycopg2.connect(
    dbname="your_database_name",
    user="your_username",
    password="your_password",
    host="your_server_name",
    port="your_port_number"  # Default is 5432
)

# Query the database
query = "SELECT * FROM your_table_name LIMIT 10;"
df = pd.read_sql(query, conn)

# Perform data analysis
print(df.describe())

# Close the connection
conn.close()
```

## Practical example of data analysis workflow

Below is a Python script that showcases a basic but realistic data analysis workflow with core data operations. This script demonstrates a practical example, analysing the well-known [Titanic dataset](https://github.com/shedloadofcode/data-files/blob/main/titanic.csv) to uncover insights about survival rates and passenger trends. This is a simple structured dataset which is easy to work with for beginners.

```python
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Import and load data
url = "https://raw.githubusercontent.com/shedloadofcode/data-files/refs/heads/main/titanic.csv"
df = pd.read_csv(url)
print("First five rows of the dataset:")
print(df.head())

# Step 2: Explore the data
print("\nDataset Info:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())

# Step 3: Clean the data
print("\nMissing values before cleaning:")
print(df.isnull().sum())

# Drop rows with missing Embarked values
df = df.dropna(subset=['Embarked'])

# Fill missing Age values with the median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Drop irrelevant columns
df = df.drop(columns=['Cabin', 'Name', 'Ticket'])

print("\nMissing values after cleaning:")
print(df.isnull().sum())

# Step 4: Transform and engineer the data by converting or adding columns
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 80], labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'])

print("\nTransformed Dataset:")
print(df.head())

# Step 5: Analyse and aggregate the data
survival_rates = df.groupby(['AgeGroup', 'Sex'])['Survived'].mean().unstack()
print("\nSurvival Rates by Age Group and Sex:")
print(survival_rates)

family_size_survival = df.groupby('FamilySize')['Survived'].mean()
print("\nSurvival Rates by Family Size:")
print(family_size_survival)

# Step 6: Visualise insights
# Survival rates by AgeGroup and Sex
survival_rates_plot = survival_rates.plot(kind='bar', figsize=(10, 6), title='Survival Rates by Age Group and Sex')
plt.ylabel('Survival Rate')
plt.xlabel('Age Group')
plt.legend(title='Sex', labels=['Male', 'Female'])
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig('survival_rates_by_age_group_and_sex.png')
plt.close()

# Survival rates by Family Size
family_size_survival_plot = family_size_survival.plot(kind='line', figsize=(10, 6), marker='o', title='Survival Rates by Family Size')
plt.ylabel('Survival Rate')
plt.xlabel('Family Size')
plt.tight_layout()
plt.savefig('survival_rates_by_family_size.png')
plt.close()

# Step 7: Draw Conclusions
# Insights from survival rates by AgeGroup and Sex
highest_female_child_survival = survival_rates.loc['Child', 1] * 100
highest_female_young_adult_survival = survival_rates.loc['Young Adult', 1] * 100
lowest_male_adult_survival = survival_rates.loc['Adult', 0] * 100

print("\nConclusions:")
print(f"1. Females had a higher survival rate across all age groups, especially among children ({highest_female_child_survival:.2f}%) "
      f"and young adults ({highest_female_young_adult_survival:.2f}%).")
print(f"2. Males had significantly lower survival rates, particularly in the adult age group ({lowest_male_adult_survival:.2f}%).")

# Insights from survival rates by Family Size
highest_survival_family_size = family_size_survival.idxmax()
highest_survival_rate = family_size_survival.max() * 100
lowest_survival_family_size = family_size_survival.idxmin()
lowest_survival_rate = family_size_survival.min() * 100

print(f"3. Passengers with family size of {highest_survival_family_size} had the highest survival rate ({highest_survival_rate:.2f}%).")
print(f"4. Passengers with family size of {lowest_survival_family_size}) had the lowest survival rate ({lowest_survival_rate:.2f}%).")
```

This code can be run on your laptop or PC in an IDE or Jupyer Notebook by creating a new file in your IDE like 'analysis.py' and pasting in the code. Then using 'python analysis.py' to run it. This saves the following plots and prints the conclusions to the console.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1733142573/App%20Images/Blog%20Images/Article%20Images/Learn%20Python%20Data%20Analysis/vs-code-analysis_1_fyenkk.png" 
  alt="VS Code plot outputs" 
  loading="lazy" 
  styling=""
  caption="The outputs from this script in VS Code" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1733142573/App%20Images/Blog%20Images/Article%20Images/Learn%20Python%20Data%20Analysis/vs-code-analysis_1_fyenkk.png" 
  :showsource="false">
</article-image>

However in an organisation, you might actually run your code in the cloud or remote servers. This allows for more power, scripts run quicker, can connect to larger datasets and enables collaboration. Learning about cloud computing and the big providers like AWS, Azure and [Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/introduction/) and Google Cloud is worthwhile but not in scope for this article. There are links in the next section to learn more about these.

## Resources, communities and projects for learning

Whether you're learning Python for data analysis, data science or tackling real-world projects, these further learning resources will guide you through key topics like analysis, visualisation, machine learning and cloud computing basics.

**Resources**
* [Python for Data Analysis](https://wesmckinney.com/book/) - great book to learn more on data analysis with Python
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
* [Practical Statistics for Data Scientists](https://www.amazon.co.uk/Practical-Statistics-Data-Scientists-Essential-dp-149207294X/dp/149207294X/ref=dp_ob_title_bk) - my most used reference book for statistical concepts
* [Kaggle Learn](https://www.kaggle.com/learn)
* [freeCodeCamp Data Analysis with Python](https://www.freecodecamp.org/learn/data-analysis-with-python/)
* [Cloud computing](https://www.kdnuggets.com/introduction-to-cloud-computing-for-data-science) - [AWS](https://docs.aws.amazon.com/whitepapers/latest/aws-overview/analytics.html), [Azure](https://azure.microsoft.com/en-gb/products/category/analytics), [Google Cloud](https://cloud.google.com/discover/what-is-cloud-analytics#related-products-and-services)
* [Data Analyst Roadmap](https://roadmap.sh/data-analyst)
* [Data Scientist Roadmap](https://roadmap.sh/ai-data-scientist)
* [Courses on YouTube](https://www.youtube.com/results?search_query=python+data+analysis+tutorial)
* [Polars](https://pola.rs/) - an upcoming [faster alternative to Pandas](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)

**Communities**
* [r/datascience](https://www.reddit.com/r/datascience/)
* [r/learnpython](https://www.reddit.com/r/learnpython/)
* [Kaggle competitions](https://www.kaggle.com/competitions)
* [Towards Data Science](https://towardsdatascience.com/)

**Project ideas**
* Analyse sales data to identify trends
> Use the [Superstore Sales Dataset](https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting) to explore sales performance by region and category, and visualise monthly or yearly trends.

* Analyse global COVID-19 data
> Use the [COVID-19 Data Repository](https://github.com/owid/covid-19-data) by Our World in Data to track global case trends, vaccination progress, and create visualizations to uncover regional patterns.

* Scrape e-commerce websites to compare product prices
> Learn web scraping with tools like [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) and [Selenium](https://selenium-python.readthedocs.io/) to collect and analyse product prices from e-commerce websites.

* Segment customers using a clustering model
> Use the [Online Retail Dataset](https://archive.ics.uci.edu/dataset/352/online+retail) to group customers based on purchasing behavior and derive actionable insights for marketing.

* Perform a time-series analysis on stock market data
> Fetch historical stock prices using the [Yahoo Finance Dataset](https://www.kaggle.com/datasets/suruchiarora/yahoo-finance-dataset-2018-2023) and analyse trends to forecast future prices with statistical models.


## Conclusion

If you made it through all of this article, well done! We covered lots of ground. This guide will serve as a useful reference to return to on your learning journey. Keep going and keep learning! The number one takeaway is to **get hands-on and do your own data analysis to solve a real problem** in either your current role or in personal projects. 

Analysis and statistics can solve all kinds of problems, but thinking through how you solved the problem or answered a question with data and statistics are the main things. These show your analytical mind working to solve problems and communicate insights clearly.

Learning Python for data analysis provides a strong foundation for working with data effectively. Start with the Python basics like data structures and libraries, practice real-world projects, and expand into advanced topics like machine learning and LLMs. With consistent practice, you'll soon be able to collect, analyse, and visualise data to derive valuable insights effectively with Python. 

If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in:

* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)
* [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai)
* [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Making GOV.UK style plots with Python and R]]></title>
            <link>https://shedloadofcode.com/blog/making-govuk-style-plots-with-python-and-r/</link>
            <guid>https://shedloadofcode.com/blog/making-govuk-style-plots-with-python-and-r/</guid>
            <pubDate>Wed, 06 Nov 2024 12:15:00 GMT</pubDate>
            <description><![CDATA[Take inspiration in your own analysis from the clean and simple style of GOV.UK plots.]]></description>
            <content:encoded><![CDATA[
Finding examples of clear, accessible, and visually appealing charts is crucial when creating your own charts to communicate data-driven insights, especially in a context with a wide audience. I've always admired the simple, clear and accessible format of GOV.UK charts to strike the perfect balance between detail and clarity. Also a big fan of the [GOV Design System](https://design-system.service.gov.uk/) as a good example of simple yet effective styling.

This guide demonstrates how to generate GOV.UK style plots using Python and R, leveraging best practices and official guidance. The lessons learnt from producing these charts can help you to keep your own analysis and insights clear and effective. The examples include mostly static charts, including line and bar charts, scatter plots, and choropleth maps, formatted in the GOV.UK style. 

These should cover 90% of data visualisation needs and keeps things simple which is perfect for clearly communicating insights. Not to mention, this article is loaded with examples in each to cover lots of use cases, using both Python and R. Enjoy!

## Versions used

You can download the latest versions of Python, Visual Studio Code, R and RStudio below. I've also added the specific versions I used at the time of writing.

**Python version**: 3.12.2 from [https://www.python.org/downloads/](https://www.python.org/downloads/) and using [Visual Studio Code](https://code.visualstudio.com/download) as the IDE.

**R version**: 4.4.1 from [https://cran.rstudio.com/](https://cran.rstudio.com/) and [RStudio](https://posit.co/download/rstudio-desktop/) as the IDE.

## Libraries used

- **Python Libraries**: `numpy`, `pandas`, `matplotlib`, `seaborn`, `plotly`, `scikit-learn`, `folium`, `geopandas`

- **R Libraries**: `tidyverse`, `ggplot2`, `govstyle`
  
These libraries support creating both static and interactive visualisations, making them ideal for producing accessible and visually appealing charts for different media. In this article, we'll be covering mainly static charts, but if you're interested in interactive visualisations check out [How to create animated charts with Python and Plotly
](/blog/how-to-create-animated-charts-with-python-and-plotly/).

## Setting Up the Environment

**Python Setup**

For Python, install the required libraries:

```bash
pip install numpy pandas matplotlib seaborn plotly pandas
```

For maps, we would also need:

```bash
pip install folium geopandas pandas
```

**R Setup**

For R, install the `tidyverse`, `ggplot2` and optionally `govstyle` packages:

```r
install.packages("tidyverse")
install.packages("ggplot2")
```

```r
install.packages('devtools')
devtools::install_github('ukgovdatascience/govstyle')
```

During my research into this topic, the [govstyle](https://github.com/ukgovdatascience/govstyle) package seems specifically tailored for producing GOV.UK-compliant visuals, making it an appealing option for R users. This is optional though, you can still create similar charts without it.

## Downloading Data

The data used in this article is not real data, so it is all dummy data but based on real datasets for learning purposes. It also makes it easy for you to try these out yourself using the code snippets.

You can find statistics and figures at the UK’s [Office for National Statistics (ONS) site](https://www.ons.gov.uk/). For example, employment data over the past few years or regional population data.

- Download ONS datasets from [ONS website](https://www.ons.gov.uk/).
- Import using `pandas` in Python or `readr` in R.

## Line Chart

**Python Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225493/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/python-line_od9sdo.png" 
  alt="Python line chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```python
import matplotlib.pyplot as plt

# Adjusted years list to match the length of rate lists
years = [1971, 1974, 1977, 1980, 1983, 1986, 1989, 1992, 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2019, 2021]
men_rate = [4.1, 5.0, 6.0, 8.2, 11.5, 9.6, 8.3, 6.5, 5.4, 4.9, 5.1, 6.8, 7.2, 6.3, 5.1, 4.6, 3.8, 4.0]
women_rate = [4.0, 5.5, 6.2, 7.8, 9.8, 8.7, 7.5, 6.9, 5.8, 5.2, 5.3, 6.1, 6.7, 5.8, 4.9, 4.3, 3.5, 3.6] 

# Create the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plotting the two lines
ax.plot(years, men_rate, label='Men', color='darkblue', linewidth=2)
ax.plot(years, women_rate, label='Women', color='orangered', linewidth=2)

# Adding labels directly on the lines
ax.text(years[-1], men_rate[-1], 'Men', fontsize=12, verticalalignment='bottom', color='darkblue')
ax.text(years[-1], women_rate[-1], 'Women', fontsize=12, verticalalignment='top', color='orangered')

# Setting axis labels and title
ax.set_ylabel('%', fontsize=14)
ax.set_xlabel('Year', fontsize=12)

# Remove top and right spines for a cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Enable horizontal gridlines only
ax.yaxis.grid(True, linestyle='--', alpha=0.5)
ax.xaxis.grid(False)

# Set y-axis limits and x-ticks
ax.set_ylim(0, 14)
ax.set_xticks(list(range(1971, 2022, 3)))  # Set x-tick intervals
ax.set_xlim(1971, 2021)

# Show the plot
plt.tight_layout()
plt.show()
```

**R Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225490/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/r-line_enxd9y.png" 
  alt="R line chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```r
library(ggplot2)

# Define the data
years <- c(1971, 1974, 1977, 1980, 1983, 1986, 1989, 1992, 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2019, 2021)
men_rate <- c(4.1, 5.0, 6.0, 8.2, 11.5, 9.6, 8.3, 6.5, 5.4, 4.9, 5.1, 6.8, 7.2, 6.3, 5.1, 4.6, 3.8, 4.1)
women_rate <- c(4.0, 5.5, 6.2, 7.8, 9.8, 8.7, 7.5, 6.9, 5.8, 5.2, 5.3, 6.1, 6.7, 5.8, 4.9, 4.3, 3.5, 3.6)

# Create a data frame to hold the values
data <- data.frame(
  Year = rep(years, 2),
  Rate = c(men_rate, women_rate),
  Gender = rep(c("Men", "Women"), each = length(years))
)

# Generate the plot using ggplot2
p <- ggplot(data, aes(x = Year, y = Rate, color = Gender, group = Gender)) +
  geom_line(size = 1.2) +  # Adjust line thickness
  geom_text(data = subset(data, Year == 2021), aes(label = Gender), hjust = -0.1, size = 5) +
  scale_color_manual(values = c("Men" = "darkblue", "Women" = "orangered")) +
  labs(x = "Year", y = "%", title = NULL) +
  theme_minimal() +  # Minimal theme
  theme(
    panel.grid.major.y = element_line(linetype = "dashed", color = "gray", size = 0.5),  # Horizontal gridlines
    panel.grid.major.x = element_blank(),  # Remove vertical gridlines
    panel.grid.minor = element_blank(),  # Remove minor gridlines
    axis.line = element_line(color = "black"),  # Add black axis lines
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "none"  # Remove legend
  ) +
  coord_cartesian(ylim = c(0, 14)) +  # Match y-axis limits
  scale_x_continuous(limits = c(1971, 2029), breaks = seq(1971, 2025, by = 5))  # Extend x-axis to 2025

# Display the plot
print(p)
```

**Python comparison subplots example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225495/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/python-line-subplots_qccvdc.png" 
  alt="Python comparison line chart with subplots" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

This can also be used for [interrupted time series](https://en.wikipedia.org/wiki/Interrupted_time_series#:~:text=Interrupted%20time%20series%20analysis%20) analysis to understand the effect of an intervention.

```python
import matplotlib.pyplot as plt
import pandas as pd

data = pd.DataFrame({
    'Year': list(range(2009, 2024)),
    'Employment Rate': [70, 71, 71.5, 72, 73, 74, 75, 76, 77, 78, 79, 78.5, 75.9, 75.5, 74],
    'Unemployment Rate': [8, 7.8, 7.5, 7, 6.8, 6.5, 6, 5.8, 5.0, 4.5, 3.5, 2.8, 4, 3.8, 3.5]
})

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8), sharex=False)

ax1.plot(data['Year'], data['Employment Rate'], color='#005EA5', linewidth=2)
ax1.text(2010, 74, 'Employment rate', fontname="Arial", fontsize=12, color='black')
ax1.set_ylabel('Employment Rate (%)')
ax1.axvspan(2020, 2021, color='#b1b4b6', alpha=0.3)  # Highlight for COVID-19 period
ax1.text(2016, 80, 'At the start of the coronavirus (COVID-19)\n pandemic the employment rate fell . . .', fontsize=10)
ax1.set_ylim(62, 85)

ax2.plot(data['Year'], data['Unemployment Rate'], color='#f47738', linewidth=2)
ax2.text(2010, 8, 'Unemployment rate', fontsize=12, color='black')
ax2.set_ylabel('Unemployment Rate (%)')
ax2.axvspan(2020, 2021, color='#b1b4b6', alpha=0.3)  # Highlight for COVID-19 period
ax2.text(2021.2, 4.3, '. . . and the\n unemployment\n rate rose.', fontsize=10)
ax2.set_ylim(0, 10)

for ax in [ax1, ax2]:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()
```

**Python focus line chart example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225493/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/python-line-focus_plbx4y.png" 
  alt="Python focus line chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```python
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
from matplotlib.dates import DateFormatter

# Generating sample data to match the given plot pattern
np.random.seed(42)  # For reproducibility

# Generate dates
dates = pd.date_range(start="2021-08-01", end="2021-11-15")

# Adjusting the synthetic data lengths to match the number of dates (107 entries)
num_dates = len(dates)

# Create synthetic data with similar trends for each country
data = {
    "Austria": np.concatenate([np.linspace(300, 400, num_dates//3), np.linspace(400, 600, num_dates//3), np.linspace(600, 1400, num_dates - 2*(num_dates//3))]),
    "Belgium": np.concatenate([np.linspace(200, 300, num_dates//3), np.linspace(300, 500, num_dates//3), np.linspace(500, 1200, num_dates - 2*(num_dates//3))]),
    "Netherlands": np.concatenate([np.linspace(200, 300, num_dates//3), np.linspace(300, 500, num_dates//3), np.linspace(500, 1100, num_dates - 2*(num_dates//3))]),
    "UK": np.concatenate([np.linspace(400, 600, num_dates//3), np.linspace(600, 800, num_dates//3), np.linspace(800, 550, num_dates - 2*(num_dates//3))]),
    "Germany": np.concatenate([np.linspace(300, 500, num_dates//3), np.linspace(500, 700, num_dates//3), np.linspace(700, 650, num_dates - 2*(num_dates//3))]),
    "France": np.linspace(50, 200, num_dates),
    "Italy": np.linspace(50, 150, num_dates),
    "Spain": np.linspace(50, 120, num_dates)
}

# Adding slight variability (smaller noise) to each line to make them less smooth
for country in data:
    noise = np.random.normal(0, 7, num_dates)  # Adding random noise with mean 0 and standard deviation 7
    data[country] += noise

# Create DataFrame
df = pd.DataFrame(data, index=dates)

# Function to format y-axis labels with commas
def y_format(x, pos):
    return f'{int(x):,}'

# Plotting the data
plt.figure(figsize=(12, 6))
for country in df.columns:
    color = 'navy' if country == "UK" else 'grey'
    alpha_value = 1 if country == "UK" else 0.6  # Set higher transparency for non-UK lines
    plt.plot(df.index, df[country], label=country, linewidth=2 if country == "UK" else 1.5, color=color, alpha=alpha_value)

# Add country names at the end of each line
for country in df.columns:
    plt.text(df.index[-1], df[country].values[-1], country, fontsize=12, color='navy' if country == "UK" else 'grey', alpha=alpha_value, va='center')

# Formatting the plot to match the visual style
plt.xlabel(None)
plt.ylabel("Cases per 100,000 people")
plt.title(None)

# Customizing x-axis date format
date_format = DateFormatter("%d %b")  # Format as "01 Aug", "21 Aug", etc.
plt.gca().xaxis.set_major_formatter(date_format)

# Customizing the y-axis format to include commas
plt.gca().yaxis.set_major_formatter(FuncFormatter(y_format))

# Set x-tick intervals to match the original chart
plt.xticks(pd.to_datetime(['2021-08-01', '2021-08-21', '2021-09-10', '2021-09-30', '2021-10-20', '2021-11-09']), rotation=0)

# Customizing the horizontal gridlines to be faint grey
plt.grid(axis='y', color='lightgrey', linestyle='-', linewidth=0.5, alpha=0.5)

# Remove all spines
for spine in plt.gca().spines.values():
    spine.set_visible(False)

# Remove x and y axis tick marks
plt.tick_params(axis='both', which='both', length=0)  # Set the length of tick marks to 0

# Remove legend
plt.legend().set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()
```

## Bar Chart

**Python Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225491/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/python-bar_sdmyqc.png" 
  alt="Python bar chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```python
import matplotlib.pyplot as plt

# Data
categories = ['Dog', 'Cat', 'Hamster', 'Dolphin', 'Other']
values = [25, 22, 20, 18, 17]

# Create the plot
plt.figure(figsize=(12, 6))
plt.bar(categories, values, color='#12436D', width=0.6, zorder=3)

# Adding faint horizontal grey gridlines
plt.grid(axis='y', color='#e0e0e0', linestyle='-', linewidth=0.7, zorder=0)

# Remove all spines
for spine in plt.gca().spines.values():
    spine.set_visible(False)

# Remove x and y axis tick marks
plt.tick_params(axis='both', which='both', length=0)  # Set the length of tick marks to 0

# Adding labels and title
plt.ylabel('Number of 6 year olds')
plt.xlabel('')
plt.title('Favorite Animals of 6 Year Olds')
plt.show()
```

**R Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225490/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/r-bar_yg6muu.png" 
  alt="R bar chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```r
# Load necessary library
library(ggplot2)

# Data
categories <- c("Dog", "Cat", "Hamster", "Dolphin", "Other")
values <- c(25, 22, 20, 18, 17)

# Create a data frame
data <- data.frame(categories, values)

# Reorder categories based on values (largest to smallest)
data$categories <- factor(data$categories, levels = data$categories[order(-data$values)])

# Create the plot
ggplot(data, aes(x = categories, y = values)) +
  geom_bar(stat = "identity", fill = "#12436D", width = 0.6) +  # Dark blue bars with custom width
  theme_minimal(base_size = 15) +  # Minimal theme with base text size
  theme(
    panel.grid.major.y = element_line(color = "#e0e0e0", size = 0.7),  # Faint horizontal gridlines
    panel.grid.minor = element_blank(),  # Remove minor gridlines
    panel.grid.major.x = element_blank(),  # Remove vertical gridlines
    axis.ticks = element_blank(),  # Remove axis tick marks
    axis.line = element_blank(),  # Remove axis lines
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  ) +
  labs(
    x = NULL,  # Remove x-axis label
    y = "Number of 6 year olds",  # Set y-axis label
    title = "Favorite Animals of 6 Year Olds"  # Set plot title
  )

```

**Python Comparison Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225493/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/python-bar-compare_jgnn91.png" 
  alt="Python comparison bar chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```python
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as mticker

# Sample DataFrame
data = {
    'County': [
        'Monmouthshire', 'Vale of Glamorgan', 'Cardiff', 'Ceredigion', 'Powys', 'Isle of Anglesey',
        'Pembrokeshire', 'Newport', 'Conwy', 'Flintshire', 'Bridgend', 'Median for all Wales',
        'Gwynedd', 'Wrexham', 'Swansea', 'Denbighshire', 'Torfaen', 'Carmarthenshire',
        'Caerphilly', 'Neath Port Talbot', 'Rhondda Cynon Taf', 'Merthyr Tydfil', 'Blaenau Gwent'
    ],
    'Ranking': [
        400000, 350000, 320000, 300000, 280000, 260000, 240000, 220000, 200000, 180000, 160000,
        140000, 130000, 120000, 110000, 100000, 90000, 80000, 70000, 60000, 50000, 40000, 30000
    ],
    'Deviation': [
        100000, 80000, 60000, 40000, 30000, 20000, 10000, 5000, 3000, 1000, 500, 0,
        -500, -1000, -3000, -5000, -7000, -9000, -10000, -20000, -30000, -40000, -50000
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Set the county names as the index
df.set_index('County', inplace=True)

# Set the figure size with increased width to accommodate labels
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18, 8))

# Plot the Ranking bar chart
df.sort_values('Ranking', ascending=True).plot.barh(
    y='Ranking', ax=ax1, color='#12436D', legend=False, zorder=3
)
ax1.set_title('Ranking')
ax1.set_xlabel('£')

# Highlight the "Median for all Wales" row in the specified color (#f46a25)
median_index = df.index.get_loc('Median for all Wales')
ax1.get_children()[median_index].set_color('#f46a25')

# Remove the y-axis label ('County')
ax1.set_ylabel('')

# Adjust x-axis for the left chart to use 100,000 increments up to 425,000
ax1.set_xlim(0, 425000)  # Set the x-axis limit from 0 to 425,000
ax1.set_xticks(range(0, 426000, 100000))  # Set x-ticks at 0, 100,000, 200,000, ..., 425,000

# Plot the Deviation bar chart
df.sort_values('Deviation', ascending=True).plot.barh(
    y='Deviation', ax=ax2, color='#12436D', legend=False, zorder=3
)
ax2.set_title('Deviation')
ax2.set_xlabel('£')

# Highlight the "Median for all Wales" row in the specified color (#f46a25)
median_index_deviation = df.index.get_loc('Median for all Wales')
ax2.get_children()[median_index_deviation].set_color('#f46a25')

# Remove the y-axis label ('County')
ax2.set_ylabel('')

# Remove all spines (box borders) and add vertical gridlines
for ax in [ax1, ax2]:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)

    # Add light gray vertical gridlines behind the bars using zorder=1
    ax.grid(True, axis='x', color='lightgray', linestyle='--', linewidth=0.7, zorder=1)

    # Remove tick marks
    ax.tick_params(axis='both', which='both', length=0)

    # Set x-axis formatter to include commas
    ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))

# Adjust subplot spacing and margins to prevent labels from being cut off
plt.subplots_adjust(left=0.13, right=0.976, top=0.9, bottom=0.1, wspace=0.39)

# Show the plot
plt.show()
```

## Scatter Chart

**Python Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225495/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/python-scatter_os14vd.png" 
  alt="Python scatter chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```python
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.ticker as mticker  # Import ticker for formatting

# Seed for reproducibility
np.random.seed(42)

# Create a DataFrame with a nuanced pattern
data = {
    'Category': np.random.choice(['Urban', 'Rural'], 50),  # Randomly assign categories
}

# Assign household incomes with Urban starting from 20,000 upwards
data['Household Income'] = [
    np.random.randint(20000, 100000) if cat == 'Urban' else np.random.randint(20000, 70000)
    for cat in data['Category']
]

# Assign house prices to be generally higher in Rural areas, lower in Urban areas
data['House Prices'] = [
    np.random.randint(300000, 500000) if cat == 'Rural' else np.random.randint(100000, 300000)
    for cat in data['Category']
]

# Convert to a DataFrame for easier manipulation
df = pd.DataFrame(data)

# Introduce a general correlation: Higher incomes should have higher house prices
# Modify 25% of the data points to ensure high income leads to high house prices
high_income_mask = df['Household Income'] > 70000
df.loc[high_income_mask, 'House Prices'] += np.random.randint(100000, 200000, high_income_mask.sum())

# Create a scatter plot with GOV.UK styling for two categories
fig, ax = plt.subplots(figsize=(10, 6))

# Plot scatter points by category with GOV.UK style colors
sns.scatterplot(x='Household Income', y='House Prices', hue='Category', data=df, s=100, ax=ax, palette=['#12436D', '#f46a25'])

# Calculate and plot separate linear trends and R² values for each category
for i, category in enumerate(df['Category'].unique()):
    # Filter data for each category
    category_data = df[df['Category'] == category]
    
    # Perform linear regression
    X = category_data[['Household Income']]
    y = category_data['House Prices']
    reg = LinearRegression().fit(X, y)
    y_pred = reg.predict(X)
    r2 = r2_score(y, y_pred)
    
    # Plot linear trend line for each category
    sns.regplot(x='Household Income', y='House Prices', data=category_data, scatter=False, ax=ax,
                line_kws={'lw': 2, 'linestyle': '--'}, color='#12436D' if category == 'Urban' else '#f46a25')
    
    # Display R² for each category with adjusted positions
    common_x_pos = df['Household Income'].min() - 3000  # Fixed x position for neat alignment
    y_pos = category_data['House Prices'].max()
    
    # Adjust R² label y position to avoid overlap and set x position to common_x_pos
    y_offset = 10000 if i == 0 else -10000  # Offset R² label: Up for Urban, Down for Rural
    plt.text(common_x_pos, y_pos + y_offset, f'{category} $R^2$ = {r2:.2f}', fontsize=12, color='#12436D' if category == 'Urban' else '#f46a25')

# Set the GOV.UK style labels and title
ax.set_xlabel('Household Income (£)', fontsize=14, weight='bold')
ax.set_ylabel('House Prices (£)', fontsize=14, weight='bold')

# Set the formatter for x and y axes to include commas
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda y, _: f'{int(y):,}'))

# Set the GOV.UK styling colors for axes and background
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_linewidth(1.5)
ax.spines['bottom'].set_linewidth(1.5)
ax.spines['left'].set_color('black')
ax.spines['bottom'].set_color('black')

# Adjust tick parameters for GOV.UK styling
ax.tick_params(axis='both', which='major', labelsize=12, color='black')

# Show the plot
plt.tight_layout()
plt.show()
```

**R Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225490/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/r-scatter_qzcsby.png" 
  alt="R scatter chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```r
# Load required libraries
library(ggplot2)
library(dplyr)
library(broom)
library(scales)  # For comma formatting

# Set seed for reproducibility
set.seed(42)

# Create a DataFrame with a nuanced pattern
data <- data.frame(
  Category = sample(c("Urban", "Rural"), 50, replace = TRUE)  # Randomly assign categories
)

# Assign household incomes with Urban starting from 20,000 upwards
data$Household_Income <- ifelse(
  data$Category == "Urban", 
  sample(20000:100000, 50, replace = TRUE), 
  sample(20000:70000, 50, replace = TRUE)
)

# Assign house prices to be generally higher in Rural areas, lower in Urban areas
data$House_Prices <- ifelse(
  data$Category == "Rural", 
  sample(300000:500000, 50, replace = TRUE), 
  sample(100000:300000, 50, replace = TRUE)
)

# Introduce a general correlation: Higher incomes should have higher house prices
# Modify 25% of the data points to ensure high income leads to high house prices
high_income_mask <- data$Household_Income > 70000
data$House_Prices[high_income_mask] <- data$House_Prices[high_income_mask] + sample(100000:200000, sum(high_income_mask), replace = TRUE)

# Define the custom GOV.UK color palette
govuk_palette <- c(Urban = "#12436D", Rural = "#f46a25")

# Create a scatter plot with GOV.UK styling for two categories
ggplot(data, aes(x = Household_Income, y = House_Prices, color = Category)) +
  geom_point(size = 3, alpha = 0.7) +
  
  # Add separate linear trends and display R² values for each category
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", size = 1.2, aes(color = Category)) +
  
  # Apply the custom GOV.UK color palette
  scale_color_manual(values = govuk_palette) +
  
  # Custom labels and title
  labs(
    x = "Household Income (£)",
    y = "House Prices (£)",
    title = "House Prices vs. Household Income by Category"
  ) +
  
  # Custom styling for GOV.UK-like appearance
  theme_minimal(base_size = 14) +
  theme(
    panel.grid = element_blank(),  # Remove gridlines
    axis.line = element_line(color = "black", size = 0.5),
    legend.position = "top",
    legend.title = element_blank(),
    axis.text = element_text(color = "black"),
    axis.title = element_text(size = 14, face = "bold", color = "black")
  ) +
  
  # Format x and y axis labels with commas
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma)

# Calculate separate R² values for each category and display them on the plot
r2_values <- data %>%
  group_by(Category) %>%
  summarise(R2 = summary(lm(House_Prices ~ Household_Income))$r.squared)

# Print R² values for reference
print(r2_values)
```

## Choropleth Map

To create a choropleth map, we’ll use a geographical dataset (e.g., a geojson file of local authorities in the UK) and color regions based on employment rates.

**Python Example**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1728225491/App%20Images/Blog%20Images/Article%20Images/GOVUK%20Basic%20Charts/choropleth_kklhgv.png" 
  alt="Python choropleth map chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

```python
import folium
import requests

# URL to the GeoJSON file hosted on GitHub
geojson_url = "https://raw.githubusercontent.com/shedloadofcode/data-files/refs/heads/main/LAD_DEC_2021_GB_BFC.json"

# Fetch the GeoJSON data from the URL
response = requests.get(geojson_url)
geo_data = response.json()

# Define the custom color scale and class intervals for density
classes = [0, 10000, 20000, 30000, 40000, 50000]
colorscale = ['#cfdce3', '#9fb9c8', '#7095ac', '#407291', '#104f75']

# Create a Folium map centered around the UK with CartoDB's light background
m = folium.Map(location=[54.509865, -5.118092], zoom_start=6, tiles=None)
folium.TileLayer(
    tiles='https://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}{r}.png',
    attr='&copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors &copy; <a href="https://carto.com/attributions">CARTO</a>'
).add_to(m)

# Define color based on density values
def style_function(feature):
    density = feature['properties'].get('density', 0)
    color = colorscale[0]  # Default color
    for i, cls in enumerate(classes):
        if density > cls:
            color = colorscale[i]
    return {
        'fillOpacity': 0.7,
        'weight': 2,
        'opacity': 1,
        'color': 'white',  # Default dashed border color
        'fillColor': color,
        'dashArray': '3'
    }

# Highlight function to show black solid lines on hover
def highlight_function(feature):
    return {
        'weight': 3,
        'color': 'black',  # Change to solid black on hover
        'dashArray': '',   # Remove dashes on hover
        'fillOpacity': 0.7
    }

# Create the GeoJson layer with tooltips
geojson_layer = folium.GeoJson(
    geo_data,
    style_function=style_function,
    highlight_function=highlight_function,
    tooltip=folium.GeoJsonTooltip(
        fields=['LAD21NM', 'density'],
        aliases=['Region:', 'Density:'],
        localize=True,
        sticky=True,
        style=("background-color: white; border: 1px solid black; border-radius: 3px; "
               "box-shadow: 3px 3px 3px rgba(0,0,0,0.25); font-size: 16px; font-family: Arial;")
    ),
    zoom_on_click=True
).add_to(m)

# Custom legend for density classes
legend_html = '''
<div style="position: fixed; 
    bottom: 50px; left: 50px; width: 300px; height: 170px; 
    background-color: white; z-index: 1000; border:2px solid grey; padding: 10px;">
    <h4 style="margin:0; text-align: center;">Population Density - Not actual figures</h4>
    <div style="padding: 5px 10px;">
        <div style="background-color: #cfdce3; width: 20px; height: 20px; display: inline-block;"></div> 0 - 10,000<br>
        <div style="background-color: #9fb9c8; width: 20px; height: 20px; display: inline-block;"></div> 10,000 - 20,000<br>
        <div style="background-color: #7095ac; width: 20px; height: 20px; display: inline-block;"></div> 20,000 - 30,000<br>
        <div style="background-color: #407291; width: 20px; height: 20px; display: inline-block;"></div> 30,000 - 40,000<br>
        <div style="background-color: #104f75; width: 20px; height: 20px; display: inline-block;"></div> 40,000 - 50,000+<br>
    </div>
</div>
'''

# Add the legend to the map
m.get_root().html.add_child(folium.Element(legend_html))

# Save the map as an HTML file
m.save('choropleth_map.html')
print("Map saved as 'choropleth_map.html'")
```

When it comes to working with GeoJSON and Shapefiles, [Mapshaper](https://mapshaper.org/) is great tool to use so keep it in mind! To learn more check out this useful guide to [Edit and Join with Mapshaper](https://handsondataviz.org/mapshaper.html).

You can alternatively perform this process in Python or R itself, but the steps are the same:

1. Get a Shapefile or GeoJSON file from the [ONS Open Geography Portal](https://geoportal.statistics.gov.uk/search?q=BDY_LAD%202024&sort=Title%7Ctitle%7Casc)
2. Get some data
3. Join the data and the Shapefile or GeoJSON file on area ID - or any ID they both contain
4. Output as a GeoJSON file
5. Use that GeoJSON file in the leaflet map - referencing the 'columns' you want

## Best Practices for GOV.UK Style Charts

To start, here are some basic tips:

- Use [GOV.UK brand colors](https://design-system.service.gov.uk/styles/colour/) (`#005EA5` for blue, `#007F3B` for green, `#FFDD00` for yellow).
- Ensure text and lines are clear and readable.
- Avoid using distracting elements such as 3D effects.
- Use contrasting colors and appropriate labeling for accessibility.
- Maintain consistent font sizes and styles.

Now going into more detailed guidance:

1. This [official GOV.UK visual content guidelines](https://www.gov.uk/government/publications/examples-of-visual-content-to-use-on-govuk/examples-of-visual-content-to-use-on-govuk) page provides examples of bar charts, line charts, tables, and various image formats used across GOV.UK web pages. It also discusses best practices for creating accessible and visually appealing charts in the GOV.UK style. 

2. The [Government Analysis Function](https://analysisfunction.civilservice.gov.uk/policy-store/data-visualisation-chart-examples/) has a detailed collection of case studies that show how to improve charts for public communication. They include practical tips like focusing on narrative, using GOV.UK color schemes, and adding annotations to guide the viewer through the data. 

3. The [ONS Data Visualisation Service Manual](https://service-manual.ons.gov.uk/data-visualisation) provides detailed guidance on best practices for creating clear and accessible charts, tables, and maps. It covers visual conventions, color usage, and building specifications, making it an excellent reference for creating high-quality and consistent visual content in line with ONS and GOV.UK standards. This resource is ideal for developers and analysts looking to create professional, standardised visualisations.

Finally, some specific examples I found of visualisations:

* This [RShiny GOV.UK styled dashboard template](https://department-for-education.shinyapps.io/dfe-shiny-template/?_inputs_&navlistPanel=%22Example%20tab%201%22&tabsetpanels=%22Line%20chart%20example%22&cookieAccept=0&cookieReject=0&cookieLink=0&hideAccept=0&hideReject=0&selectPhase=%22All%20Local%20authority%20maintained%20schools%22&selectArea=%22England%22&selectBenchLAs=null) contains examples of tables, line charts, and bar charts. [[Code]](https://github.com/dfe-analytical-services/shiny-template)
* [Slides, datasets and transcripts to accompany coronavirus press conferences](https://www.gov.uk/government/collections/slides-and-datasets-to-accompany-coronavirus-press-conferences)


## Conclusion

This guide has shown how to create accessible and visually consistent GOV.UK style charts using Python and R. Whatever analysis you are presenting, the lessons learnt from producing these visuals can help you to communicate insights simply and effectively.

I will be covering more advanced and complex charts in a future article, to cover things like detailed comparisons, forecasting, [visualising uncertainty](https://analystsuncertaintytoolkit.github.io/UncertaintyWeb/introduction.html) and more.

If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in:

* [Creating statistical neighbours comparator benchmarking models with Python](/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/)
* [Solving real-world optimisation problems - a crash course with PuLP](/blog/solving-real-world-optimisation-problems-a-crash-course-with-pulp)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)
* [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Exploratory data analysis with Danfo.js and JavaScript]]></title>
            <link>https://shedloadofcode.com/blog/exploratory-data-analysis-with-danfojs-and-javascript/</link>
            <guid>https://shedloadofcode.com/blog/exploratory-data-analysis-with-danfojs-and-javascript/</guid>
            <pubDate>Sun, 15 Sep 2024 19:30:00 GMT</pubDate>
            <description><![CDATA[Learn how I built a lightweight web-based EDA tool using Danfo.js - a powerful JavaScript library for data manipulation and analysis. We will explore the key Danfo.js functions I used to achieve this.]]></description>
            <content:encoded><![CDATA[
Recently, I published a free web-based lightweight data exploration app [DataCabin](https://shedloadofcode.github.io/) using [Danfo.js](https://danfo.jsdata.org/), a powerful JavaScript library for data manipulation and analysis. I was inspired by libraries such as [SweetViz](https://github.com/fbdesignpro/sweetviz), [Pandas Profiling / ydata-profiling](https://docs.profiling.ydata.ai/latest/) and [AutoViz](https://github.com/AutoViML/AutoViz).

These libraries automate the process of Exploratory Data Analysis (EDA) by generating comprehensive reports and visualisations, I was curious to find out if something like this could be done using only JavaScript and be incorporated into a web based tool a user could interect with. This would allow a user to analyse small to medium sized datasets (maybe less than 20,000 rows) even without knowing any code.

Through this process, I learned valuable lessons about EDA, data processing and visualisation using JavaScript, which I’ll share in this article along with an explanation of the key Danfo.js functions I used in the app. Danfo.js in the wild so to speak 😆 At the start of this article I list all of the Danfo.js functions I used throughout this article and in the project.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726077541/App%20Images/Blog%20Images/App%20Images/datacabin-cover_w6ie2l.png" 
  alt="DataCabin app" 
  loading="lazy" 
  styling=""
  caption="You can upload a CSV dataset to the DataCabin app and perform EDA to get a feel for the dataset quickly. This has been very useful to understand new datasets rapidly." 
  captionsrc="https://shedloadofcode.github.io/" 
  :showsource="false">
</article-image>

## Why Danfo.js?

Danfo.js is a library similar to the [Pandas](https://pandas.pydata.org/docs/reference/general_functions.html) package in Python, making it easier to handle data manipulation in JavaScript. It’s perfect for web-based applications that require working with datasets, providing functions for reading, transforming, and analyzing data with an intuitive syntax.

You can learn more about Danfo.js in the:

* [Getting started guide](https://danfo.jsdata.org/getting-started)
* [API reference -  all public Danfo objects, functions and methods](https://danfo.jsdata.org/api-reference)

## Core features of the app

The primary features of the [DataCabin](https://shedloadofcode.github.io/) EDA app include:

- **Uploading CSV files**: The app allows users to upload CSV files and parses the data for exploration.
- **Data overview**: It provides an overview of the dataset, such as the number of rows, columns, missing values, and data types.
- **Univariate analysis**: Visualisation of individual variables, both numeric and categorical, with histograms and bar charts.
- **Correlation heatmap**: Shows the correlation between numeric variables in the dataset.
- **Relationship plots**: Scatter and table plots that compare relationships between two selected variables.
- **Key influencers**: View influencer statistical summaries of other variables in relation to the target variable and values.


For the frontend I used [Vue.js](https://vuejs.org/) since I like the ease of use for this frontend library. Once you have a Vue.js app ready for deployment, running the npm or yarn **build** command creates a static site in the dist folder, ready for deployment to a CDN or GitHub pages as a static site - very useful!

The first step was importing the package into the Vue app. This gives us a reference to it as 'dfd'. Other useful packages were Plotly for charting and Axios for HTTP requests.

```js
import * as dfd from "danfojs";
import Plotly from "plotly.js-dist-min";
import axios from "axios";
```

## Danfo functions used

Now let's take a look at some of the key functions I used from Danfo. You'll see these referenced as we progress through the article. For any others you can take a look at the [Danfo API reference](https://danfo.jsdata.org/api-reference).

- **`dfd.readCSV(this.dropFiles)`**: Reads a CSV file and converts it into a DataFrame.
- **`df.selectDtypes(["int32", "float32"])`**: Selects columns with numeric data types (integers, floats).
- **`df.selectDtypes(["string"])`**: Selects columns with string (categorical) data types.
- **`df.isNa()`**: Identifies missing values in the DataFrame.
- **`df.sum()`**: Sums values across the DataFrame, used for calculating missing values.
- **`df.iloc({ rows: [i] })`**: Retrieves data from specific rows in the DataFrame.
- **`df.ctypes`**: Gets the data types of the DataFrame's columns.
- **`df.column(column).unique()`**: Finds unique values in a specific column.
- **`df.column(column).mean()`**: Calculates the mean of a numeric column.
- **`df.query()`**: Filters the DataFrame based on conditions (e.g., matching specific values).
- **`df.column(column).valueCounts()`**: Counts occurrences of unique values in a column.
- **`df.drop({ columns: columnsToDrop, inplace: true })`**: Drops specific columns from the DataFrame.
- **`dfd.getDummies(df)`**: Converts categorical variables into dummy/indicator variables for analysis.

## Reading CSV files

```javascript
dfd.readCSV(this.dropFiles).then((df) => { ... });
```

One of Danfo.js’s most crucial features is reading CSV files. Using [readCSV](https://danfo.jsdata.org/api-reference/input-output/danfo.read_csv), I was able to load a dataset, analyse its shape, and then proceed with deeper analyses. After loading the CSV, the dataset was stored in a DataFrame object (similar to Pandas), allowing further operations like filtering, grouping, and transforming the data.

## Initial processing

After the CSV dataset is uploaded, this triggers a [Vue watcher](https://vuejs.org/guide/essentials/watchers.html) aimed at the dropFiles variable from [Buefy](https://buefy.org/documentation/upload/). This processes the data and calculates summary statistics to give a basic data overview. It's quite a long function since it does most of the heavy lifting.

* The code reads an uploaded CSV file, converts it into a DataFrame using Danfo.js, and stores the raw data for processing.
* It calculates key dataset statistics like the number of rows, columns, and data types, and separates numeric and categorical columns.
* The code identifies missing values, duplicate rows, and computes their percentages, storing this info in a dataInfo object.
* High-cardinality columns (with many unique values) and binary columns (with only two unique values) are flagged, and warnings are generated.
* The processed data and insights are then stored for further analysis within the Vue application using 'this.dataInfo' and 'this.df'

```js
watch: {
    /*
     Reads the given CSV file when the upload input
     changes. Parses the CSV and calculates dataset 
     statistics.
     */
    dropFiles() {
      if (this.dropFiles == null) {
        return;
      }

      if (this.dropFiles.length < 1) {
        return;
      }

      console.log("this.dropFiles", this.dropFiles);
      console.log("this.dropFiles[0]", this.dropFiles[0])

      const reader = new FileReader();

      reader.readAsText(this.dropFiles);

      reader.onload = () => {
        this.rawCSVData = reader.result; // contains the file content as a string
      };

      console.log("this.rawCSVData", this.rawCSVData);

      reader.onerror = () => {
        console.log(reader.error);
      };

      dfd.readCSV(this.dropFiles).then((df) => {
        let dfShape = df.shape;

        let numericColumns = df.selectDtypes(["int32", "float32"]);

        let categoricalColumns = df.selectDtypes(["string"]);
        let totalMissingValues = df
          .isNa()
          .sum()
          .values.reduce((a, b) => a + b, 0);

        let rows = [];
        let duplicateRows = [];

        for (let i = 0; i < dfShape[0]; i++) {
          let row = df.iloc({ rows: [i] }).$data[0];

          if (rows.includes(JSON.stringify(row))) {
            duplicateRows.push(row);
          }

          rows.push(JSON.stringify(row));
        }

        let dataInfo = {
          numberOfRows: dfShape[0],
          numberOfColumns: dfShape[1],
          columns: df.columns,
          dtypes: df.ctypes,
          numColumns: numericColumns,
          catColumns: categoricalColumns,
          missingValuesByColumn: df.isNa(),
          totalMissingValues: totalMissingValues,
          "missingValues%": (
            (totalMissingValues / (dfShape[0] * dfShape[1])) *
            100
          ).toFixed(2),
          duplicateRows: duplicateRows.length,
          "duplicateRows%": ((duplicateRows.length / dfShape[0]) * 100).toFixed(
            2
          ),
        };

        console.log("dataInfo", dataInfo);

        // Identify high cardinality categoric columns
        this.influencerColumns = [...dataInfo["numColumns"].columns];
        for (let column of dataInfo["catColumns"].$columns) {
          let uniqueValuesCount = df.column(column).unique().$data.length;

          if (uniqueValuesCount > 10) {
            this.highCardinalityColumns.push(column);
          } else {
            this.influencerColumns.push(column);
          }
        }

        let warnings = [];

        for (let column of dataInfo["columns"]) {
          let missingValues =
            dataInfo["missingValuesByColumn"][column].valueCounts();
          let uniqueValuesCount = df.column(column).unique().$data.length;

          if (uniqueValuesCount == 2) {
            warnings.push(
              `<b>${column}</b> appears to be a binary variable: 2 values`
            );
          }

          if (uniqueValuesCount > 10) {
            warnings.push(
              `<b>${column}</b> has high cardinality: ${uniqueValuesCount} unique values`
            );
            this.allHighCardinalityColumns.push(column);
          }

          if (missingValues.$dataIncolumnFormat.length > 1) {
            let missingValuesCount = missingValues.$dataIncolumnFormat[1];
            let missingValuesPc = (
              (missingValuesCount / dataInfo["numberOfRows"]) *
              100
            ).toFixed(2);

            warnings.push(
              `<b>${column}</b> has ${missingValuesCount} (${missingValuesPc}%) missing values`
            );
          }
        }

        this.dataInfo = dataInfo;
        this.dataInfo.warnings = warnings;
        this.df = df;
        this.activeTab = 0;

        this.isLoading = false;
      });
    },
    relationships: {
      handler: function (newValue) {
        this.generateRelationshipPlot();
      },
      deep: true,
    },
  },
```

## Basic data overview

```javascript
let numericColumns = df.selectDtypes(["int32", "float32"]);
let categoricalColumns = df.selectDtypes(["string"]);
```

I used selectDtypes() to separate numeric and categorical columns. This distinction was important because numeric data often requires statistical analyses, while categorical data usually benefits from counting unique values or generating bar charts.

```javascript
let totalMissingValues = df.isNa().sum().values.reduce((a, b) => a + b, 0);
```

The app also calculated the number of missing values using isNa() to identify columns that needed attention. This was essential for providing an overview of data quality.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726413566/App%20Images/Blog%20Images/App%20Images/summary_w0uhsg.png" 
  alt="DataCabin app data summary tab" 
  loading="lazy" 
  styling=""
  caption="The data summary tab shows the first 20 rows of data to get a feel for the contents." 
  captionsrc="https://shedloadofcode.github.io/" 
  :showsource="false">
</article-image>

## Sample table

In this section I displayed the top 20 records while ensuring to fill any missing data with a blank value. This was to ensure it rendered correctly in the HTML table.

```javascript
df.head(20).fillNa('')
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726413566/App%20Images/Blog%20Images/App%20Images/sample_l7a58p.png" 
  alt="DataCabin app sample tab" 
  loading="lazy" 
  styling=""
  caption="The sample tab shows the first 20 rows of data to get a feel for the contents." 
  captionsrc="https://shedloadofcode.github.io/" 
  :showsource="false">
</article-image>

## Univariate analysis

```javascript
this.df[column].plot("plot_" + column.toLowerCase()).hist({ layout });
```

For numeric columns, I used plot().hist() to generate histograms. This provided a visual representation of the distribution of each variable, which is helpful for identifying trends and outliers.

For categorical variables, I computed value counts and displayed them in bar charts:

```javascript
let counts = this.df.column(column).valueCounts();
let df = new dfd.DataFrame([counts.$data], { columns: counts.$index });
df.plot("plot_" + column.toLowerCase()).bar({ layout });
```

Here, valueCounts() gives the frequency of each category, which I then plotted using bar charts to display the frequency distribution of each categorical variable.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726413566/App%20Images/Blog%20Images/App%20Images/variables_r7fwtb.png" 
  alt="DataCabin app sample tab" 
  loading="lazy" 
  styling=""
  caption="The variable tab shows summary stats for each variable in isolation." 
  captionsrc="https://shedloadofcode.github.io/" 
  :showsource="false">
</article-image>

## Correlation heatmap

The correlation heatmap was one of the most interesting features to implement. First, I created dummy variables for categorical columns using getDummies() to ensure all data could be treated numerically for correlation analysis.

```javascript
let dummies = dfd.getDummies(df);
```

I created a custom corr() function to calculate the pairwise Pearson correlation between columns.

```javascript
corr(x, y) {
    let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0;
    x.forEach((xi, idx) => {
      let yi = y[idx];
      sumX += xi;
      sumY += yi;
      sumXY += xi * yi;
      sumX2 += xi * xi;
      sumY2 += yi * yi;
    });
    return ((x.length * sumXY - sumX * sumY) / Math.sqrt((x.length * sumX2 - sumX * sumX) * (x.length * sumY2 - sumY * sumY)));
}
```

Here, corr() is used to calculate the Pearson correlation between two arrays. The result is displayed as a [heatmap using Plotly](https://plotly.com/javascript/heatmaps/), where each cell represents the correlation between two variables. 

The image below shows what this looks like in the app followed by the code to implement it in Vue. You can read more about this in the article [How to create an interactive correlation heatmap using Danfo.js and Plotly](/blog/how-to-create-an-interactive-correlation-heatmap-using-danfojs-and-plotly/).

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1694358951/App%20Images/Blog%20Images/Article%20Images/Danfo%20Plotly%20Correlation%20Heatmap/heatmap_wng9xn.png" 
  alt="DataCabin app" 
  loading="lazy" 
  styling=""
  caption="The correlation heatmap shows the Pearson correlation between variables." 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1694358951/App%20Images/Blog%20Images/Article%20Images/Danfo%20Plotly%20Correlation%20Heatmap/heatmap_wng9xn.png" 
  :showsource="false">
</article-image>

```html
<div id="correlation-heatmap">
	<!-- Plotly Heatmap -->
</div>
```

```javascript
/*
     * Creates a correlation heatmap
     * https://plotly.com/javascript/heatmaps/
     */
    generateHeatmap() {
      /**
       * This needs to be in the format of
       *  zValues = [
       *     [0.00, 0.00, 0.75, 0.75, 1.00],
       *     [0.00, 0.00, 0.75, 1.00, 0.00],
       *     [0.75, 0.75, 1.00, 0.75, 0.75],
       *     [0.00, 1.00, 0.00, 0.75, 0.00],
       *     [1.00, 0.00, 0.00, 0.75, 0.00]
       *  ];
       */
      if (this.heatmapInitialised) {
        return;
      }

      let zValues = [];
      let df = this.df.copy();
      let columnsLength = this.dataInfo.columns.length;
      let columnsToDrop = [];

      // Drop columns with high cardinality (many unique values)
      for (let i = 0; i < columnsLength; i++) {
        let column = this.dataInfo.columns[i];

        // Skip if a numeric column as it will have lots of unique values
        // but this doesn't matter :)
        if (this.dataInfo["numColumns"].$columns.includes(column)) {
          continue;
        }

        let uniqueValuesCount = df.column(column).unique().$data.length;

        if (uniqueValuesCount > 5) {
          columnsToDrop.push(column);
        }
      }

      df.drop({ columns: columnsToDrop, inplace: true });

      // Create dummy columns for categoric variables
      let dummies = dfd.getDummies(df);
      // Uncomment to debug: console.log("DUMMIES", dummies);
      columnsLength = dummies.$columns.length;

      for (let i = 0; i < columnsLength; i++) {
        let column = dummies.$columns[i];
        // Uncomment to debug: console.log("COMPARING", column);
        let correlations = [];

        for (let j = 0; j < columnsLength; j++) {
          let comparisonColumn = dummies.$columns[j];
          // Uncomment to debug: console.log("TO", comparisonColumn);

          let pearsonCorrelation = this.corr(
            dummies[column].$data,
            dummies[comparisonColumn].$data
          ).toFixed(2);

          correlations.push(pearsonCorrelation);

          this.logCorrelationMessage(
            pearsonCorrelation,
            column,
            comparisonColumn
          );
        }

        zValues.push(correlations);
      }

      var xValues = dummies.$columns;
      var yValues = dummies.$columns;

      var colorscaleValue = [
        [0, "#3D9970"],
        [1, "#001f3f"],
      ];

      var data = [
        {
          x: xValues,
          y: yValues,
          z: zValues,
          type: "heatmap",
          colorscale: colorscaleValue,
          showscale: false,
        },
      ];

      var columnWidth = document.getElementById("main-column").offsetWidth - 35;

      if (!columnWidth > 0) {
        columnWidth = window.innerWidth - 750;
      }

      var layout = {
        autosize: false,
        width: columnWidth,
        height: 700,
        annotations: [],
        xaxis: {
          ticks: "",
          side: "top",
        },
        yaxis: {
          ticks: "",
          ticksuffix: " ",
          autosize: false,
        },
      };

      for (var i = 0; i < yValues.length; i++) {
        for (var j = 0; j < xValues.length; j++) {
          var currentValue = zValues[i][j];
          if (currentValue != 0.0) {
            var textColor = "white";
          } else {
            var textColor = "black";
          }
          var result = {
            xref: "x1",
            yref: "y1",
            x: xValues[j],
            y: yValues[i],
            text: zValues[i][j],
            font: {
              family: "Arial",
              size: 12,
              color: "rgb(50, 171, 96)",
            },
            showarrow: false,
            font: {
              color: textColor,
            },
          };
          layout.annotations.push(result);
        }
      }

      Plotly.newPlot("correlation-heatmap", data, layout);

      this.heatmapInitialised = true;
    }
```

The function **generateHeatmap** handles:

* Data preprocessing and correlation calculation: The function first removes columns with high cardinality (more than five unique values) to simplify the dataset. It then creates dummy variables for categorical columns using getDummies(), and calculates Pearson correlations between pairs of columns, storing the results in a zValues array.

* Heatmap plotting with Plotly: The correlation data (zValues) is plotted using Plotly as a heatmap, where each cell represents the correlation between two variables. The heatmap is customised with specific colors and annotations that display correlation values in each cell.
* Layout and scaling: The heatmap layout dynamically adjusts to the screen size, and annotations (values in the heatmap cells) are colored based on the correlation value, providing a visually informative representation of the correlations between variables.

## Relationship plots
Another key feature was plotting relationships between pairs of variables, which could either be numeric or categorical. For numeric pairs, scatter plots were used:

```javascript
this.df.plot("relationship-plot").scatter({ config: { x: variableA, y: variableB } });
```

For categorical pairs, I created grouped tables showing the distribution:

```javascript
let group = this.df.loc({ columns: [variableA, variableB] }).groupby([variableA, variableB]).size();
group.plot("relationship-plot").table();
```

This allows users to explore relationships between two categorical variables through tabular summaries.


<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726077541/App%20Images/Blog%20Images/App%20Images/datacabin-cover_w6ie2l.png" 
  alt="DataCabin app relationships tab" 
  loading="lazy" 
  styling=""
  caption="The relationships tab shows the patterns between two variables in a scatter chart if numeric or table for categoric." 
  captionsrc="https://shedloadofcode.github.io/" 
  :showsource="false">
</article-image>

## Key influencers

For the key influencers the idea is that selecting a categorical variable or a variable with low cardinality we ask what influences this to be A, B, C etc. whereas if we select a numeric variable we ask what influences this to be higher or lower. 

My inspiration for this was the [Power BI key influencers](https://learn.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-influencers?tabs=powerbi-desktop#features-of-the-key-influencers-visual) visual albeit a simplified version to understand averages and frequent counts in other columns.

This allows us to figure out what are the main influencers driving a particular variable. It helps to highlight further avenues for deeper analyis, and ask the right questions of our datasets.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726413566/App%20Images/Blog%20Images/App%20Images/influencers_ku1o2q.png" 
  alt="DataCabin app" 
  loading="lazy" 
  styling=""
  caption="The key influencers tab lets us ask 'what influences this to be X?' or for a continuous variable 'what drives this higher or lower?'." 
  captionsrc="https://shedloadofcode.github.io/" 
  :showsource="false">
</article-image>

The code and logic for this section is long so I won't share it here. In general I have two functions that handle categoric or continous variables:

**generateInfluencerResults()**

Trigger and setup: This function starts after a short delay and checks if a specific variable and value (influencer) are selected. If they are not, it exits early.

Querying data: It filters the dataset based on the selected influencer variable and value. For numeric values labeled "Higher" or "Lower", it calls another function, generateContinuousInfluencerResults().

Summarizing results: For each column, it calculates either the average (for numeric columns) or the most frequent value and its percentage (for categorical columns), ignoring irrelevant columns.

**generateContinuousInfluencerResults()**

Filter based on mean: This function filters rows where the selected variable is either 20% higher or lower than its mean, depending on the user's selection ("Higher" or "Lower").

Calculating averages: For numeric columns, it calculates and displays the average values of other columns in the filtered data.

Frequent value analysis: For categorical columns, it finds the most frequent value, the number of occurrences, and its percentage of the total dataset.

## Lessons learned

**Learning curve:** The main lesson learnt is that if you've used Pandas in Python, you'll likely feel at home using Danfo. This is a major strength of the package and makes it quick to get started coming from Python to try JavaScript.

**Flexibility with data types:** Danfo.js handles multiple data types well, but understanding the nuances between numeric and categorical data is crucial. Making decisions about how to process each type is key to effective data exploration.

**Visualisation matters:** While Danfo.js provides some basic plotting capabilities, and the base plots use Plotly, integrating it directly with Plotly allows for more advanced visualisations. This combination is powerful when creating interactive and informative apps.

**Custom functions are sometimes required:** This is a similar finding in most other data analysis packages. Danfo.js provides many built-in functions, but sometimes you need custom logic for specific tasks, such as calculating correlations or managing categorical relationships.

**Scalability and performance:** While Danfo.js is efficient, handling large datasets with many unique categorical values can slow things down. I learned to filter out high-cardinality columns and simplify the data when necessary for performance reasons.

**Generic data vs specific:** The data explorer app needed to handle most datasets without seeing them first, this posed it's own challenges. Whereas if you tailored Danfo.js to a specific dataset you could leverage it's capabilities more effectively. You would know what data types to expect and wouldn't need so much preprocessing. 


## Conclusion

Using Danfo.js to build this data explorer app was an exciting challenge. The library’s ability to handle data manipulation and analysis directly in JavaScript was super useful, especially for web-based data exploration. 

I am still very impressed by how far JavaScript and the Danfo package has come, I really didn't think something comparable to Pandas and Python would be available in JavaScript but there it is! By leveraging its core features and integrating it with external visualisation libraries, I was able to create a fully functional app for quickly analysing datasets, which I plan to continue improving and using myself. 

Thanks for reading, I hope this helped you in your own data analysis and visualisation projects 😄

As always, if you enjoyed this article be sure to check out [other articles](/) on the site including:

* [How to create an interactive correlation heatmap using Danfo.js and Plotly](/blog/how-to-create-an-interactive-correlation-heatmap-using-danfojs-and-plotly/)
* [How to create animated charts with Python and Plotly](/blog/how-to-create-animated-charts-with-python-and-plotly/)
* [How to match and count keywords in text using JavaScript](/blog/how-to-match-and-count-keywords-in-text-using-javascript/) 
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) 
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to sort your eBay sold items by price with DevTools]]></title>
            <link>https://shedloadofcode.com/blog/how-to-sort-your-ebay-sold-items-by-price-with-devtools/</link>
            <guid>https://shedloadofcode.com/blog/how-to-sort-your-ebay-sold-items-by-price-with-devtools/</guid>
            <pubDate>Sat, 14 Sep 2024 20:14:00 GMT</pubDate>
            <description><![CDATA[Learn how to quickly sort your sold items on eBay to find your top selling items over the past year or last 200 items. This has been a super useful hack to hone in on the most profitable or highest revenue products.]]></description>
            <content:encoded><![CDATA[
If you’re selling items on eBay UK and want to quickly sort your sold items by price, like all items sold in the last year, you might be surprised to learn that you can't do that on the orders page... only from the last 90 days! 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726339483/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/ebay-unable-to-sort-year_pux53s.png" 
  alt="eBay site unable to sort all current year items" 
  loading="lazy" 
  styling=""
  caption="Unable to sort all current year items" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1726339483/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/ebay-unable-to-sort-year_pux53s.png" 
  :showsource="false">
</article-image>

This guide will walk you through the simple steps to use a browser feature called "DevTools" or developer tools in the browser to quickly get a breakdown of your highest price sold items across your last year / previous year last 200 items sold using JavaScript. 

Don’t worry if you’re not tech-savvy; we’ll make it easy to follow. Finding the items that have sold for the most in your last 200 items can be a very useful piece of analysis. It allows you to hone in on the most profitable or highest revenue products. Let's begin.

## Navigate to the eBay sold items page

Before starting the steps below [head to eBay and login](https://www.ebay.co.uk/) then...

**Step 1: Open the eBay orders page**
First, you need to visit the page that shows all your [sold items on eBay](https://www.ebay.co.uk/mys/sold/rf/filter=ALL&limit=200&period=CURRENT_YEAR). The URL for this is [https://www.ebay.co.uk/mys/sold/rf/filter=ALL&limit=200&period=CURRENT_YEAR](https://www.ebay.co.uk/mys/sold/rf/filter=ALL&limit=200&period=CURRENT_YEAR) which limits to the top 200 results for the current year. You can swap 'CURRENT_YEAR' to 'LAST_YEAR' in the URL also to change periods.

**Step 2: Open DevTools**
DevTools is a tool built into your web browser that lets you view and interact with the code on a webpage. You can typically press F12 to open DevTools in a browser but here’s how you can open it in different browsers:

**Google Chrome:**

Right-click anywhere on the page and select "Inspect".
Alternatively, you can press Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).

**Mozilla Firefox:**

Right-click anywhere on the page and select "Inspect".
You can also press Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).

**Microsoft Edge:**

Right-click anywhere on the page and select "Inspect".
Press Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).

**Safari:**

You might need to enable the Develop menu first by going to Safari > Preferences > Advanced and checking "Show Develop menu in menu bar".
Then, right-click anywhere on the page and select "Inspect Element".
Or press Cmd+Option+I (Mac).

**Step 3**: Open the Console tab
Once DevTools is open, look for a tab called "Console". Click on it.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726339874/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/select-console-tab_fznu2a.png" 
  alt="Chrome DevTools window" 
  loading="lazy" 
  styling=""
  caption="Head to the 'Console' tab in DevTools" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1726339874/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/select-console-tab_fznu2a.png" 
  :showsource="false">
</article-image>

## Grab the code and use it

You’ll see a text area where you can type or paste code. Copy the following code and paste it into this area:

```js
// Get all elements with the class 'sold-itemcard'
const items = document.querySelectorAll('.sold-itemcard');

// Extract title and price from each item and store them in an array
const itemDetails = Array.from(items).map(item => {
  const title = item.querySelector('.item-title')?.innerText.trim();
  const priceText = item.querySelector('.item__price')?.innerText.trim();
  const price = parseFloat(priceText.replace(/[^0-9.]/g, '')); // Remove non-numeric characters
  
  return { title, price };
});

// Sort items by price in ascending order
itemDetails.sort((a, b) => a.price - b.price);

// Display the sorted items
console.log('Item Price');
itemDetails.forEach(item => {
  console.log(`${item.title} £${item.price.toFixed(2)}`);
});
```

Press Enter to run the code.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726340096/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/paste-code-into-devtools_ymvktw.png" 
  alt="Sort items code pasted into DevTools" 
  loading="lazy" 
  styling=""
  caption="The code pasted into DevTools - copy the code above then right click in DevTools Console > Paste" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1726340096/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/paste-code-into-devtools_ymvktw.png" 
  :showsource="false">
</article-image>

## View the results

After running the code, look at the "Console" tab again. You should see a list of your sold items, sorted by price in ascending order. The format will look something like this:

```
...
Real Madrid Home Football Shirt 2006/07 Adults Medium Adidas B567 £25.00
Real Madrid 2004-05 Home Football Shirt UK Small Adidas £28.00
Hands-On Machine Learning With Scikit-Learn, Keras & TensorFlow Aurélien Géron £30.00
England 2006 Away Football Shirt Medium Beckham 7 £35.00
Pokemon Crystal Official Perfect Guide Game Boy Nintendo Magazine With Poster £40.00
Star Wars Master Replicas .45 Scaled Darth Vader Lightsaber £50.00
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1726339483/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/ebay-sorted-results_emjzy9.png" 
  alt="Sort items code pasted into DevTools" 
  loading="lazy" 
  styling=""
  caption="The code pasted into DevTools - copy the code above then right click in DevTools Console > Paste" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1726339483/App%20Images/Blog%20Images/Article%20Images/Sort%20Ebay%20Sold%20Items/ebay-sorted-results_emjzy9.png" 
  :showsource="false">
</article-image>

That's all there is to it! The last 200 sold items sorted by price in ascending order. You can scroll upwards highest to lowest to get an idea of your which were your most profitable or highest revenue items.


## Bonus: How to update the code if class names change

Sometimes, websites update their design and the class names used in their HTML structure can change. If you find that the code provided no longer works, you can update the class names in the code to match the new ones on the page. Here’s a step-by-step guide on how to inspect the page and find the correct class names.

**Step 1: Inspect the page elements**
Open DevTools: Follow the same steps as before to open DevTools. Right-click on the page and select "Inspect" or use the shortcut Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).

Find an item: Locate an item card on your eBay sold items page. You’ll want to find the specific parts of the card that include the title and price.

Inspect the elements: Hover over the item card you’re interested in, right-click on it, and select "Inspect". This action will highlight the HTML code for that item card in the DevTools.

**Step 2: Identify the new class names**

Look at the HTML Structure: In the DevTools panel, you’ll see the HTML code related to the item card you inspected. The code might look something like this:

```html
<div class="new-item-card">
    <span class="new-item-title">Item Name</span>
    <span class="new-item-price">£10.00</span>
</div>
```

Find the correct class names: Note the class names used for the title and price. In this example, the new class names are new-item-title and new-item-price.

**Step 3: Update the code**

Modify the code: Replace the old class names in the code with the new ones you found. Here’s how you would update the code if the class names changed:

```javascript
// Get all elements with the class 'new-item-card'
const items = document.querySelectorAll('.new-item-card');

// Extract title and price from each item and store them in an array
const itemDetails = Array.from(items).map(item => {
  const title = item.querySelector('.new-item-title')?.innerText.trim();
  const priceText = item.querySelector('.new-item-price')?.innerText.trim();
  const price = parseFloat(priceText.replace(/[^0-9.]/g, '')); // Remove non-numeric characters
  
  return { title, price };
});

// Sort items by price in ascending order
itemDetails.sort((a, b) => a.price - b.price);

// Display the sorted items
console.log('Item Price');
itemDetails.forEach(item => {
  console.log(`${item.title} £${item.price.toFixed(2)}`);
});
```

Paste and run the updated code: Go back to the "Console" tab in DevTools, paste the updated code, and press Enter to run it.

**Additional tips:**

Use the search function: In DevTools, you can use Ctrl+F (Windows) or Cmd+F (Mac) to search for specific text or class names within the HTML code. This can help you quickly find the elements you need.

Look for patterns: Sometimes class names might include additional numbers or letters, but they follow a pattern. Look for similar patterns to identify the correct classes.

Check multiple items: If one item card has the new class names, it’s a good idea to check a few more items to ensure the same class names are used throughout.

By following these steps, you can adapt the code to fit any changes in the website’s structure, ensuring you always get the sorted list of sold items as needed.

## Happy sales analysis

By using this simple process, you can quickly organise your sold items and make sense of your sales data. I'm sure there might be another way to achieve this but this is fast and fairly simple, plus you learn some JavaScript in the process 😄 Happy selling!

If you enjoyed this article be sure to check out: 

* [How to scrape AutoTrader with Python and Selenium to search for multiple makes and models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/)
* [How to scrape and analyse your Amazon spending data](/blog/how-to-scrape-and-analyse-your-amazon-spending-data/) 
* [How to scrape and analyse your Chess.com data](/blog/how-to-scrape-and-analyse-your-chess-com-data/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to keep track of the games you actually want to play]]></title>
            <link>https://shedloadofcode.com/blog/how-to-keep-track-of-the-games-you-actually-want-to-play/</link>
            <guid>https://shedloadofcode.com/blog/how-to-keep-track-of-the-games-you-actually-want-to-play/</guid>
            <pubDate>Fri, 30 Aug 2024 16:40:00 GMT</pubDate>
            <description><![CDATA[Use these steps to keep track of the games you are playing. Spend more time playing games you enjoy and less time worrying about tracking and missing out on others.]]></description>
            <content:encoded><![CDATA[
After recently picking back up the hobby of video gaming and retro gaming, I needed a way to keep on track with the limited time I have available.

Unfortunately I don't have unlimited time to enjoy every game I want to, and believe me there are lots! Since acquiring an [Odin 2 Pro](https://www.ayntec.com/products/odin-2?variant=44035820552384) which has been such a great device for me as a portable/retro gaming powerhouse, this has opened up my library to many consoles and potentially thousands of retro games. I didn't want gaming to be a chore or having to assemble some kind of backlog - that doesn't sound fun at all.

So the task was to answer these questions...

## Our goal 

* How to find the games I'm interested in playing 
* How to manage games in progress 
* How to stay motivated and keep it fun not like a tick list 
* Avoid the misery that comes with the [paradox of choice](https://en.wikipedia.org/wiki/The_Paradox_of_Choice) and [FOMO](https://en.wikipedia.org/wiki/Fear_of_missing_out)
* Accept that we cannot play every single game there is 
* But also accept there is a lot of joy to be had by playing some of them 
 
The mindset to this is **you are like the visitor to a library, not the librarian**. You want to spend more time playing games with your limited time than searching, selecting and recording them. 

To do that we will keep it very simple, creating a straightforward process to find, keep track of, and most importantly play the games you actually want to.

## Finding games 

In searching for games there are numerous avenues I take:

* Browsing the Microsoft store on Xbox Series X
* Looking at titles you like and related games to those - then adding to the wishlist to avoid any impulse buys
* Search Reddit for top 10 lists and pick only the ones you really like the sound of - this is especially useful for retro consoles
* Browsing the [Daijishō app](https://play.google.com/store/apps/details?id=com.magneticchen.daijishou&hl=en_GB) on the [Odin 2 Pro](https://www.ayntec.com/products/odin-2?variant=44035820552384) - I use this app to catalogue my retro games collection

We want this search to be a timed exercise so spend 10 - 30 minutes browsing these places and generally getting a feel for if a game is looking like one you really want to play. Then we can move on to...

## Keeping track of games

Nothing complicated about the system I'm using, the power is in it's simplicity. I keep a note on my phone using Google Keep, with the following headings:

* Main 
* Side 
* Play anytime 
* Next up 
* To revisit 

There are apps like [Stash](https://stash.games/) but a simple note works well and naturally limits the amounts added. You could also use a Trello board if you prefer [like in this video](https://youtu.be/S97u51n4zYQ?si=PF936jYmODn_eib7&t=483) but that could become overwhelming. Here is what my list looks like currently...

``` games.txt
Main:
Diablo 4 💀
Sheep Dog and Wolf PS1 🐑 🐺 


Side:
Wario Land 4 GBA 🍄
Pokémon Fire Red GBA 🐦‍🔥 


Play anytime:
Dead Cells ⚔️
Insurgency Sandstorm 🪖
Marble It Up Ultra 🔮


Next up:
Balders Gate 3 👿
Luigi's Mansion GC 👻
Resident Evil 3/4 🧟‍♂️
Control 🏢
Hellblade 🗡️
Dark Souls 💀


Top games to revisit:
Witcher 3 🧙‍♀️
Red Dead Redemption 2 🤠
Skyrim 🐉
Wolfenstein NO/NC 🪖
Metro 🚇
```

It's fairly short, simple and keeps me focused and enjoying the games **I picked**.

But what do each of these headings mean? What kinds of games would I include on each of them? Here is a breakdown for each heading below 👇 All of the game examples are my own views, your games in each heading may differ.  

**Main** = Games that you have to sit down, concentrate, follow the story and immerse yourself in. They may require multiple hour sessions to get the most out of them. They may be 100+ hours in total  Witcher 3, Baldurs Gate 3. No more than 2 probably only 1.

**Side** = Games that can be story driven but can be laid back too. You can play these while doing something else. You can kinda relax to these games too. Pokémon Fire Red, Stardew Valley. The [Odin 2 Pro](https://www.ayntec.com/products/odin-2?variant=44035820552384) I recently purchased has really helped me to expand the side games I can explore whenever and wherever I feel like a session - it's a portable/retro gaming powerhouse. No more than 3 depending on complexity.

**Play anytime** = Games that have great replayability but borderline addictive in some cases. These are pick up and play games mostly but they can still require lots of skill. You can play them in 10-30 minutes. They have a great feedback loop for instant gratification. I kinda want to limit these games to an extent because the dopamine rush is heavy. Insurgency Sandstorm, Dead Cells. No more than 5 but can go higher if you're ok with it.

**Next up**: Games you've selected as your next to play. Once you feel you've gotten as much as you need out of your main or side games, these can take thier place. No more than 10 and only pick the one you feel most drawn to playing next, then move it main or side.

The most important rule for these headings is to stay within the number limits. We want to limit our choice to avoid that paradox of choice and any overwhelming feelings or disatisfaction. 

The aim is play and enjoy games, so avoiding [FOMO](https://en.wikipedia.org/wiki/Fear_of_missing_out) is crucial. You selected the games to add to your next up, so it reduces your choice, but you know that they were the ones you chose and were looking forward to!


## Playing the games 

You probably don't need to many tips on this one, but there are some solid pieces of advice to keep in mind. Now that you have a simple system of identifying the games **you really want to play** and you're actually playing them, you just have to remember that to combat FOMO - that right now there is nothing better out there than this game you chose. So play it like it's the last game in the world and enjoy it. 

We live in a strange time of maximum availability and digital downloads unlike cartridges or disks which somewhat restricted the feeling that everything is available always. The [paradox of choice](https://en.wikipedia.org/wiki/The_Paradox_of_Choice) as a concept suggests that too many options aren't good for us psychologically. We can become overwhelmed with options leading to decision-making and commitment difficulty, reduced satisfaction, increased anxiety, and a decrease in wellbeing and happiness.

* Don't read too many reviews on a game... judge it yourself 
* Keep a note of what you were doing so you can pick it back it easily
* Try to save the game in a location that makes it clear what you were doing when you pick up again
* Give the game a chance - some games take a while to show their true potential
* If it works for you, pretend you're playing the game on a PS1 pre-internet, that's what I'm doing with [Sheep Dog 'n' Wolf](https://en.wikipedia.org/wiki/Sheep,_Dog_%27n%27_Wolf) and it's a really fun game with some difficult puzzles

So what happens if you start a game and it's not as good as you expected? Well, keep going, some games do get better in their later stages, but if it's really not for you then drop it. No harm in accepting if wasn't the right pick for you. Just don't get trapped into dropping things too quick, pretend you rented it from Blockbuster 😂 you may as well try it a little longer see if it gets better.

## Conclusion

These steps of how to identify, track and play games should help you combat [FOMO](https://en.wikipedia.org/wiki/Fear_of_missing_out), keep it simple, and ultimately increase your enjoyment of playing games again. 

I hope you have a great time finding and playing new or old games, enjoying the experience and feeling like you're on top of your library - or not and just going with the flow, enjoying one game at a time.

Thanks for reading 👍 If you enjoyed this article you might also like these articles:

* [Super quick setup guide to playing retro games using RetroPie, Dolphin and Redream](/blog/super-quick-setup-guide-to-playing-retro-games-using-retropie-dolphin-and-redream/)
* [Reflections on digital streaming and reducing smartphone usage](/blog/reflections-on-digital-streaming-and-reducing-smartphone-usage/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Azure Function to vectorise PDFs and store in Qdrant container app with OpenAI and Python]]></title>
            <link>https://shedloadofcode.com/blog/azure-function-to-vectorise-pdfs-and-store-in-qdrant-container-app-with-openai/</link>
            <guid>https://shedloadofcode.com/blog/azure-function-to-vectorise-pdfs-and-store-in-qdrant-container-app-with-openai/</guid>
            <pubDate>Sun, 30 Jun 2024 18:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to code an Azure function app which is automatically triggered when a new PDF is dropped into Azure blob storage, vectorises the PDF contents and stores the vectors in an Azure Qdrant container app. Lot's of moving parts and lots to learn from it, this was a tough one!]]></description>
            <content:encoded><![CDATA[
<affiliate-disclaimer></affiliate-disclaimer>

## Introduction

In this one we'll be going through part of an [LLM](https://en.wikipedia.org/wiki/Large_language_model) / [Azure OpenAI](https://azure.microsoft.com/en-gb/products/ai-services/openai-service) project I worked on recently. This involved:
 
1. Triggering an Azure Function app when a new PDF is dropped into an Azure Blob Storage container.
2. Vectorising the document in the Azure Function app.
3. Dropping those vectors into a Qdrant database running in an Azure Container app.
4. A Python FastAPI app which [retrieved and queried](https://en.wikipedia.org/wiki/Document_retrieval) those document vectors to answer user questions.

I will be outlining the process and code on how to vectorise documents in an Azure Function, but cannot give too much detail due to the organisation's data protection policy.

I have redacted some details in the images, but you should get a good feel for how to put together a solution like this if it's something you're interested in. It was quite a lightweight solution but still had many moving parts.

There are tools that make this process easier and do the heavier lifting which I'm learning about currently including [Azure Prompt Flow](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/prompt-flow) and [Azure AI Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) for documents. However this process is more customisable and provides greater control.

## Drop a PDF into blob storage

The first step was to set up a resource group in Azure, and create these components:

* Blob storage container
* Function app 
* Qdrant container app

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762421/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Function_app_resource_group_atjcl7.png" 
  alt="Azure resource group" 
  loading="lazy" 
  styling=""
  caption="This resource group contains all of the vectorisation components" 
  captionsrc="" 
  :showsource="false">
</article-image>

Once a new PDF is dropped into the storage account, the function app is automatically triggered and begins to run.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762426/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Blob_storage_ve6mk9.png" 
  alt="Azure blob storage container" 
  loading="lazy" 
  styling=""
  caption="'Upload' a new PDF to trigger the function app. I have uploaded 5 (names redacted)" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Function app automatically triggers
The function app is triggered now that a new PDF has been uploaded to the blob storage container. Here is the trigger configured in the function app.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762422/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Trigger_for_blob_storage_k60o6n.png" 
  alt="Azure function app automatically triggered" 
  loading="lazy" 
  styling=""
  caption="This trigger is applied to launch the function app when a new PDF is uploaded to the blob container" 
  captionsrc="" 
  :showsource="false">
</article-image>

Here is the Python code that drives the vectorisation process. The docstring at the top of the file outlines the steps the function app takes. It triggers, reads the PDF, vectorises, and stores the vectors.

```python [function_app.py]
"""
An Azure function app which:

- is triggered when a new PDF file is added to the blob container 'docupload'
- reads the PDF file and turns to vectors
- stores the vectors in Azure Qdrant

Components:

- Function app 
- Blob store 
- Qdrant container 

Prerequisites:
- Install VS Code Azure extension 
- Read getting started documentation at https://shorturl.at/59jYg
"""

import azure.functions as func
import logging
import fitz
import openai
import qdrant_client.models as models
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import *
from qdrant_client.fastembed_common import *


app = func.FunctionApp()

@app.blob_trigger(arg_name="myblob", path="docupload",
                               connection="saaicdupsertvectorspoc01_STORAGE") 
def aicdfaupsertvectorspoc(myblob: func.InputStream):
    # 1. Read document
    blob_name: str = myblob.name

    logging.info(f"Python blob trigger function processed blob"
                f"Name: {myblob.name}"
                f"Blob Size: {myblob.length} bytes")
    ''
    try:
        document = fitz.open(stream=myblob.read(), filetype="pdf")
        logging.info(f'PDF read successfully: {document}')
    except:
        print("The PDF could not be read.")

    # 2. Vectorise document and upload to Qdrant
    def tiktoken_len(text: str) -> int:
        tokenizer = tiktoken.get_encoding("p50k_base")
        tokens = tokenizer.encode(text, disallowed_special=())

        return len(tokens)

    def data_upload(qdrant_index_name: str, document) -> None:
        settings = {
            "url": "https://ca-qdrant-poc.azurecontainerapps.io", # The URL to your container app
            "host": "ca-qdrant-poc.azurecontainerapps.io",
            "port": "6333",
            "openai_api_key": "", # Enter your OpenAI API key
            "openai_embedding_model": "text-embedding-ada-002"
        }

        whole_text = []
        for page in document:
            text = page.get_text()
            text = text.replace("\n", " ")
            text = text.replace("\\xc2\\xa3", "£")
            text = text.replace("\\xe2\\x80\\x93", "-")
            whole_text.append(text)

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=100,
            length_function=tiktoken_len,
            separators=["\n\n", "\n", " ", ""],
        )

        chunks = []

        for record in whole_text:
            text_temp = text_splitter.split_text(record)
            chunks.extend([{"text": text_temp[i]} for i in range(len(text_temp))])

            try:
                client = QdrantClient(url=settings["url"],
                                      port=None)

                collection_names = []
                collections = client.get_collections()

                for i in range(len(collections.collections)):
                    collection_names.append(collections.collections[i].name)

                if qdrant_index_name in collection_names:
                    client.get_collection(collection_name=qdrant_index_name)
                else:
                    client.create_collection(
                        collection_name=qdrant_index_name,
                        vectors_config=models.VectorParams(
                            distance=models.Distance.COSINE, size=1536
                        ),
                    )
            except Exception as e:
                logging.error("Unable to connect to QdrantClient")
                logging.error(f"Error message: {str(e)}")

        for id, observation in enumerate(chunks):
            text = observation["text"]

            try:
                openai.api_key = settings["openai_api_key"]
                res = openai.Embedding.create(
                    input=text, engine=settings["openai_embedding_model"]
                )
            except openai.AuthenticationError:
                logging.error("Invalid API key")
            except openai.APIConnectionError:
                logging.error(
                    "Issue connecting to open ai service. Check network and configuration settings"
                )
            except openai.RateLimitError:
                logging.error("You have exceeded your predefined rate limits")

            client.upsert(
                collection_name=qdrant_index_name,
                points=[
                    models.PointStruct(
                        id=id,
                        payload={"text": text},
                        vector=res.data[0].embedding,
                    )
                ],
            )
            logging.info("Text uploaded")

        logging.info("Embeddings upserted")

    file_index = blob_name \
        .strip() \
        .lower() \
        .replace(" ", "_") \
        .replace("docupload/", "") \
        .replace(".pdfblob", "") \
        .replace(".pdf", "")
    
    logging.info(f"File index: {file_index}")

    data_upload(qdrant_index_name=file_index, document=document)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762421/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Function_app_in_Azure_roykxx.png" 
  alt="Azure function app in Azure" 
  loading="lazy" 
  styling=""
  caption="The function app in Azure" 
  captionsrc="" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719767789/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/vs-code_p1sex8.png" 
  alt="Azure function app in Azure" 
  loading="lazy" 
  styling=""
  caption="The function app in VS Code - requirements.txt defines the required packages" 
  captionsrc="" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762426/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/All_function_app_upsert_runs_are_below_60_seconds_bhgoxx.png" 
  alt="Azure function app run logs" 
  loading="lazy" 
  styling=""
  caption="These are the log outputs to show the function app is working ok" 
  captionsrc="" 
  :showsource="false">
</article-image>

So above we can see the Azure function in the Azure portal and in VS Code, and the run logs - yes I found 65 ways to fail here but eventually found a way to succeed! The full logs end with "Embeddings upserted" so we know it completed successfully.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762421/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Logs_for_function_app_run_inxdvg.png" 
  alt="Azure function app run logs" 
  loading="lazy" 
  styling=""
  caption="'Embeddings upserted' marks the end of the function - job completed" 
  captionsrc="" 
  :showsource="false">
</article-image>

Now to check the Qdrant container app to confirm for sure that the vector embeddings are present there.

## Vectors stored in Qdrant container app

The Qdrant container app was set up in Azure to hold the vector embeddings.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762422/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Qdrant_URL_qz8xgl.png" 
  alt="Azure Qdrant container app" 
  loading="lazy" 
  styling=""
  caption="The function app needed this Qdrant container app URL to upsert vectors" 
  captionsrc="" 
  :showsource="false">
</article-image>

If we head to that URL given for the container app and add **/dashboard/collections** we will see all of the document vector collections present in Qdrant.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762425/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/All_collections_upserted_i0s22b.png" 
  alt="Azure Qdrant container app" 
  loading="lazy" 
  styling=""
  caption="Success! We have a collection for each of the 5 PDFs added (redacted names)" 
  captionsrc="" 
  :showsource="false">
</article-image>

Selected a collection shows the vector embeddings that are stored in it. By vectorising and chunking the PDF content and storing it in a Qdrant vector database this can now work with the OpenAI LLM to answer questions based on the PDF documents.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762422/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Vectors_upserted_to_Qdrant_c47dal.png" 
  alt="Azure Qdrant container app" 
  loading="lazy" 
  styling=""
  caption="All of the vectors for the given colleciton / PDF (redacted content)" 
  captionsrc="" 
  :showsource="false">
</article-image>

---

<a href="https://edx.sjv.io/c/4971160/1520396/17728">
  <article-image 
    src="https://app.impact.com/display-ad/17728-1520396?v=1" 
    alt="" 
    loading="lazy" 
    styling=""
    caption="Ready to transform your career? Start with edX" 
    captionsrc="https://edx.sjv.io/c/4971160/1520396/17728" 
    :showsource="false">
  </article-image>
</a>

---

## Bonus: Saving snapshots for backups

I found during this process a useful feature for backup and disaster recovery planning. Once a PDF has been vectorised and upserted, you can save a snapshot of the vectors in the Qdrant dashboard. 

If everything is wiped, you can just upload the snapshot to a collection and you're back up and running. This gives a secondary option to re-running the function app for all the documents.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762421/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/How_to_save_download_a_vector_snapshot_maicgv.png" 
  alt="Azure Qdrant snapshot save" 
  loading="lazy" 
  styling=""
  caption="Saving a snapshot" 
  captionsrc="" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762421/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/How_to_upload_a_vector_snapshot_foeczp.png" 
  alt="Azure Qdrant snapshot upload" 
  loading="lazy" 
  styling=""
  caption="Uploading a snapshot" 
  captionsrc="" 
  :showsource="false">
</article-image>


## How can I learn more about LLMs, Qdrant and OpenAI in Python?

First off, if you know nothing the freeCodeCamp course and video [A Non-Technical Introduction to Generative AI](https://www.freecodecamp.org/news/a-non-technical-introduction-to-generative-ai/) is great.

Secondly, this is a useful article from DataCamp on the [25 Top MLOps Tools You Need to Know in 2024](https://datacamp.pxf.io/Wq1KkO) which includes Qdrant and LangChain.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762430/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Datacamp_Qdrant_ftqhnk.png" 
  alt="Azure resource group" 
  loading="lazy" 
  styling=""
  caption="A good read on 25 MLOps tools" 
  captionsrc="https://datacamp.pxf.io/Wq1KkO" 
  :showsource="false">
</article-image>

Lastly, to learn more about using OpenAI with Python there is a DataCamp course [Working with the OpenAI API](https://datacamp.pxf.io/Orj5xK) or you can check out the [openai-python GitHub repo](https://github.com/openai/openai-python). 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1719762428/App%20Images/Blog%20Images/Article%20Images/Azure%20Function%20Qdrant/Datacamp_course_zjxbqn.png" 
  alt="Azure resource group" 
  loading="lazy" 
  styling=""
  caption="This course from DataCamp focuses on how to work with the OpenAI API" 
  captionsrc="https://datacamp.pxf.io/Orj5xK"
  :showsource="false">
</article-image>

## Wrap up

That's everything for this one! There are lots of things to explore when it comes to LLMs and the new tools that are emerging. This was a pretty simple use case but required some discovery and learning to figure out how to do this.

I hope you enjoyed this article and it helps you out if you're planning on embarking on the same wild journey of vectorising documents in Azure from scratch! Thanks for reading. If you know even easier way to query and answer questions based on documents please share them in the comments section at the bottom of this page. I will likely write another article on the entire solution once it's fully completed. Keep an eye out for that.

Since you read this article all the way to the end you might also be interested in:

* [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)
* [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Super quick setup guide to playing retro games using RetroPie, Dolphin and Redream]]></title>
            <link>https://shedloadofcode.com/blog/super-quick-setup-guide-to-playing-retro-games-using-retropie-dolphin-and-redream/</link>
            <guid>https://shedloadofcode.com/blog/super-quick-setup-guide-to-playing-retro-games-using-retropie-dolphin-and-redream/</guid>
            <pubDate>Sat, 13 Apr 2024 15:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to setup RetroPie with a Raspberry Pi 3b+ to play NES, SNES, GBA plus Dolphin for GameCube and Redream for Dreamcast games on PC.]]></description>
            <content:encoded><![CDATA[
## Introduction

In this short article we'll quickly learn how to setup RetroPie with a Raspberry Pi 3b+ to play NES, SNES, GBA... and also setup Dolphin for GameCube and Redream for Dreamcast games on PC. I recently had a ton of fun setting these up and doing a little retro gaming and thought I'd share the experience I went through and what I learned 😄

This guide is designed to be no-nonsense (hopefully) - I won't be going into how to get game ROMs, the methods and ethics of acquiring ROMs are for someone else to discuss. I will also signpost to the very useful resources and videos I used and collated to get started. This should speed up the setup for you and reduce the amount of research you need to do!

With that out of the way, let's get started.

## What you’ll need

I used the following to get a good retro gaming setup:

* A [Raspberry Pi 3b+](https://www.amazon.co.uk/Raspberry-Pi-3-Model-B/dp/B07BDR5PDW)
* A [64GB SanDisk MicroSD](https://www.amazon.co.uk/gp/product/B09X7C7LL1/) card 
* A laptop i5 processor, 8GB RAM and it's default nothing special graphics card - Intel HD Graphics 4400
* The latest version of RetroPie - go to the [RetroPie Download page](https://retropie.org.uk/download/) and download the latest version for your Raspberry Pi, for me this was the "Raspberry Pi 2/3/Zero 2 W" button. This worked well for NES, SNES, GB, GBC, GBA, N64 (some games are slow though), Dreamcast (some games are slow though).
* The PS1 BIOS files - search for [PS1 BIOS files](https://www.google.com/search?q=ps1+bios+files&oq=ps1+bios+files) ... you're looking for .bin files named scph5500, scph5501 and scph5502
* The latest version of Dolphin - go to the [Dolphin Download page](https://dolphin-emu.org/download/) and download the latest version for your OS such as Windows x64 v5.0-21264
* The latest version of Redream - go to the [Redream Download page](https://redream.io/download) and download the latest version for your OS such as Windows v1.5.0 
* Some game [ROMs](https://en.wikipedia.org/wiki/ROM_image)

I found the 3b+ couldn't quite handle Dreamcast and since it's 32 bit, it couldn't install Redream. It also struggled with some N64 games and definitely wouldn't handle GameCube. Everything else was perfect including PS1. So I think Dreamcast and GameCube are best left for a half-decent laptop.

## RetroPie - Setup and adding ROMs

1. Head to the [RetroPie first installation](https://retropie.org.uk/docs/First-Installation/) page watch the video, follow the steps there to add RetroPie image to your MicroSD card
2. Insert the MicroSD card into your Raspberry Pi
3. Attached the power, HDMI and controller to your Raspberry Pi
4. EmulationStation launches on bootup, then configure your controller buttons
5. Find the device's IP by selecting the Show IP option in the RetroPie menu after booting up your Raspberry Pi. 
6. Add ROMs by copying them into the relevan folders at the IP address like \\192.168.1.113\roms for example
7. Select a game to launch it - you can then adjust the settings, change the emulator etc just before it loads

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021381/App%20Images/Blog%20Images/Article%20Images/Retropie/network-access_zv3hdw.png" 
  alt="RetroPie folders" 
  loading="lazy" 
  styling=""
  caption="Go to Network and enter the RetroPie IP address like \\192.168.1.113  ..." 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021381/App%20Images/Blog%20Images/Article%20Images/Retropie/network-access_zv3hdw.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021380/App%20Images/Blog%20Images/Article%20Images/Retropie/roms_folder_mjcktl.png" 
  alt="RetroPie ROMs folder" 
  loading="lazy" 
  styling=""
  caption="...  then copy ROMs into the relevant folder inside the roms folder" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021380/App%20Images/Blog%20Images/Article%20Images/Retropie/roms_folder_mjcktl.png" 
  :showsource="false">
</article-image>

You can also [transfer ROMs by using a USB stick too](https://retropie.org.uk/docs/Transferring-Roms/) if you prefer to do it that way instead of transferring over your network.

## RetroPie - Saving your game

After launching a game, *select + right bumper* saves the state, and *select + left bumper* loads the state. This saves to slot #0. 

To save to another slot, press *select + dpad left or right* to change save slot, then same as before *select + right bumper* saves the state, and *select + left bumper* loads the state.

This [video tutorial neatly covers up this process](https://www.youtube.com/watch?v=cIYwcJDShU0).

## RetroPie - Configuring PS1, NDS and DC

* You will need to [install an additional emulator for Nintendo DS](https://www.youtube.com/watch?v=IfY2FjaSaAk) called Drastic. Then add ROMs to the new "nds" folder in the RetroPie roms folder over the network.
* You will need to [install an additional emulator for DreamCast](https://www.youtube.com/watch?v=yb3kYuLnkD8) called reicast or lr-flycast. However, a much better emulator is Redream mentioned later in the article. Since Raspberry Pi 3b+ is 32 bit Redream won't work on it, so a laptop/PC seems the better option for Dreamcast against 3b+.
* You will need to add additional BIOS files to play PS1 games - you can find these with a [quick web search](https://www.google.com/search?q=ps1+bios+files&oq=ps1+bios+files). Also, ensure you add both the .bin files and a .cue file for the ROMs to the /roms/psx/ folder and ensure they are unzipped. You can take .bin files and create a .cue from them using a [cue maker](https://www.duckstation.org/cue-maker/).

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021380/App%20Images/Blog%20Images/Article%20Images/Retropie/bios_aairuu.png" 
  alt="RetroPie bios folder" 
  loading="lazy" 
  styling=""
  caption="Add the PS1 BIOS files to the RetroPie bios folder" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021380/App%20Images/Blog%20Images/Article%20Images/Retropie/bios_aairuu.png" 
  :showsource="false">
</article-image>

## Dolphin - Setup and adding ROMs

So as mentioned earlier, I found the 3b+ definitely wouldn't handle GameCube. Everything else was perfect including PS1. So I think GameCube are best left for a half-decent laptop. The setup is pretty straightforward.

* [Download the installer from the Dolplhin site](https://dolphin-emu.org/).
* Run the download to launch Dolphin
* Follow this [handy video tutorial](https://www.youtube.com/watch?v=LzOIS7KqvdM&list=PL5TuPBnwdd6h172vufklL3VU4vv8lY4c8&index=10) to get setup with your controller and ROMs
* Launch a game
* To save/load a game, go to the taskbar at the top, select Emulation > Save/Load State > Save State to Slot/Load State from Slot

## Redream - Setup and adding ROMs

So as mentioned earlier, I found the 3b+ couldn't quite handle Dreamcast and since it's 32 bit, it couldn't install Redream. Everything else was perfect including PS1. So I think Dreamcast are best left for a half-decent laptop. The setup is pretty straightforward.

* [Download the installer from the Redream site](https://redream.io/).
* Run the download to launch Redream
* Go to the library tab and add the folder containing your ROMs seen below

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021380/App%20Images/Blog%20Images/Article%20Images/Retropie/add-redream-roms_f4lsaf.png" 
  alt="Redream library tab" 
  loading="lazy" 
  styling=""
  caption="Go to the library tab and find your ROM folder" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021380/App%20Images/Blog%20Images/Article%20Images/Retropie/add-redream-roms_f4lsaf.png" 
  :showsource="false">
</article-image>

* Launch a game
* To save/load a game, while in the game hit the 'Esc' key which brings up the menu seen below, then save to a slot. You get 1 slot for free at the time of writing and to unlock more requires a payment.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021383/App%20Images/Blog%20Images/Article%20Images/Retropie/redream-save-game_ux5beq.png" 
  alt="Redream game save tab" 
  loading="lazy" 
  styling=""
  caption="Hit 'Esc' then save your game" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1713021383/App%20Images/Blog%20Images/Article%20Images/Retropie/redream-save-game_ux5beq.png" 
  :showsource="false">
</article-image>

## Bonus - How to format an SD card to default code snippet

Just in case you have a MicroSD card which is already in use and you want to effectively factory reset it to it's default settings, wipe it and then add the RetroPie image:

* Search for diskpart.exe
* Open as admin in a command prompt
* Type `list disk`. ...
* Type `select disk X` where X is the SD card drive number. ...
* **WARNING!** Make sure you have selected the correct disk before proceeding as it will wipe the selected disk completely
* Type `clean` to clean the drive and wipe it

## Bonus - Overclocking your Raspberry Pi 3b+

Insert the MicroSD into your PC then in the boot folder find the "config" file. Open this in Notepad++ or Visual Studio, then edit and increase the arm_freq variable to overclock. This is the section you're looking for...

```[config.txt]
...

#uncomment to overclock the arm. 700 MHz is the default.
#arm_freq=800
arm_freq=1400

...
```

This [video tutorial neatly covers the process](https://www.youtube.com/watch?v=xXOi3xPLi6E&list=PL5TuPBnwdd6h172vufklL3VU4vv8lY4c8&index=12).

## Happy retro gaming

That's everything for this one! There are lots of things to explore when it comes to retro gaming. I really enjoyed trying to set this up and learnt a lot in the process.

I hope you enjoyed this article and found it useful, thanks for reading. Since you read this article all the way to the end you might also be interested in:

* [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)
* [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reflections on digital streaming and reducing smartphone usage]]></title>
            <link>https://shedloadofcode.com/blog/reflections-on-digital-streaming-and-reducing-smartphone-usage/</link>
            <guid>https://shedloadofcode.com/blog/reflections-on-digital-streaming-and-reducing-smartphone-usage/</guid>
            <pubDate>Wed, 13 Mar 2024 17:50:00 GMT</pubDate>
            <description><![CDATA[Find out why actually reducing your technology use might be a benefit, and how to stop using your smartphone so much. Regain your focus from the attention & subscription economy. Includes a quick review of minimalist phone and YouTube Music.]]></description>
            <content:encoded><![CDATA[
This articles covers some of the thoughts and steps I've taken to redefine my relationship with my smartphone, streaming and modern tech devices, using some nostalgic tactics.

## What is the problem with smartphones?

Smartphones and modern tech devices seemingly offer everything we could possibly need in a single device. They appear to be the perfect multi-tool. Is it a surprise we all seem to be addicted to them? Is that good for us?

I wasn't really an avid user to begin with, I don't use social media and yet still I found myself using my smartphone far too much. Checking the bank account, watching a how-to video, making a note, researching directions, looking up some information, streaming and listening to music. The heating system can be controlled via an app, the CCTV cameras can be viewed on an app... Almost everything and everyone is connected.

This is all super convinient but lately I've felt I / we spend far too much time on these devices, and with technology in general. It's verging on a minor addiction, like the compulsive checking of your phone even though you know there won't be any notifications - or none that you're interested in anyway! Could it be that hyper-convience is actually bad for us? I am starting to think so. 

<article-image 
  src="https://images.pexels.com/photos/8088495/pexels-photo-8088495.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1" 
  alt="Model selection diagram" 
  loading="lazy" 
  styling=""
  caption="We are ALL likely more addicted than we think we are" 
  captionsrc="https://www.pexels.com/photo/photo-of-people-engaged-on-their-phones-8088495/" 
  :showsource="true">
</article-image>

Dopamine (the feel good chemical) is released not for the reward itself, but [in ancipation of the reward](https://medium.com/delasign/anticipation-is-worth-more-than-the-reward-3ed5e4883258#:~:text=It's%20not%20the%20reward%2C%20it's%20the%20anticipation.&text=A%20finding%20that%20led%20the,the%20craving%20for%20that%20reward.%E2%80%9D). This means even being in the presence of a smartphone is like having constant access to a metaphorical slot machine, it might have us on edge constantly thinking things like:

* Can I find any new information?
* Should I make a note of that?
* What's my bank balance?
* I should look up directions to that place.
* ... and many more

All of this seems to be affecting our concentration and attention spans. If [early modern humans](https://en.wikipedia.org/wiki/Early_modern_human) have survived for ~300,000 years without smartphones, why since [2007 and the first iPhone](https://youtu.be/x7qPAY9JqE4?si=oEj_RK7e3dPIWf3c&t=156) do we need them so much?

## What are the solutions to this problem?

I watched a few videos and lectures on solutions to this attention addiction problem. I am not against smartphones they are great devices that help us, but there are many dangers too which can lead to bigger issues like anxiety, depression, insomnia and more. Here are the options I gathered:

* [Live without a smartphone](https://www.youtube.com/watch?v=uNQujCwCu88) - radical and difficult in a world built for smartphones with things like QR codes, medical apps, online government services and so on.
* [Confront and redefine your relationship with them](https://www.youtube.com/watch?v=2ldLwkj4dRc) - acknowledge it is an issue and work towards improving it using intentional app time limits, using only a few apps etc.
* [Learn to look up again](https://www.youtube.com/watch?v=m1_QlV6XCNs) - use tactics to help you manage your relationship so put it away during social situations, ask others to put theirs away, don't sleep with or near your phone, and turn off notifications.

Now for my own reflections and tactics on how to reduce smartphone and redefine your relationship with them, tech usage and streaming. The results being a healthier, happier relationship with technology where you are more in control.

## Reset expectations

I think technological progress is great, but it can be harmful. I think around 2005 was a sweet spot for technology use in that:

* Landlines frequently used 
* Texting frequently used, picture messaging less so - also much harder to text on ['dumb' phones](https://en.wikipedia.org/wiki/Nokia_3310)
* Computers and laptops were bulky, slower but did the job
* Internet was available albeit much slower, more viruses, less sophisticated, but felt more free and open
* Films were on DVD, options were buying or renting from [Blockbuster](https://en.wikipedia.org/wiki/Blockbuster_(retailer)) (not sure when Blockbuster collapsed) or for some they were downloaded via piracy - resulting in this [classic ad](https://www.youtube.com/watch?v=HmZm8vNHBSU).
* Music albums were released to CD, played on portable CD players or ripped to PC and then stored on iPods / MP3 players
* To find new music and artists you checked out [Last.fm](https://www.last.fm/)
* [YouTube](https://en.wikipedia.org/wiki/History_of_YouTube) launched February 2005 
* [Facebook](https://en.wikipedia.org/wiki/History_of_Facebook) launched 2004, before that is was [Myspace](https://en.wikipedia.org/wiki/Myspace), [Bebo](https://en.wikipedia.org/wiki/Bebo) which weren't really as widely adopted 
* Endless scrolling didn't really exist in the same way it does now

So this world didn't include smartphones, and yes things were more inconvenient as a result nevertheless, **it still worked**. It still had everything we have now more or less. 

Now I'm not advocating we go back to these times, but we can certainly learn from them, reflect on what we've gained and what we've lost. Use some of those reflections to improve lives in this hyper-connected attention-seeking world.

Here we go...

## Use it like a tool (or a landline)

Smartphones are great multi-purpose tools, and that is part of the problem! One thing I've done to use it more like a tool is to use an app called [minimalist phone](https://www.minimalistphone.com/) for Android. I think there is an equivalent for iPhone too. 

It's great for keeping the phone semi-dumb and highlighting only the apps you really need while keeping some tucked away out of view just in case you need them occasionally. The best part for me is that at the time of writing there is a one-time purchase available instead of monthly / yearly! I hope this never changes.

You can see in the image below I keep a few select apps on the front page. You can rename apps to keep things really simple, I renamed...

* YouTube Music to **Music** 
* Kindle to **Books**
* YouTube to **Videos**

If you designate an app it will prompt you how long you want to spend on that app. Swiping right gives you a search bar to find your tucked away apps which I have added to folders. You can also 'hide' apps totally stashing them away and out of view completely. Once your time is up you get a prompt to 'Take me out of here' or continue with 'More time'. It's a super helpful interface.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1710332888/App%20Images/Blog%20Images/Article%20Images/Digital%20Streaming/minimalist-phone-app_a4hsjd.png" 
  alt="My minimalist phone app setup" 
  loading="lazy" 
  styling=""
  caption="Minimalist phone app setup - many useful features to stay intentional" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1710332888/App%20Images/Blog%20Images/Article%20Images/Digital%20Streaming/minimalist-phone-app_a4hsjd.png" 
  :showsource="true">
</article-image>

Some other helpful things were:

* Keep your smartphone in the same location - on a window sill so you have to physically go to it the same as a landline. When you're done, put it back. This keeps you distanced physically and mentally.
* When you have a question you want to ask Google, **ask yourself first**, try to figure it out, use that gift of a brain! Pretend Google doesn't exist, how would you work it out? How would you find that information?
* Removing all social media apps - stick to text and WhatsApp to message people or call them.
* Pretend it's a single use device - if you're listening to music, only do that, no app switching, this requires lots of willpower!
* Try using colour contrast mode to turn the display black and white - much less distracting
* If you're not using a minimalist phone app clean up those apps, get rid of the unused and hide the infrequent ones, reduce to the **essential** tools

## Enjoy single use devices to prevent multi-tasking

My single device hacks to reduce reliance on streaming and to prevent multi-tasking are:
* A [2TB Toshiba external hard drive](https://www.amazon.co.uk/dp/B07994QL95) to play movies directly on TV or Xbox
* A [5th Gen iPod 60GB](https://www.ebay.co.uk/sch/i.html?_from=R40&_trksid=p4432023.m570.l1313&_nkw=ipod+5th+generation+A1136&_sacat=0) which I modded with a 256GB SD card and bigger battery for offline listening plus a [Bluetooth adapter](https://www.amazon.co.uk/dp/B09ZTBZHCN). I got this from eBay for £33 in great condition, even with songs loaded from the seller! Best purchase in ages. Only a small section of dead pixels on the screen.
* An old iPhone SE no SIM card to use just for music -  YouTube Music + YouTube background play
* [JBL Charge 5](https://www.amazon.co.uk/JBL-Charge-Bluetooth-waterproof-built-Black/dp/B08VDNCZT9/ref=sr_1_1) - great portable Bluetooth speaker with good battery life if you can find one on offer

These make my smartphone optional and it can be left alone sitting on the window sill, it makes me use the smartphone more like a desktop PC - I go to it, do what I need to do then leave it alone.

These also make streaming somewhat optional, it means I could unsubscribe to most media streaming services and still be entertained and would have only what I love and treasure. 

It took effort to hunt down those movies and albums - again the anticipation of the reward is greater than the reward itself! Some effort and inconvience ensures the reward is appreciated even more. 

It wasn’t mindless scrolling to hunt for them either, it was active searching, thinking, reflecting. It slowed down consumption. One you've acquired them they are yours, no one can take them away from you.

If you were lost in the jungle and had to eat anything to survive, your favourite food on return to civilisation would be the finest food you've ever tasted, and you would appreciate every bit of it. Struggle isn't always nice, but **some** struggle is a good thing - it makes us appreciate what we have instead of worrying other options might be better.

There are lots of good ideas on how to introduce some struggle into your life [like using an iPod to listen to music](https://www.youtube.com/watch?v=3mfC4WNVMec) alongside or instead of streaming.

The thing about streaming music, is that there isn't really a way to do it *without* a smartphone. They tend to go hand-in-hand. I think having a device dedicated to music is a special thing, even if like me, that's just an old iPhone SE used solely for music. It's the perfect size for this purpose and after finding a [new battery](https://www.amazon.co.uk/dp/B088TBSVSR) for it [and fitting it](https://youtu.be/x9JRqocmm24?si=G3si-0qqr8Mq8xtp) it goes for days. I keep the display black and white and only use YouTube Music, Headspace and YouTube for background listening with this device.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1710333574/App%20Images/Blog%20Images/Article%20Images/Digital%20Streaming/dumb-devices_xwzrdv.jpg" 
  alt="iPhone SE and iPod 5th Gen" 
  loading="lazy" 
  styling=""
  caption="My semi-dumb devices an iPhone SE and iPod 5th Gen" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1710333574/App%20Images/Blog%20Images/Article%20Images/Digital%20Streaming/dumb-devices_xwzrdv.jpg/" 
  :showsource="true">
</article-image>

If I want to go totally offline I've been building a good music library to load onto an iPod 5th Generation (A1136) modded with an [iFlash Quad](https://www.iflash.xyz/store/iflash-quad/) and 256GB SD card, along with 3000mAH battery giving days upon days of usage. You can find [great guides](https://youtu.be/6bhOyLF4Co4?si=r90rGFRZA4x6QB0f) to do this on YouTube from [DankPods](https://www.youtube.com/@DankPods) and others. 

The 5th Gen iPod seems the easiest to open up, whereas the 6th and 7th Gen have fully metal cases so much harder. Plus the 6th Gen has a limit of 128GB when flash modded, whereas the 5th and 7th Gen have no limit up to 1TB - not that I've ever tested that, 256GB is more than enough.

## List your top tens for entertainment 

If you could only watch / listen to 10 movies, documentaries and artists ever again, what would they be?

Collate your own library of top 10’s whether that be MP3s, CDs, DVDs, or  files. I lived during the times of piracy where streaming wasn’t an option and individual items were expensive. The price of a CD album can now get your a monthly subscription to most of the songs ever created! I’m not an expert on the economics of streaming, but however slim, there is a chance of returning to a world one day of ‘if you like it, then buy it’. I mean if they don’t pay artists enough it’s not unfeasible. I’m not sure, that’s another topic though for someone else to debate.

Maybe use the money you spend on streaming to acquire your favourite music and films digitally or on CD / DVD, then get an external hard drive and back them up and for easier viewing. Plug the hard drive into a games console or TV and you've got your own private music / movie collection. Barring the hard drive failing you'll always have access to them. My varied lists included:

**Music:**
* Linkin Park
* Atreyu
* Avenged Sevenfold
* Five Finger Death Punch
* Queen
* ...

**Films:**
* American Psycho
* The Big Lebowski
* There Will Be Blood
* Starship Troopers
* ...

**Documentaries / Series:**
* Blue Planet I, II, III
* Planet Earth I, II, III
* Anything David Attenborough
* World War II in Colour by Robert Powell
* The Simpsons
* Futurama
* ...

## Make streaming and convenience optional

Are there benefits to streaming? Definitely, but there are dangers too.

* Too much stuff available creating decision fatigue
* Too instantly available
* Mindless scrolling vs. active thinking and searching
* Not what you treasure
* You don't own it so it can disappear
* Price increases could become unpalatable 
* Free version you are bombarded with ads - I think these have a big effect on your mental health, I avoid ads like the plague.
* Actually changing the market and how we consume music - losing any physical connection

Can you apply the principles of the old physical media world to the new streaming world? Yes, I think you can though the power of pretending.

* Pretend your music streaming app is an iPod - you can’t change apps, search for anything, receive notifications
* Pretend your Netflix app is Blockbuster or IMDB - what do you feel like watching before you load it up? 
* Pretend new episodes are released daily or weekly. So only 1 episode or film per day / week to avoid binging.
* Pretend it's the 80s or 90s and your phone and internet doesn't exist for a day - find alternatives

As discussed in the previous section, have some go-to entertainment to avoid endless streaming. My go-to before unsubscribing from Netflix was watching a film / episode then follow it with a David Attenborough documentary box set of Planet Earth, Blue Planet etc. Perfect for relaxing and winding down. 

I think having a go-to is becoming old school, a favourite film or documentary you could watch over and over.

By using pretending in combination with some go-tos we can make streaming more optional, a nice to have, but not a necessity.

## Find a middle ground

The only streaming service I used to pay for was Netflix. I recently unsubscribed from that to avoid endless scrolling and not finding anything I like.

In 2023, I subscribed to Spotify Premium for the first time, I became tired of ads and the constant bombardment from them. We are certainly in an attention economy, where so much money is spent getting your attention and convincing you to spend money on things! I recently unsubscribed from that because I only listen to certain playlists and artists. Before 2023, I kind of got by with just MP3s and occasional Spotify, back when it was ad-free on desktop and tablet.

That leaves me with only two subscriptions I have currently:
* Amazon Prime which comes with [many benefits](https://www.amazon.co.uk/b?ie=UTF8&node=14917073031) at £7.91 per month - paying yearly was £95 so at approx £5 per delivery my **household** needs to have 19 orders per year since [you can share Prime benefits with your household members](https://www.amazon.com/gp/help/customer/display.html?nodeId=GWZ7QXD2X8WL8YE8)
* YouTube Premium which comes with YouTube Music too at £12.99 per month
* Total at £20.90 per month 

Do I enjoy giving money to two market dominating tech giants? Not really, I'm against monopolies but can't argue the services they have are good quality and mostly reliable. I can live with this choice, it's my middle ground. Limiting myself to only two subscriptions feels good, both mentally and for the wallet - I get tons of use from each so very cost-effective.

I only recently subscribed to [YouTube Premium](https://www.youtube.com/premium) which comes with YouTube Music too. To confirm, I have no affliation with Google or YouTube Premium. I really enjoy watching ad-free videos and use it for everything how-tos, documentaries, lectures, guides, coding tutorials and lots more. 

I dislike having to pay to remove ads, nevertheless it's a huge platform with estimated over 800 million videos and 100 million songs so I understand that needs funding to keep it all running! 😂 Plus it keeps valuable content creators paid which is a good thing too. As a bonus too, YouTube Music is thrown into the bundle.

Here is a comparison of Youtube Music against Spotify:

**YouTube Music pros:**

- Sounds louder and crisper than Spotify to me
- Seamlessly switch between music and video version
- Fine-tune Up Next playing suggestions with Familiar, Discover, Popular, Genres
- Better Recommendation and Quick pick features in my view
- Similar size catalogue of 100 million songs but with more niche uploads from Community Playlists
- Can find more obscure songs maybe not on Spotify like very recent covers 
- Clean layout with Up Next, Lyrics, Related
- Smart downloads - when connected to Wi-Fi the app will automatically download your specified amount of favourite + recent songs in an 'Offline Mixtape' which is awesome. This has also made using an old iPhone SE with no SIM as a dedicated music streamer even easier on the go.

**YouTube Music cons:**

- No app on Xbox for background play or easy navigation 
- No desktop app - although you can [download it as a progressive web app](https://support.google.com/chrome/answer/9658361?hl=en-GB&co=GENIE.Platform%3DDesktop#:~:text=On%20your%20computer%2C%20open%20Chrome,instructions%20to%20install%20the%20PWA.) from Chrome using the 'Install' button in the address bar, which adds it to the desktop
- Adding artists is a pain, must subscribe or add albums 
- Creating playlists is a pain and are added to main YT 
- The solution to that I've found is to [create a new 'channel'](https://tinyurl.com/3wzsy3sz) to keep music seperated
- No reliable 'Spotify Connect' function like using another device as a remote
- No searching within playlists - first world problem, I know!
- Not sure how good podcasts are, don't use them
- Playlists are not as good probably due to a smaller community, they are more than ok though

**Tips I used to transition music services or to iPod:**

1. Monitor what you use your old music service for
2. Add the same artists, albums and playlists to iPod (optional) 
3. For Bluetooth use an adapter with the iPod
4. For Xbox use USB with Background Music Player or AirServer with phone
5. Use [Soundiiz](https://soundiiz.com/) to transfer any playlists from old to new service (free tier is 1 playlist at a time with 200 songs per playlist at the time of writing)
6. Unsubscribe from your old music service, use new service for discovery, repeat

## Conclusion 

I hope you enjoyed this article and it gave you the chance to reflect on your own relationship with technology and smartphones. We've covered many related topics including:

* Smartphones
* Streaming 
* Minimalism
* Consumerism and the attention / subscription economy
* How music and video consumption has changed
* How ads and distraction affects our concentration

I think we saw some common themes emerging:

* Be aware and intentional with tech
* Ensure you're controlling it and it's not controlling you
* Set your structures, boundaries and limits
* Try single use devices or a smartphone with minimal apps
* Find a way that works for you

I think moving forward those of us who create systems, apps, websites and any other digital solutions have to be aware of this stuff, and that success metrics don't focus on engagement but ethical use. It's definitely not being anti-technology, just a reflection on practices for positive human-computer interactions in an ever changing landscape.

One philosophy is that digital tools in any form should give you time back, not take it away from you, it should make life better, and easier, not harder or harmful to users. You should own it, it shouldn't own you. What smoking was to the physical health, smartphone usage is to mental health. It is the issue of our time.]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Solving real-world optimisation problems - a crash course with PuLP]]></title>
            <link>https://shedloadofcode.com/blog/solving-real-world-optimisation-problems-a-crash-course-with-pulp/</link>
            <guid>https://shedloadofcode.com/blog/solving-real-world-optimisation-problems-a-crash-course-with-pulp/</guid>
            <pubDate>Sat, 10 Feb 2024 15:58:00 GMT</pubDate>
            <description><![CDATA[Explore four optimisation scenarios applicable to the real-world and how to solve these using linear programming with Python and the PuLP library.]]></description>
            <content:encoded><![CDATA[
<affiliate-disclaimer></affiliate-disclaimer>

I’ve read a few tutorials recently to refresh my knowledge on optimal resource allocation, and either the examples were too complex or delved too far into the maths. I also enrolled on a useful course from DataCamp called [Supply Chain Analytics in Python](https://datacamp.pxf.io/KjA61e). This article focuses more on the practical steps required for you to get started quickly with some good examples.

By the end of this article, you should be able to solve simple and intermediate optimisation problems using Python and PuLP.

This is a really useful skill for [statisticians](https://www.prospects.ac.uk/job-profiles/statistician), [data scientists](https://www.prospects.ac.uk/job-profiles/data-scientist), [operational reseachers](https://www.prospects.ac.uk/job-profiles/operational-researcher) and business to make the best decisions, maximise profits, production, minimise time, costs and more. 

We will start with a small example, then build up to more complex examples as we proceed. I ran all of the code contained in this article using [Spyder IDE with Anaconda](https://docs.anaconda.com/free/working-with-conda/ide-tutorials/spyder/).

## What is optimisation and linear programming?

* Optimisation helps to find the best decision given some inputs, so aims to maximise or minimise an objective function, given a number of constraints

* Linear programming (LP), also called linear optimisation, is a method to achieve the best outcome (such as maximum profit or minimal cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is a special case of mathematical programming also known as [mathematical optimisation](https://en.wikipedia.org/wiki/Mathematical_optimization).

* [PuLP](https://coin-or.github.io/pulp/) is a library in Python to help with optimisation and linear programming tasks. PuLP stands for “Python. Linear Programming”

## What are the steps to solving an optimisation problem?

An optimisation problem that uses linear programming (LP) and PuLP typically has the following steps / components:

* **Model** - an initialised PuLP model
* **Decision variables** - what you can control
* **Objective function** - the goal to maximise or minimise like profit, cost, resources
* **Constraints** - limitations to our solution like demand, capacity, time
* **Solve model** - then view the most optimal outcome

Let's see these in action in our first example.

## Exercise routine

Use LP to decide on an exercise routine to burn as many calories as possible.

 
|           |  Pushup   	    | Running      |
|-----------|-----------------|--------------|
| Minutes	  |  0.2 per pushup	| 10 per mile  |
| Calories	|  3 per pushup	  | 130 per mile |

Constraint = only 10 minutes to exercise

```python [exercise.py]
from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value

# 1. Initialise model
model = LpProblem("Maximize Calories Burnt", LpMaximize)

# 2. Define Decision Variables: pushups and running
pushup = LpVariable('Pushup', lowBound=0, upBound=None, cat="Continuous")
running = LpVariable('Running', lowBound=0, upBound=None, cat="Continuous")

# 3. Define objective function: calories per pushup or per mile
model += 3 * pushup + 130 * running

# 4. Define constraints: our model's limitations
model += 0.2 * pushup + 10 * running <= 10  # Time constraint is 10 minutes to exercise
model += pushup >= 0 + running >= 0         # Our results must be more than 0 pushups or miles ran (so not negative)

# 5. Solve model
model.solve()
print("Run = {} miles".format(running.varValue))
print("Pushups = {}".format(pushup.varValue))
print(f"Calories burnt: {(running.varValue * 130) + (pushup.varValue * 3)}")
```

Our workflow in this code consisted of:

1. Initialising the model with the help of PuLP using LpProblem and set our goal as LpMaximize - since we want to maximise calories burnt
2. Defining our two decision variables as either pushups or running and set the category as Continuous
3. Setting the objective function in mathematical form, which were calories per pushup (3) and calories per mile of running (130)
4. Setting the constraints which were 10 minutes to exercise, and not a negative result
5. Solve the model and output the results

The results printed were:

>  Run = 0.0 miles
>
> Pushups = 50.0
>
> Calories burnt: 150.0

This has computed all possible combinations and returned the most optimal decision in miliseconds! We can see the most optimal outcome is to perform 50 pushups which burns 150 calories and is under the 10 minute constraint (0.2 * 10 = 10) 

I hope you can see the power here of quickly solving optimisation problems that would be very difficult to solve by hand accounting for all possible combinations. 

<article-image 
  src="
https://res.cloudinary.com/dayqxxsip/image/upload/v1707581850/App%20Images/Blog%20Images/Article%20Images/Optimisation%20with%20PuLP/spyder-optimisation_swph74.png" 
  alt="Spyder IDE and PuLP" 
  loading="lazy" 
  styling=""
  caption="If you're using Spyder IDE you can explore PuLP's model classes in the variable explorer tab." 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1707581850/App%20Images/Blog%20Images/Article%20Images/Optimisation%20with%20PuLP/spyder-optimisation_swph74.png" 
  :showsource="false">
</article-image>

## Glass manufacturing 

We are tasked with planning the optimal production at a glass manufacturer to maximise profit. This manufacturer only produces wine and beer glasses:

* there is a maximum production capacity of 60 hours
* each batch of wine and beer glasses takes 6 and 5 hours respectively
* the warehouse has a maximum capacity of 150 rack spaces
* each batch of the wine and beer glasses takes 10 and 20 spaces respectively
* the production equipment can only make full batches, no partial batches
* Also, we only have orders for 6 batches of wine glasses. Therefore, we do not want to produce more than this. Each batch of the wine glasses earns a profit of $5 and the beer $4.5

```python [resources.py]
from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value

# 1. Initialise model
model = LpProblem("Maximize Glass Co. Profits", LpMaximize)

# 2. Define Decision Variables: wine and beer glasses
wine = LpVariable('Wine', lowBound=0, upBound=None, cat="Integer")
beer = LpVariable('Beer', lowBound=0, upBound=None, cat="Integer")

# 3. Define objective function: profit for both wine glasses and beer glass decision variables
model += 5 * wine + 4.5 * beer

# 4. Define constraints: our model's limitations
model += 10 * wine + 20 * beer <= 150   # Rack space cannot exceed 150
model += 6 * wine + 5 * beer <= 60      # Maximum production capacity is 60 hours
model += wine <= 6                      # Wine glasses cannot exceed 6 batches

# 5. Solve model
model.solve()
print("Produce {} batches of wine glasses".format(wine.varValue))
print("Produce {} batches of beer glasses".format(beer.varValue))
```

We followed the same pattern in this example, but defined more constraints. We also defined the category for our decision variables as Integer because we can only make full batches, no partial batches.  

Given these constraints, we calculate the optimal production outcome to maximise profit is to produce 6 batches of wine and 4 batches or beer!

> Produce 6.0 batches of wine glasses
>
> Produce 4.0 batches of beer glasses

## Warehouse stock allocation

Decide which warehouse to ship from to fulfil customer unit demand at the lowest cost.

This example is more complex so uses Python list comprehension to define many decision variables, objective functions and constraints quickly.

```python [logistics.py]
from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value

warehouses = ['New York', 'Atlanta']
customers = ['A', 'B', 'C']

costs = {
    ('New York', 'A'): 232,
    ('New York', 'B'): 255,
    ('New York', 'C'): 264,
    ('Atlanta',  'A'): 255,
    ('Atlanta',  'B'): 233,
    ('Atlanta',  'C'): 250
}

demand = {
    'A': 1500,
    'B': 900,
    'C': 800
}

# 1. Initialise model
model = LpProblem("Minimise_Transportation_Costs", LpMinimize)

# 2. Define 6 Decision Variables in a few lines of code using LpVarible.dicts
#    That's (2 warehouses * 3 customers)
key = [(w, c) for w in warehouses for c in customers]
shipments =  LpVariable.dicts('Shipments', key, lowBound=0, cat='Integer')

# 3. Define objective function: shipping costs
model += lpSum([costs[(w, c)] * shipments[(w, c)] 
                for w in warehouses for c in customers])

# 4. Define constraints: our model's limitations which is demand must be met for each customer
for c in customers:
    model += lpSum([shipments[(w, c)] for w in warehouses]) == demand[c]

# 5. Solve model
model.solve()
print("Status", LpStatus[model.status], "\n")

# 6. Print values for each decision variable - demand
print("Optimal units for each warehouse:")
for decision_variable in model.variables():
    print(decision_variable.name, "=", decision_variable.varValue)

# 7. Print value for the objective function - costs
print("\nObjective =", value(model.objective))
```

In this example we've created some dictionaries to hold our data for warehouses, customers, costs (warehouse to customer), and demand (units).

We then follow the same pattern but use list comprehension to define every combination of decision variables for warehouses and customers. We do the same thing to define all of our shipping costs. 

Finally, we can define the constraints in that the shipments for each warehouse must meet demand and solve the model.

The output from PuLP gives us:

> Status Optimal 
> 
> Optimal units for each warehouse:
> 
> Shipments_('Atlanta',_'A') = 0.0
> 
> Shipments_('Atlanta',_'B') = 900.0
> 
> Shipments_('Atlanta',_'C') = 800.0
> 
> Shipments_('New_York',_'A') = 1500.0
> 
> Shipments_('New_York',_'B') = 0.0
> 
> Shipments_('New_York',_'C') = 0.0
> 
> 
> Objective = 757700.0

We can see that to meet demand for: 
* customer B we need 900 units in Atlanta  
* customer C we need 800 units in Atlanta
* customer A we need 1500 units in New York

This results in optimal shipping costs of 757,000 and we've solve a much bigger problem with many more variables.

## C02 monitor allocation

Let's say we were tasked with allocating C02 monitors to schools in order to manage and monitor air quality similar to [this real scenario](https://www.gov.uk/guidance/using-co-monitors-and-air-cleaning-units-in-education-and-care-settings). 

We need to allocate them proportionally to have the greatest impact, with some left over for additional demand later. This is the longest example given there are a number of constraints to define.

```python [monitors.py]
"""
Allocates the optimal number of C02 monitors to schools given the constraints.
"""
from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value

# Objective function: number of monitors
available_monitors = 200

# Decision variables: a list of schools to allocate monitors
schools = ["School A", "School B", "School C", "School D"] 

# Constraints: dictionaries of size, rooms and pupil counts for each school
school_sizes = {"School A": 5000, "School B": 6000, "School C": 4000, "School D": 5500}  # in square feet
school_rooms = {"School A": 30, "School B": 40, "School C": 25, "School D": 35}
pupil_counts = {"School A": 2000, "School B": 3000, "School C": 1500, "School D": 2500}


def allocate_co2_monitors(schools, available_monitors, school_sizes, school_rooms, pupil_counts):
    # 1. Initialise model
    model = LpProblem("CO2_Monitor_Allocation", LpMinimize)

    # 2. Define the decision variables - the things we can control
    # .dicts creates a dictionary of LpVariables https://coin-or.github.io/pulp/technical/pulp.html#pulp.LpVariable.dicts
    monitors = LpVariable.dicts("Monitors", schools, lowBound=0, cat="Integer")

    # 3. Define the objective function: the thing we want to minimise or maximise so total number of monitors used per school
    # Passing a list to lpSum can add many decision variables at once
    model += lpSum(monitors)

    # 4. Define the constraints:

    # At least one monitor to each school
    for school in schools:
        model += monitors[school] >= 1 

    # The total number of allocated monitors should not exceed the available monitors
    model += lpSum(monitors) <= available_monitors - 20

    # There must be 1 monitor per 500 square feet
    for school in schools:
        model += monitors[school] >= school_sizes[school] / 500

    # There must be 1 monitor per 2 rooms
    for school in schools:
        model += monitors[school] >= school_rooms[school] / 2

    # There must be 1 monitor per 50 pupils
    for school in schools:
        model += monitors[school] >= pupil_counts[school] / 50 

    # 5. Solve the LP problem
    model.solve()

    # 6. Check the status of the solution
    if LpStatus[model.status] != "Optimal":
        print("Unable to find an optimal solution.")
        return None

    # 7. Get the model results
    allocation = {}
    for school in schools:
        allocation[school] = value(monitors[school])

    return allocation

allocation = allocate_co2_monitors(schools, available_monitors, school_sizes, school_rooms, pupil_counts)

if allocation:
    print("CO2 Monitor Allocation:")
    total_monitors_allocated = 0
    for school, monitors in allocation.items():
        total_monitors_allocated += int(monitors)
        print(f"{school}: {monitors} monitors")
    print(f"\nTotal monitors allocated: {total_monitors_allocated}")
    print(f"Total monitors leftover: {str(available_monitors - total_monitors_allocated)}")
```

Here we define the available monitors, and set our data for schools, school size, rooms, and pupil counts.

Following the same pattern, we initialise the model, and generate our decision variables from the **schools** list - our decision variable is what we can change so here it's the schools and how many monitors to assign to each.

Finally we add each of the constraints and solve:

* At least one monitor to each school
* The total number of allocated monitors should not exceed the available monitors
* There must be 1 monitor per 500 square feet
* There must be 1 monitor per 2 rooms
* There must be 1 monitor per 50 pupils

These are reasonable assumptions for the constraints but we could change them if they are too strict. In this case, we have a solved model which gives:

> CO2 Monitor Allocation:
> 
> School A: 40.0 monitors
> 
> School B: 60.0 monitors
> 
> School C: 30.0 monitors
> 
> School D: 50.0 monitors
> 
> Total monitors allocated: 180
>
> Total monitors leftover: 20

Great! So from the 200 available monitors we have allocated 180 given the constraints with 20 leftover. I find this the most impressive example as solving this scenario without LP and PuLP would take so much more work!

## Conclusion

Well done if you made it through all the examples. You should be now be able to solve simple and intermediate optimisation problems using Python and PuLP using this workflow. You just have to frame your problem as an LP problem and then modify your decision variables and constraints. 

It is worth reminding ourselves that sometimes there won't be a solution to problem. In that case, we must revisit our inputs to loosen them a little if possible. Maybe the constraints are too strict and need to be made more forgiving. This is where operations meets analysis. 

Always question the outputs and sense check them through solid quality assurance - find out more in the article [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/).

I hope you enjoyed this article, you've hopefully added a seriously useful tool to your toolkit. You may also be interested in these articles on the site:

* [How to build and visualise a Monte Carlo simulation with Python and Plotly](/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/)
* [Understanding Explainable AI (XAI) for classification, regression and clustering with Python](/blog/understanding-explainable-ai-for-classification-regression-and-clustering-with-python/)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Improving Wi-Fi 2.4GHz and 5GHz speeds after Full Fibre (FTTP) upgrade]]></title>
            <link>https://shedloadofcode.com/blog/improving-wi-fi-24ghz-and-5ghz-speeds-after-full-fibre-fttp-upgrade/</link>
            <guid>https://shedloadofcode.com/blog/improving-wi-fi-24ghz-and-5ghz-speeds-after-full-fibre-fttp-upgrade/</guid>
            <pubDate>Tue, 09 Jan 2024 17:50:00 GMT</pubDate>
            <description><![CDATA[Discover the steps I took to increase Wi-Fi speeds from 25Mbps to 150Mbps after a recent FTTP upgrade. Maybe you can try some of these to help you to maximise your own connection speeds too!]]></description>
            <content:encoded><![CDATA[
Given I work with computers every day, and have a good understanding of computer science and networking, I recently needed a refresher to improve my Wi-Fi connection speeds.

In this article, we'll go through the steps I took to increase Wi-Fi speeds from 25Mbps to all 150Mbps after a recent internet connection upgrade. Maybe these steps can help you to maximise your own connection speeds too.

## Setting the scene - the issue

My [fibre connection](https://www.openreach.com/fibre-broadband) was previously FTTC (fibre to the cabinet) but was recently upgraded to 'Full Fibre' or FTTP (fibre to the property). Great! This meant I was able to go from 25-30Mbps ([megabits per second](https://en.wikipedia.org/wiki/Data-rate_units)) to a maximum speed of 150Mbps. 

After the installation by [CityFibre](https://cityfibre.com/homes), I was impressed with the setup and everything was working ok with the new router but Wi-Fi speeds weren't always better, suffering some drop-out and similar speeds to the prior setup. I needed to investigate this and figure out how to get the full speed throughout the property. 

The following sections go step-by-step through **what I did**, what you **need to know** and the **tactics you can try** to improve your Wi-Fi connection speeds. I'm not saying these things are guaranteed to work for you, but they've been a huge improvement for me and I wanted to share them.

## To begin, check your wired connection speed

A good starting point is to first check that you are receiving the increased speeds by connecting your device to the router with an Ethernet cable. I have an Xbox Series X console connected this way, which has a [Network connection speed test](https://support.xbox.com/en-GB/help/hardware-network/connect-network/xbox-one-connection-speed) in the settings menu.

You can compare your wired / wireless speed by using the same or another device and searching Google for "[internet speed test](https://www.google.com/search?q=internet+speed+test)".

Both of these methods give the download and upload speeds. The Xbox Series X was receiving the full 150Mbps so the upgrade was definitely working correctly through a wired connection.

## Understand the difference between 2Ghz and 5Ghz bands

Before progressing, it's important to understand what the 2.4Ghz and 5Ghz Wi-Fi bands are and their pros vs cons. 

Here is a crash course:

**2.4Ghz =** slower but larger coverage area - can also get interference from radios, bluetooth, other networks etc.

**5Ghz =** faster but smaller coverage area

Most routers auto-assign a device to a band based on how far away the device is and if the device is capable of using the 5Ghz band. More devices are assigned to the 2.4Ghz band and that can lead to crowding.

To check this out, login to your router admin panel at the local IP address http://192.168.1.1/ ... the **admin** username and password is typically found on the back of the router.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654271/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/router-admin_i8yyts.png" 
  alt="Router admin" 
  loading="lazy" 
  styling=""
  caption="Your router admin panel should show which devices are connected to which band" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654271/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/router-admin_i8yyts.png" 
  :showsource="false">
</article-image>

In Wi-Fi Settings / Device Settings, you can then see which devices are connected to which band. If a device is connected to the 2.4Ghz band, that could be the reason for lower speeds! You can try re-connecting the device if you're close to the router to attempt to switch to the 5Ghz band.

## Move and elevate your router

Okay, starting with the basics, if your router is crammed into a cupboard or behind a huge TV, it's likely going to block the signal substantially.

You can try to move it to a higher location where it isn't blocked in. You may need to run the main cable connected to the router to a suitable spot then re-test the speeds.

<article-image 
  src="https://images.pexels.com/photos/579471/pexels-photo-579471.jpeg" 
  alt="Signal tower" 
  loading="lazy" 
  styling=""
  caption="There's a reason signal towers are very high! To boost the signal!" 
  captionsrc="https://www.pexels.com/photo/signal-tower-579471/" 
  :showsource="false">
</article-image>

## Check your device Wi-Fi network adapter

Sometimes for older devices, the built in Wi-Fi chip / receiver cannot actually connect to the faster 5Ghz channel. To check this I found a great article from Louisiana State University [Wireless: Determine if Computer Has 5GHz Network Band Capability (Windows)](https://grok.lsu.edu/article.aspx?articleid=17341). This can be summarised as:

* Search "**cmd**" in the Start Menu.
* Type "**netsh wlan show drivers**" in the Command Prompt & Press Enter.
* Look for the "**Radio types supported**" section.
* If the network adapter supports network mode **802.11ac**:
    * The computer supports both 2.4GHz and 5GHz - your network capability IS Dual-Band Compatible.
    * This is true if your computer supports both 802.11ac and 802.11n together as well.
* If the network adapter supports only network mode 802.11n:
    * The computer MAY OR MAY NOT have 2.4 GHz and 5GHz network capability and be Dual-Band Compatible.* 
* If the network adapter does not support either of these network modes, it IS NOT Dual-Band Compatible.

Where a device can only connect to the 2.4Ghz band it may still get okay speeds and have further range, but just won't be able to benefit from the much faster 5Ghz band.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654272/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/before-wav-link_qrzusi.png" 
  alt="Network adapter capability before" 
  loading="lazy" 
  styling=""
  caption="Old built-in 'Wi-Fi' adapter showing no 802.11ac in Radio types supported" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654272/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/before-wav-link_qrzusi.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654271/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/after-wav-link_gqajns.png" 
  alt="Network adapter capability after" 
  loading="lazy" 
  styling=""
  caption="New WAVLINK 'Wi-Fi 2' adapter now has 802.11ac in Radio types supported so supports 5Ghz band :)" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654271/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/after-wav-link_gqajns.png" 
  :showsource="false">
</article-image>

## Upgrade your device with an external Wi-Fi network adapter 

If it happens that your device Wi-Fi network adapter is older and incapable of connecting to the 5Ghz band then it might be a good time to upgrade with an external network adapter. Newer adapters are capable of pulling in greater speed, are dual-band so can connect to both the 2.4Ghz and 5Ghz bands and can hold the connection better for less drop-out.

The WAVLINK AC1900 USB WiFi Dongle has delivered the best improvement in speeds to my upstairs desktop PC, and seems future proof in that it's capable of pulling even greater speeds than my current maximum of 150Mbps. It should pull in up to 600Mbps on 2.4Ghz and up to 1300Mbps on 5Ghz bands respectively. So if you upgrade your plan with your ISP, you're covered - although I'm sure that would be more than you'll ever need.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654271/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/wav-link_pswfom.png" 
  alt="WAVLINK Wi-Fi dongle" 
  loading="lazy" 
  styling=""
  caption="WAVLINK AC1900 USB WiFi Dongle for PC, Dual Band 1900Mbps WiFi Adapter" 
  captionsrc="https://www.amazon.co.uk/dp/B09KRK7TQT?ref=ppx_yo2ov_dt_b_product_details&th=1" 
  :showsource="false">
</article-image>

These links are on Amazon, I don't receive any commissions for these links and have used both products, you should be able to find these products elsewhere if you wish though. Both worked very well and pulled in consistent speeds close to the connection's max 150Mbps but can go even higher if your [ISP](https://en.wikipedia.org/wiki/Internet_service_provider) plan allows.

* [WAVLINK AC1900 USB WiFi Dongle for PC, Dual Band 1900Mbps WiFi Adapter for Desktop, Laptop PC with Magnetic Base, 4X 3dBi External Antennas, support Win 11/10/8/7/XP, Mac OS 10.7-10.15](https://www.amazon.co.uk/dp/B09KRK7TQT?ref=ppx_yo2ov_dt_b_product_details&th=1)

* [TP-Link AC600 High Gain USB Wi-Fi Dongle, Dual Band Wi-Fi Adapter with 5dBi Antenna for PC/Desktop/Laptop, Supports Windows11/10/8.1/8/7/XP, Mac OS X 10.9-10.14 (Archer T2U Plus)](https://www.amazon.co.uk/dp/B07PJV66CN?ref=ppx_yo2ov_dt_b_product_details&th=1)

My desktop PC hadn't moved, it was in the same location in my home office, as when I did the first test (left of the image below) however with the new WAVLINK network adapter the internet speed test had gone from 44 down 26 up to 147 down 144 up! You can see in the image below I'm connected to **Wi-Fi 2** which was the new WAVLINK external network adapter. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654272/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/speed-test-after_jvind6.png" 
  alt="Internet speed before vs after dongle" 
  loading="lazy" 
  styling=""
  caption="Internet speed tests before (left) and after (right) using external network adapter" 
  captionsrc="https://www.amazon.co.uk/dp/B09KRK7TQT?ref=ppx_yo2ov_dt_b_product_details&th=1" 
  :showsource="false">
</article-image>

It was a similar result with the lower profile [TP-Link AC600](https://www.amazon.co.uk/dp/B07PJV66CN?ref=ppx_yo2ov_dt_b_product_details&th=1) adapter on my laptop, but the WAVLINK seemed more robust and stable - with it's four prongs likely the reason! 

So is trying an external network adapter with your desktop PC and laptops worth a try? These results say absolutely!

A final point, usually with an external network adapter you must install the relevant driver for the new adapter. With WAVLINK you head to their site, download for Windows or Mac, then install. Pretty simple process and the instructions are on the box.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654272/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/head-to-wavlink_hvqimp.png" 
  alt="Install external network adapter driver" 
  loading="lazy" 
  styling=""
  caption="Download and install the external network adapter driver" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654272/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/head-to-wavlink_hvqimp.png" 
  :showsource="false">
</article-image>

## Add a Wi-Fi mesh extender to avoid dead zones

Now we've covered using an external network adapter to improve **receiving** Wi-Fi network signal, what about improving general **outgoing** coverage to address dead-spots in the property? This proved to be a little tougher. The only thing I have tried so far is using a Wi-Fi mesh 'extender'. This effectively acts as a second router, which has the option to split the 2.4Ghz and 5Ghz bands on the extender. 

You can therefore end up with multiple access points or [SSID](https://en.wikipedia.org/wiki/Service_set_(802.11_network))s. I split up the bands and then named them something easy to understand like:

* TALKTALK-843
* TALKTALK-843_EXT_24
* TALKTALK-843_EXT_5

This gives the option to connect to the main router downstairs, or the extension upstairs either on the 2.4 or 5Ghz bands. I used the TP-Link range extender as seen below.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1702654272/App%20Images/Blog%20Images/Article%20Images/Improving%20Wi-Fi%20Speeds/extender_eawyn0.png" 
  alt="WAVLINK Wi-Fi dongle" 
  loading="lazy" 
  styling=""
  caption="TP-Link AC750 Universal Dual Band Range Extender, Broadband/Wi-Fi Extender, Booster/Hotspot with Ethernet Port, Plug and Play, Smart Signal Indicator, UK Plug (RE220) ,White" 
  captionsrc="https://www.amazon.co.uk/dp/B07ZWBBPQN?ref=ppx_yo2ov_dt_b_product_details&th=1" 
  :showsource="false">
</article-image>

* [TP-Link AC750 Universal Dual Band Range Extender, Broadband/Wi-Fi Extender, Booster/Hotspot with Ethernet Port, Plug and Play, Smart Signal Indicator, UK Plug (RE220) ,White](https://www.amazon.co.uk/dp/B07ZWBBPQN?ref=ppx_yo2ov_dt_b_product_details&th=1)

This worked quite well, but still struggled in one room - must be a particular thick wall slightly blocking the signal. Still a very good improvement though with no drop.

The pros vs cons of using a Wi-Fi extender are:

Pros:
* You can choose which devices connect to which access point - spreading the network load
* You can choose which band you want to connect a device to
* It should improve coverage and reduce dead-spots

Cons:
* It is in effect still one connection just mirroring and relaying from the host router
* Can introduce interference as now there are two access points broadcasting
* It can only improve the coverage if it is still in range of the host router - ideally half way between the router and the dead-spot

It was an inexpensive option to try and it did boost coverage in certain rooms. It provides another option to try in combination with the others.


## Consider adding a wired connection for critical devices

I haven't taken this step yet, but I am considering it. A wired Ethernet connection may pull in similar speeds to a solid external network adapter, but the difference is reliability. Even the best Wi-Fi adapter may suffer drop-out at a critical moment like during a conference call or video interview. The chances of that happening with a wired connection is significantly less.

If you're not a fan of ripping open your walls to install network cable, then a DIY job of running (and hiding) flat Ethernet under the carpets or floorboards, up the stairs and along skirting boards and into your PC is an option. Is it an ideal solution? Nope. But as long as it's run where no one will disturb it this temporary fix might become a permanent one and also very reliable.

The one I'm looking to try from BUSOHE below claims to be flexible, durable and support over 30kg. Sounds tough to me. It's also flat so should be easier to lay under carpets neatly and away from footsteps.

<article-image 
  src="https://m.media-amazon.com/images/I/71grSt6AhsL._SL1500_.jpg" 
  alt="Ethernet cable" 
  loading="lazy" 
  styling=""
  caption="Consider adding a wired connection from the router to your critical devices" 
  captionsrc="https://www.amazon.co.uk/gp/product/B07QV7S2HT/ref=ox_sc_saved_title_2?smid=AJQDNWC8R613R&th=1" 
  :showsource="false">
</article-image>

* [BUSOHE Cat 8 Ethernet Cable 20m, High Speed Flat Gigabit RJ45 Lan Network Cable, 40Gbps 2000Mhz Internet Patch Cord for Switch, Router, Modem, Patch Panel, PC (White)](https://www.amazon.co.uk/gp/product/B07QV7S2HT/ref=ox_sc_saved_title_2?smid=AJQDNWC8R613R&th=1)

## Happy networking

This wasn't a typical analytical or programming article, however to write code and learn effectively, a strong stable internet connection is pretty vital! Worth giving this stuff some thought and taking the time to ensure you have the best connection possible so you can keep coding, learning and building great solutions without any worries.

Not only that, it's good to have a solid and stable setup for video calls, video streaming and screen sharing. All great tools in any digital role.

I hope this article gave you ideas and helped you to improve your network Wi-Fi speeds 😄 

If you enjoyed this article be sure to check out [other articles](/) on the site 👍 ]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Searching Markdown files for internal links and visualising with a Pyvis network graph]]></title>
            <link>https://shedloadofcode.com/blog/searching-markdown-files-for-internal-links-and-visualising-with-a-pyvis-network-graph/</link>
            <guid>https://shedloadofcode.com/blog/searching-markdown-files-for-internal-links-and-visualising-with-a-pyvis-network-graph/</guid>
            <pubDate>Fri, 08 Dec 2023 16:31:00 GMT</pubDate>
            <description><![CDATA[If you use Markdown, you can improve your web content strategy by visualising the relationships between internal links to identify your content clusters.]]></description>
            <content:encoded><![CDATA[
Lately I've been trying to improve the internal links on the site to improve the user experience. I wanted to check whether each article links to at least one other relevant article. 

I also wanted to understand what my content clusters looked like - the aim is to cover topics with a unique take or that are under-represented so they can help as many people as possible and avoid covering topics that are saturated. This also helps to keep efficient use of my time.

There is a 'related articles' section at the bottom but this works on the category and isn't in the body of the article. The articles are stored in Markdown files in GitHub to keep them backed up and version controlled, so the plan was to:

* Search the Markdown files and extract all internal links using [RegEx](https://docs.python.org/3/library/re.html)
* Produce and display a network visualisation to understand content clusters and relationships using [Pyvis](https://pyvis.readthedocs.io/en/latest/)

Pyvis is a wrapper for the popular [visJS](https://visjs.org/) JavaScript library, and it allows for easy generation of network graph visuals in Python.

If you want to follow along, a reproducible example can be [found in the GitHub repo](https://github.com/shedloadofcode/pyvis-network-graph-md) ready to clone or download. The main Python file is in the /utils folder, and the Markdown files containing internal links are in the /content/blog/ folder.

## Install packages

We'll only need to install two libraries, pyvis and pandas, so let's install those.

```
python -m pip install pyvis pandas
```

## Import libraries

In a new Python file **internal_links_graph.py**, we'll first import all libraries.

```python [internal_links_graph.py]
import os
import re
import pandas as pd
from pyvis.network import Network
```

## Searching the Markdown files

Next we need to create the edge data to feed into the network graph, by searching the Markdown files for internal links.

To do that, we need to:

* Define source (page linked from), target (page linked to), and weight (line weight) lists
* Set a regular expression to parse Markdown links
* Loop through and open each file in the given directory path, and for each:
    * Grab all links starting with **/blog/**
    * Append these to source, target and weight lists
* Zip the lists together and return

```python [internal_links_graph.py]
def get_edge_data() -> pd.DataFrame:
    source = []
    target = []
    weight = []
    pages_with_no_internal_links = set()

    count = 0
    path = "../content/blog"
    links_regex = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
    
    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)
        name, extension = os.path.splitext(filename)
        count += 1

        try:
            with open(file_path, encoding="utf8") as f:
                md = f.read()
                links = list(links_regex.findall(md))
                links_added = 0

                for link in links:
                    if link[1].startswith("/blog/"):
                        source.append("/blog/" + name + "/")
                        target.append(link[1])
                        weight.append(0.4)
                        links_added += 1
                    
                if links_added == 0:
                    pages_with_no_internal_links.add(name)
        except Exception as error:
            print("An exception occurred:", error)
    
    print(f"{count} files searched.")

    print(f"{len(source)} sources and {len(target)} targets.", end="\n\n")

    print(f"{len(pages_with_no_internal_links)} pages with no internal links:")

    for link in pages_with_no_internal_links:
        print(link)

    return zip(source, target, weight)
```

## Producing the network graph

Now we have the **edge_data** of all source and target pages, we can build a network graph to visualise the nodes by:

* Defining a new **Network** with the given properties
* Add each item in **edge_data** to as a network node
* Add hover information to each node
* Output the network graph to an HTML file **links.html**

```python [internal_links_graph.py]
def display_graph(edge_data) -> None:
    net = Network(height="900px", 
                  width="100%", 
                  directed=True,
                  bgcolor="#222222", 
                  font_color="#b1b4b6",
                  select_menu=True, 
                  filter_menu=True,
                  cdn_resources="remote")
    
    net.show_buttons(filter_=["nodes", "physics"])

    for e in edge_data:
        src = e[0]
        dst = e[1]
        w = e[2]

        net.add_node(src, src, title=src)   
        net.add_node(dst, dst, title=dst)
        net.add_edge(src, dst, value=w)

    neighbor_map = net.get_adj_list()

    # add neighbor data to node hover data
    for node in net.nodes:
        node["title"] += " links to:\n" + "\n".join(neighbor_map[node["id"]])
        node["value"] = len(neighbor_map[node["id"]])

    net.show("links.html", notebook=False)
```

## Run the file

Finally, we can add the two function calls to the script to get the edge data and display the graph.

```python [internal_links_graph.py]
if __name__ == "__main__":
    edge_data = get_edge_data()
    display_graph(edge_data)
```

To run the program in a new terminal or command line we can use:

```
python internal_links_graph.py
```

## Full code

```python [internal_links_graph.py]
"""Searches the Markdown files for internal links in blog articles.

Reads in the all files in the /content/blog directory and then searches for
any link which contains /blog/.

Outputs the results of this to a graph visual 'links.html'

Install packages using `pip install pandas pyvis`
"""
import os
import re
import pandas as pd
from pyvis.network import Network

def get_edge_data() -> pd.DataFrame:
    source = []
    target = []
    weight = []
    pages_with_no_internal_links = set()

    count = 0
    path = "../content/blog"
    links_regex = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
    
    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)
        name, extension = os.path.splitext(filename)
        count += 1

        try:
            with open(file_path, encoding="utf8") as f:
                md = f.read()
                links = list(links_regex.findall(md))
                links_added = 0

                for link in links:
                    if link[1].startswith("/blog/"):
                        source.append("/blog/" + name + "/")
                        target.append(link[1])
                        weight.append(0.4)
                        links_added += 1
                    
                if links_added == 0:
                    pages_with_no_internal_links.add(name)
        except Exception as error:
            print("An exception occurred:", error)
    
    print(f"{count} files searched.")

    print(f"{len(source)} sources and {len(target)} targets.", end="\n\n")

    print(f"{len(pages_with_no_internal_links)} pages with no internal links:")

    for link in pages_with_no_internal_links:
        print(link)

    return zip(source, target, weight)

def display_graph(edge_data) -> None:
    net = Network(height="900px", 
                  width="100%", 
                  directed=True,
                  bgcolor="#222222", 
                  font_color="#b1b4b6",
                  select_menu=True, 
                  filter_menu=True,
                  cdn_resources="remote")
    
    net.show_buttons(filter_=["nodes", "physics"])

    for e in edge_data:
        src = e[0]
        dst = e[1]
        w = e[2]

        net.add_node(src, src, title=src)   
        net.add_node(dst, dst, title=dst)
        net.add_edge(src, dst, value=w)

    neighbor_map = net.get_adj_list()

    # add neighbor data to node hover data
    for node in net.nodes:
        node["title"] += " links to:\n" + "\n".join(neighbor_map[node["id"]])
        node["value"] = len(neighbor_map[node["id"]])

    net.show("links.html", notebook=False)

if __name__ == "__main__":
    edge_data = get_edge_data()
    display_graph(edge_data)
```

## What I learnt about the content clusters

The main takeaway from plotting all of the content in a network graph, was that there wasn't enough internal linking throughout the site. 

I spent some time to embed relevant content links in other articles and the outcome was a collection of strong content clusters. The clusters included web scraping, automation, data science and analysis, and web app development.

The first image below shows what the network looked like before these improvements, and the second what it looks like now.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1700157938/App%20Images/Blog%20Images/Article%20Images/Pyvis%20Network%20Graph/fixed-before-pvis-graph_znjitw.png" 
  alt="Before improvements" 
  loading="lazy" 
  styling=""
  caption="Before improvements content had few internal links between them" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1700157938/App%20Images/Blog%20Images/Article%20Images/Pyvis%20Network%20Graph/fixed-before-pvis-graph_znjitw.png" 
  :showsource="false">
</article-image>

<article-image 
  src="
https://res.cloudinary.com/dayqxxsip/image/upload/v1700157938/App%20Images/Blog%20Images/Article%20Images/Pyvis%20Network%20Graph/pyviz-after_ffpa7m.png" 
  alt="After improvements" 
  loading="lazy" 
  styling=""
  caption="After improvements strong content clusters emerge with lots more internal linking" 
  captionsrc="
https://res.cloudinary.com/dayqxxsip/image/upload/v1700157938/App%20Images/Blog%20Images/Article%20Images/Pyvis%20Network%20Graph/pyviz-after_ffpa7m.png" 
  :showsource="false">
</article-image>

You can see from the HTML file output the network graph can be searched and filtered using the top dropdowns. This is because earlier we passed **True** to both **select_menu** and **filter_menu** when creating the **Network** object.

The image below shows filtering the example from the GitHub repo by a given path. Very useful for quickly identifying and highlighting nodes in a larger network.

<article-image 
  src="
https://res.cloudinary.com/dayqxxsip/image/upload/v1702057609/App%20Images/Blog%20Images/Article%20Images/Pyvis%20Network%20Graph/searching-example_cixrya.png" 
  alt="After improvements" 
  loading="lazy" 
  styling=""
  caption="The HTML graph output can be filtered on a given node" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1702057609/App%20Images/Blog%20Images/Article%20Images/Pyvis%20Network%20Graph/searching-example_cixrya.png" 
  :showsource="false">
</article-image>

## Happy networking

I hope you were able to apply this methodology to your own use case. Although you might not store your content in Markdown, I am sure this could be adapted to search other formats with a similar setup.

Visualising relationships like this through nodes in a network graph is very powerful. It certainly helped to deliver more relevant internal links to articles and visualise the content clusters. Pyvis can also be [integrated with NetworkX](https://pyvis.readthedocs.io/en/latest/tutorial.html#networkx-integration). [NetworkX](https://networkx.org/documentation/stable/index.html) is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in:

* [Searching for text in PDFs at increasing scale](/blog/searching-for-text-in-pdfs-at-increasing-scale/)
* [How to match and count keywords in text using JavaScript](/blog/how-to-match-and-count-keywords-in-text-using-javascript/)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) for improving your Python skills]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Record mouse and keyboard for automation scripts with Python]]></title>
            <link>https://shedloadofcode.com/blog/record-mouse-and-keyboard-for-automation-scripts-with-python/</link>
            <guid>https://shedloadofcode.com/blog/record-mouse-and-keyboard-for-automation-scripts-with-python/</guid>
            <pubDate>Sat, 02 Dec 2023 16:05:00 GMT</pubDate>
            <description><![CDATA[Learn how to record mouse clicks and keyboard input with pynput then convert that to a PyAutoGUI automation script for playback.]]></description>
            <content:encoded><![CDATA[
In this article, we'll take a look at how to record mouse clicks and keyboard input with [pynput](https://pynput.readthedocs.io/en/latest/) then convert that to a [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/index.html) automation script for playback. 

## Why build a mouse and keyboard recorder?

The short answer is to automate boring, time-consuming and repetitive tasks and let Python do them instead while you go enjoy a coffee ☕

PyAutoGUI is excellent for click and type automation tasks, but one of the weaknesses I found with it, is that it's difficult to 'record' a task and get the xy coordinates for the mouse clicks. There is an option to [take screenshots and locate images within the screen](https://pyautogui.readthedocs.io/en/latest/screenshot.html) but I could never get this to work accurately - mouse xy coordinates are much more reliable.

The [documentation](https://pyautogui.readthedocs.io/en/latest/mouse.html) features a useful program that will constantly print out the position of the mouse cursor:

```python
#! python3
import pyautogui, sys
print('Press Ctrl-C to quit.')
try:
    while True:
        x, y = pyautogui.position()
        positionStr = 'X: ' + str(x).rjust(4) + ' Y: ' + str(y).rjust(4)
        print(positionStr, end='')
        print('\b' * len(positionStr), end='', flush=True)
except KeyboardInterrupt:
    print('\n')
```

But then you'd have to find all the coordinates and script that up seperately, a tedious task!

Previously I explored [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/) which was great for keeping the screen active. 

Taking this a step further, actually recording the coordinates and also keyboard input then outputting that as a script, would be far better for automation tasks. 

I checked out a few existing tools like [record-and-play-pynput](https://github.com/george-jensen/record-and-play-pynput) and [pyautogui-mouse-record](https://github.com/DepictYourself/pyautogui-mouse-record) but none really satisfied what I was looking for, but they did give me a good start and inspiration. 

## How to run the recorder

You'll need to install a few Python packages first.

```
python -m pip install pynput pyautogui
```

Now let's go through step-by-step how to use this mouse and keyboard recorder. 

* Run `python record.py` to start the recording
* To end the recording:
  - Hold right click for 2 seconds then release to end the recording for mouse.
  - Press 'ESC' to end the recording for keyboard.
  - Both are needed to finish recording.
  - The recorded mouse and keyboard actions will be saved as 'recording.json'
* Run `python convert.py` to convert 'recording.json' into a PyAutoGUI script
  - The conversion will be saved as 'play.py'
* Run `python play.py` to play back the actions 😄 

All of the code can be found below or in [the GitHub repo](https://github.com/shedloadofcode/mouse-and-keyboard-recorder). Also, at the end there is a video demo of the recorder in action.

<subscribe-form></subscribe-form>

## Record mouse and keyboard

The first step is to record the mouse and keyboard input. To do this, we are using pynput to listen for on press and on click, then storing those events as a dictionary in the **recording** list. Once both listeners are terminated, we store this in a file **recording.json**

```python [record.py]
"""
Records mouse and keyboard and outputs the actions
to a JSON file recording.json 

To begin recording:
- Run `python record.py`

To end recording:
- Hold right click for 2 seconds then release to end the recording for mouse.
- Press 'ESC' to end the recording for keyboard.
- Both are needed to finish recording.
"""
import time
import json
from pynput import mouse, keyboard

print("Hold right click for 2 seconds then release to end the recording for mouse")
print("Click 'ESC' to end the recording for keyboard")
print("Both are needed to finish recording")

recording = [] 
count = 0

def on_press(key):
    try:
        json_object = {
            'action':'pressed_key', 
            'key':key.char, 
            '_time': time.time()
        }
    except AttributeError:
        if key == keyboard.Key.esc:
            print("Keyboard recording ended.")
            return False

        json_object = {
            'action':'pressed_key', 
            'key':str(key), 
            '_time': time.time()
        }
        
    recording.append(json_object)


def on_release(key):
    try:
        json_object = {
            'action':'released_key', 
            'key':key.char, 
            '_time': time.time()
        }
    except AttributeError:
        json_object = {
            'action':'released_key', 
            'key':str(key), 
            '_time': time.time()
        }

    recording.append(json_object)
        

def on_move(x, y):
    if len(recording) >= 1:
        if (recording[-1]['action'] == "pressed" and \
            recording[-1]['button'] == 'Button.left') or \
            (recording[-1]['action'] == "moved" and \
            time.time() - recording[-1]['_time'] > 0.02):
            json_object = {
                'action':'moved', 
                'x':x, 
                'y':y, 
                '_time':time.time()
            }

            recording.append(json_object)


def on_click(x, y, button, pressed):
    json_object = {
        'action':'clicked' if pressed else 'unclicked', 
        'button':str(button), 
        'x':x, 
        'y':y, 
        '_time':time.time()
    }

    recording.append(json_object)

    if len(recording) > 1:
        if recording[-1]['action'] == 'unclicked' and \
           recording[-1]['button'] == 'Button.right' and \
           recording[-1]['_time'] - recording[-2]['_time'] > 2:
            with open('recording.json', 'w') as f:
                json.dump(recording, f)
            print("Mouse recording ended.")
            return False


def on_scroll(x, y, dx, dy):
    json_object = {
        'action': 'scroll', 
        'vertical_direction': int(dy), 
        'horizontal_direction': int(dx), 
        'x':x, 
        'y':y, 
        '_time': time.time()
    }

    recording.append(json_object)


def start_recording():
    keyboard_listener = keyboard.Listener(
        on_press=on_press,
        on_release=on_release)

    mouse_listener = mouse.Listener(
            on_click=on_click,
            on_scroll=on_scroll,
            on_move=on_move)

    keyboard_listener.start()
    mouse_listener.start()
    keyboard_listener.join()
    mouse_listener.join()


if __name__ == "__main__":
    start_recording()
    
```

## Convert JSON output to PyAutoGUI script

Now we have the **recording.json** file, we can use that to convert it into a Python script. We are excluding mouse release and scroll events as these don't really help for the purposes of conversion.

```python [convert.py]
"""
Converts the recording.json file to a Python script 
'play.py' to use with PyAutoGUI.

The 'play.py' script may require editing and adapting 
before use.

Always review 'play.py' before running with PyAutoGUI!
"""
import json

key_mappings = {
    "cmd": "win",
    "alt_l": "alt",
    "alt_r": "alt",
    "ctrl_l": "ctrl",
    "ctrl_r": "ctrl"
}


def read_json_file():
    """
    Takes the JSON output 'recording.json'

    Excludes released and scrolling events to 
    keep things simple.
    """
    with open('recording.json') as f:
        recording = json.load(f)

    def excluded_actions(object):
        return "released" not in object["action"] and \
               "scroll" not in object["action"]

    recording = list(filter(excluded_actions, recording))

    return recording


def convert_to_pyautogui_script(recording):
    """
    Converts to a Python template script 'play.py' to 
    use with PyAutoGUI.

    Converts the:

    - Mouse clicks
    - Keyboard input
    - Time between actions calculated
    """
    if not recording: 
        return
    
    output = open("play.py", "w")
    output.write("import time\n")
    output.write("import pyautogui\n\n")
    
    for i, step in enumerate(recording):
        print(step)

        not_first_element = (i - 1) > 0
        if not_first_element:
            ## compare time to previous time for the 'sleep' with a 10% buffer
            pause_in_seconds = (step["_time"] - recording[i - 1]["_time"]) * 1.1 

            output.write(f"time.sleep({pause_in_seconds})\n\n")
        else:
            output.write("time.sleep(1)\n\n")

        if step["action"] == "pressed_key":
            key = step["key"].replace("Key.", "") if "Key." in step["key"] else step["key"]

            if key in key_mappings.keys():
                key = key_mappings[key]

            output.write(f"pyautogui.press('{key}')\n")
        
        if step["action"] == "clicked":
            output.write(f"pyautogui.moveTo({step['x']}, {step['y']})\n")

            if step["button"] == "Button.right":
                output.write("pyautogui.mouseDown(button='right')\n")
            else:
                output.write("pyautogui.mouseDown()\n")

        if step["action"] == "unclicked":
            output.write(f"pyautogui.moveTo({step['x']}, {step['y']})\n")

            if step["button"] == "Button.right":
                output.write("pyautogui.mouseUp(button='right')\n")
            else:
                output.write("pyautogui.mouseUp()\n")

    print("Recording converted. Saved to 'play.py'")


if __name__ == "__main__":
    recording = read_json_file()
    convert_to_pyautogui_script(recording)
```

As some of the keys from pynput don't correspond directly to PyAutoGUI, the **key_mappings** dictionary helps out with this. If you come across any more, you can add to this dictionary taking the pynput key and mapping it to the relevant PyAutoGUI [keyboard keys](https://pyautogui.readthedocs.io/en/latest/keyboard.html#keyboard-keys).

## Play the automation script

Once the conversion ends, **play.py** will contain a PyAutoGUI script that will look something like:

```python [play.py]
import time
import pyautogui

time.sleep(1)

pyautogui.press('win')
time.sleep(1)

pyautogui.press('f')
time.sleep(0.22220540046691897)

pyautogui.press('i')
time.sleep(0.10727632045745851)

pyautogui.press('r')
time.sleep(0.08800437450408936)

pyautogui.press('e')
time.sleep(0.5824827909469605)

pyautogui.press('f')
time.sleep(0.11989445686340333)

pyautogui.press('o')
time.sleep(0.22220461368560793)

pyautogui.press('x')
time.sleep(2.674463224411011)

pyautogui.moveTo(206, 219)
pyautogui.mouseDown()
time.sleep(0.07921419143676758)

pyautogui.moveTo(206, 219)
pyautogui.mouseUp()
time.sleep(5.592976307868958)

pyautogui.moveTo(522, 68)
pyautogui.mouseDown()
time.sleep(0.11439170837402345)
```

Here is a quick end-to-end video demo recording, converting then playing back an automation process - an example of opening Firefox, navigating to W3Schools, searching for Python, copying some code, then pasting it into Visual Studio Code. This uses left click, right click and keyboard input so applicable to a real-world scenario.

<article-video 
  id="JvK0kXEgDgo" 
  title="Record mouse and keyboard for automation scripts with Python">
</article-video>

## Final cut

Okay this was another fun Python automation article, now you know how to create a mouse and keyboard recorder with Python, and have a solid start to building more advanced robotic process automation (RPA) solutions with PyAutoGUI. You can refer to the [documentation](https://pyautogui.readthedocs.io/en/latest/) for more guidance on using PyAutoGUI and think about what else you might like to build 😄

Although there is functionality for [controlling the mouse with pynput](https://pynput.readthedocs.io/en/latest/mouse.html) I still prefer to have a PyAutoGUI output script. 

This program can be modified and adapted further to your needs. You could read in some data with [pandas](https://pandas.pydata.org/) and then introduce a for loop to repeat an automation process for multiple inputs during playback.

If you enjoyed this article be sure to check out other articles on the site including:

* [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/) for another Python and PyAutoGUI use case 
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) for improving your Python skills

Finally, if you have any questions or if you decide to use or extend this program, please leave a comment below. I'd love to know what you use it for and how it's helped you out 👍]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Developing your data science and analytical coding skills - a review of DataCamp]]></title>
            <link>https://shedloadofcode.com/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/</link>
            <guid>https://shedloadofcode.com/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/</guid>
            <pubDate>Mon, 13 Nov 2023 11:47:00 GMT</pubDate>
            <description><![CDATA[Get a solid overview of DataCamp, what it is, who it's good for, how to get started, my experience using it, and why it might be a good fit to further your data science and career skills.]]></description>
            <content:encoded><![CDATA[
<affiliate-disclaimer></affiliate-disclaimer>

In this article, we will explore quite an in-depth overview of [DataCamp](https://datacamp.pxf.io/EKAK42), what it is, who it's for, how to get started and get the most out of it, alongside my experiences of using DataCamp to develop data science and career skills. 

I hope this review can give you a solid starting point to decide whether DataCamp is right for you. Let's begin!

## What is DataCamp?

DataCamp is an online learning platform and a powerful resource for learning how to code for data science.

> Develop in-demand data science and AI skills at your own pace with 460+ courses. Learn SQL, Python, R, Tableau, PowerBI, ChatGPT and more with interactive exercises. Follow short videos led by expert instructors and then practice what you've learned with hands-on exercises in your browser.

## Who is DataCamp good for?

* Beginners who want to learn how to code for data analysis, data science and / or data engineering
* Intermediate analysts who want to explore more complex data science topics
* Professional analysts who want to quickly refresh skills for a project or carry out continuous professional development

## My experience with DataCamp

My first use of DataCamp way back in 2018 was through the [Microsoft Professional Certificate in Data Science](https://devblogs.microsoft.com/premier-developer/microsoft-professional-program-for-data-science-sharpen-your-data-science-skills/) where it was used for the practical coding sections. I was both very impressed and hooked on data science, so subscribed for a yearly subscription to really commit to the change of career specialism.

I completed that alongside [Harvard's Professional Certificate in Computer Science for Artificial Intelligence](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/). Both of these courses were essential for me to break in to the field of data science and software development. Statistics accounted for 50% of my undergraduate degree but had no where near the hands on coding experience DataCamp provided.

Back then, I studied all of the introductary courses for [Python](https://datacamp.pxf.io/217ayQ), [R](https://datacamp.pxf.io/NkE9R2) and [SQL](https://datacamp.pxf.io/vNmPQj). This gave me an excellent foundation for understanding how to use code to interrogate data and solve business problems.

Since then, I joined a large employer who provides a business subscription to DataCamp. This really helps me to balance a full-time job with learning. Ongoing professional development is vital, and this also helps when a project comes up I need a refresher on or a technique I’ve not used before or in a while. 

We have recently started using Azure Databricks with PySpark for a prediction project, so the courses I am doing right now include: 

* [Introduction to Azure](https://datacamp.pxf.io/k0AOJd)
* [Introduction to PySpark](https://datacamp.pxf.io/rQWaJ3)
* [Supervised Learning with scikit-learn](https://datacamp.pxf.io/q4ozOY)

## Pricing and free tier

When it comes to [pricing](https://datacamp.pxf.io/c/4971160/1112312/13294), it is very clear and easy to select your currency from the dropdown at the top right. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1699722928/App%20Images/Blog%20Images/Article%20Images/Datacamp%20Review/pricing_pkpihf.png" 
  alt="DataCamp's pricing page" 
  loading="lazy" 
  styling=""
  caption="DataCamp's pricing page" 
  captionsrc="https://datacamp.pxf.io/c/4971160/1112312/13294" 
  :showsource="false">
</article-image>

At the time of writing, there is a discount for a yearly subscription opposed to a monthly subscription which is great if you're ready to dedicate yourself to learning data science. Much like a gym membership, I think once you commit for the long term, you stick with it and make progress. 

In terms of advancing your career, gaining access to an immense library of content and the ability to practice coding plus gain certification, I feel this price is very reasonable. When comparing the pricing to typical [undergraduate tuition fees](https://www.ucas.com/finance/undergraduate-tuition-fees-and-student-loans#how-much-are-tuition-fees), I think the yearly pricing represents exceptional value for hands-on learning.

 In the unlikely event that you try it and really don't gel with it then [you can cancel easily](https://support.datacamp.com/hc/en-us/articles/360001546054-How-do-I-cancel-my-subscription-#h_01HEAHEBB21SH0MHY7VP71Y8V4).

Also, take advantage of the limited access free tier - you get every first chapter free.

## Offers and promotions

From time to time there are promotions and discounts so be sure to take advantage of these if you decide DataCamp is right for you. 

Here is a list I will keep updated with current and upcoming promotions and discounts:

* [Student Discount - 50% Off for Students by subscribing to our Premium Student Plan!](https://datacamp.pxf.io/c/4971160/1611874/13294)
* [End of Year Sale - 50% Off](https://datacamp.pxf.io/c/4971160/2261972/13294) Dec 8, 2024 05:00 – Dec 28, 2024 04:59

## Getting started with DataCamp

After logging in to [DataCamp](https://datacamp.pxf.io/EKAK42), the Learn hub is the main place to access learning materials.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1699722928/App%20Images/Blog%20Images/Article%20Images/Datacamp%20Review/leader_znzqhk.png" 
  alt="DataCamp's Learn hub" 
  loading="lazy" 
  styling=""
  caption="DataCamp's Learn hub" 
  captionsrc="https://datacamp.pxf.io/1rz7ga" 
  :showsource="false">
</article-image>

Although the video below is geared for business users, it's very helpful to everyone getting started with the basics of DataCamp including:

* Tracks - career or skill tracks currate courses into a guided track.
* Courses - interactive courses combining short videos with hands-on exercises.
* Practice - quick daily challenges to keep skills sharp.
* Assessments - test your skills to find your weak areas.
* Tutorials - lots of articles and how-to guides.
* Projects and Case Studies - solve real world problems guided or unguided.

<article-video 
  id="oO2RFvpHjDg" 
  title="DataCamp 101: Getting Started with DataCamp">
</article-video>

## Making the most of DataCamp 

* Certifications - DataCamp Certification is an official recognition and a great way to prove your skills are job-ready.
* [Workspace](https://datacamp.pxf.io/rQWaL3) - personal in-browser tool to write code, and share your data analysis. Think of this as a cloud based Jupyter notebook-like tool.
* Competitions - apply skills to a real world task and compare to other DataCamp learners.
* Code Alongs - webinars and events.
* Popular topics - learn about new and trending tech like ChatGPT.

<article-video 
  id="O8nCMV0XVdo" 
  title="DataCamp 101: 5 Ways to Make the Most of DataCamp">
</article-video>

## Does DataCamp have any weaknesses?

One of the downsides I've heard is that sometimes DataCamp can feel too much like a 'fill in the gaps' puzzle. I get this to an extent, but it's really important to not blindly go through the exercise, but to try and understand the exercise instead. 

DataCamp is excellent at providing a taste of what an aspiring data scientist needs to start with. If aspiring analysts/data scientists become very interested in what they are exposed to, they'll then complement this with other learning methods and research wider (YouTube videos, textbooks, articles and so on).

For me, DataCamp is like a flight simulator; it teaches you what you need to know in a controlled environment, where you can make mistakes but don’t forget you also need to prepare for the real thing in a business setting which includes:

* Setting up an IDE such as RStudio, Spyder, Visual Studio Code, PyCharm on your own machine
* Installing Python (Base or Anaconda) or R on your own machine
* Using cloud tools like Azure, AWS, Google Cloud Platform, Databricks
* Setting up and configuring cloud databases with SSMS, Postgres etc
* Gathering requirements from real business stakeholders
* Selecting an explainable model for classification / regression given a business problem
* Managing a project from start to finish; delivering a working solution
* Presenting analysis to real stakeholders

<article-image 
  src="https://images.pexels.com/photos/3862137/pexels-photo-3862137.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1" 
  alt="Flight simulator" 
  loading="lazy" 
  styling=""
  caption="'For me, DataCamp is like a flight simulator, it teaches you what you need to know in a controlled environment'" 
  captionsrc="https://www.pexels.com/photo/engineer-in-flight-simulator-3862137/" 
  :showsource="false">
</article-image>

Your first day as a Data Scientist probably won't include firing up DataCamp! However, gaining the skills required to land and carry out that role, it may well provide you. 

You can check out the article [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/) if you're preparing to apply.

## What do others think about DataCamp?

To get a feel for what others think and their experiences, check out the [stories](https://datacamp.pxf.io/1rz7ga) page which has lots of learner outcomes.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1699722929/App%20Images/Blog%20Images/Article%20Images/Datacamp%20Review/stories_ybzlz6.png" 
  alt="Flight simulator" 
  loading="lazy" 
  styling=""
  caption="Read about learner stories..." 
  captionsrc="https://datacamp.pxf.io/1rz7ga" 
  :showsource="false">
</article-image>

I also read a really interesting article on [How One Learner Saved 1,500+ Hours of Work By Taking 200+ Courses and Amassing 1,000,000+ XP](https://www.datacamp.com/blog/how-one-learner-saved-1500-hours-of-work-by-taking-200-courses-and-amassing-1000000-xp).

<subscribe-form></subscribe-form>

## Alternatives to DataCamp

It wouldn't be fair to finish the review without acknowledging alternatives to DataCamp. Although DataCamp is excellent for data science, if you have a slightly different goal in mind, another service may be better suited to you. These might include:

* [Pluralsight](https://www.pluralsight.com/) - interactive and video courses on all areas of tech 
* [edX](https://www.edx.org/) - courses from big-name universities and colleges with optional paid certificates
* [Coursera](https://www.coursera.org/) - video courses on many topics including coding
* [Udemy](https://udemy.com/) - video courses on many topics including coding
* [freeCodeCamp](https://www.freecodecamp.org/) - free interactive coding courses 

I've used all of these in the past, my favourites were freeCodeCamp, edX and Pluralsight. My opinion is that freeCodeCamp is great for starting out, edX offers accreditation from universities and colleges like [Harvard's CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/), and Pluralsight is another enterprise favourite for tech with Microsoft usually offering a 3 month trial with their Visual Studio Enterprise / Professional subscriptions.

## Final verdict 

The overall conclusion to this review is that DataCamp is a fantastic resource for learning data science. It may not be perfect, nothing is, but it is one of the best tools out there to improve or maintain data science skills. It is no surprise that [80% of the Fortune 1000 use it](https://datacamp.pxf.io/AW5Poa).

A final thought is that every role seems to be demanding more skills in analysis, statistics and using data to make better decisions. This means that not just data scientists or data engineers need data skills, everyone does.

If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions feel free to leave a comment 👍]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to scrape AutoTrader with Python and Selenium to search for multiple makes and models]]></title>
            <link>https://shedloadofcode.com/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/</link>
            <guid>https://shedloadofcode.com/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/</guid>
            <pubDate>Sun, 05 Nov 2023 17:31:00 GMT</pubDate>
            <description><![CDATA[Take this new AutoTrader UK web scraper for a spin! It can search for and filter multiple makes and models to help you easily compare and make the right decision quicker.]]></description>
            <content:encoded><![CDATA[
Searching for used cars can be time consuming and sometimes there isn't a good way to easily compare potential cars. AutoTrader is a great place to perform this search and comparison but as far as I can see, it does not allow to search for multiple makes and models in one search. 

Who wants to keep going back and forth between previously saved searches, right? Wouldn't it be so much easier if you could compare all of them in one list or spreadsheet? We'll explore the Python code that does just that using both [Selenium](https://selenium-python.readthedocs.io/) and [regular expressions](https://docs.python.org/3/library/re.html) (RegEx), along with a video demo of how to use it. 

## Installing required Python packages

Of course, you'll need the latest stable version of [Python](https://www.python.org/downloads/) installed on your operating system and added to path before progressing. I'm also using [Visual Studio Code](https://code.visualstudio.com/) as the code editor, this isn't essential but it's a great free lightweight IDE worth checking out.

Following that, the autotrader scraper will rely on a few Python packages so using pip, install the following for specific version I used at the time of writing:

```
python -m pip install numpy pandas==2.2.3 bs4==0.0.1 selenium==4.15.1 xlsxwriter==1.4.3
```

Or the latest versions with:

```
python -m pip install numpy pandas bs4 selenium xlsxwriter
```


The main libraries we are using here are:

* [Selenium](https://selenium-python.readthedocs.io/) to control ChromeDriver, navigate to URLs etc.
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse and search the HTML elements
* [Pandas](https://pandas.pydata.org/) for data manipulation and calculations
* [XlsxWriter](https://xlsxwriter.readthedocs.io/) to create the Excel output including conditional formatting

All other libraries such as [os](https://docs.python.org/3/library/os.html) [re](https://docs.python.org/3/library/re.html), [time](https://docs.python.org/3/library/time.html) and [datetime](https://docs.python.org/3/library/datetime.html) come as standard with the [Python standard library](https://docs.python.org/3/library/index.html). 

## Downloading ChromeDriver

Selenium effectively 'controls' or 'drives' a web browser in an automated way. In order to do that, we need ChromeDriver, and we need the version that matches your current version of [Chrome](https://www.google.com/intl/en_uk/chrome/). 

My version of Chrome was **'Version 119.0.6045.106 (Official Build) (64-bit)'**. You can find your current version of Chrome by hitting the three dots in the top right of the browser > Help > About Google Chrome. 

You will see your current version and an option to update if it isn't the latest version.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1699200976/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Scraper%202023/chrome-version_wcksut.png" 
  alt="Chrome version" 
  loading="lazy" 
  styling=""
  caption="Finding the current version of Chrome" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1699200976/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Scraper%202023/chrome-version_wcksut.png" 
  :showsource="false">
</article-image>

So based on that, I required the [latest stable version of ChromeDriver](https://googlechromelabs.github.io/chrome-for-testing/) for my machine which was ['119.0.6045.105 win64'](https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip).

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1699200976/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Scraper%202023/same-version-as-chrome_uvawih.png" 
  alt="Chrome version" 
  loading="lazy" 
  styling=""
  caption="Finding the matching version of ChromeDriver" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1699200976/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Scraper%202023/same-version-as-chrome_uvawih.png" 
  :showsource="false">
</article-image>

If you already have Version 119.0.6045.106 you can just [head to this repository](https://github.com/shedloadofcode/autotrader-selenium-scraper) where I have stored the code alongside the version of 'chromedriver.exe' I used ready for cloning / download.

## Explaining the AutoTrader scraper

To simplify the code block below and to understand the process, here is a 3 step summary of what's going on.

1. We set our `criteria` and `cars` search parameters.
2. Then we `scrape_autotrader`:
    * For each car find how many pages of results there are in `number_of_pages`
    * For each page scrape all the `articles`
    * For each article use [RegEx](https://www.w3schools.com/python/python_regex.asp) to find all the car `details`
    * Store all car details in a list `data` and return this
3. We take that, and `output_data_to_excel`
    * Ensuring the data is parsed to numeric format
    * Calculating mileage per annum
    * Sorting on distance
    * Conditional format the numeric columns red, amber, green for easier analysis

So once you've set your criteria and cars, ensure you're in the correct directory, then you can run the scraper using:

```
python autotrader-scraper.py
```

The code below then executes and begins the automated scraping in ChromeDriver.

```python [autotrader-scraper.py]
# type: ignore

"""
Enables the automation of searching for multiple makes/models on Autotrader UK using Selenium and Regex.

Set your criteria and cars makes/models.

Data is then output to an Excel file in the same directory.

Running Chrome Version 119.0.6045.106 and using Stable Win64 ChromeDriver from:
https://googlechromelabs.github.io/chrome-for-testing/
https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip
"""
import os
import re
import time
import datetime

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options

criteria = {
    "postcode": "LS1 2AD",
    "radius": "20",
    "year_from": "2010",
    "year_to": "2014",
    "price_from": "3000",
    "price_to": "6500",
}


cars = [
    {
        "make": "Toyota",
        "model": "Yaris"
    },
    {
         "make": "Honda",
         "model": "Jazz"
    },
    {
         "make": "Suzuki",
         "model": "Swift"
    },
    {
         "make": "Mazda",
         "model": "Mazda2"
    }
]


def scrape_autotrader(cars, criteria):
    chrome_options = Options()
    chrome_options.add_argument("_tt_enable_cookie=1")
    driver = webdriver.Chrome()
    data = []

    for car in cars:

        # Example URL: 
        # https://www.autotrader.co.uk/car-search?advertising-location=at_cars&include-delivery-option
        # =on&make=Honda&model=Jazz&postcode=LS12AD&radius=10&sort=relevance&year-from=2011&year-to=2015
        
        url = "https://www.autotrader.co.uk/car-search?" + \
            "advertising-location=at_cars&" + \
            "include-delivery-option=on&" + \
            f"make={car['make']}&" + \
            f"model={car['model']}&" + \
            f"postcode={criteria['postcode']}&" + \
            f"radius={criteria['radius']}&" + \
            "sort=relevance&" + \
            f"year-from={criteria['year_from']}&" + \
            f"year-to={criteria['year_to']}&" + \
            f"price-from={criteria['price_from']}&" + \
            f"price-to={criteria['price_to']}"
        
        driver.get(url)

        print(f"Searching for {car['make']} {car['model']}...")

        time.sleep(5) 

        source = driver.page_source
        content = BeautifulSoup(source, "html.parser")

        try:
            pagination_next_element = content.find("a", attrs={"data-testid": "pagination-next"})
            aria_label = pagination_next_element.get("aria-label")
            number_of_pages = int(re.search(r'of (\d+)', aria_label).group(1))
        except AttributeError:
            print("No results found or couldn't determine number of pages.")
            continue
        except Exception as e:
            print(f"An error occurred while determining number of pages: {e}")
            continue  

        print(f"There are {number_of_pages} pages in total.")

        for i in range(int(number_of_pages)):
            driver.get(url + f"&page={str(i + 1)}")
            
            time.sleep(5)
            page_source = driver.page_source
            content = BeautifulSoup(page_source, "html.parser")

            articles = content.findAll("section", attrs={"data-testid": "trader-seller-listing"})

            print(f"Scraping page {str(i + 1)}...")

            for article in articles:
                details = {
                    "name": car['make'] + " " + car['model'],
                    "price": re.search("[£]\d+(\,\d{3})?", article.text).group(0),
                    "year": None,
                    "mileage": None,
                    "transmission": None,
                    "fuel": None,
                    "engine": None,
                    "owners": None,
                    "location": None,
                    "distance": None,
                    "link": article.find("a", {"href": re.compile(r'/car-details/')}).get("href")
                } 

                try:
                    seller_info = article.find("p", attrs={"data-testid": "search-listing-seller"}).text
                    location = seller_info.split("Dealer location")[1] 
                    details["location"] = location.split("(")[0]
                    details["distance"] = location.split("(")[1].replace(" mile)", "").replace(" miles)", "") 
                except:
                    print("Seller information not found.")

                specs_list = article.find("ul", attrs={"data-testid": "search-listing-specs"})
                for spec in specs_list:
                    if "reg" in spec.text:
                        details["year"] = spec.text

                    if "miles" in spec.text: 
                        details["mileage"] = spec.text

                    if spec.text in ["Manual", "Automatic"]: 
                        details["transmission"] = spec.text

                    if "." in spec.text and "L" in spec.text:
                        details["engine"] = spec.text

                    if spec.text in ["Petrol", "Diesel"]: 
                        details["fuel"] = spec.text

                    if "owner" in spec.text:
                        details["owners"] = spec.text[0]

                data.append(details)

            print(f"Page {str(i + 1)} scraped. ({len(articles)} articles)")
            time.sleep(5)

        print("\n\n")

    print(f"{len(data)} cars total found.")

    return data


def output_data_to_excel(data, criteria):
    df = pd.DataFrame(data)

    df["price"] = df["price"].str.replace("£", "").str.replace(",", "")
    df["price"] = pd.to_numeric(df["price"], errors="coerce").astype("Int64")

    df["year"] = df["year"].str.replace(r"\s(\(\d\d reg\))", "", regex=True)
    df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")

    df["mileage"] = df["mileage"].str.replace(",", "").str.replace(" miles", "")
    df["mileage"] = pd.to_numeric(df["mileage"], errors="coerce").astype("Int64")

    now = datetime.datetime.now()
    df["miles_pa"] = df["mileage"] / (now.year - df["year"])
    df["miles_pa"].fillna(0, inplace=True)
    df["miles_pa"] = df["miles_pa"].astype(int)

    df["owners"] = df["owners"].fillna("-1") 
    df["owners"] = df["owners"].astype(int)

    df["distance"] = df["distance"].fillna("-1") 
    df["distance"] = df["distance"].astype(int)

    df["link"] = "https://www.autotrader.co.uk" + df["link"] 

    df = df[[
        "name",
        "link",
        "price",
        "year",
        "mileage",
        "miles_pa",
        "owners",
        "distance",
        "location",
        "engine",
        "transmission",
        "fuel",
    ]]

    df = df[df["price"] < int(criteria["price_to"])]

    df = df.sort_values(by="distance", ascending=True)

    writer = pd.ExcelWriter("cars.xlsx", engine="xlsxwriter")
    df.to_excel(writer, sheet_name="Cars", index=False)
    workbook = writer.book
    worksheet = writer.sheets["Cars"]

    worksheet.conditional_format("C2:C1000", {
        'type':      '3_color_scale',
        'min_color': '#63be7b',
        'mid_color': '#ffdc81',
        'max_color': '#f96a6c'
    })

    worksheet.conditional_format("D2:D1000", {
        'type':      '3_color_scale',
        'min_color': '#f96a6c',
        'mid_color': '#ffdc81',
        'max_color': '#63be7b'
    })

    worksheet.conditional_format("E2:E1000", {
        'type':      '3_color_scale',
        'min_color': '#63be7b',
        'mid_color': '#ffdc81',
        'max_color': '#f96a6c'
    })

    worksheet.conditional_format("F2:F1000", {
        'type':      '3_color_scale',
        'min_color': '#63be7b',
        'mid_color': '#ffdc81',
        'max_color': '#f96a6c'
    })

    writer.close() # Previously writer.save()
    print("Output saved to current directory as 'cars.xlsx'.")


if __name__ == "__main__":
    data = scrape_autotrader(cars, criteria)
    output_data_to_excel(data, criteria)

    os.system("start EXCEL.EXE cars.xlsx")
```

If you don't want an Excel file with all the conditional formatting, after the transformations in `output_data_to_excel` remove everything at and below `writer` then just output to a CSV instead using:

```python
df.to_csv("cars.csv")
```

I hope you find this code highly modifiable so you can adapt and extend it however you like. 

I was keen to calculate the mileage per annum to assess wear and tear, but you might want to include other calculations to explore other aspects and take it even further!

There is also a way to avoid having to download the correct version of Chrome to match ChromeDriver. You can install webdriver-manager with:

`pip install webdriver-manager`

Then update the code to automatically download and provide the path to Selenium webdriver for ChromeDriver using webdriver-manager instead.

```python [autotrader-scraper.py]
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

...
service = Service(executable_path=ChromeDriverManager().install()) 
driver = webdriver.Chrome(service=service, options=chrome_options)
...
```

## Taking the scraper for a test drive

Let's see the scraper in action, in this end-to-end demo. By performing this process weekly we can get the most up to date listing for a given area. In this demo, I have chosen a random postcode in Leeds. 

The formatting after scraping makes it really easy to see the trade offs in terms of price, year, mileage, miles per annum and previous owners. It also nicely allows for further filtering to narrow down your parameters.

<article-video 
  id="ak5cdSJX5A8" 
  title="How to scrape AutoTrader with Python and Selenium to search for multiple makes and models">
</article-video>

I closed the accept cookies pop up manually just so the steps taken in ChromeDriver were easily visible, but this isn't essential, you can just let it run.

<subscribe-form></subscribe-form>

## Why did the previous scraper stop working?

For those of you who tried the old scraper from a [previous article](/blog/building-an-autotrader-scraper-with-python-to-search-for-multiple-makes-and-models/) you'll know it stopped working after the AutoTrader UK website changed sometime after September 2023. All of the classes used for scraping changed and were obfuscated.

However, as we've seen in the current scraper, some attributes still allow element identification such as the `data-testid` attribute.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1699200977/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Scraper%202023/html-changes_sh8pdy.png" 
  alt="HTML changes" 
  loading="lazy" 
  styling=""
  caption="The HTML obfuscated after the website change" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1699200977/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Scraper%202023/html-changes_sh8pdy.png" 
  :showsource="false">
</article-image>

The current scraper is simpler, should be more robust and less reliant on third party code other than stable libraries. 

However I have no doubt at some point it will stop working after another site change. Nevertheless, this scraper is easier to change, relying only on attribute identification followed by using regular expressions to find the required information. So by changing:

1. How we are identifying elements with [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and
2. How we are parsing the information out of those elements with [RegEx](https://docs.python.org/3/library/re.html)

We can successfully update the code to adapt to changing needs. Selenium is a big help with this also, as it ensures that all scraping occurs after the page has loaded within Chrome. This means that anything that is dynamically added to the page using JavaScript after the page load should be captured.

## Happy car hunting again!

The only thing left for you to do is set your criteria, add the makes and models you want, and off you go! Happy car hunting. 

I hope the scraper helps you compare cars easier and find the one you're looking for as much as it helped me 👍

If you have any thoughts on this article, please leave a comment below or reach out by email at the bottom of this page. Certainly want to hear how this is being used, if it's helping others and how you've adapted it to your needs 😄

If you enjoyed this article be sure to check out: 

* [How to scrape and analyse your Amazon spending data](/blog/how-to-scrape-and-analyse-your-amazon-spending-data/) 
* [How to scrape and analyse your Chess.com data](/blog/how-to-scrape-and-analyse-your-chess-com-data/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to import a CSV from Dropbox or GitHub into Google Sheets]]></title>
            <link>https://shedloadofcode.com/blog/how-to-import-a-csv-from-dropbox-or-github-into-google-sheets/</link>
            <guid>https://shedloadofcode.com/blog/how-to-import-a-csv-from-dropbox-or-github-into-google-sheets/</guid>
            <pubDate>Thu, 02 Nov 2023 13:05:00 GMT</pubDate>
            <description><![CDATA[Learn how to use the IMPORTDATA function to automate CSV data ingestion into Google Sheets for analysis.]]></description>
            <content:encoded><![CDATA[
## Introduction

Recently I really wanted to export some of my spending data from the [Spending Tracker app](https://play.google.com/store/apps/details?hl=en&id=com.mhriley.spendingtracker) I use in CSV format to analyse it. This app exports data to Dropbox once it's linked up. So I needed a way to bring that data into Google Sheets to analyse trends etc. 

The process is quite simple once you know the steps involved, so I have documented them here! I have also documented how to do the same thing using GitHub.

As an example, we will use the Titanic dataset stored in both Dropbox and GitHub and then import that into Google Sheets from both sources 😄

## Get the CSV link from Dropbox

First things first, we need to head across to Dropbox and copy the link to the CSV file.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/select-copy-link_xc7fbo.png" 
  alt="Select copy link" 
  loading="lazy" 
  styling=""
  caption="Select copy link" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/select-copy-link_xc7fbo.png" 
  :showsource="false">
</article-image>

This gives us the link [https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=0](https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=0)

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855080/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/copy-dropbox-link_ytqebb.png" 
  alt="Link is copied" 
  loading="lazy" 
  styling=""
  caption="Link is copied" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855080/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/copy-dropbox-link_ytqebb.png" 
  :showsource="false">
</article-image>

For this Dropbox link we will need to change the ending from `dl=0` to `dl=1` so that the file is downloaded rather than viewed when we try to import it later. **This is an important step**.

So the correct link is [https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=1](https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=1)

## Get the CSV link from GitHub

Doing the same process for GitHub I stored the CSV within a repository named [data-files](https://github.com/shedloadofcode/data-files/blob/main/titanic.csv). To ensure the CSV imports correctly we must first hit the 'Raw' button and copy that link instead. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/hit-raw_bb3yrs.png" 
  alt="Hit the 'Raw' button" 
  loading="lazy" 
  styling=""
  caption="View the CSV in the Repo and hit the 'Raw' button" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/hit-raw_bb3yrs.png" 
  :showsource="false">
</article-image>

This gives us the raw CSV link [https://raw.githubusercontent.com/shedloadofcode/data-files/main/titanic.csv](https://raw.githubusercontent.com/shedloadofcode/data-files/main/titanic.csv)

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/raw-csv-github_kij3yw.png" 
  alt="The raw CSV data" 
  loading="lazy" 
  styling=""
  caption="The link goes to the raw CSV data" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/raw-csv-github_kij3yw.png" 
  :showsource="false">
</article-image>

## Import CSV data from Dropbox

Now we have both links, to import that CSV data into Google Sheets, we will use the [IMPORTDATA](https://support.google.com/docs/answer/3093335) function and pass in the URL for each CSV file. Again, for the Dropbox link we will need to change the ending from `dl=0` to `dl=1` so that the file is downloaded. 

We can enter the formula and pass the link as the first argument.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-dropbox-formula_gsx6cc.png" 
  alt="Use IMPORTDATA to pull data from Dropbox" 
  loading="lazy" 
  styling=""
  caption="Use IMPORTDATA to pull data from Dropbox" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-dropbox-formula_gsx6cc.png" 
  :showsource="false">
</article-image>

This imports the data and adds it to the current sheet.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-dropbox-result_h5o5hv.png" 
  alt="The CSV data is imported into the current sheet" 
  loading="lazy" 
  styling=""
  caption="The CSV data is imported into the current sheet" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-dropbox-result_h5o5hv.png" 
  :showsource="false">
</article-image>

## Import CSV data from GitHub

Following the same pattern but on a new sheet, we enter the link from GitHub and hit enter.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-github-formula_zd2fzu.png" 
  alt="Use IMPORTDATA to pull data from GitHub" 
  loading="lazy" 
  styling=""
  caption="Use IMPORTDATA to pull data from GitHub" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-github-formula_zd2fzu.png" 
  :showsource="false">
</article-image>

This imports the data and adds it to the current sheet.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-github-result_l3vcau.png" 
  alt="The CSV data is imported into the current sheet" 
  loading="lazy" 
  styling=""
  caption="The CSV data is imported into the current sheet" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/import-github-result_l3vcau.png" 
  :showsource="false">
</article-image>

## Analyse the data

On either sheet, if we click any cell in the table and hit `Ctrl + A` we can select all the data, and then go to **Insert > Pivot Table** and select **'New sheet'**

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/insert-pivot-table_iqoasi.png" 
  alt="Select all the data and insert a Pivot Table" 
  loading="lazy" 
  styling=""
  caption="Select all the data and insert a Pivot Table" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855081/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/insert-pivot-table_iqoasi.png" 
  :showsource="false">
</article-image>


We can then drag in fields to analyse the data. Here were are finding the count and survival rate of males vs females.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855080/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/pivot-table-analysis_ikrhzt.png" 
  alt="Analyse the data using the Pivot Table" 
  loading="lazy" 
  styling=""
  caption="Analyse the data using the Pivot Table" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1698855080/App%20Images/Blog%20Images/Article%20Images/Import%20CSV%20to%20Google%20Sheets/pivot-table-analysis_ikrhzt.png" 
  :showsource="false">
</article-image>

You can apply this methodology to any dataset, and any questions you have for that dataset! The best part is when the Google Sheet reloads then new data will be automatically pulled in creating a data pipeline.

## Import complete!

Thanks very much for reading, this was a short article covering how to import a CSV from Dropbox or GitHub into Google Sheets. 

By using this method, it creates an automated refresh when the Sheet is reloaded, ensuring analysis is always carried out on the latest data.

If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions please leave a comment 👍 Hope this helps you out and enjoy your day!]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to build a random recipe selector with Python]]></title>
            <link>https://shedloadofcode.com/blog/how-to-build-a-random-recipe-selector-with-Python/</link>
            <guid>https://shedloadofcode.com/blog/how-to-build-a-random-recipe-selector-with-Python/</guid>
            <pubDate>Thu, 26 Oct 2023 15:30:00 GMT</pubDate>
            <description><![CDATA[In this article we will be building ingredirandom, a random recipe selector program which takes your library of recipes, samples X number at random, then outputs them as a shopping list with the ingredients and costs!]]></description>
            <content:encoded><![CDATA[
## Introduction

For a while I've wanted to try and mimic the setup of [HelloFresh](https://www.hellofresh.co.uk/), whereby you: 

* have a list of meals you like to cook and eat
* want to choose a random a number of recipes for the following week
* then want a shopping list for the ingredients for those recipes

Although I have never tried HelloFresh I've heard from others it's great for simplicity - you only get the ingredients you need for the recipe and it's mostly healthy stuff. Nevertheless, the principles inspired my own solution. I wanted to sharpen up my cooking skills, learn new recipes, and automate the stressful decision and procurement part of the process.

This is where ingredirandom steps in to help, it:

* defines a list of `recipes` dictionaries from various cook books I use
* defines a list of `costs` as a tuple with product codes from online shopping at ASDA
* a script `ingredirandom` which randomly selects a given number of those recipes, and outputs the shopping list to a text file

The following code blocks contain each of these steps, so please enjoy having a read through 😄

## Create a list of recipes

```python [recipes.py]
recipes = [
    {
        "name": "Beefy Mince and Pasta Bake",
        "book": "Enter cookbook name here",
        "page": 38,
        "serves": "2-3",
        "ingredients": [
            "Tin of Campbell's condensed Tomato soup",
            "500g beef mince",
            "Beef or vegetable stock cubes",
            "Grated cheese",
            "Garlic cloves",
            "Onion",
            "Butter",
            "Pasta",
            "Freeze dried basil",
            "Salt",
            "Pepper"
        ]
    },
    {
        "name": "Hoisin Chicken Noodles",
        "book": "Enter cookbook name here",
        "page": 59,
        "serves": "2",
        "ingredients": [
            "Spring onions",
            "Fresh ginger",
            "Garlic cloves",
            "Chicken breasts",
            "Mushrooms",
            "Chicken stock",
            "Soy sauce",
            "Hoisin sauce",
            "Can of sweetcorn",
            "Fresh egg noodles",
            "Olive oil"
        ]
    },
    {
        "name": "Pan Roast Chicken Breast with mustard sauce",
        "book": "Enter cookbook name here",
        "page": 64,
        "serves": "Enter cookbook name here",
        "ingredients": [
            "Cherry tomatoes",
            "Lettuce",
            "Cucumber",
            "Chicken breasts",
            "Potato wedges or new potatoes",
            "Olive oil",
            "Balsamic vinegar",
            "Sugar",
            "Salt", 
            "Pepper",
            "Mustard sauce",
            "Prosecco"
        ]
    },
    {
        "name": "Chicken and Mushroom Pasta with 'Philly' cheese and fresh basil",
        "book": "Enter cookbook name here",
        "page": 68,
        "serves": "2",
        "ingredients": [
            "500g tagliatelle",
            "Onion",
            "Garlic cloves",
            "Chicken breasts",
            "Mushrooms",
            "200g Philadelphia soft cheese",
            "Fresh basil",
            "Salt",
            "Pepper",
            "Parmesan",
            "Olive oil"
        ]
    },
    {
        "name": "Thai Salmon with coconut rice and green chilli dressing",
        "book": "Enter cookbook name here",
        "page": 200,
        "serves": "2-3",
        "ingredients": [
            "Olive oil",
            "Thai red curry paste",
            "Spring onions",
            "400g can coconut milk",
            "Fresh coriander leaves",
            "Lemon",
            "Rice",
            "Salmon steaks",
            "Hoisin sauce",
            "Sugar",
            "Green chilli"
        ]
    },
    {
        "name": "Pan Roasted Chicken with spicy fried rice",
        "book": "Enter cookbook name here",
        "page": 154,
        "serves": "2-3",
        "ingredients": [
            "Basmati rice",
            "Chicken breasts",
            "Eggs",
            "Onion",
            "Garlic cloves",
            "Red pepper",
            "Fresh ginger",
            "Red chilli",
            "Oyster sauce",
            "Soy sauce",
            "Spring onions"
        ]
    },
    {
        "name": "Tuna Noodles with honey and ginger dressing",
        "book": "Enter cookbook name here",
        "page": 153,
        "serves": "2-3",
        "ingredients": [
            "Honey",
            "Soy sauce",
            "White wine vinegar",
            "Red chilli",
            "Fresh ginger",
            "Salt",
            "Pepper",
            "Spring onions",
            "Cucumber",
            "Red pepper",
            "Can of tuna",
            "Fresh egg noodles"
        ]
    },
    {
        "name": "Zesty Tuna Steaks with chilli tagliatelle",
        "book": "Enter cookbook name here",
        "page": 87,
        "serves": "2-3",
        "ingredients": [
            "500g tagliatelle",
            "Olive oil",
            "Spring onions",
            "Red chilli",
            "Tuna steaks",
            "Black olives",
            "Fresh thyme",
            "Olive Oil",
            "Sugar",
            "Lime",
            "Salt", 
            "Pepper"
        ]
    },
    {
        "name": "Spaghetti Carbonara with Parmesan",
        "book": "Enter cookbook name here",
        "page": 80,
        "serves": "2-3",
        "ingredients": [
            "500g pack spaghetti",
            "Onion",
            "Garlic cloves",
            "200g pack pancetta lardons or strips of streaky bacon",
            "Olive oil",
            "Eggs",
            "Parmesan",
            "Fresh basil",
            "Red wine"
        ]
    },
    {
        "name": "Chorizo Spaghetti with balsamic and basil sauce",
        "book": "Enter cookbook name here",
        "page": 79,
        "serves": "2-3",
        "ingredients": [
            "500g pack spaghetti",
            "Onion",
            "Garlic cloves",
            "Red pepper",
            "Chorizo sausages",
            "Tomatoes",
            "Olives",
            "Fresh basil",
            "Olive oil",
            "Balsamic vinegar",
            "Red wine vinegar",
            "Sugar"
        ]
    },
    {
        "name": "Italian Meatballs with spaghetti",
        "book": "Enter cookbook name here",
        "page": 175,
        "serves": "2-3",
        "ingredients": [
            "Olive oil",
            "Onion",
            "Garlic cloves",
            "400g tin tomatoes",
            "Tomato puree",
            "Brown sugar",
            "Red wine vinegar",
            "Fresh basil",
            "Salt", 
            "Pepper",
            "500g beef mince",
            "Onion",
            "Red wine"
        ]
    },
    {
        "name": "Beef Chow Mein with oyster sauce",
        "book": "Enter cookbook name here",
        "page": 72,
        "serves": "2-3",
        "ingredients": [
            "Fresh ginger",
            "Garlic cloves",
            "Tomato puree",
            "Oyster sauce",
            "Soy sauce",
            "Onion",
            "Red pepper",
            "Rump steak",
            "Bean sprouts",
            "Fresh egg noodles",
            "Olive oil"
        ]
    },
    {
        "name": "Crispy Fried Duck Breast with ginger dressing and fried rice",
        "book": "Enter cookbook name here",
        "page": 180,
        "serves": "2",
        "ingredients": [
            "Basmati rice",
            "Carrots",
            "Courgette",
            "Duck breasts",
            "Eggs",
            "Spring onions",
            "Olive oil",
            "Fresh ginger",
            "Lime",
            "Soy sauce",
            "Honey"
        ]
    },
    {
        "name": "Irish Lamb Stew with colcannon",
        "book": "Enter cookbook name here",
        "page": 184,
        "serves": "4",
        "ingredients": [
            "Olive oil", 
            "Onion",
            "Garlic cloves",
            "Stewing or diced lamb",
            "Carrots",
            "Flour",
            "Vegetable stock cube",
            "Apricot jam",
            "Red wine",
            "Rosemary",
            "Mushrooms",
            "Potatoes",
            "Butter",
            "250g pack of spring greens or savoy cabbage (optional)",
            "300ml pot of soured cream (optional)"
        ]
    },
    {
        "name": "Beef Steak with balsamic onion and peppercorn sauce",
        "book": "Enter cookbook name here",
        "page": 171,
        "serves": "2",
        "ingredients": [
            "Peppercorn sauce",
            "Onion",
            "Olive oil",
            "Onion",
            "Balsamic vinegar",
            "Brown sugar",
            "Potatoes",
            "Butter",
            "Salt",
            "Pepper",
            "Rump steak"
        ]
    },
    {
        "name": "Sweet Honey Chicken with risotto rice",
        "book": "Enter cookbook name here",
        "page": 187,
        "serves": "2",
        "ingredients": [
            "Soy sauce",
            "Fresh ginger",
            "Honey",
            "Dried chives",
            "Chicken breasts",
            "Butter",
            "Garlic cloves",
            "Yellow pepper",
            "Basmati rice",
            "Chicken stock",
            "Mushrooms",
            "Spring onions",
            "Courgette"
        ]
    },
    {
        "name": "Sweet and Sour Chicken Noodles",
        "book": "Enter cookbook name here",
        "page": 196,
        "serves": "2",
        "ingredients": [
            "Soy sauce",
            "Spring onions",
            "Red pepper",
            "Garlic cloves",
            "Chicken breasts",
            "Fresh egg noodles",
            "Olive oil",
            "Tomato puree",
            "Honey",
            "White wine vinegar",
            "Fresh ginger"
        ]
    },
    {
        "name": "Corned Beef Hash with fried eggs",
        "book": "Enter cookbook name here",
        "page": 141,
        "serves": "2",
        "ingredients": [
            "Potatoes",
            "Onion",
            "Corned beef",
            "Eggs",
            "Olive oil",
            "Salt",
            "Pepper"
        ]
    },
    {
        "name": "Shiitake Mushroom Risotto with Parmesan",
        "book": "Enter cookbook name here",
        "page": 149,
        "serves": "2",
        "ingredients": [
            "Butter",
            "Onion",
            "Garlic cloves",
            "Risotto rice",
            "Vegetable stock cube",
            "Shiitake mushrooms",
            "Salt",
            "Pepper",
            "Fresh basil",
            "Parmesan"
        ]
    },
    {
        "name": "Traditional Pork Steaks with honey and mustard sauce",
        "book": "Enter cookbook name here",
        "page": 76,
        "serves": "2",
        "ingredients": [
            "Potato wedges or new potatoes",
            "Carrots",
            "Pork steaks",
            "Green beans",
            "Onion",
            "Fresh ginger",
            "Cumin",
            "Cinnamon",
            "Flour",
            "Honey",
            "Wholegrain mustard",
            "Salt",
            "Pepper",
            "Olive oil"
        ]
    },
    {
        "name": "Thai Prawn Curry with rice",
        "book": "Enter cookbook name here",
        "page": 83,
        "serves": "2",
        "ingredients": [
            "Rice",
            "Pilau rice seasoning",
            "Prawns",
            "Olive oil",
            "Thai red curry paste",
            "400g can coconut milk",
            "Mangetout",
            "Baby sweetcorn",
            "Spring onions"
        ]
    },
    {
        "name": "Crispy Parmesan Cod with fresh tomato sauce and mini roasts",
        "book": "Enter cookbook name here",
        "page": 84,
        "serves": "2",
        "ingredients": [
            "Olive oil",
            "Tomato and basil sauce",
            "Pepper",
            "Fresh basil",
            "Breadcrumbs",
            "Parmesan",
            "Lemon",
            "Potatoes",
            "Cod",
            "Eggs"
        ]
    },
    {
        "name": "Easy Cooked Breakfast",
        "book": "Enter cookbook name here",
        "page": 100,
        "serves": "2",
        "ingredients": [
            "Hash browns",
            "Salt",
            "Pepper",
            "Sausages",
            "Streaky bacon", 
            "Tomatoes",
            "Eggs",
            "Bread",
            "Olive oil",
            "Orange juice"
        ]
    },
    {
        "name": "Chicken Biryani with Naan bread",
        "book": "Enter cookbook name here",
        "page": 199,
        "serves": "2",
        "ingredients": [
            "Butter",
            "Onion",
            "Chicken thighs",
            "Korma curry paste",
            "Rice",
            "Chicken stock",
            "Yoghurt",
            "Raisins"
            "Fresh coriander leaves",
            "Flaked almonds",
            "Naan bread"
        ]
    },
    {
        "name": "Pork with apple and pear chutney",
        "book": "Enter cookbook name here",
        "page": 75,
        "serves": "2",
        "ingredients": [
            "Apple and pear chutney",
            "Pilau rice seasoning",
            "Rice",
            "Mangetout",
            "Pork steaks"
        ]
    },
    {
        "name": "Zesty Cod with rice",
        "book": "Enter cookbook name here",
        "page": 96,
        "serves": "2",
        "ingredients": [
            "Rice",
            "Pilau rice seasoning",
            "Cod fillets",
            "Eggs",
            "Onion",
            "Mushrooms",
            "Lemon",
            "Freeze dried basil"
        ]
    }
]
```

## Record ingredients and costs

```python [costs.py]
"""
A record of costs per ingredient.

Key is ingredient name, value is tuple (cost of item, ASDA product code)

Last updated: September 2023
"""

cost_lookup = {
    "Tin of Campbell's condensed Tomato soup": (1.30, "5498495"),
    "500g beef mince": (3.70, "1525219"),
    "Beef or vegetable stock cubes": (3.10, "4052433"),
    "Grated cheese": (2.55, "4639365"),
    "Garlic cloves": (2.00, "6599892"),
    "Onion": (1.00, "5737702"),
    "Butter": (3.25, "6858100"),
    "Pasta": (0.95, "6125466"),
    "Freeze dried basil": (0.80, "544353"),
    "Salt": (0.80, "4938721"),
    "Pepper": (0.90, "1352762"),
    "Spring onions": (0.75, "410212"),
    "Fresh ginger": (0.60, "6668284"),
    "Chicken breasts": (4.70, "7648521"),
    "Mushrooms": (1.29, "4110717"),
    "Chicken stock": (0.75, "2687967"),
    "Soy sauce": (1.90, "6124290"),
    "Hoisin sauce": (1.80, "6124274"),
    "Can of sweetcorn": (0.65, "5986511"),
    "Fresh egg noodles": (1.50, "5128622"),
    "500g tagliatelle": (2.00, "2207092"),
    "Olive oil": (5.90, "6722819"),
    "Red chilli": (0.55, "4928242"),
    "Tuna steaks": (5.00, "7740432"),
    "Black olives": (1.15, "951664"),
    "Fresh thyme": (0.55, "5139830"),
    "Sugar": (0.89, "217367"),
    "Lime": (1.00, "5596923"),
    "500g pack spaghetti": (0.75, "12943"),
    "200g pack pancetta lardons or strips of streaky bacon": (2.25, "6345750"),
    "Eggs": (2.95, "166781"),
    "Parmesan": (1.85, "3160573"),
    "Fresh basil": (0.55, "6753736"),
    "Red wine": (8.50, "1701819"),
    "Red pepper": (0.55, "1857059"),
    "Chorizo sausages": (2.70, "3567277"),
    "Tomatoes": (1.25, "5794643"),
    "Olives": (2.00, "6697522"),
    "Balsamic vinegar": (1.30, "1554788"),
    "Red wine vinegar": (4.50, "7681719"),
    "400g tin tomatoes": (1.25, "7675447"),
    "Tomato puree": (1.40, "7675461"),
    "Brown sugar": (1.35, "6345327"),
    "Oyster sauce": (1.75, "6124294"),
    "Rump steak": (6.20, "7357125"),
    "Bean sprouts": (0.50, "6536231"),
    "Basmati rice": (2.20, "18631"),
    "Carrots": (0.50, "150208"),
    "Courgette": (0.75, "6566770"),
    "Duck breasts": (6.00, "7443861"),
    "Honey": (1.47, "5506364"),
    "Stewing or diced lamb": (4.95, "6740100"),
    "Flour": (0.80, "11120"),
    "Vegetable stock cube": (0.75, "2687969"),
    "Apricot jam": (1.15, "6722853"),
    "Rosemary": (0.55, "5148466"),
    "Potatoes": (1.70, "1843017"),
    "250g pack of spring greens or savoy cabbage (optional)": (0.75, "150460"),
    "300ml pot of soured cream (optional)": (1.00, "5673649"),
    "Thai red curry paste": (2.30, "7563362"),
    "400g can coconut milk": (2.00, "7679943"),
    "Fresh coriander leaves": (2.00, "18695"),
    "Lemon": (0.55, "5797459"),
    "Rice": (2.70, "18802"),
    "Salmon steaks": (5.50, "6349272"),
    "Green chilli": (0.50, "1208242"),
    "Cucumber": (0.79, "152446"),
    "Can of tuna": (4.00, "6041045"),
    "White wine vinegar": (1.30, "2569207"),
    "200g Philadelphia soft cheese": (2.20, "7345715"),
    "Potato wedges or new potatoes": (1.50, "6311576/6141368"),
    "Peppercorn sauce": (1.20, "6923656"),
    "Dried chives": (0.80, "544339"),
    "Yellow pepper": (0.55, "1857071"),
    "Corned beef": (2.30, "2594051"),
    "Risotto rice": (2.40, "6125968"),
    "Shiitake mushrooms": (1.60, "4708261"),
    "Pork steaks": (3.60, "7452907"),
    "Green beans": (0.93, "7132612"),
    "Cumin": (0.80, "544313"),
    "Cinnamon": (0.80, "6684574"),
    "Wholegrain mustard": (2.65, "3667611"),
    "Pilau rice seasoning": (2.00, "59161"),
    "Prawns": (2.80, "6305703"),
    "Baby sweetcorn": (1.35, "6523635"),
    "Mangetout": (0.85, "5795246"),
    "Tomato and basil sauce": (1.15, "7458116"),
    "Breadcrumbs": (1.00, "5496030"),
    "Cod": (5.00, "6088572"),
    "Hash browns": (2.00, "3843261"),
    "Sausages": (3.50, "7600840"),
    "Streaky bacon": (2.25, "6345750"),
    "Bread": (1.30, "2160171"),
    "Orange juice": (1.15, "656042"),
    "Chicken thighs": (4.95, "6923652"),
    "Korma curry paste": (2.10, "5904835"),
    "Saffron": (2.45, "5615948"),
    "Yoghurt": (1.00, "3425334"),
    "Raisins": (1.50, "4960067"),
    "Flaked almonds": (1.50, "4960109"),
    "Naan bread": (0.75, "5215599"),
    "Apple and pear chutney": (1.95, "6210082"),
    "Cod fillets": (4.75, "6246480"),
    "Curry paste": (2.10, "5017664")
}
```

## Select random recipes

```python [ingredirandom.py]
import random
import datetime
from recipes import recipes
from costs import cost_lookup

def get_random_selections():
    k = int(input("Number of recipes to randomly choose?: "))
    selections = random.sample(recipes, k=k)

    return selections

def output_to_text_file(selections):
    now = datetime.datetime.now()
    file_path = "shopping-list-" + now.strftime("%d-%m-%Y") + ".txt"

    with open(file_path, "w") as file:
        file.write("Shopping list for ")
        file.write(now.strftime("%B %d, %Y\n\n"))
        file.write("* = Ingredient is in multiple recipes\n\n")
        total_week_cost = 0
        all_ingredients = set()

        for i, recipe in enumerate(selections):
            if (i > 0):
                file.write("\n\n")

            total_recipe_cost = 0

            file.write(f"Recipe {i + 1}\n")
            file.write("____________________________\n")
            file.write(f"{recipe['name']}\n")
            file.write(f"{recipe['book']} - Page {recipe['page']}\n\n")

            for ingredient in recipe['ingredients']:
                if ingredient in cost_lookup:
                    ingredient_cost = float(cost_lookup[ingredient][0])
                    ingredient_product_id = cost_lookup[ingredient][1]
                    
                    file.write(
                        '{:70s} {:20s} {:20s}'.format(
                            ingredient + "*" if ingredient in all_ingredients else ingredient,
                            "\u00a3" + str(ingredient_cost), 
                            str(ingredient_product_id))
                    )
                    
                    file.write("\n")

                    total_recipe_cost += ingredient_cost
                else:
                    file.write(
                        ingredient + "*" if ingredient in all_ingredients else ingredient)
                    file.write("\n")

                all_ingredients.add(ingredient)

            file.write(f"\nEstimated recipe cost: \u00a3{round(total_recipe_cost, 2)}\n")
            total_week_cost += total_recipe_cost

        file.write(f"\n\nEstimated week cost: \u00a3{round(total_week_cost, 2)}")

    print("Selections saved to shopping-list.txt", end="\n")
    print("Happy cooking :)")
   

if __name__ == "__main__":
    print("Welcome to IngrediRandom!", end="\n")
    print(f"There are {len(recipes)} recipes in total.")

    selections = get_random_selections()
    output_to_text_file(selections)
```

You run `python ingredirandom.py`, enter the number of random recipes you want selecting and the recipes, ingredients and costs are output to a text file `shopping-list.txt`. 

In the text file below we can see the output of the program, our random meals, the page number of the recipe book for the instructions along with a shopping list with the ingredients required and their costs.

Remember the below list shows the **total** cost, that is if you were starting with nothing and had to buy everything. That's not to say you don't already have most of the ingredients or can find them cheaper elsewhere. These are just suggestions, groceries are becoming increasingly expensive so always shop around and adapt the program! That's just the way I decided to build this program, so I always know the cost of the core ingredients but remembering that the expensive one off items like olive oil or butter will bring that cost upwards.

```[shopping-list-16-09-2023.txt]
Shopping list for September 16, 2023

* = Ingredient is in multiple recipes

Recipe 1
____________________________
Crispy Fried Duck Breast with ginger dressing and fried rice
Your cookbook name - Page 180

Basmati rice                                                           £2.2                 18631               
Carrots                                                                £0.5                 150208              
Courgette                                                              £0.75                6566770             
Duck breasts                                                           £6.0                 7443861             
Eggs                                                                   £2.95                166781              
Spring onions                                                          £0.75                410212              
Olive oil                                                              £5.9                 6722819             
Fresh ginger                                                           £0.6                 6668284             
Lime                                                                   £1.0                 5596923             
Soy sauce                                                              £1.9                 6124290             
Honey                                                                  £1.47                5506364             

Estimated recipe cost: £24.02


Recipe 2
____________________________
Beef Steak with balsamic onion and peppercorn sauce
Your cookbook name - Page 171

Peppercorn sauce                                                       £1.2                 6923656             
Onion                                                                  £1.0                 5737702             
Olive oil*                                                             £5.9                 6722819             
Onion*                                                                 £1.0                 5737702             
Balsamic vinegar                                                       £1.3                 1554788             
Brown sugar                                                            £1.35                6345327             
Potatoes                                                               £1.7                 1843017             
Butter                                                                 £3.25                6858100             
Salt                                                                   £0.8                 4938721             
Pepper                                                                 £0.9                 1352762             
Rump steak                                                             £6.2                 7357125             

Estimated recipe cost: £24.6


Recipe 3
____________________________
Hoisin Chicken Noodles
Your cookbook name - Page 59

Spring onions*                                                         £0.75                410212              
Fresh ginger*                                                          £0.6                 6668284             
Garlic cloves                                                          £2.0                 6599892             
Chicken breasts                                                        £4.7                 7648521             
Mushrooms                                                              £1.29                4110717             
Chicken stock                                                          £0.75                2687967             
Soy sauce*                                                             £1.9                 6124290             
Hoisin sauce                                                           £1.8                 6124274             
Can of sweetcorn                                                       £0.65                5986511             
Fresh egg noodles                                                      £1.5                 5128622             
Olive oil*                                                             £5.9                 6722819             

Estimated recipe cost: £21.84


Estimated week cost: £70.46
```

## Adapting to your needs

You can add, remove or modify recipe entries from `recipes.py`.

You can also update the costs in `costs.py` for each ingredient. If you order online this should be easy to do as you do it, or find the cost from your receipt.

I haven't added actual recipes to avoid copyright issues, however I will say my main cook books are [HelloFresh Recipes That Work](https://www.amazon.co.uk/HelloFresh-Recipes-that-step-step/dp/1784724653/), [Nosh for Students](https://www.amazon.co.uk/NOSH-Students-Student-Cookbook-Recipe/dp/0993260985) and [Nosh for Graduates](https://www.amazon.co.uk/GRADUATES-cookbook-those-graduated-student/dp/0954317955/) - yes these are simple but effective books I am no expert so simple is good for me. 

I entered my favourite recipes from these books into the `recipes.py` lookup and entered the costs from online grocery shopping at ASDA into `costs.py` lookup. Voila!

I could also almost fully automate this process by turning the cost and recipe lookups into JSON files, storing them in GitHub, then having an AWS Lambda function read them and run ingredirandom.py, then send an email to me with the recipes for the week. I might explore this in a future article.

## Bon appetit

This program works really well at mixing things up and enjoying learning new recipes. You can also adapt this code to fulfil any other random selection use case you may have. 

The major benefit is you can add more recipes you enjoy and remove the ones that you don't want to try again. 

It keeps your cooking skills sharp and I hope you find like me, eases the stressful procurement part of cooking. This leaves you to gather everything you need upfront and then just enjoy the process of preparing, cooking and eating good clean fresh food at least a few times per week! 😆]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to create an interactive correlation heatmap using Danfo.js and Plotly]]></title>
            <link>https://shedloadofcode.com/blog/how-to-create-an-interactive-correlation-heatmap-using-danfojs-and-plotly/</link>
            <guid>https://shedloadofcode.com/blog/how-to-create-an-interactive-correlation-heatmap-using-danfojs-and-plotly/</guid>
            <pubDate>Sun, 10 Sep 2023 17:47:00 GMT</pubDate>
            <description><![CDATA[Discover how to use Danfo.js and Plotly.js to calculate and display Pearson correlation values for a CSV dataset. This helps us to discover and display important predictor variables prior to machine learning.]]></description>
            <content:encoded><![CDATA[
In this short article, we'll look at how create a Pearson correlation heatmap visual using [Danfo.js](https://danfo.jsdata.org/) and [Plotly.js](https://plotly.com/javascript/) and then display it in an HTML page using JavaScript.

I recently came across this issue whilst building the [Data Explorer Workbench tool](https://shedloadofcode.github.io/) in which I needed to calculate and display correlation between variables in the dataset using only JavaScript. Data Explorer Workbench is a web based tool for automated exploratory data analysis (EDA) where you can upload a CSV dataset and explore descriptive statistics, relationships and correlation. I was using Vue.js as the framework here, although you can amend the steps to other frameworks or a static HTML file.

When it comes to data visualisation, heatmaps are a powerful tool for exploring relationships and patterns in your dataset. Heatmaps allow you to visualise the correlation between different variables, making it easier to identify trends and dependencies. 

## What is a Correlation Heatmap?

A correlation heatmap is a graphical representation of the correlation matrix, which shows the correlation coefficients between multiple variables in a dataset. Each cell in the heatmap represents the correlation between two variables, with colors indicating the strength and direction of the correlation. Heatmaps are commonly used in data analysis to identify relationships between variables, especially in fields like finance, healthcare, and social sciences.

## Getting Started
Before we dive into creating a correlation heatmap, you'll need to have [Node](https://nodejs.org/en) installed on your system. Additionally, you'll need to install the [danfo](https://www.npmjs.com/package/danfojs-node) and [plotly](https://www.npmjs.com/package/plotly.js) libraries. You can do this using Node and [npm](https://www.npmjs.com/):

```
npm i danfojs-node
npm i plotly.js
```

Once you have the required libraries installed, let's move on to the step-by-step process of creating a correlation heatmap.

## Step 0: Create the HTML chart placeholder

```html
<div id="correlation-heatmap">
    <!-- Plotly Heatmap will go here -->
</div>
```

This gives us a div container where the correlation heatmap will be placed.

## Step 1: Importing the libraries
The first step is to import the necessary libraries:

```js
import * as dfd from "danfojs";
import Plotly from 'plotly.js-dist-min';
```

We use danfo for data manipulation and plotly for creating interactive visualisations.

## Step 2: The corr function
You will see in the next step we require a `corr` function to calculate the Pearson correlation value for each variable.

```js
/*
* Calculates Pearson correlation between 
* two arrays x and y.
*/
corr(x, y) {
    let sumX = 0,
        sumY = 0,
        sumXY = 0,
        sumX2 = 0,
        sumY2 = 0;

    const minLength = x.length = y.length = Math.min(x.length, y.length),
            reduce = (xi, idx) => {
            const yi = y[idx];
            sumX += xi;
            sumY += yi;
            sumXY += xi * yi;
            sumX2 += xi * xi;
            sumY2 += yi * yi;
            }

    x.forEach(reduce);

    return (minLength * sumXY - sumX * sumY) / 
            Math.sqrt((minLength * sumX2 - sumX * sumX) * (minLength * sumY2 - sumY * sumY));
}
```


## Step 3: Loading the data and display the heatmap
Now, you need to load your dataset into a DataFrame using danfo. For the purpose of this tutorial, let's assume you have a CSV file named your_data.csv containing your dataset. You can load an example Titanic dataset from a GitHub repo as follows:

```js
dfd.readCSV("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv") 
  .then(df => {
        df.head().print()

        /**
         * Generate heatmap
         * This needs to be in the format of 
         *  zValues = [
         *     [0.00, 0.00, 0.75, 0.75, 1.00],
         *     [0.00, 0.00, 0.75, 1.00, 0.00],
         *     [0.75, 0.75, 1.00, 0.75, 0.75],
         *     [0.00, 1.00, 0.00, 0.75, 0.00],
         *     [1.00, 0.00, 0.00, 0.75, 0.00]
         *  ];
         */
        let zValues = [];
        let dfCopy = df.copy();
        let columnsLength = dfCopy.shape[1];
        let columnsToDrop = [];
        let numericColumns = dfCopy.selectDtypes([
              'int32', 
              'float32', 
          ]);

        // Drop columns with high cardinality (many unique values)
        for (let i = 0; i < columnsLength; i++) {
          let column = dfCopy.columns[i];

          // Skip if a numeric column as it will have lots of unique values
          // but this doesn't matter :)
          if (numericColumns.$columns.includes(column)) {
            continue;
          }

          let uniqueValuesCount = dfCopy.column(column).unique().$data.length;

          if (uniqueValuesCount > 5) {
            columnsToDrop.push(column);
          }
        }

        dfCopy.drop({ columns: columnsToDrop, inplace: true });

        // Create dummy columns for categoric variables
        let dummies = dfCopy.getDummies(dfCopy);
        // Uncomment to debug: console.log("DUMMIES", dummies);
        columnsLength = dummies.$columns.length;

        for (let i = 0; i < columnsLength; i++) {
          let column = dummies.$columns[i];
          // Uncomment to debug: console.log("COMPARING", column);
          let correlations = [];

          for (let j = 0; j < columnsLength; j++) {
            let comparisonColumn = dummies.$columns[j];
            // Uncomment to debug: console.log("TO", comparisonColumn);
            
            let pearsonCorrelation = corr(
              dummies[column].$data,
              dummies[comparisonColumn].$data
            ).toFixed(2)

            correlations.push(
              pearsonCorrelation
            );
          }

          zValues.push(correlations);
        }

        var xValues = dummies.$columns;
        var yValues = dummies.$columns;

        var colorscaleValue = [
          [0, '#3D9970'],
          [1, '#001f3f']
        ];

        var data = [{
          x: xValues,
          y: yValues,
          z: zValues,
          type: 'heatmap',
          colorscale: colorscaleValue,
          showscale: false
        }];

        var layout = {
          autosize: false,
          width: window.innerWidth - 650,
          height: 700,
          annotations: [],
          xaxis: {
            ticks: '',
            side: 'top'
          },
          yaxis: {
            ticks: '',
            ticksuffix: ' ',
            autosize: false
          }
        };

        for ( var i = 0; i < yValues.length; i++ ) {
          for ( var j = 0; j < xValues.length; j++ ) {
            var currentValue = zValues[i][j];
            if (currentValue != 0.0) {
              var textColor = 'white';
            }else{
              var textColor = 'black';
            }
            var result = {
              xref: 'x1',
              yref: 'y1',
              x: xValues[j],
              y: yValues[i],
              text: zValues[i][j],
              font: {
                family: 'Arial',
                size: 12,
                color: 'rgb(50, 171, 96)'
              },
              showarrow: false,
              font: {
                color: textColor
              }
            };
            layout.annotations.push(result);
          }
        }

        Plotly.newPlot('correlation-heatmap', data, layout);
  }).catch(err=>{
     console.log(err);
  })
```

The length of this code can be made more concise by introducing functions. However, here we are performing a number of preprocessing steps before calculating the correlation coefficient with `corr`:

* Reading the dataset with Danfo
* Copying the dataset to work on it
* Identifying the numeric type columns in the dataset
* Dropping columns with high cardinality (many unique values)
* Creating dummy columns for categoric variables

## Bonus: Using just plain HTML and JavaScript 

That's the whole process done with the heatmap created! If you prefer not to use Node and NPM with a framework, you can give this [minimal working example](https://github.com/shedloadofcode/danfo-plotly-correlation-heatmap/blob/main/test.html) using just plain HTML and JavaScript a go. In this example we are just importing both Danfo and Plotly from a [CDN](https://en.wikipedia.org/wiki/Content_delivery_network).

```html
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <title>HTML 5 Boilerplate</title>
  <script src="https://cdn.plot.ly/plotly-2.25.2.min.js" charset="utf-8"></script>
  <script src="https://cdn.jsdelivr.net/npm/danfojs@1.1.2/lib/bundle.min.js"></script>
</head>

<body>

  <div id="correlation-heatmap" style="height: 800px; width: 1000px">
    <!-- Plotly Heatmap will go here -->
  </div>

</body>

<script>
  /*
  * Calculates Pearson correlation between 
  * two arrays x and y.
  */
  function corr(x, y) {
    let sumX = 0,
      sumY = 0,
      sumXY = 0,
      sumX2 = 0,
      sumY2 = 0;

    const minLength = x.length = y.length = Math.min(x.length, y.length),
      reduce = (xi, idx) => {
        const yi = y[idx];
        sumX += xi;
        sumY += yi;
        sumXY += xi * yi;
        sumX2 += xi * xi;
        sumY2 += yi * yi;
      }

    x.forEach(reduce);

    return (minLength * sumXY - sumX * sumY) /
      Math.sqrt((minLength * sumX2 - sumX * sumX) * (minLength * sumY2 - sumY * sumY));
  }

  dfd.readCSV("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
    .then(df => {
      df.head().print()

      /**
       * Generate heatmap
       * This needs to be in the format of 
       *  zValues = [
       *     [0.00, 0.00, 0.75, 0.75, 1.00],
       *     [0.00, 0.00, 0.75, 1.00, 0.00],
       *     [0.75, 0.75, 1.00, 0.75, 0.75],
       *     [0.00, 1.00, 0.00, 0.75, 0.00],
       *     [1.00, 0.00, 0.00, 0.75, 0.00]
       *  ];
       */
      let zValues = [];
      let dfCopy = df.copy();
      let columnsLength = dfCopy.shape[1];
      let columnsToDrop = [];
      let numericColumns = dfCopy.selectDtypes([
        'int32',
        'float32',
      ]);

      // Drop columns with high cardinality (many unique values)
      for (let i = 0; i < columnsLength; i++) {
        let column = dfCopy.columns[i];

        // Skip if a numeric column as it will have lots of unique values
        // but this doesn't matter :)
        if (numericColumns.$columns.includes(column)) {
          continue;
        }

        let uniqueValuesCount = dfCopy.column(column).unique().$data.length;

        if (uniqueValuesCount > 5) {
          columnsToDrop.push(column);
        }
      }

      dfCopy.drop({ columns: columnsToDrop, inplace: true });

      // Create dummy columns for categoric variables
      let dummies = dfCopy.getDummies(dfCopy);
      // Uncomment to debug: console.log("DUMMIES", dummies);
      columnsLength = dummies.$columns.length;

      for (let i = 0; i < columnsLength; i++) {
        let column = dummies.$columns[i];
        // Uncomment to debug: console.log("COMPARING", column);
        let correlations = [];

        for (let j = 0; j < columnsLength; j++) {
          let comparisonColumn = dummies.$columns[j];
          // Uncomment to debug: console.log("TO", comparisonColumn);

          let pearsonCorrelation = corr(
            dummies[column].$data,
            dummies[comparisonColumn].$data
          ).toFixed(2)

          correlations.push(
            pearsonCorrelation
          );
        }

        zValues.push(correlations);
      }

      var xValues = dummies.$columns;
      var yValues = dummies.$columns;

      var colorscaleValue = [
        [0, '#3D9970'],
        [1, '#001f3f']
      ];

      var data = [{
        x: xValues,
        y: yValues,
        z: zValues,
        type: 'heatmap',
        colorscale: colorscaleValue,
        showscale: false
      }];

      var layout = {
        autosize: false,
        width: window.innerWidth - 650,
        height: 700,
        annotations: [],
        xaxis: {
          ticks: '',
          side: 'top'
        },
        yaxis: {
          ticks: '',
          ticksuffix: ' ',
          autosize: false
        }
      };

      for (var i = 0; i < yValues.length; i++) {
        for (var j = 0; j < xValues.length; j++) {
          var currentValue = zValues[i][j];
          if (currentValue != 0.0) {
            var textColor = 'white';
          } else {
            var textColor = 'black';
          }
          var result = {
            xref: 'x1',
            yref: 'y1',
            x: xValues[j],
            y: yValues[i],
            text: zValues[i][j],
            font: {
              family: 'Arial',
              size: 12,
              color: 'rgb(50, 171, 96)'
            },
            showarrow: false,
            font: {
              color: textColor
            }
          };
          layout.annotations.push(result);
        }
      }

      console.log(data);

      Plotly.newPlot('correlation-heatmap', data, layout);
    }).catch(err => {
      console.log(err);
    })

</script>

</html>
```

This produces the below HTML page.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1694363974/App%20Images/Blog%20Images/Article%20Images/Danfo%20Plotly%20Correlation%20Heatmap/heatmap-demo_mezzpv.png" 
  alt="Working heatmap example" 
  loading="lazy" 
  styling=""
  caption="Miminal working example of the correlation heatmap" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1694363974/App%20Images/Blog%20Images/Article%20Images/Danfo%20Plotly%20Correlation%20Heatmap/heatmap-demo_mezzpv.png" 
  :showsource="false">
</article-image>

## Conclusion

Creating a correlation heatmap is a valuable step in data analysis and visualisation. It helps you quickly identify relationships and patterns within your dataset, which can lead to valuable insights.

In this article, we've demonstrated how to create a correlation heatmap using the Danfo and Plotly libraries in JavaScript. By following these steps, you can easily generate interactive heatmaps for your own datasets, enabling you to explore and understand your data more effectively.

Remember that data visualisation is not only about creating pretty charts but also about gaining insights and making data-driven decisions. Heatmaps are just one of the many tools at your disposal for this purpose, and they can be a powerful addition to your data analysis toolkit.

I am really excited by Danfo which brings Pandas style data manipulation and data analysis to JavaScript. I hope more articles utilising this library will be coming soon.

If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions feel free to leave a comment 👍]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Eight ways to perform feature selection with scikit-learn]]></title>
            <link>https://shedloadofcode.com/blog/eight-ways-to-perform-feature-selection-with-scikit-learn/</link>
            <guid>https://shedloadofcode.com/blog/eight-ways-to-perform-feature-selection-with-scikit-learn/</guid>
            <pubDate>Sat, 05 Aug 2023 12:25:00 GMT</pubDate>
            <description><![CDATA[Explore how to apply feature selection techniques using Python. This is an important step in finding the most predictive features for machine learning.]]></description>
            <content:encoded><![CDATA[
Feature selection is a crucial step in machine learning that involves selecting the most relevant features from a dataset. By eliminating irrelevant or redundant features, feature selection techniques can improve model performance and efficiency. In this guide, we'll explore some common feature selection techniques and provide code examples using the Boston Housing dataset.

The Boston Housing dataset contains information about housing prices in Boston. It consists of various features such as average number of rooms per dwelling, crime rate, and pupil-teacher ratio. Our goal is to select a subset of features that have the most impact on predicting house prices.

## Univariate Feature Selection
Univariate feature selection evaluates each feature individually based on statistical tests to measure the correlation between each feature and the target variable. Let's visualise the feature scores using a bar plot. 

```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform univariate feature selection
selector = SelectKBest(score_func=f_regression, k=5)
X_new = selector.fit_transform(X, y)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = data.feature_names[selected_indices]

# Get the feature scores
scores = selector.scores_

# Plot the feature scores
plt.figure(figsize=(10, 6))
plt.bar(range(len(data.feature_names)), scores, tick_label=data.feature_names)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Scores')
plt.title('Univariate Feature Selection: Feature Scores')
plt.show()

print("Selected Features:")
print(selected_features)
```

Selected Features:
['INDUS' 'RM' 'TAX' 'PTRATIO' 'LSTAT']

In this example, we select the top 5 features using the f_regression score function and visualise the feature scores using a bar plot. The selected features are the ones that have the highest correlation with the target variable. If we had a categorical target instead of a continuous target we might use chi2 instead of using f_regression 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/ufs_su8uz2.png" 
  alt="Selected feature importance" 
  loading="lazy" 
  styling=""
  caption="Selected feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/ufs_su8uz2.png" 
  :showsource="false">
</article-image>

## Recursive Feature Elimination (RFE)
Recursive Feature Elimination (RFE) is an iterative method that starts with all features and recursively eliminates the least important features based on the model's performance. Let's visualise the feature rankings using a line plot.

```python
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform Recursive Feature Elimination
estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=5)
X_new = selector.fit_transform(X, y)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = data.feature_names[selected_indices]

# Get the feature rankings
rankings = selector.ranking_

# Plot the feature rankings
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rankings) + 1), rankings, marker='o')
plt.xticks(range(1, len(rankings) + 1), data.feature_names, rotation=90)
plt.xlabel('Features')
plt.ylabel('Rankings')
plt.title('Recursive Feature Elimination: Feature Rankings')
plt.show()

print("Selected Features:")
print(selected_features)
```

Selected Features:
['CHAS' 'NOX' 'RM' 'DIS' 'PTRATIO']

Here, we use LinearRegression as the estimator and select the top 5 features. We visualise the feature rankings using a line plot. Lower ranks indicate more important features.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207923/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/rfe_pfcvvx.png" 
  alt="Selected feature importance" 
  loading="lazy" 
  styling=""
  caption="Selected feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207923/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/rfe_pfcvvx.png" 
  :showsource="false">
</article-image>

## L1 Regularisation (Lasso)
L1 regularisation, also known as Lasso regularisation, applies a penalty term to the linear regression model, encouraging sparse feature weights. This results in some feature weights being driven to zero, effectively selecting only the most relevant features. Let's visualise the feature coefficients using a horizontal bar plot.

```python
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform L1 regularisation (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Get the non-zero feature coefficients
nonzero_coefs = lasso.coef_
selected_indices = nonzero_coefs != 0
selected_features = data.feature_names[selected_indices]
nonzero_coefs = nonzero_coefs[selected_indices]

# Plot the feature coefficients
plt.figure(figsize=(10, 6))
plt.barh(range(len(nonzero_coefs)), nonzero_coefs, tick_label=selected_features)
plt.xlabel('Coefficient Values')
plt.ylabel('Features')
plt.title('L1 Regularisation (Lasso): Feature Coefficients')
plt.show()

print("Selected Features:")
print(selected_features)
```

Selected Features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B'
 'LSTAT']

In this example, we apply L1 regularisation with a regularisation strength (alpha) of 0.1. We visualise the non-zero feature coefficients using a horizontal bar plot. The selected features are the ones with non-zero coefficients in the Lasso model.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207923/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/l1_skucse.png" 
  alt="Selected feature importance" 
  loading="lazy" 
  styling=""
  caption="Selected feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207923/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/l1_skucse.png" 
  :showsource="false">
</article-image>

## Tree-Based Methods
Tree-based methods, such as Random Forest and Gradient Boosting, inherently perform feature selection by evaluating the importance of each feature in the tree construction process. Let's visualise the feature importances using a bar plot.

```python
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform feature selection using Random Forest
forest = RandomForestRegressor(n_estimators=100)
forest.fit(X, y)

# Get feature importances
importances = forest.feature_importances_

# Sort feature importances in descending order
sorted_indices = importances.argsort()[::-1]

# Select the top k features
k = 5
selected_features = data.feature_names[sorted_indices[:k]]
top_importances = importances[sorted_indices[:k]]

# Plot the feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(top_importances)), top_importances, tick_label=selected_features)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Tree-Based Methods: Feature Importances')
plt.show()

print("Selected Features:")
print(selected_features)
```

Selected Features:
['RM' 'LSTAT' 'DIS' 'CRIM' 'NOX']

In this example, we use a Random Forest model with 100 estimators to calculate feature importances. We select the top 5 features based on their importance scores and visualise them using a bar plot.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207923/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/tree_zvx2cl.png" 
  alt="Selected feature importance" 
  loading="lazy" 
  styling=""
  caption="Selected feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207923/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/tree_zvx2cl.png" 
  :showsource="false">
</article-image>

## Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. Let's visualise the explained variance ratio using a bar plot.

```python
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform PCA
pca = PCA(n_components=5)
X_new = pca.fit_transform(X)

# Get the explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance)
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Principal Component Analysis (PCA): Explained Variance Ratio')
plt.show()

# Get the loadings (principal component vectors)
loadings = pca.components_

# Create a loading plot
plt.figure(figsize=(10, 6))
for i, (loading, feature_name) in enumerate(zip(loadings, data.feature_names)):
    plt.arrow(0, 0, loading[0], loading[1], head_width=0.05, head_length=0.1, fc='blue', ec='blue')
    plt.text(loading[0], loading[1], feature_name, fontsize=12, ha='center', va='center', color='black')
plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
plt.axvline(x=0, color='gray', linestyle='--', linewidth=0.8)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Loading Plot: Feature Contributions to Principal Components')
plt.grid(True)
plt.show()
```

Selected Features:
['CHAS' 'INDUS' 'CRIM', 'ZN', 'NOX']

In this example, we select the top 5 principal components that capture the most variance in the data. We visualise the explained variance ratio of these components using a bar plot for the 5 principal components, and a summary in two dimension space with 2 principal components to view the loadings / feature importances.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/pca_yoiumv.png" 
  alt="PCA explained variance" 
  loading="lazy" 
  styling=""
  caption="PCA explained variance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/pca_yoiumv.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/pca-loadings_aul1re.png" 
  alt="PCA loadings for feature importance" 
  loading="lazy" 
  styling=""
  caption="PCA loadings for feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/pca-loadings_aul1re.png" 
  :showsource="false">
</article-image>

In the PCA example with the bar chart, the importance of variables is not directly represented by the bar heights as in feature importance plots. Instead, PCA focuses on transforming the original features into a new set of uncorrelated variables called principal components. These principal components are linear combinations of the original features and are sorted in descending order of the amount of variance they capture.

The explained variance ratio plot in the PCA example shows the proportion of the total variance in the dataset that each principal component explains. While this plot doesn't directly indicate which original features are the most important, it does help us understand the overall contribution of each principal component to the variability in the data.

In general, when you perform PCA, the first few principal components tend to capture most of the variance in the dataset. Therefore, the original features that contribute the most to these early principal components can be considered more important in terms of explaining the dataset's variability.

However, identifying which specific original features contribute most to a particular principal component can be challenging due to the linear combination nature of principal components. If you need to understand the relationship between the original features and specific principal components, you might need to perform further analysis, such as looking at the loadings of the principal components, which represent the contribution of each original feature to the construction of the principal component.

In summary, in a PCA analysis, the focus is more on understanding the variability and relationships between variables rather than directly identifying the "most important" variables as you would in other feature selection methods.



## Correlation-based Feature Selection
Correlation-based feature selection measures the correlation between each feature and the target variable, as well as the correlation between different features. Let's visualise the feature correlations using a heatmap.

```python
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Calculate feature correlations with target variable
correlations = np.abs(np.corrcoef(X.T, y)[:X.shape[1], -1])
sorted_indices = correlations.argsort()[::-1]

# Select the top k features
k = 5
selected_features = data.feature_names[sorted_indices[:k]]
top_correlations = correlations[sorted_indices[:k]]

print("Selected Features with correlation:")
print(selected_features)
print(top_correlations)

```

Selected Features with correlation:

['LSTAT' 'RM' 'PTRATIO' 'INDUS' 'TAX']

[0.73766273 0.69535995 0.50778669 0.48372516 0.46853593]

In this example, we calculate the absolute correlations between each feature and the target variable. We select the top 5 features with the highest correlations to the target variable y.

## Mutual Information
Mutual information measures the statistical dependency between two variables. In the context of feature selection, it quantifies the amount of information that one feature provides about the target variable. Let's visualise the feature scores using a bar plot.

```python
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform mutual information feature selection
selector = SelectKBest(score_func=mutual_info_regression, k=5)
X_new = selector.fit_transform(X, y)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = data.feature_names[selected_indices]

# Get the feature scores
scores = selector.scores_

# Plot the feature scores
plt.figure(figsize=(10, 6))
plt.bar(range(len(data.feature_names)), scores, tick_label=data.feature_names)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Scores')
plt.title('Mutual Information: Feature Scores')
plt.show()

print("Selected Features:")
print(selected_features)
```

Selected Features:
['INDUS' 'NOX' 'RM' 'PTRATIO' 'LSTAT']

In this example, we select the top 5 features based on mutual information scores using the mutual_info_regression score function. We visualise the feature scores using a bar plot. This method is also good for datasets with a categorical target but instead of using 'mutual_info_regression' as the `score_func` we would import and use 'mutual_info_classif' instead.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692208757/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/mutual_iu2rbc.png" 
  alt="Selected feature importance" 
  loading="lazy" 
  styling=""
  caption="Selected feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692208757/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/mutual_iu2rbc.png" 
  :showsource="false">
</article-image>

## Sequential Feature Selection
Sequential Feature Selection is a method that combines multiple feature subsets and evaluates their performance using a machine learning model. Let's visualise the feature performance using a line plot.

```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Perform sequential feature selection
estimator = LinearRegression()
selector = SequentialFeatureSelector(estimator, n_features_to_select=5, direction='forward')
selector.fit(X, y)

# Get the selected feature indices
selected_indices = np.where(selector.support_)[0]
selected_features = data.feature_names[selected_indices]

# Get the feature performance (manually store performance scores)
performance = []

for step in range(1, len(selected_indices) + 1):
    subset_indices = selected_indices[:step]
    X_subset = X[:, subset_indices]
    score = -np.mean(np.abs(np.mean(LinearRegression().fit(X_subset, y).predict(X_subset) - y)))
    performance.append(score)

# Plot the feature performance
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(performance) + 1), performance, marker='o')
plt.xticks(range(1, len(performance) + 1), selected_features, rotation=90)
plt.xlabel('Features')
plt.ylabel('Performance')
plt.title('Sequential Feature Selection: Feature Performance')
plt.show()

print("Selected Features:")
print(selected_features)
```

Selected Features:
['CRIM' 'CHAS' 'RM' 'PTRATIO' 'LSTAT']

In this example, we use LinearRegression as the estimator and select the top 5 features using the forward selection approach. We visualise the feature performance using a line plot.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/sequential_fqdily.png" 
  alt="Selected feature importance" 
  loading="lazy" 
  styling=""
  caption="Selected feature importance" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1692207922/App%20Images/Blog%20Images/Article%20Images/Feature%20Selection/sequential_fqdily.png" 
  :showsource="false">
</article-image>

## Conclusion 

In conclusion, feature selection techniques are essential for improving machine learning models by selecting the most relevant features and reducing dimensionality. In this guide, we explored various techniques and applied them to the Boston Housing dataset.

By incorporating these feature selection techniques into your machine learning workflow, you can enhance model performance, reduce overfitting, and gain better insights into the underlying data patterns. Consider experimenting with different techniques and evaluating their impact on your specific dataset and task to identify the most effective feature subset. Ultimately, feature selection empowers you to build more robust, interpretable models that deliver accurate predictions and valuable insights.]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Understanding Explainable AI (XAI) for classification, regression and clustering with Python]]></title>
            <link>https://shedloadofcode.com/blog/understanding-explainable-ai-for-classification-regression-and-clustering-with-python/</link>
            <guid>https://shedloadofcode.com/blog/understanding-explainable-ai-for-classification-regression-and-clustering-with-python/</guid>
            <pubDate>Sat, 08 Jul 2023 17:47:00 GMT</pubDate>
            <description><![CDATA[We will explore the concepts of XAI in the context of classification, regression, and clustering models, and understand how these techniques can enhance the interpretability and trustworthiness of AI and ML systems.]]></description>
            <content:encoded><![CDATA[
## Introduction

Artificial Intelligence (AI) has become an integral part of our lives, with its applications spanning across various domains. However, one major concern associated with AI is its lack of transparency and explainability. In recent years, there has been a growing demand for Explainable AI (XAI) techniques that aim to shed light on the decision-making processes of AI models. In this blog post, we will explore the concepts of XAI in the context of classification, regression, and clustering, and understand how these techniques can enhance the interpretability and trustworthiness of AI systems. 

The primary goal of AI and machine learning is to build models that can analyse and interpret complex data, recognise patterns, make predictions or decisions, detect anomalies, optimise processes and adapt their behavior based on new information without too much human expertise or explicit programming.

Use the contents menu above to jump to classification, regression or clustering examples based on your interests. I carried out these analyses in the Spyder IDE.

## Classification and Explainable AI

Classification is a fundamental task in AI that involves assigning input data points to predefined categories or classes. Explainable AI techniques in classification aim to provide insights into how a model arrived at a particular classification decision. Let's take a closer look at some XAI methods commonly used in classification:

* Feature Importance: Feature importance techniques help identify which input features contribute the most to the classification decision. These methods assign scores or weights to each feature, allowing us to understand the relative importance of different inputs.

* Rule Extraction: Rule extraction methods attempt to extract a set of human-interpretable rules from a trained classification model. These rules provide a transparent representation of how the model makes decisions, enabling easier comprehension.

* Local Explanations: Local explanation methods focus on explaining individual predictions by highlighting the relevant features and their impact on the decision. Techniques like LIME (Local Interpretable Model-agnostic Explanations) generate locally faithful explanations that explain model behavior at specific instances.

Explainable models aim to address the "black box" nature of traditional classification models by providing insights into the underlying factors and reasoning behind each classification prediction. Here are some popular explainable classification models:

* Decision Trees: Decision trees are intuitive and transparent models that make decisions based on a sequence of rules. Each internal node represents a decision based on a specific feature, and each leaf node represents a class label. Decision trees provide a clear path of decision-making, making them inherently explainable.

* Rule-Based Models: Rule-based models generate a set of if-then rules that define the decision boundaries of the classification model. These rules are typically human-readable and provide a transparent representation of the decision-making process.

* Logistic Regression with L1 Regularisation: Logistic regression models with L1 regularisation can result in sparse solutions where only a subset of the input features is used for classification. This sparsity property allows for feature selection, indicating which features are most important for the classification decision.

## Decision Tree Classifier example

To demonstrate the process, we will use a scikit-learn Decision Tree Classifier with the Titanic dataset to train a model to predict whether a passenger survived the disaster. It's a common and well known dataset, so perfect for learning the XAI process.

We first import packages, read and prepare the dataset for the model, and split the data into training and testing sets. The training set (80% of the data) will be used to train the model, and the test set (20% of the data) acts as 'unseen data' to see how well the model works. Finally, we create a Decision Tree Classifier and train the model on the training set.

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)

# Handle missing values
data.fillna(value={'Age': data['Age'].median()}, inplace=True)
data.fillna(value={'Embarked': data['Embarked'].mode()[0]}, inplace=True)

# Remove unnecessary columns
data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

# Encode categorical variables
data = pd.get_dummies(data, columns=['Sex', 'Embarked'])

# Split the data into features and target variable
X = data.drop('Survived', axis=1)
y = data['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=42)

# Initialize the decision tree classifier
model = DecisionTreeClassifier(max_depth=3)

# Train the model
model.fit(X_train, y_train)
```

The prepared data looks like this.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562889/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/titanic-prepared_tlirdx.png" 
  alt="Prepared dataset" 
  loading="lazy" 
  styling=""
  caption="Prepared dataset" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562889/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/titanic-prepared_tlirdx.png" 
  :showsource="false">
</article-image>

We now really want to explain how well this model has performed, what features are important to the model's decison making and how well we expect it to perform on new data. We can first check training set accuracy as a benchmark and feature importances. Later we will check the testing set (unseen data) accuracy.

```python
# Assess accuracy
train_accuracy = round(model.score(X_train, y_train) * 100, 2)

# Plot the feature importances
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = X_train.columns
sorted_feature_names = feature_names[indices]

plt.figure()
plt.title("Feature importance")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train.shape[1]), sorted_feature_names, rotation='vertical')
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562647/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/decision_tree_feature_importance_dmtccb.png" 
  alt="Decision tree visual" 
  loading="lazy" 
  styling=""
  caption="Decision tree visual" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562647/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/decision_tree_feature_importance_dmtccb.png" 
  :showsource="false">
</article-image>

The training accuracy returns 83.43% and the feature importances show that Sex_female, Pclass, Age has the largest importance on the model's decisions. So this model can correctly classify 83.43% of this dataset. Not a bad start.

We can further break this down by visualising the decision tree.

```python
# Plot the decision tree 
fig = plt.figure(figsize=(35, 15))
plot = tree.plot_tree(model, 
                      feature_names=X.columns, 
                      class_names=['Not Survived', 'Survived'], 
                      filled=True,
                      fontsize=18)

plt.suptitle(f"Model accuracy score = {train_accuracy}%\nTraining sample = {len(X_train)} rows", 
             fontsize=18)
plt.savefig("tree.png")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688561837/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/decision_tree_p52vix.png" 
  alt="Decision tree visual" 
  loading="lazy" 
  styling=""
  caption="Decision tree visual" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688561837/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/decision_tree_p52vix.png" 
  :showsource="false">
</article-image>

Let's interpret how the decision tree would classify a 30 year old male named Mike who was in passenger class 3.

> 1st condition
>
> Sex_female (Mike=1) <= 0.5 ~ True
>
> Mike fulfils the condition; we move to the left side of the tree.

> 2nd condition
>
> Age (Mike=30.0) <= 6.5 ~ False
>
> Mike doesn't fulfil the condition; we move to the right side of the tree.

> 3rd condition
>
> Pclass (Mike=3) <= 1.5 ~ False
>
> Mike doesn't fulfil the condition; we move to the right side of the tree.

> Last node
>
> The ultimate node, the leaf, tells us that the training dataset contained 354 males with a passenger class more than 1.5 of which > 42 survived (1) but 312 (0) didn't survive. 

Therefore, the chances of Mike surviving according to this model are 42 divided by 354:

42 / 354 = 0.1186440677966102

We get the answer that Mike had a 11.86% chance of surviving the Titanic accident and can understand how the model arrived at such a decision. We can confirm this later when passing in brand new data for the model to predict on.

Things to remember when interpreting decision tree diagrams:

* Nodes: Each node in the tree represents a decision point based on a specific feature and threshold. The topmost node is the root node, and subsequent nodes are internal nodes. The leaf nodes represent the final predictions.

* Splits: The edges or branches between nodes indicate the splits based on the feature and threshold values. For example, if a sample's feature value is greater than the threshold, it follows the right branch; otherwise, it follows the left branch.

* Gini Impurity or Information Gain: The plot_tree visual may also include measures such as Gini impurity or information gain. These metrics reflect the impurity or the amount of information gained by the split at each node. Lower values indicate more homogeneous child nodes, indicating better splits. In general, the Gini impurity ranges from 0 to 1, where 0 represents a perfectly pure node (all elements belong to the same class) and 1 represents a maximally impure node (elements are evenly distributed across all classes).

* Colors: By setting `filled=True` in the `plot_tree` function, the plot is filled with colors to represent the majority class in each node. The color intensity reflects the class distribution or the probability of each class.

* Samples: The plot may display the number of samples or observations that reach each node. It provides insights into the data distribution and the number of instances at different decision points.

* Value: Refers to the target or output variable that the decision tree is trying to predict or classify at each node. At each internal node of the tree, a decision is made based on a feature and its threshold, leading to a different branch depending on whether the condition is satisfied or not. Eventually, the tree reaches the leaf nodes, which correspond to the final predicted classes.

* Class: Refers to the distribution or count of samples belonging to each class at a specific node or leaf of the decision tree. This provides a breakdown of the samples in that node or leaf based on their class labels. It indicates the number of instances or the distribution of classes within that particular subset of the data. For example, the top node shows class=[444, 268] which means 444 did not survive and 268 survived.

* Feature Importance: The decision tree visual allows you to infer feature importance based on the position and depth of the features within the tree. Features closer to the root node are more influential in the decision-making process.

We can now also get a sense for how the model performed overall on the testing set by using a confusion matrix. We can see that the prediction success drops to 79.89% when applied to the test set (unseen data). We can see of the total test set of 179 records the model predicted 143 (92 + 51) correctly and 36 (13 + 23) incorrectly. This is an accuracy score of 143 / 179 = 0.79888 which confirms our score.

```python
# Plot a confusion matrix to assess prediction success
y_pred = model.predict(X_test)
test_accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title(f"Accuracy score = {test_accuracy}%\nTest sample = {len(X_test)} rows")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562760/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/confusion_matrix_ff9vxp.png" 
  alt="Confusion matrix" 
  loading="lazy" 
  styling=""
  caption="Confusion matrix" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562760/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/confusion_matrix_ff9vxp.png" 
  :showsource="false">
</article-image>

The same information can be found in the classification report. The classification report in scikit-learn provides a clear and concise summary of the model's performance for each class, as well as overall performance metrics.

```python
# Produce a classification report
report = classification_report(
    y_true=y_test,
    y_pred=y_pred,
    output_dict=True
)

report = pd.DataFrame(report)
```

|           | 0        | 1        | accuracy | macro avg | weighted avg |
| --------- | -------- | -------- | -------- | --------- | ------------ |
| precision | 0.8      | 0.796875 | 0.798883 | 0.798438  | 0.798708     |  |
| recall    | 0.87619  | 0.689189 | 0.798883 | 0.78269   | 0.798883     |  |
| f1-score  | 0.836364 | 0.73913  | 0.798883 | 0.787747  | 0.796167     |  |
| support   | 105      | 74       | 0.798883 | 179       | 179          |  |

This classification report shows:

* Precision: The precision for each class is the ratio of true positives (correctly predicted instances) to the sum of true positives and false positives (instances incorrectly predicted as positive). It measures the accuracy of positive predictions. Precision is reported for each class.

* Recall: The recall, also known as sensitivity or true positive rate, for each class is the ratio of true positives to the sum of true positives and false negatives (instances incorrectly predicted as negative). It measures the model's ability to correctly identify positive instances. Recall is reported for each class.

* F1-score: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. The F1-score is reported for each class. The closer it is to 1, the better the model.

* Support: The support indicates the number of occurrences of each class in the true labels. It represents the number of samples belonging to each class.

* Accuracy: The accuracy is the proportion of correctly classified instances (both true positives and true negatives) to the total number of instances. It provides an overall measure of the model's performance.

* Macro average: The macro average is the average of precision, recall, and F1-score across all classes. It treats all classes equally, regardless of class imbalance.

* Weighted average: The weighted average is the average of precision, recall, and F1-score across all classes, weighted by the support (number of samples) of each class. It considers the class imbalance and provides a more representative evaluation metric.

We can apply this model to brand new unseen data. In this example we have 4 new passengers. 2 males and 2 females.

```python
# Pass in new unseen data to the model and get a prediction
columns = ["Pclass", "Age","SibSp","Parch","Fare","Sex_female",
           "Sex_male","Embarked_C","Embarked_Q", "Embarked_S"]

unseen_data = {
    "Pclass": [3, 1, 2, 1],
    "Age": [30, 15, 50, 28],
    "SibSp": [1, 2, 0, 0],
    "Parch": [0, 0, 0, 0],
    "Fare": [20.0, 20.0, 20.0, 35.5],
    "Sex_female": [0, 1, 1, 0],
    "Sex_male": [1, 0, 0, 1],
    "Embarked_C": [0, 1, 0, 0],
    "Embarked_Q": [0, 0, 1, 0],
    "Embarked_S": [1, 0, 0, 1]
}

unseen_df = pd.DataFrame(unseen_data, columns=columns)
predictions = model.predict(unseen_df)
probability = pd.DataFrame(model.predict_proba(unseen_df), 
                           columns=["Did Not Survive %", "Survived %"])

unseen_df["Survived Prediction"] = predictions
unseen_df["Survived Probability"] = probability["Survived %"]
```

Here are the results showing both female passengers are predicted to survive with a 96.87% probability, whereas both male passengers are not predicted to survive, with 11.86% (this profile matches Mike from earlier!) and 32.96% probability.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562895/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/unseen_data_prediction_rknvak.png" 
  alt="Predictions on new unseen data" 
  loading="lazy" 
  styling=""
  caption="Predictions on new unseen data" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688562895/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/unseen_data_prediction_rknvak.png" 
  :showsource="false">
</article-image>

Decision trees can be prone to overfitting as there is only one 'tree'. A Random Forest model can overcome this by assessing many trees using subsets of the data to avoid overfitting. You can still ouput feature importances with a Random Forest model, and they are generally more accurate, but are harder to explain to others! 

I will cover Logistic Regression and Random Forest models for classification in another article. Both are good alternative options.

The maximum depth of a decision tree determines the number of levels in the tree and directly impacts the complexity of the decision boundary. By setting a higher maximum depth, the decision tree can capture more complex relationships in the data, potentially resulting in a more intricate decision boundary. Conversely, reducing the maximum depth can lead to a simpler decision boundary.

## Rules-based Classifier example

An alternative approach to the Titanic classification problem is to use a rules based approach. Rule-based models typically provide deterministic predictions (0 or 1) based on the conditions of the rules. They do not inherently provide probabilistic outputs or confidence levels associated with predictions, which can be valuable for certain applications.

Despite these limitations, rule-based models can still be valuable in certain scenarios, especially when interpretability and explainability are essential requirements. They are often used in domains where human-understandable decision rules are preferred, such as expert systems, regulatory compliance, or auditing. Here's an example of implementing a rules based model in Python using the Titanic dataset:

```python
data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Define rules
rules = [
    {'condition': (data['Sex'] == 'female') & (data['Pclass'] <= 2) & (data['Age'] <= 50), 'prediction': 1},
    {'condition': (data['Sex'] == 'female') & (data['Pclass'] <= 2) & (data['Age'] > 50), 'prediction': 0},
    {'condition': (data['Sex'] == 'female') & (data['Pclass'] > 2), 'prediction': 1},
    {'condition': (data['Sex'] == 'male') & (data['Age'] <= 10), 'prediction': 1},
    {'condition': (data['Sex'] == 'male') & (data['Age'] > 10) & (data['Fare'] > 20), 'prediction': 1}
]

# Apply rules to make predictions
predictions = []
for rule in rules:
    condition = rule['condition']
    prediction = rule['prediction']
    predictions.append(condition & (data['Survived'] == prediction))

# Combine predictions
final_prediction = pd.concat(predictions, axis=1).any(axis=1)
data["Predicted"] = final_prediction.replace({True: 1, False: 0})

# Calculate accuracy
rules_based_model_accuracy = sum(final_prediction == data['Survived']) / len(data)
```

We define a list of rules, where each rule consists of a condition and a prediction. The condition is a boolean expression based on the features in the dataset, and the prediction represents the outcome if the condition is satisfied.

We then iterate over the rules and apply them to the dataset to make predictions. Each rule is evaluated as a boolean condition, and the predictions are stored in a list.

Finally, we combine the predictions using the logical OR operation, and compare the final prediction with the actual target variable ('Survived') to calculate the accuracy of the rule-based model. The accuracy of this model is 91.58% which suggests the rules are quite overfit, but that's okay if we want rigid well defined rules that are easily explainable, it's a trade off.


## Regression and Explainable AI

Regression is a type of supervised learning task that predicts continuous numerical values based on input variables. Explainable AI techniques in regression help us understand how the model estimates the relationship between the input features and the target variable. Here are some common XAI methods used in regression:

Partial Dependence Plots: Partial dependence plots visualize the relationship between a target variable and one or more input features while keeping other features fixed. These plots provide insights into how changes in the input variables impact the predicted outcome.

Feature Contribution: Feature contribution methods quantify the impact of each input feature on the regression model's predictions. They help identify the most influential features and their corresponding effects, aiding interpretability.

Model Simplification: Model simplification techniques aim to create simpler, more interpretable models that approximate the behavior of complex regression models. This simplification enhances transparency and enables easier comprehension of the underlying relationships.

## Linear Regression example

We will use a scikit-learn Linear Regression model with the Boston Housing dataset to train a model to predict house prices. It's another well known dataset. 

There are 14 attributes in each case of the dataset. They are:
* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per $10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV / target - Median value of owner-occupied homes in $1000's

We follow the same pattern as our first example.

```python
import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    median_absolute_error,
    r2_score
)
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Boston Housing dataset
from sklearn.datasets import load_boston
boston = load_boston()

# Create a DataFrame from the dataset
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['target'] = boston.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('target', axis=1), 
    data['target'], 
    test_size=0.2, 
    random_state=42
)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
```

The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation.

We can now evaluate the accuracy of the model.

```python 
# Calculate the residuals
residuals = y_test - y_pred

results = pd.DataFrame({'Actual': y_test, 
                        'Predicted': y_pred, 
                        'Residuals': residuals,
                        'Absolute Residuals': abs(residuals)})

# Identify incorrect predictions
results['Prediction Status'] = results['Absolute Residuals'] <= 5
close_predictions_count = len(results[results['Absolute Residuals'] <= 5])
results['Prediction Status'] = results['Prediction Status'].replace({
    True: 'Prediction +/- $5000',
    False: 'Prediction > $5000'
})

# Evaluate the model
print('Mean Square Error      = ' + str(mean_squared_error(y_test, y_pred)))
print('Root Mean Square Error = ' + str(math.sqrt(mean_squared_error(y_test, y_pred))))
print('Mean Absolute Error    = ' + str(mean_absolute_error(y_test, y_pred)))
print('Median Absolute Error  = ' + str(median_absolute_error(y_test, y_pred)))
print('R2                     = ' + str(r2_score(y_test, y_pred)))
print('')
print('% within +/- $5000     = ' + str(close_predictions_count / len(results)))
```

| Evaluation metric      | Value    |
| ---------------------- | -------- |
| Mean Square Error      | 24.29112 |
| Root Mean Square Error | 4.928602 |
| Mean Absolute Error    | 3.189092 |
| Median Absolute Error  | 2.324332 |
| R2                     | 0.668759 |
| % within +/- $5000     | 0.862745 |

In general, an R2 value of 0.66 means that approximately 66% of the variation in the target variable is explained by the regression model. This implies that the model captures a substantial portion of the underlying patterns in the data and performs better than simply using the mean value of the target variable for prediction. However, it also indicates that there is still some unexplained variation in the target variable that the model does not account for.

* The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points. The Root Mean Squared Error (RMSE) is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axisd.
* Root Mean Squared Error (RMSE): RMSE is the square root of the mean squared error and provides an interpretable metric in the same unit as the target variable. It penalizes larger errors more heavily compared to MSE.
* Mean Absolute Percentage Error (MAPE): MAPE measures the average percentage difference between the predicted and actual values. It is particularly useful when the scale of the target variable varies significantly.
* Coefficient of Determination (Adjusted R-squared): R-squared measures the proportion of the variance in the target variable explained by the regression model. Adjusted R-squared adjusts for the number of features in the model, penalizing the addition of irrelevant features.

We can now use the scatter plot below to compare the actual target values with the predicted values. This visualisation helps assess how closely the model's predictions align with the true values. I have highlighted those predictions the were within +/- $5000 as these can be assumed to be accurate.

```python
# Visualize actual vs predicted plot
plt.figure(figsize=(15, 6))
sns.scatterplot(x='Actual', 
                y='Predicted', 
                hue='Prediction Status', 
                data=results)
sns.lineplot(x=results['Actual'], 
             y=results['Actual'], 
             color='black', 
             label='Perfect Prediction')
plt.title(f'Testing set = {len(y_test)} rows\nActual vs. Predicted')
plt.xlabel('Actual Values ($1000)')
plt.ylabel('Predicted Values ($1000)')
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688654172/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/actual_vs_predicted_mxc6eg.png" 
  alt="Actual vs predicted median house price values" 
  loading="lazy" 
  styling=""
  caption="Actual vs predicted median house price values ($1000)" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688654172/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/actual_vs_predicted_mxc6eg.png" 
  :showsource="false">
</article-image>

Residuals, in the context of regression analysis, refer to the differences between the observed (actual) values and the predicted values obtained from a regression model. By examining the residuals, we can assess how well the regression model captures the patterns and trends in the data. A desirable regression model should have residuals that exhibit certain properties, such as being normally distributed around zero, showing no systematic patterns or trends, and having consistent variability across the range of the predicted values.

We can create a residuals plot like the one below. The `residuals` were calculated by subtracting the predicted values `y_pred` from the actual values `y_test`.

The `residplot()` function from seaborn is used to create the residuals vs. predicted values plot. It automatically fits and plots a linear regression line to the data points. The plot displays the relationship between the predicted values and the residuals.

The horizontal line at y=0 serves as a reference line to indicate where the residuals should ideally be centered. Residuals above the line indicate overestimation, while residuals below the line indicate underestimation.

```python
# Create the residuals vs. predicted values plot using seaborn
plt.figure(figsize=(15, 6))
sns.residplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs. Predicted Values')
plt.xlabel('Predicted Values ($1000)')
plt.ylabel('Residuals ($1000)')
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688654237/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/residuals_vs_predictions_ksfqdj.png" 
  alt="Residuals vs predicted values plot" 
  loading="lazy" 
  styling=""
  caption="Residuals vs predicted values plot" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688654237/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/residuals_vs_predictions_ksfqdj.png" 
  :showsource="false">
</article-image>

Visualising the distribution of residuals can help too. A histogram or a kernel density plot of the residuals can help assess if they are normally distributed. Deviations from normality may indicate model misspecification or the presence of outliers. We can see in this distribution that most residuals are within $5000 either way which we also found in our earlier actual vs predicted chart.

```python
# Create a histogram of residuals using seaborn
plt.figure(figsize=(15, 6))
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals ($1000)')
plt.ylabel('Frequency')
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688654214/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/residuals_distribution_ylakrs.png" 
  alt="Residuals vs predicted values plot" 
  loading="lazy" 
  styling=""
  caption="Residuals vs predicted values plot" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688654214/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/residuals_distribution_ylakrs.png" 
  :showsource="false">
</article-image>

Finally, to figure out which features are most important to this model's predictions we can examine their coefficients to provide insights into the relationships between the features and the target variable. I have used `abs()` to rank absolute coefficients, regardless of whether they were positive or negative relationships.

```python
# Interpret the model
coefficients = pd.DataFrame({
    'Feature': list(X_train.columns.values), 
    'Coefficient': model.coef_,
    'Absolute Coefficient': abs(model.coef_)
})

feature_importance = coefficients.sort_values('Absolute Coefficient', 
                                              ascending=False).reset_index(drop=True)
```

| Feature | Coefficient | Absolute Coefficient |
| ------- | ----------- | -------------------- |
| NOX     | \-17.2026   | 17.20263             |
| RM      | 4.438835    | 4.438835             |
| CHAS    | 2.784438    | 2.784438             |
| DIS     | \-1.44787   | 1.447865             |
| PTRATIO | \-0.91546   | 0.915456             |
| LSTAT   | \-0.50857   | 0.508571             |
| RAD     | 0.26243     | 0.26243              |
| CRIM    | \-0.11306   | 0.113056             |
| INDUS   | 0.040381    | 0.040381             |
| ZN      | 0.03011     | 0.03011              |
| B       | 0.012351    | 0.012351             |
| TAX     | \-0.01065   | 0.010647             |
| AGE     | \-0.0063    | 0.006296             |

We can confirm these relationships using a pairplot with high coefficient features plotted against the target (median house price). 

```python
# Confirm feature importance with correlation pairplot
plt.figure(figsize=(30, 20))
sns.pairplot(data, 
             y_vars = ['target'],
             x_vars = ['PTRATIO', 'NOX', 'RM', 'LSTAT', 'AGE'])
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688655477/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/feature_pairplot_bewhj8.png" 
  alt="Pairplot of target vs features" 
  loading="lazy" 
  styling=""
  caption="Pairplot of target vs high coefficient features" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688655477/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/feature_pairplot_bewhj8.png" 
  :showsource="false">
</article-image>

We can further enhance our understanding using a correlation heatmap for all features. I have opted to set a threshold of more than 0.4 or less than -0.4 here to only display important correlations which makes this visual much easier to read. You can just pass in `correlation` instead of `masked_corr_matrix` if you want to view them all.

```python
# Check this against a Pearson correlation heatmap
# Only keep important correlations (more than 0.4 or less than -0.4)
correlation = data.corr()
masked_corr_matrix = correlation[(correlation > 0.4) | (correlation < -0.4)]
plt.figure(figsize=(20, 10))
sns.heatmap(masked_corr_matrix, 
            cmap="coolwarm", 
            annot=True, 
            fmt='.2f', 
            linewidths=.05).set_title("Correlation Heatmap")
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688656867/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/masked_heatmap_extate.png" 
  alt="Correlation heatmap of all features" 
  loading="lazy" 
  styling=""
  caption="Correlation heatmap of correlated features. 0 is negative correlation. 1 is positive correlation." 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688656867/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/masked_heatmap_extate.png" 
  :showsource="false">
</article-image>

Another important point in selecting features for a linear regression model is to check for multicolinearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. This can affect the model. Same goes for the features DIS and AGE which have a correlation of -0.75. We kept all the features in this example for simplicity.

## Clustering and Explainable AI

Clustering is an unsupervised learning task that involves grouping similar data points together based on their inherent patterns or characteristics. Although clustering lacks explicit labels, XAI techniques can still play a crucial role in understanding and validating the clustering results. Here are a few XAI methods in clustering:

Cluster Visualization: Visualizing the clustering results helps us understand how the data points are grouped together. Techniques like scatter plots, heatmaps, or dendrograms provide a visual representation of the clusters, aiding in interpretation.

Cluster Profiling: Cluster profiling techniques analyze the characteristics of each cluster, such as mean values, distribution, or other statistical measures. These profiles provide insights into the defining features of each cluster, enhancing interpretability.

Dimensionality Reduction: Dimensionality reduction methods, such as Principal Component Analysis (PCA) or t-SNE (t-Distributed Stochastic Neighbor Embedding), can help reduce the high-dimensional input space to a lower-dimensional representation that is more easily understandable and interpretable.

## K-means example

For this example we will use the Palmer Penguins dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. This dataset contains the data of 344 penguins. Just like in the Iris dataset, there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. These three classes are Adelie, Chinstrap, and Gentoo. So we could use this dataset for classification supervised learning (labelled data).

But unlike the other examples we've seen, since clustering and dimensionality reduction are unsupervised methods, we will pretend we don't know what the classes are. We are only interested in grouping similar data points together based on their characteristics, helping us discover patterns and structure in data without pre-defined categories. This has real world uses including customer segmentation, market research and social network analysis. 

We will use  K-means clustering which is an unsupervised algorithm that groups data points into K distinct clusters based on their proximity to the cluster centroids. It iteratively assigns data points to the nearest centroid and updates the centroids until convergence, aiming to minimize the within-cluster sum of squares.

It's important to note that the clustering model is a tool to assist us in organizing and understanding data, but it doesn't provide definitive answers or predictions.

```python
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Palmer Penguin dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv'
data = pd.read_csv(url)  

# Drop missing
data = data.dropna()

# Keep species as known labels
known_labels = data['species'].values

# Select relevant features for clustering
features = data[['bill_length_mm', 
                 'bill_depth_mm', 
                 'flipper_length_mm', 
                 'body_mass_g']]

# Scale the data
scaler = StandardScaler()
scaled_features = pd.DataFrame(scaler.fit_transform(features),
                               columns=features.columns)

# Perform clustering using K-means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(scaled_features)
labels = kmeans.predict(scaled_features)
centroids = kmeans.cluster_centers_
```

This will give us `labels` as our clusters.

Note that in this example, we have used three clusters (n_clusters=3) but you can adjust the number of clusters as per your requirements in other datasets and experiment with different cluster sizes.

```python
# Add the cluster labels to the dataset
data['cluster'] = labels

# Profile each cluster using feature analysis
features_profile = data.groupby('cluster')\
    .agg(['mean', 'median', 'std'])

mean_features = data.groupby('cluster').mean()
    
# Compute the silhouette score to evaluate cluster quality
silhouette_avg = silhouette_score(scaled_features, labels)
```

After running this code, we add the predicted cluster `labels` back to the original data, and obtain the `features_profile` and `mean_features` values for each cluster, which will provide insights into the characteristics of the clusters by mean, median, and standard deviation. The mean feature values can help identify the statistical differences between clusters. 

**Mean feature values by cluster**
| cluster | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g |
| ------- | -------------- | ------------- | ----------------- | ----------- |
| 0       | 38.27674       | 18.12171      | 188.6279          | 3593.798    |
| 1       | 47.56807       | 14.99664      | 217.2353          | 5092.437    |
| 2       | 47.66235       | 18.74824      | 196.9176          | 3898.235    |

The silhouette score returned 0.58 before scaling and 1.00 after, which is a great silhouette score, they range between -1 and 1, with values closer to 1 indicating well-separated clusters and values closer to -1 indicating overlapping or poorly separated clusters. 

We can produce a scatter plot visualisation using PCA to display the clusters in a two-dimensional space. By reducing the dimensionality of the data using PCA, we can project the data onto these principal components, effectively creating a lower-dimensional representation of the original data. This lower-dimensional representation allows us to visualise the data in a more manageable and interpretable way.

```python
# Visualization using PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_features)

# Visualize the clusters
clustered_data = pd.DataFrame({'PCA Component 1': reduced_data[:, 0],
                               'PCA Component 2': reduced_data[:, 1], 
                               'Cluster': labels})
plt.figure(figsize=(15, 10))
sns.scatterplot(data=clustered_data, 
                x='PCA Component 1', 
                y='PCA Component 2', 
                hue='Cluster', 
                palette='viridis')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA Results')
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688660293/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/pca_cluster_spmo0s.png" 
  alt="PCA clusters chart" 
  loading="lazy" 
  styling=""
  caption="Principal components analysis (PCA) plot" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688660293/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/pca_cluster_spmo0s.png" 
  :showsource="false">
</article-image>

Although the axes in a PCA plot do not directly correspond to individual features, the contributions of the original features to each principal component can be quantified. The loadings of the features on the principal components indicate their relative importance in explaining the variability in the data. This information can be used to assess which features have the most influence on the overall patterns observed in the PCA plot. 

```python
loadings = pca.components_

# Calculate the squared loadings (squared weights) for each feature
feature_importance = np.square(loadings)

# Sum the squared loadings across principal components to get the total importance for each feature
total_importance = np.sum(feature_importance, axis=0)

feature_importance_df = pd.DataFrame({'Feature': scaled_features.columns, 'Importance': total_importance})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
feature_importance_df = feature_importance_df.reset_index(drop=True)
```

| Feature           | Importance |
| ----------------- | ---------- |
| bill_depth_mm     | 0.793125   |
| bill_length_mm    | 0.566126   |
| flipper_length_mm | 0.332761   |
| body_mass_g       | 0.307989   |

By calculating the squared loadings, we obtain the importance of each feature for each principal component. Summing the squared loadings across principal components provides the total importance for each feature. Finally, we sort the features based on their total importance to determine their ranking. As a general guideline, a common approach is to consider a total_importance value that captures a substantial amount of the variance in the data. For instance, a threshold of 0.80 or higher is often used, suggesting that the selected principal components account for at least 80% of the variance in the data.

At a higher level, we cannot view all of the features in two-dimensional space, but we can select two features to explore.

```python
# Visualize the clusters with two variables
plt.figure(figsize=(15, 10))
sns.scatterplot(data=data, 
                x='bill_length_mm', 
                y='flipper_length_mm', 
                hue='cluster', 
                palette='viridis')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Flipper Length (mm)')
plt.title('Clustering Results')
sns.scatterplot(x=centroids[:, 0], 
                y=centroids[:, 2], 
                hue=range(3), 
                marker='X', 
                s=200, 
                palette=['black', 'black', 'black'],
                legend=False)
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688660293/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/scatter_cluster_gatpdq.png" 
  alt="PCA clusters chart" 
  loading="lazy" 
  styling=""
  caption="Clusters and centroids for Flipper Length (mm) by " 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688660293/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/scatter_cluster_gatpdq.png" 
  :showsource="false">
</article-image>

The following two images below show the clusters identified by the model after scaling the data, and the actual penguin groupings. We can see that the images almost perfectly align which shows this model is performing very well at identifying distinct cluster groupings. It was signficantly less accurate before scaling the data.

```python
# Visualise the clusters in a pairplot
plt.figure(figsize=(30, 20))
sns.pairplot(data, 
             hue="cluster",
             palette='viridis',
             vars = ['bill_length_mm', 
                     'bill_depth_mm', 
                     'flipper_length_mm', 
                     'body_mass_g'])
plt.suptitle('Clusters after scaling')
plt.show()

# Visualise the actual penguin relationships and groupings
plt.figure(figsize=(30, 20))
sns.pairplot(data, 
             hue="species",
             vars = ['bill_length_mm', 
                     'bill_depth_mm', 
                     'flipper_length_mm', 
                     'body_mass_g'])
plt.suptitle('Actual Penguin groupings')
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688662101/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/clustering_pairplot_rrdtli.png" 
  alt="PCA clusters chart" 
  loading="lazy" 
  styling=""
  caption="Clusters identified by the model after scaling the data" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688662101/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/clustering_pairplot_rrdtli.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1688662101/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/penguin_groupings_pairplot_zjf246.png" 
  alt="PCA clusters chart" 
  loading="lazy" 
  styling=""
  caption="Actual known labels of penguin groupings" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1688662101/App%20Images/Blog%20Images/Article%20Images/Explainable%20AI/penguin_groupings_pairplot_zjf246.png" 
  :showsource="false">
</article-image>

As a final piece of quality assurance, we can use a crosstab to examine the cluster labels vs known labels (species) to see how they align. The known labels won't usually be available in clustering problems because we're not trying to make a prediction, so this is a nice sense check whilst learning about using clustering.

```python
# Quality assure (QA) our clusters against known penguin species
qa = pd.DataFrame({'labels': labels, 'species': known_labels})
qa = pd.crosstab(qa['labels'], qa['species'])
```

| Cluster | Adelie | Chinstrap | Gentoo |
| ------- | ------ | --------- | ------ |
| 0       | 124    | 5         | 0      |
| 1       | 0      | 0         | 119    |
| 2       | 22     | 63        | 0      |

We can see that Cluster 1 contains all Gentroo! Cluster 0 is mostly Adelie. Cluster 2 is the weakest with mostly Chinstrap but some Adelie. On the whole though, this suggests a very well performing clustering model. Once again, we're not trying to predict anything with clustering, only to identify clear groupings in the data, and ensure those groupings are explainable.

## Benefits and importance of Explainable AI

The integration of XAI techniques into classification, regression, and clustering models offers several benefits:

* Transparency: XAI methods provide transparency by revealing the inner workings of AI models, making them more understandable to users and stakeholders.

* Trust: Enhanced explainability builds trust by enabling users to comprehend and verify the decisions made by AI systems.

* Bias Detection: XAI techniques can help identify and mitigate biases present in AI models, ensuring fair and unbiased decision-making.

* Compliance: In regulated industries, explainability is crucial for compliance with legal and ethical standards.

An article I found really interesting on all of these topics was [6 Lessons from a Data Scientist in the Banking Industry](https://towardsdatascience.com/6-lessons-from-a-data-scientist-in-the-banking-industry-11dc4a8a7234). A quote that really hit me during that article was:

> I exclusively build models using logistic regression. I am not alone. From banking to insurance, much of the financial world runs on regression. Why?
>
>
> Because these models work.
>
> ...
>
>
> With regression, I ended up with models that had 8 to 10 features. Each of these features had to be thoroughly explained. A non-technical colleague had to agree they captured a relationship that existed in reality.
>
>...
>
> This was a source of disappointment. Leaving uni, I had learned so much about random forests, XGBoost and neural networks. I was excited to apply these techniques. In the first week, I remember one of my senior colleagues saying:
> 
> “Forget about all those fancy models”

This echoes that a simple model that is easy to explain to a non-technical audience, is better than a more accurate but more complex model that is much harder to explain.

## Conclusion

Explainable AI is a rapidly evolving field that aims to make AI models more transparent and interpretable. By incorporating XAI techniques into classification, regression, and clustering, we can gain insights into the decision-making processes of these models. Enhanced transparency not only facilitates user understanding but also promotes trust, fairness, and accountability in AI systems. As AI continues to shape our world, it becomes imperative to prioritise explainability.

As always, if you enjoyed this article, be sure to check out [other articles on the site](/). You may be interested in [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/).]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to match and count keywords in text using JavaScript]]></title>
            <link>https://shedloadofcode.com/blog/how-to-match-and-count-keywords-in-text-using-javascript/</link>
            <guid>https://shedloadofcode.com/blog/how-to-match-and-count-keywords-in-text-using-javascript/</guid>
            <pubDate>Tue, 04 Jul 2023 20:30:00 GMT</pubDate>
            <description><![CDATA[Keywords play a crucial role in analyzing and extracting information from text data. Whether you're building a search functionality or conducting text analysis, being able to match and count keywords in JavaScript can be a valuable skill.]]></description>
            <content:encoded><![CDATA[
## Introduction

Keywords play a crucial role in analysing and extracting information from text data. Whether you're building a search functionality or conducting text analysis, being able to match and count keywords in JavaScript can be a valuable skill. In this article, we will explore a step-by-step approach to achieving this using JavaScript. I used this approach whilst creating an interactive JavaScript tool [Job Application Keyword Checker](/tools/job-application-keyword-calculator/). Be sure to check it out!

## Define your keywords and text

The first step is to define the keywords you want to search for in the text. Create an array and populate it with the keywords you wish to match. Next, you need to obtain the text in which you want to search for the keywords. This can be any string of text you have or even user input. For demonstration purposes, let's assume we have the following:

```js
const keywords = ["this", "where", "keywords", "none"];

const text = "This is the input text where we will search for keywords.";
```

Be sure to customise this array with your own set of keywords and you own text input.

## Match and count keywords using regex

Now that we have our keywords and text ready, let's proceed with the matching and counting process. We will iterate over each keyword and utilise regular expressions to find matches in the text. We'll also count the occurrences of each keyword.

```js
let keywordCount = 0;
const keywordCounts = {};

keywords.forEach(keyword => {
  const regex = new RegExp(keyword, "gi");
  const matches = text.match(regex);

  if (matches) {
    keywordCount[keyword] = matches.length;
    keywordCount++;
  } else {
    keywordCount[keyword] = 0;
  }
});

console.log(keywordCount);
```

In the code above, we iterate over each keyword and create a regular expression using the keyword and the "gi" flags. The "g" flag enables a global search to find all occurrences of the keyword, while the "i" flag ensures case-insensitive matching.

Using the match method on the text with the regular expression, we find all the matches. If matches are found, we store the count in the keywordCount object; otherwise, we set the count to 0.

Finally, we log the keywordCount object to the console, which displays the count of each keyword in the text.

## Match and count keywords using array.includes()

An alternative and more iterative approach is to transform the input text to an array and then match words from each array. We first would need a function to transform a string into an array.

```js
/**
 * Parses an input string and transforms it into 
 * an array of words
 */
function getWords(str) {
    let words = str.toLowerCase().split(" ");
    let uniqueWords = [...new Set(words)];
    
    for (let i = 0; i < uniqueWords.length; i++) {
        uniqueWords[i] = uniqueWords[i].replace(/-/g, " ");
    } 

    return uniqueWords;
}
```

We can then use this to match the keywords. We use `toLowerCase` to avoid case sensitive mismatches.

```js
let textArray = getWords(text);
let matchedWords = [];

// Go over each word in the text array and find matches
for (let i = 0; i < textArray.length; i++) {
    let word = textArray[i].toLowerCase();

    if (!matchedWords.includes(word)) {
        if (keywords.includes(word)) {
            matchedWords.push(word);
        }
    }
}

// Then go over all keywords to cross check
for (let i = 0; i < keywords.length; i++) {
    let term = keywords[i];

    if (!matchedWords.includes(term)) {
        if (text.toLowerCase().includes(term)) {
            matchedWords.push(term);
        }
    }
} 

console.log(matchedWords.length);
```

## Conclusion

Matching and counting keywords in text is a useful technique when working with JavaScript and textual data. By following the steps outlined in this article, you can easily implement this functionality into your own projects. Remember to customize the keywords and text variables to match your specific use case.

Feel free to experiment and enhance this code further by considering variations of keywords, such as plural forms or different tenses. Advanced techniques like stemming or lemmatisation can be employed to achieve more comprehensive keyword matching. Stemming is the process of reducing words to their base or root form, disregarding variations like tense or plural forms, to improve keyword matching and analysis in text data whereas lemmatisation is the process of reducing words to their base or dictionary form. So a good example would be the word "running" becomes "run".

Harnessing the power of JavaScript and keyword matching opens up possibilities for creating powerful search engines, text analysis tools, and much more. Start exploring and leveraging this technique to unlock the potential within your own solutions!

As always, if you enjoyed this article, be sure to check out [other articles on the site](/). 

If you are interested in finding out how to search for keywords using Python, then check out [Using PyPDF2 to score keywords in a job application](/blog/using-PyPDF2-to-score-keywords-in-a-job-application/).

]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using PyPDF2 to score keywords in a job application]]></title>
            <link>https://shedloadofcode.com/blog/using-PyPDF2-to-score-keywords-in-a-job-application/</link>
            <guid>https://shedloadofcode.com/blog/using-PyPDF2-to-score-keywords-in-a-job-application/</guid>
            <pubDate>Wed, 28 Jun 2023 20:30:00 GMT</pubDate>
            <description><![CDATA[Count essential and desirable keywords to score, sift and rank job applications. A model like this can bring a better quantitative assessment, whereas a human reviewer can bring a better qualitative assessment. Both are valuable.]]></description>
            <content:encoded><![CDATA[
## Introduction

AI and automated models will be used alongside human expertise more and more in the future. This article will explore a simple but useful example of this by counting and assessing keywords in job applications using Python. A model can bring a better quantitative assessment, whereas a human reviewer can bring a better qualitative assessment. Both are valuable.

## What are the benefits?

I sit on interview panels to select and onboard apprentices, degree apprenticeships alongside junior and intermediate staff at a large organisation, for both the software / web development and the data science sides of the business. Managing this in combination with the day job, using AI and automation is super helpful. Sifting 100+ applicants can take many hours from many people.

It helps take a more objective approach and to be more critical. Did the candidate just load up on buzzwords without any real substance? Did the candidate use only a few target words but have solid examples that demonstrated the skills required? Did the candidate give solid examples which also included the target words? Which candidate would you invite to interview?

## Understanding the PDF input

I can't share the individual job applications of course due to data protection, but I can share the model code and show what the outputs look like. You can then use this approach and tailor it to your specific needs.

The way that the organisation processes job applications means that only a single PDF is given to the panel with all of them combined. This enables anonymity and fairness in that you only see a candidate number and the application itself. No identifiable information given. It also meant that the model would first need to separate this large PDF file into the constituate applications. You will see in the code, I achieved this by splitting the text of the file on 'Application ID:'.

This is what the large PDF file looked like. I have censored all text for privacy.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/sift-example_nhshk0.png" 
  alt="PDF job applications" 
  loading="lazy" 
  styling=""
  caption="A single PDF containing 109 job applications" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/sift-example_nhshk0.png" 
  :showsource="false">
</article-image>

If your situation requires many files instead of just one that requires splitting up, you can adapt this code using the approach found in [Searching for text in PDFs at increasing scale](/blog/searching-for-text-in-pdfs-at-increasing-scale/).

## Creating the model

Before taking a look at the model, here is a brief summary of what's going on:

* We define essential and desirable `criteria` keywords to look for.
* We then use PyPDF2 to `read_applications` from the PDF using the filepath to it.
* After splitting the text into separate applications we then `score_applications` using regex to count keyword matches.
* Finally, we use `scores.describe()` to provide summary statistics.

```python [siftbot.py]
# -*- coding: utf-8 -*-
"""
A scoring model to help with job application sifting.

Enter keywords for the role essential and desirable criteria, then run the program.

The outputs will be saved in the 'applications', 'scores' and 'statistics' variables.

Documentation for PyPDF2: https://pypdf2.readthedocs.io/en/3.0.0/
Migration guide for PyPDF2: https://pypdf2.readthedocs.io/en/3.0.0/user/migration-1-to-2.html
"""
import re
import time
import PyPDF2
import pandas as pd


def criteria():
    return {
        "essential": [
            "maths",
            "a level",
            "numeric",
            "analytical",
            "technologi",
            "language",
            "data",
            "business challenge",
            "problem solving",
            "communicat"
        ],
        "desirable": [
            "programming skills",
            "analysis",
            "data manipulation",
            "analytical software",
            "software packages",
            "RStudio",
            "SQL",
            "Power BI",
            "Excel",
            "mathematical models",
            "infrastructure",
            "security",
            "web design",
            "agile",
            "agile project methodology",
            "customer facing",
            "technical and non-technical",
            "data architecture",
            "innovative"
        ]
    }


def read_applications(filepath: str) -> list:
    pdf_reader = PyPDF2.PdfReader(filepath) # Formerly PyPDF2.PdfFileReader(filepath)
    number_of_pages = pdf_reader.getNumPages()
    all_text = ""
    
    for i in range(0, number_of_pages):
        pages = pdf_reader.pages[i] # Formerly reader.getPage(pageNumber)
        text = pages.extractText()
        all_text += text
    
    applications = all_text.split("Application ID:")
    
    return applications


def score_applications(applications: list, criteria: dict):
    scores = []
    
    for application in applications:
        score = {
            "application_id": application[1:8],
            "word_count": 0,
            "essential": 0,
            "desirable": 0,
            "matched_terms": ""
        }
        
        for term in criteria["essential"]:
            if re.search(term, application):
                print(f"Matched '{term}' in application {score['application_id']}")
                score["essential"] += 1
                score["matched_terms"] += (term + " ")
                
        for term in criteria["desirable"]:
            if re.search(term, application):
                print(f"Matched '{term}' in application {score['application_id']}")
                score["desirable"] += 1
                score["matched_terms"] += (term + " ")
                
        score["word_count"] = len(application.split())
                
        scores.append([
            score["application_id"],
            score["word_count"],
            score["essential"],
            score["desirable"],
            score["essential"] + score["desirable"],
            score["matched_terms"]
        ])
    
    columns = ["Application ID", "Word Count", "Essential", 
               "Desirable",      "Combined",   "Matched Terms"]
                
    return pd.DataFrame(scores, columns=columns)
      

if __name__ == "__main__":
    start = time.time()
    
    applications = read_applications("C:\\Users\\shedloadofcode\\OneDrive\\Documents\\Recruitment\\Recruitment Jan 2023\\Applications\\applications for sift (109).pdf")
    scores = score_applications(applications, criteria())
    statistics = scores.describe()
    
    print(f"Bot finished in {round(time.time() - start, 2)} seconds")
```

I created the model using Spyder IDE and the key variables are then stored as outputs in the variable explorer - the top right window.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1687996752/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/image_yrefpn_brefzu.png" 
  alt="PDF job applications" 
  loading="lazy" 
  styling=""
  caption="Using Spyder IDE to build the model and view the outputs" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1687996752/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/image_yrefpn_brefzu.png" 
  :showsource="false">
</article-image>

## Viewing the outputs

Using the variable explorer the outputs can be analysed. We can first sense check that all of the applications were split up correctly on 'Application ID:' and that there are 109 records as expected in `applications`.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/applications_output_ykygcz.png" 
  alt="Applications variable output" 
  loading="lazy" 
  styling=""
  caption="The applications variable output" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/applications_output_ykygcz.png" 
  :showsource="false">
</article-image>

The results of the scoring is shown in `scores` which is super helpful by providing an application word count, a count of essential and desirable keywords matched, a combined count, and the matched terms themselves. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/sift-scores_abmxe7.png" 
  alt="Scores variable output" 
  loading="lazy" 
  styling=""
  caption="The scores variable output" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/sift-scores_abmxe7.png" 
  :showsource="false">
</article-image>

This means you can sort by essential, desirable or total keywords matched. It also opens up further insights, such as 'did a candidate have a high word count, but didn't use many keywords?'.

To aid with these kinds of questions, we can view the `statistics` output to find out min, max, mean and median (50%) word counts, essential, desirable and combined counts. This helps us to assess how a particular candidate compares to the average in terms of word count vs matched terms.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/statistics_output_xjf3bp.png" 
  alt="Summary statistics output" 
  loading="lazy" 
  styling=""
  caption="The summary statistics variable output" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1685293993/App%20Images/Blog%20Images/Article%20Images/PyPDF2%20Job%20Application%20Siftbot/statistics_output_xjf3bp.png" 
  :showsource="false">
</article-image>

The `scores` DataFrame could also be saved to a CSV file to share with other panel members.

## Final thoughts and interactive tool

Thanks for reading 😄 I hope you found this article interesting. 

I used the logic from the code in this article to create an interactive JavaScript tool [Job Application Keyword Checker](/tools/job-application-keyword-calculator/). Be sure to check it out and give it a go, you can get started by hitting 'Show me an example' and take it from there!

My final thought is that we should never blindly trust a model, especially in cases such as these where we are assessing suitability for a job position. A quantitative model can only get us so far. We should always carry out a human review and ask critical questions such as: 

* Is the candidate strong even though they didn't directly match many keywords?
* Did the candidate just drop all the keywords into their application without really understanding them?

This ensures fairness and avoids simple keyword matching bias, whilst also allowing the model to aid in decision making and speed up reviews.

As always, if you enjoyed this article, be sure to check out [other articles on the site](/).

]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to create animated charts with Python and Plotly]]></title>
            <link>https://shedloadofcode.com/blog/how-to-create-animated-charts-with-python-and-plotly/</link>
            <guid>https://shedloadofcode.com/blog/how-to-create-animated-charts-with-python-and-plotly/</guid>
            <pubDate>Fri, 14 Apr 2023 20:30:00 GMT</pubDate>
            <description><![CDATA[Learn how to build animated bar, line and scatter charts alongside animated maps.]]></description>
            <content:encoded><![CDATA[
## Introduction

In this short article we'll learn how to create animated charts using Python and Plotly. This follows on from the theme of the previous article [How to build and visualise a Monte Carlo simulation with Python and Plotly](/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/) Using these techniques can better help to tell the story when it comes to communicating data insights and changes over time periods or stages.

The best way to get started quickly with animated charts, is to learn from examples and then start to apply your own datasets to them. All the examples in this article will follow the pattern 'show me the code that generates the chart' then 'show me what that chart looks like'. 

You'll see that [Plotly makes generating animated charts](https://plotly.com/python/animations/) in Python relatively straightforward, and attaches a play and stop button to the chart as standard. Auto play is enabled by default, but for the charts embedded on this page, I set this to false, so you need to hit the play button on the chart to start the animation.

There are of course options to customise Plotly charts further. All of the code and outputs for the charts can be [found on GitHub](https://github.com/shedloadofcode/animated-plotly-charts).

## Animated bar chart

```python 
import plotly.express as px

df = px.data.gapminder()

fig = px.bar(df, 
             x="continent", 
             y="pop", 
             animation_frame="year", 
             animation_group="country", 
             hover_name="country",
             range_y=[0,4000000000],
             color="continent",
             color_discrete_map={
                'Asia': '#1d70b8',
                'Europe': '#f47738',
                'Africa': '#28a197',
                'Americas': '#6f72af',
                'Oceania': '#d53880'
            })

fig.update_layout(
        title="Global population growth over time.",
        xaxis_title="Continent",
        yaxis_title="Population",
        legend_title="Legend Title",
        showlegend=False,
        font=dict(
            family="Arial",
            size=14
        ),
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)')

fig.write_html("outputs/animated_bar.html", auto_play=False)
```

<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/animated-plotly-charts/main/outputs/animated_bar.html"></iframe>

## Animated line chart

```python
import plotly.graph_objects as go
import pandas as pd

dates = ["2022-12-03", "2022-12-04", "2022-12-05", "2022-12-06", "2022-12-07", "2022-12-08", "2022-12-09"]
school_a = [86.77, 80.74, 79.48, 76.47, 75.44, 74.49, 70.41]
school_b = [92.77, 91.64, 90.68, 92.37, 92.84, 90.29, 92.71]

df = pd.DataFrame(list(zip(dates, school_a, school_b)),
                  columns=['date', 'school_a', 'school_b'])

fig = go.Figure(
    layout=go.Layout(
        updatemenus=[dict(type="buttons", direction="right", x=0.9, y=1.16), ],
        xaxis=dict(range=["2022-12-02", "2022-12-10"],
                   autorange=False, tickwidth=2,
                   title_text="Time"),
        yaxis=dict(range=[0, 100],
                   autorange=False,
                   title_text="Price")
    ))

# Add traces
i = 1

fig.add_trace(
    go.Scatter(x=df.date[:i],
               y=df.school_a[:i],
               name="School A",
               visible=True,
               line=dict(color="#f47738", dash="dash")))

fig.add_trace(
    go.Scatter(x=df.date[:i],
               y=df.school_b[:i],
               name="School B",
               visible=True,
               line=dict(color="#1d70b8", dash="dash")))

# Animation
fig.update(frames=[
    go.Frame(
        data=[
            go.Scatter(x=df.date[:k], y=df.school_a[:k]),
            go.Scatter(x=df.date[:k], y=df.school_b[:k])]
    )
    for k in range(i, len(df) + 1)])

fig.update_xaxes(ticks="outside", tickwidth=2, tickcolor='white', ticklen=10)
fig.update_yaxes(ticks="outside", tickwidth=2, tickcolor='white', ticklen=1)
fig.update_layout(yaxis_tickformat=',')
fig.update_layout(legend=dict(x=0, y=1.1), legend_orientation="h")

# Buttons
fig.update_layout(title="Attendance % of two schools over time.",
                  xaxis_title="Date",
                  yaxis_title="Attendance %",
                  legend_title="Legend Title",
                  showlegend=False,
                  font=dict(
                      family="Arial",
                      size=14
                  ),
                  paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  hovermode="x",
                  updatemenus=[
                        dict(
                            buttons=list([
                                dict(label="Play",
                                     method="animate",
                                     args=[None, {"frame": {"duration": 500}}]),
                                dict(label="School A",
                                     method="update",
                                     args=[{"visible": [False, True]},
                                           {"showlegend": True}]),
                                dict(label="School B",
                                     method="update",
                                     args=[{"visible": [True, False]},
                                          {"showlegend": True}]),
                                dict(label="All",
                                     method="update",
                                     args=[{"visible": [True, True, True]},
                                          {"showlegend": True}]),
                            ]))
                        ]
                    )

fig.write_html("outputs/animated_line.html", auto_play=False)
```

<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/animated-plotly-charts/main/outputs/animated_line.html"></iframe>

## Animated scatter chart

```python
import plotly.express as px


df = px.data.gapminder()


fig = px.scatter(df, 
                 x="gdpPercap", 
                 y="lifeExp", 
                 animation_frame="year", 
                 animation_group="country",
                 size="pop", 
                 color="continent", 
                 hover_name="country",
                 log_x=True,
                 size_max=55, 
                 range_x=[100,100000], 
                 range_y=[25,90])

fig.add_hline(y=72, 
              line_width=2, 
              line_dash='dash', 
              line_color='lightgray',
              annotation_text='',
              annotation_font=dict(
                family="Arial",
                size=15,
                color="lightgray"
               ),
               annotation_font_size=15,
               annotation_position='bottom left',
               fillcolor='lightgray')

fig.add_shape(type="line",
              x0=12000, 
              y0=0, 
              x1=12000, 
              y1=100,
              line_width=2,
              line_color='lightgray',
              line_dash='dash')

fig.add_annotation(x=0.15,
                   xref='paper',
                   yref='paper',
                   xanchor='left',
                   y=0.15,
                   yanchor='top',
                   text="Below average",
                   font=dict(
                        color="black",
                        size=20,
                        family="Arial"
                    ),
                    showarrow=False)

fig.add_annotation(x=0.85,
                   xref='paper',
                   yref='paper',
                   xanchor='left',
                   y=0.95,
                   yanchor='top',
                   text="Above average",
                   font=dict(
                        color="black",
                        size=20,
                        family="Arial"
                    ),
                    showarrow=False)


fig.add_annotation(x=.99,
                   xref='paper',
                   xanchor='right',
                   y=27,
                   yanchor='bottom',
                   text="<b>Data last updated 2008</b>",
                   font=dict(
                       color="gray",
                       size=14
                   ),
                   showarrow=False)


fig.update_layout(
        title="Global life expectancy and GDP per capita over time.",
        xaxis_title="GDP per capita",
        yaxis_title="Life expectancy",
        legend_title="Legend Title",
        showlegend=False,
        font=dict(
            family="Arial",
            size=14
        ),
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)')


fig.write_html("outputs/animated_scatter.html", auto_play=False)
```

<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/animated-plotly-charts/main/outputs/animated_scatter.html"></iframe>

## Bonus: Animated choropleth map

For this chart I had to use a [Jupyter Notebook launched with Anaconda](https://medium.com/analytics-vidhya/fastest-way-to-install-geopandas-in-jupyter-notebook-on-windows-8f734e11fa2b) with an environment dedicated to GeoPandas. I had a few issues installing GeoPandas but this method worked ok. You can view the [entire notebook on GitHub](https://github.com/shedloadofcode/animated-plotly-charts/blob/main/animated%20map%20choropleth.ipynb).

You'll see in the Jupyter Notebook, the final cell creates the choropleth map using [mapbox](https://www.mapbox.com/) - you will need to sign up to get a free API key to use this.

```python
fig = px.choropleth_mapbox(df_crime_final,
                           geojson=geojson,
                           featureidkey='properties.name',
                           locations='NEIGHBOURHOOD',
                           color='Count',
                           hover_name='NEIGHBOURHOOD',
                           hover_data=['Count'],
                           color_continuous_scale='Reds',
                           animation_frame='Date',
                           mapbox_style='carto-positron',
                           title='Cumulative Numbers of Crimes in Vancouver Neighborhoods',
                           center={'lat':49.25, 'lon':-123.13},
                           zoom=11,
                           opacity=0.75,
                           labels={'Count':'Count'},
                           width=1200,
                           height=800)

fig.write_html("outputs/animated_choropleth.html", auto_play=False)
```

<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/animated-plotly-charts/main/outputs/animated_choropleth_yoy.html"></iframe>

## Deployment options

Now you've seen some examples of animated charts you can start putting together your own, but what's the best way to share these charts with others? Well you could export to an HTML file the same as in this article, and then even embed that into a web page. My [previous article](/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/) discussed this approach, here's the code snippet which uses [htmlpreview.github.io](https://htmlpreview.github.io).

```html
<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html"></iframe>
```

## Final note

I usually post an article every month (at least) but I missed February and March as I was busy preparing to bring my new German Shepherd puppy home. His name is Kaiser and he's settled in to the home very well 😄 I've been doing lots of training with him, teaching commands like sit, stay, down, come, leave it, out and heel. Maybe I'll write a fun article soon on that since I guess it's related to coding and logic - 'Programming my German Shepherd' 😆

I should now be back on track with my (at least) monthly new article releases. Anyway, I hope you enjoyed the article and as always be sure to check out other articles on the site. You may be interested in:

* [Creating statistical neighbours comparator benchmarking models with Python](/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/)
* [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/)
* [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to build and visualise a Monte Carlo simulation with Python and Plotly]]></title>
            <link>https://shedloadofcode.com/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/</link>
            <guid>https://shedloadofcode.com/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/</guid>
            <pubDate>Fri, 06 Jan 2023 18:30:00 GMT</pubDate>
            <description><![CDATA[Learn how to construct a Monte Carlo simulation with Python and plot the results using Plotly.]]></description>
            <content:encoded><![CDATA[
<affiliate-disclaimer></affiliate-disclaimer>

*This article does not constitute financial advice and is for educational purposes only.*

## What are Monte Carlo simulations?

[Monte Carlo simulations](https://en.wikipedia.org/wiki/Monte_Carlo_method) are used to model the probabilities of different outcomes where those outcomes are hard to predict due to random variables. The [Law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) states that as a sample size grows, its mean gets closer to the average of the whole population. This is due to the sample being more representative of the population as the sample become larger. In other words, with a Monte Carlo simulation the goal is to simulate the collection of all or many possible paths (using random sampling) in order to find the possibilities and the most likely or theoretical solution. 

In summary:

* A Monte Carlo simulation is a model used to predict the probability of a variety of outcomes when the potential for   random variables is present.
* Monte Carlo simulations help to explain the impact of risk and uncertainty in prediction and forecasting models.
* A Monte Carlo simulation requires assigning multiple values to an uncertain variable to achieve multiple results and then averaging the results to obtain an estimate.
* A Monte Carlo model is a [stochastic model](https://www.investopedia.com/terms/s/stochastic-modeling.asp#:~:text=our%20editorial%20policies-,What%20Is%20Stochastic%20Modeling%3F,different%20conditions%2C%20using%20random%20variables.), meaning that due to randomness the results may differ each time, as opposed to a deterministic model where given the same inputs you'll get the same result every time.

## A quick example to illustrate

Monte Carlo simulations are named after the [Monte Carlo casino](https://en.wikipedia.org/wiki/Monte_Carlo_Casino) in Monaco, so let's ask a casino based question. 

"If we always pick red at roulette, how often would we win?"

The roulette wheel has 18 red slots, 18 black slots, and 1 green slot for a total of 37 slots. 

```python [roulette.py]
import random

def play_roulette():
    total_slots = 37
    red_probability   = (18 / total_slots) * 100
    black_probability = (18 / total_slots) * 100
    green_probability = (1 / total_slots) * 100
    
    possible_outcomes = ["red", "black", "green"]
    probabilities = [red_probability, black_probability, green_probability]
    
    outcome = random.choices(
        possible_outcomes,
        weights=probabilities,
        k=1
    )[0]
    
    return outcome
    
    
def perform_simulation(n_times=1000, choice="red"):   
    results = { "red": 0, "black": 0, "green": 0 }
    
    for i in range(n_times):
        outcome = play_roulette()
        results[outcome] += 1
           
    win_percentage = results[choice] / n_times
        
    return results, win_percentage

        
if __name__ == "__main__":
    results, win_percentage = perform_simulation(n_times=1000000, 
                                                 choice="red")
    
    print(results)
    print(win_percentage)
```

<code-runner :output="['{\'red\': 486196, \'black\': 486594, \'green\': 27210}', '0.487112']" 
  filename="roulette.py" 
  language="Python">
</code-runner>

So after 1 million simulations, we can say we win just less than half of the time with a 48.71% probability. We've proven that the extra green pocket gives the house an edge over the long run.

## Building the Monte Carlo model with Python

Now we have an idea of what a Monte Carlo simulation is and have seen a short example, we can build a more complex model. The challenge I have set here is to recreate an awesome [Monte Carlo retirement simulation](https://engaging-data.com/fire-calculator/?age=32&initsav=25000&spend=45000&initinc=60000&wr=4&ir=1&retspend=40000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=8.1&MCbondrtn=2.4&tax=7&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70) from [engaging-data.com](https://engaging-data.com) using Python and Plotly. After playing around with this calculator I wondered how this could be re-created in Python with a few individual touches. I got quite close and there's lots to learn from the code.

The question this time is "If I invest a set amount for a number of years, how much might I have?"

All of the code for this model can be found on [GitHub](https://github.com/shedloadofcode/monte-carlo-simulation).

```python [model.py]
"""
Monte Carlo model to simulate the growth 
of an investment portfolio over time.
"""
import numpy as np
from helpers import (
    get_random_returns, 
    get_confidence_levels,
    get_yearly_percentiles)

from plots import (
    plot_histogram,
    plot_yearly_percentiles)


def perform_simulation(inputs: dict):
    """
    Performs a simulation to find out how much
    the pot is worth in £ after years of growth.
    
    Returns:
        pot (float)     - the final amount at the end 
        history (list)  - the yearly history of results 
                          [10000, 11000, 12000, ...]
        
    """
    years = inputs['end_age'] - inputs['start_age']
    pot = inputs['starting_pot']
    returns = get_random_returns(years=years) 
    mean_return = (np.mean(returns) - 1) * 100
    
    history = []
    
    for i in range(years):
        annual_return = returns[i]
        pot *= annual_return
        pot += inputs['annual_contributions']
        history.append(int(pot))
        
    return pot, history, mean_return
    
    
def perform_monte_carlo(inputs: dict, n: int = 1000):
    pot_sizes = []
    results = []
    mean_returns = []
    
    for i in range(n):
        final_amount, history, mean_return = perform_simulation(inputs)
        pot_sizes.append(final_amount)
        results.append(history)
        mean_returns.append(mean_return)
        
    lower_confidence, upper_confidence = get_confidence_levels(pot_sizes)
    
    print('Monte carlo model done :)', end='\n')
    print('Plots saved to /outputs folder')
    print('Mean return across all simulations: ', end='')
    print(f'{round(np.mean(mean_returns), 1)}%')
    
    return {
        'pot_sizes': pot_sizes,
        'results': results, 
        'yearly_percentiles': get_yearly_percentiles(results, inputs),
        'lower_confidence': lower_confidence,
        'upper_confidence': upper_confidence,
        'mean_returns': mean_returns
    }
    
    
if __name__ == "__main__":
    inputs = {
        'start_age': 20,
        'end_age': 65,
        'starting_pot': 5000,
        'annual_contributions': 500 * 12, 
        'target_amount': 300000,
        'n_simulations': 10000
    }
    
    mc = perform_monte_carlo(inputs, 
                             n=inputs['n_simulations'])
    
    plot_histogram(mc['pot_sizes'], 
                   mc['upper_confidence'], 
                   mc['lower_confidence'])
    
    plot_yearly_percentiles(inputs=inputs,
                            df=mc['yearly_percentiles'])
```

This model takes a dictionary 'inputs' which you can change to adapt the simulation. The 'perform_monte_carlo' function carries out a given number of simulations and returns the final 'pot_sizes' with other useful information like the history and mean returns of each simulation, the yearly percentiles, alongside upper and lower confidence intervals.

For this example our starting age is 20 and end age is 65. We start with £5,000 and our annual contributions are £6,000 (or £500 per month) and we're aiming for a £300,000 pot! We will run this simulation 10,000 times. You might be thinking, how do we simulate the randomness of what our returns might be each year? 

A quick Google search tells us that the historic [annual average return](https://www.google.com/search?q=s%26p+500+average+return) of the S&P500 is 10% per year. Sorry but I'm much more pessimistic and expect lower. I have modelled a range of returns and assigned them probability weights in the file helpers.py below. This means I've assumed low returns are more likely, but there is also a chance of higher returns, or negative returns. Nobody knows what the markets will do, and that's why randomness will help us with this uncertainty and view the outcomes of many simulations.

```python [helpers.py]
import random
import numpy as np
import pandas as pd


def get_random_returns(years: int):
    """
    Generates a list of random return percentages
    for the length of years required.
    """
    random_returns = []
    
    for i in range(years):
        high_negative_returns = (random.randint(-20, -8) / 1000) + 1
        low_negative_returns = (random.randint(-7, -1) / 1000) + 1
        low_returns = (random.randint(0, 4) / 100) + 1
        medium_returns = (random.randint(5, 9) / 100) + 1
        high_returns = (random.randint(10, 20) / 100) + 1
        
        possible_returns = [        # Weights
            high_negative_returns,  # 5  % chance
            low_negative_returns,   # 25 % chance
            low_returns,            # 40 % chance
            medium_returns,         # 25 % chance
            high_returns            # 5  % chance
        ]
        
        random_return = random.choices(
            possible_returns,
            weights=(5, 25, 40, 25, 5),
            k=1
        )[0]
        
        random_returns.append(
            random_return
        )
    
    return random_returns


def get_confidence_levels(pot_sizes):    
    upper_confidence = round(np.quantile(pot_sizes, 0.975), 2)
    lower_confidence = round(np.quantile(pot_sizes, 0.025), 2)
    
    return lower_confidence, upper_confidence


def get_yearly_percentiles(results, inputs) -> pd.DataFrame:
    """
    Finds the percentiles for each year.
    """
    results_rotated = list(zip(*results[::-1]))

    year = []
    age = []
    ninetieth_percentile = []
    seventy_fifth_percentile = []
    median = []
    twenty_fifth_percentile = []
    tenth_percentile = []
    
    for i, year_results in enumerate(results_rotated):
        new_age = (inputs['start_age'] + 1) + i
        ninetieth_percentile_value = np.percentile(year_results, 90)
        seventy_fifth_percentile_value = np.percentile(year_results, 75)
        median_value = np.median(year_results)
        twenty_fifth_percentile_value = np.percentile(year_results, 25)
        tenth_percentile_value = np.percentile(year_results, 10)
        
        year.append(i + 1)
        age.append(new_age)
        ninetieth_percentile.append(ninetieth_percentile_value)
        seventy_fifth_percentile.append(seventy_fifth_percentile_value)
        median.append(median_value)
        twenty_fifth_percentile.append(twenty_fifth_percentile_value)
        tenth_percentile.append(tenth_percentile_value)
        

    return pd.DataFrame(
        list(
            zip(year,
                age,
                ninetieth_percentile, 
                seventy_fifth_percentile,
                median, 
                twenty_fifth_percentile,
                tenth_percentile)
        ),
        columns=[
            'year',
            'age',
            '90th_percentile',
            '75th_percentile',
            'median', 
            '25th_percentile',
            '10th_percentile']
    )
    
```

The randomness we've introduced here is for every year in each of the 10,000 or more simulations a:

* 5%  chance of negative returns between -20% and -8%
* 25% chance of negative returns between -7%  and -1%
* 40% chance of low returns between       0%  and  4%
* 25% chance of medium returns between    5%  and  9%
* 5%  chance of high returns between      10% and 20%

If you think these are too pessimistic or optimistic please go ahead change the values or weights 👍

The 'get_yearly_percentiles' function takes the 2D list 'results' (all of the histories for all simulations year by year), [rotates it](https://stackoverflow.com/questions/8421337/rotating-a-two-dimensional-array-in-python) to line up year 1, year 2, year 3 and so on, and then finds the percentiles (10th, 25th, median, 75th, 90th) for each year. This effectively shows the range of results from all simulations for each year in a DataFrame:

| year | age | 90th_percentile | 75th_percentile | median   | 25th_percentile | 10th_percentile |
| ---- | --- | --------------- | --------------- | -------- | --------------- | --------------- |
| 1    | 21  | 11450           | 11300           | 11100    | 10990           | 10970           |
| 2    | 22  | 18153           | 17804           | 17399    | 17110           | 16957           |
| 3    | 23  | 25296.1         | 24570.25        | 23919    | 23395.75        | 23051           |
| 4    | 24  | 32631           | 31605           | 30632    | 29832           | 29279.8         |
| 5    | 25  | 40342.1         | 38841.25        | 37513.5  | 36407.75        | 35624.9         |
| ...  | ... | ...             | ...             | ...      | ...             | ...             |
| 41   | 61  | 618714.6        | 553043.8        | 492832   | 442674.5        | 403963.4        |
| 42   | 62  | 645462.7        | 578133          | 514355   | 461004.3        | 420718          |
| 43   | 63  | 673703          | 602295          | 535547   | 478970          | 437741.3        |
| 44   | 64  | 703538.1        | 629788.8        | 557324.5 | 498292.3        | 453481.4        |
| 45   | 65  | 739303.7        | 656680.5        | 580414.5 | 517842          | 470615.8        |

This can then be plotted using Plotly along with the final pot sizes.

## Plotting the Monte Carlo results with Plotly

I was using [Spyder](https://www.spyder-ide.org/) to carry out this analysis, and saved the plots as html files in the /output directory. You'll need to install Plotly with `python -m pip install plotly`

```python [plots.py]
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

pio.renderers.default='svg'


def plot_histogram(pot_sizes: list, 
                   upper_confidence:float, 
                   lower_confidence: float):
    """
    Plots the frequencies of the final pot sizes.
    """
    fig = px.histogram(pot_sizes, 
                       title=f"The final pot size after {len(pot_sizes)} simulations.")
    
    fig.add_vline(x=lower_confidence, 
                  line_width=3, 
                  line_dash="dash", 
                  line_color="green")
    
    fig.add_vline(x=upper_confidence, 
                  line_width=3, 
                  line_dash="dash", 
                  line_color="green")
    
    fig.add_vline(x=np.median(pot_sizes), 
                  line_width=3, 
                  line_dash="dash", 
                  line_color="black",
                  annotation_text="median",
                  annotation_font_size=15)
    
    fig.add_vrect(x0=lower_confidence, 
                  x1=upper_confidence, 
                  line_width=0, 
                  fillcolor="green",
                  opacity=0.2,
                  annotation_text="95% confidence interval",
                  annotation_font_size=15)
    
    fig.update_layout(
        xaxis_title="Amount (£)",
        yaxis_title="Count",
        showlegend=False,
        font=dict(
            family="Arial",
            size=14
        ),
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)',
    )
    
    fig.write_html('outputs/mc-histogram.html', auto_open=False)


def plot_yearly_percentiles(inputs, df):
    """
    Plots the year by year percentile graph.
    """
    exact_np = df[df['90th_percentile'] > inputs['target_amount']].iloc[0]
    exact_sfp = df[df['75th_percentile'] > inputs['target_amount']].iloc[0]
    exact_median = df[df['median'] > inputs['target_amount']].iloc[0]
    exact_tfp = df[df['25th_percentile'] > inputs['target_amount']].iloc[0]
    exact_tp = df[df['10th_percentile'] > inputs['target_amount']].iloc[0]
    
    fig = go.Figure()
    
    fig.add_traces(go.Scatter(x=df['age'], 
                              y=df['10th_percentile'],
                              line = dict(color='#FFA502'),
                              mode='lines',
                              name='10th %tile',
                              fill='none', 
                              fillcolor = '#F7CA77'))
    
    fig.add_traces(go.Scatter(x=df['age'], 
                              y=df['25th_percentile'],
                              line = dict(color='#7BE56E'),
                              mode='lines',
                              name='25th %tile',
                              fill='tonexty', 
                              fillcolor = '#F7CA77'))

    
    fig.add_traces(go.Scatter(x=df['age'], 
                              y=df['median'],
                              line=dict(color='black'),
                              line_width=3,
                              mode='lines',
                              name="median", 
                              fill='tonexty',
                              fillcolor='#00FF66'))
    
    fig.add_traces(go.Scatter(x=df['age'], 
                              y=df['75th_percentile'],
                              line = dict(color='#7BE56E'),
                              mode='lines',
                              name="75th %tile",
                              fill='tonexty', 
                              fillcolor = '#00FF66'))
    
    fig.add_traces(go.Scatter(x=df['age'], 
                              y=df['90th_percentile'],
                              line = dict(color='#FFA502'),
                              mode='lines',
                              name="90th %tile",
                              fill='tonexty', 
                              fillcolor = '#F7CA77'))
    
    fig.update_layout(hovermode="x")
    
    fig.update_xaxes(tickangle=0, 
                     dtick=1,
                     showticklabels=True, 
                     gridcolor='lightgray',
                     type='category')
    
    fig.update_yaxes(gridcolor='lightgray',
                     rangemode="tozero")
    
    fig.add_hline(y=inputs['target_amount'], 
                  line_width=2, 
                  line_dash='dash', 
                  line_color='red',
                  annotation_text='Target amount',
                  annotation_font=dict(
                    family="Arial",
                    size=15,
                    color="red"
                  ),
                  annotation_font_size=15,
                  annotation_position='bottom left',
                  fillcolor='red')
    
    fig.add_shape(type="line",
                  x0=int(exact_median['year'] - 1), 
                  y0=0, 
                  x1=int(exact_median['year'] - 1), 
                  y1=inputs['target_amount'],
                  line_width=2,
                  line_color='gray',
                  line_dash='dash')
    
    fig.add_shape(type="line",
                  x0=int(exact_tp['year'] - 1), 
                  y0=0, 
                  x1=int(exact_tp['year'] - 1), 
                  y1=inputs['target_amount'],
                  line_width=2,
                  line_color='orange',
                  line_dash='dash')
    
    fig.add_shape(type="line",
                  x0=int(exact_np['year'] - 1), 
                  y0=0, 
                  x1=int(exact_np['year'] - 1), 
                  y1=inputs['target_amount'],
                  line_width=2,
                  line_color='orange',
                  line_dash='dash')
    
    fig.add_shape(type="line",
                  x0=int(exact_tfp['year'] - 1), 
                  y0=0, 
                  x1=int(exact_tfp['year'] - 1), 
                  y1=inputs['target_amount'],
                  line_width=2,
                  line_color='green',
                  line_dash='dash')
    
    fig.add_shape(type="line",
                  x0=int(exact_sfp['year'] - 1), 
                  y0=0, 
                  x1=int(exact_sfp['year'] - 1), 
                  y1=inputs['target_amount'],
                  line_width=2,
                  line_color='green',
                  line_dash='dash')
      
    fig.add_annotation(x=int(exact_median['year'] - 1), 
                       y=inputs['target_amount'] * 1.45,
                       text=f"<b>{int(exact_median['year'])} years</b>",
                       font=dict(
                            color="black",
                            size=21
                       ),
                       showarrow=False,
                       yshift=10)
    
    fig.add_annotation(x=int(exact_median['year'] - 1), 
                       y=inputs['target_amount'] * 1.3,
                       text=f"<b>(Age {int(exact_median['age'])})</b>",
                       font=dict(
                            color="black",
                            size=21
                       ),
                       showarrow=False,
                       yshift=10)
    
    fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, 
                       y=df['10th_percentile'].max() - 8000,
                       text="<b>10%</b>",
                       font=dict(
                            color="black",
                            size=12
                       ),
                       showarrow=False,
                       yshift=10)
    
    fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, 
                       y=df['25th_percentile'].max() - 8000,
                       text="<b>25%</b>",
                       font=dict(
                            color="black",
                            size=12
                       ),
                       showarrow=False,
                       yshift=10)
        
    fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.25, 
                       y=df['median'].max() - 5000,
                       text="<b>median</b>",
                       font=dict(
                            color="black",
                            size=12
                       ),
                       showarrow=False,
                       yshift=10)
    
    fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, 
                       y=df['75th_percentile'].max() - 4000,
                       text="<b>75%</b>",
                       font=dict(
                            color="black",
                            size=12
                       ),
                       showarrow=False,
                       yshift=10)
    
    fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, 
                       y=df['90th_percentile'].max() - 5000,
                       text="<b>90%</b>",
                       font=dict(
                            color="black",
                            size=12
                       ),
                       showarrow=False,
                       yshift=10)
    
    fig.add_annotation(x=.99,
                       xref='paper',
                       xanchor='right',
                       y=0,
                       yanchor='bottom',
                       text="<b>shedloadofcode.com</b>",
                       font=dict(
                            color="gray",
                            size=14
                       ),
                       showarrow=False)
    
    fig.add_annotation(x=0.01,
                       xref='paper',
                       yref='paper',
                       xanchor='left',
                       y=0.99,
                       yanchor='top',
                       text=f"In <b>{inputs['n_simulations']}</b> simulations " +
                            f"<b>{int(exact_median['age'])}</b> " +
                            f"is the median age ({int(exact_median['year'])} years)<br>",
                       font=dict(
                            color="black",
                            size=15
                       ),
                       showarrow=False)
    
    fig.add_annotation(x=0.01,
                       xref='paper',
                       yref='paper',
                       xanchor='left',
                       y=0.96,
                       yanchor='top',
                       text="<span style=\"color:orange\">10th to 90th %ile: " +
                            f"<b>{int(exact_np['year'])} to {int(exact_tp['year'])} " + 
                             "years</b> to target</span>",  
                       font=dict(
                            color="black",
                            size=15
                       ),
                       showarrow=False)
    
    fig.add_annotation(x=0.01,
                       xref='paper',
                       yref='paper',
                       xanchor='left',
                       y=0.93,
                       yanchor='top',
                       text="<span style=\"color:green\">25th to 75th %ile: " +
                            f"<b>{int(exact_sfp['year'])} to {int(exact_tfp['year'])} " + 
                             "years</b> to target</span>",
                       font=dict(
                            color="black",
                            size=15
                       ),
                       showarrow=False)
     
    
    fig.update_layout(
        title=f"Percentiles by year after {inputs['n_simulations']} simulations.",
        xaxis_title="Age",
        yaxis_title="Amount (£)",
        legend_title="Legend Title",
        showlegend=False,
        font=dict(
            family="Arial",
            size=14
        ),
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)',
    )
    
    fig.write_html('outputs/mc-percentiles.html', auto_open=False)
```

This outputs the year by year percentiles to 'outputs/mc-percentiles.html'. The good part about the Plotly HTML output is that after [uploading to GitHub](https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html) it can be viewed via [htmlpreview.github.io](https://htmlpreview.github.io)

Go ahead and take a look at the [Monte Carlo percentile graph](https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html).

You can also use this to embed the interactive plot in a web page using an [iframe](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) like the one below!

<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html"
loading="lazy"></iframe>

```html
<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html"></iframe>
```

We can see that the median amount crosses the target at age 51 given our inputs. However, given a better or worse outcome it could cross the target amount between ages 47 and 54. 

Monte Carlo simulations are a great way to deal with uncertainty when we simply don't know what the expected values (in this case investment returns) will be.

We can also take a look at the frequencies of the final pot sizes at the end age 65 in a histogram.

<iframe height="800" width="100%" loading="lazy" src="https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-histogram.html"></iframe>

We can see the 95% confidence interval is between £420k - £850k with the median pot size being £581k. This demonstrates the power of compounding and starting investing from an early age.

<subscribe-form></subscribe-form>

## Comparing the results

I have used numerous scenarios as inputs to test this model against the calculator from [engaging-data.com](https://engaging-data.com) (ED) to see how the results align, which has been pretty fun. I set the average return on the ED calculator to **4%** as my model is a bit more pessimistic. As mentioned earlier, you can change the probability weights for a given set of returns in the 'get_random_returns' function if you feel more optimistic.

Here are my findings in three scenarios:

---

**Scenario 1 inputs**

| Input                | Value   | 
| -------------------  | -----   | 
| Start age            |   20    |       
| End age              |   65    |      
| Starting pot         |  5,000  |   
| Annual contributions |  6,000  |
| Target amount        | 300,000 | 
| Simulations          | 10,000  |       

**Scenario 1 results** [View calculator results](https://engaging-data.com/fire-calculator/?age=20&initsav=5000&spend=6000&initinc=12000&wr=4&ir=0&retspend=12000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=4&MCbondrtn=2&tax=0&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70)

| Model         | Years | Age    | 
| ----------    | ----- | ------ |  
| This model    |  30   |  50    |   
| ED calculator |  30.3 |  50    | 

---

**Scenario 2 inputs**

| Input                | Value   | 
| -------------------  | -----   | 
| Start age            |   30    |       
| End age              |   65    |      
| Starting pot         |  10,000 |   
| Annual contributions |  10,000 |
| Target amount        | 400,000 | 
| Simulations          | 10,000  |       

**Scenario 2 results** [View calculator results](https://engaging-data.com/fire-calculator/?age=30&initsav=10000&spend=10000&initinc=20000&wr=4&ir=0&retspend=16000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=4&MCbondrtn=2&tax=0&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70)

| Model         | Years | Age    | 
| ----------    | ----- | ------ |  
| This model    |  26   |  56    |   
| ED calculator |  25.2 |  55    | 

---

**Scenario 3 inputs**

| Input                | Value     | 
| -------------------  | -----     | 
| Start age            |   25      |       
| End age              |   65      |      
| Starting pot         |  20,000   |   
| Annual contributions |  20,000   |
| Target amount        | 1,000,000 | 
| Simulations          | 10,000    |       

**Scenario 3 results** [View calculator results](https://engaging-data.com/fire-calculator/?age=25&initsav=20000&spend=20000&initinc=40000&wr=4&ir=0&retspend=40000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=4&MCbondrtn=2&tax=0&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70)

| Model         | Years | Age    | 
| ----------    | ----- | ------ |  
| This model    |   30  |  55    |   
| ED calculator |  30.2 |  55    |

---

As you can see the results are very closely aligned, so I'm very pleased with how well this model is performing. Of course, as [George Box said](/blog/programming-quotes-that-offer-wisdom-and-motivation/#using-statistics) *All models are wrong, but some are useful*. We should not forget that models and simulations can only give us an indication of possible outcomes, we should never blindly trust them but use them as tools. I think it's also important to keep them realistic and not introduce too much bias or ego into our assumptions. It would be great to get 10% returns every year, but is that realistically going to happen? 

Lowering our model's assumptions ensures we are closer to a 'worst case scenario' and any over-performance is a bonus!

## Conclusion

We've learned lots on both Monte Carlo methods and creating / embedding Plotly visualisations with Python.

Some of the techniques used in this article with Plotly can also be used for variations of [fan charts](https://analystsuncertaintytoolkit.github.io/UncertaintyWeb/chapter_6.html#fan-charts) typically used for forecasting and acknowledging uncertainty.

I didn't quite get around to incrementing the results by 0.1 and plotting the circles which you can [achieve with Plotly shapes](https://plotly.com/python/shapes/#circles-positioned-relative-to-the-axis-data). Maybe this is something you can try to replicate if you want to. I actually preferred seeing the vertical lines show which age the amount goes above the target rather than the exact intersection - this also is the foundation of statistical process control charts to make variation in the results explicit.

Thanks to [engaging-data.com](https://engaging-data.com) for giving me the inspiration to try and re-create this awesome model and visualisation with Python and Plotly.

Finally, it's worth mentioning that [DataCamp](https://datacamp.pxf.io/EKAK42) has an interactive course [Monte Carlo Simulations in Python](https://datacamp.pxf.io/rQWmd5) and many other great courses on data science and machine learning. Read the full review [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp).

If you enjoyed this article you may also be interested in:

* [Creating statistical neighbours comparator benchmarking models with Python](/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/)
* [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview)
* [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code)
* [MIT OpenCourseWare Monte Carlo Simulation](https://www.youtube.com/watch?v=OgO1gpXSUzU)
* [Uncertainty Toolkit for Analysts](https://analystsuncertaintytoolkit.github.io/UncertaintyWeb/introduction.html)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Six tips for producing and assuring high quality analytical code]]></title>
            <link>https://shedloadofcode.com/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/</link>
            <guid>https://shedloadofcode.com/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/</guid>
            <pubDate>Thu, 15 Dec 2022 10:40:00 GMT</pubDate>
            <description><![CDATA[Use these best practices when building analytical models and assuring code quality.]]></description>
            <content:encoded><![CDATA[
In this article we'll look at six tips on producing solid analytical code and ensuring it is of high quality. As with all software engineering the goal is to solve the problem alongside reducing complexity, creating useful abstractions, and keeping it simple!

These tips are inspired by two excellent resources [Quality assurance of code for analysis and research](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html) and [The Turing Way](https://the-turing-way.netlify.app/welcome).

## Begin with the end in mind

Analysis can get complicated without a good roadmap of where you want to get to. What is the purpose of the analysis? What does the end result look like? It's worth asking questions like this first. You want to be able to describe it to someone who's never heard of your project in one sentence.

* A model to identify our most valuable customers.
* A model to allocate the correct amount of stock to each store. 
* A model to forecast product sales.

This helps people understand 'what it does'. To explain to those more curious 'how it does it' we might require a simple and clear solution diagram. It is the A to B summary - I find this helps newcomers understand the technical big picture. It doesn't even have to be a diagram it can be as simple something like this in the README file:

```
Read sales data 
|
---> Apply forecasting model 
|
------> Output daily predicted sales for each product 
|
---------> Email output to store manager
```

Without looking at any code I know what this model should do. By writing this before writing the code it allows you plan at a high level what the solution should actually do and avoids coding parts that aren't actually needed. If you want to improve your system design skills more generally, check out the article [Five ways to improve your system design and software architecture skills](/blog/five-ways-to-improve-your-system-design-and-software-architecture-skills/).

## Structure your project neatly

This enables you and others to find the files they need quickly, and to make sense of the overall solution. [cookiecutter](https://drivendata.github.io/cookiecutter-data-science/) and [govcookiecutter](https://github.com/best-practice-and-impact/govcookiecutter) provide useful Data Science project structures.

```
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io
```

This project structure might be too complex for simpler projects, but it gives you a start and you can reduce or repurpose from there. Just a 'data', 'models', 'notebooks', 'output' and 'tests' folder might be enough with a 'src' directory for helper modules/functions and a good README. This structure can also be replicated for projects [where R is used](https://www.r-bloggers.com/2018/08/structuring-r-projects/) instead of Python.

## Use version control always

You may think it's only a small project and that using version control is too complex for it. Always use version control! Your future self will thank you 😄 It enables the ability to back up your work, collaborate with others using branches, revert to previous versions, and more. Plus, there's usually no good reason not to use it!

First create a repository with a repository hosting provider such as [GitHub](https://www.github.com).

Then in your working directory initialise the directory as a repo and push your initial commit.

```
git init
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/your-username/repo-name.git
git push -u origin main
```

Then every time you make a change, commit again. Keep commits short and often, rather than committing lots of changes all in one go. Then push to the remote repository every once in a while so your changes are backed up.

```
git add .
git commit -m "Add new percentage calculations to model"
git push
```

There are [many commands with Git](https://git-scm.com/docs) you should explore, the most useful are to `revert` to a previous commit, and to create a new `branch` to work on something separately before you `merge` it back to the main branch. You can see the whole history of the project with every commit using `git log --graph`.

## Keep it reproducible with a virtual environment and README

A virtual environment is a collection of packages / dependencies that gives you everything you need to run a project. It solves 'but it works on my machine' problems. You want your analysis to be reproducible, which means someone should be able to clone your repo, install the package dependencies and run your code successfully. For Python there is the [venv](https://docs.python.org/3/library/venv.html) and pipenv packages and for R there is the [renv](https://rstudio.github.io/renv/articles/renv.html) and [packrat](https://rstudio.github.io/packrat/) packages. I prefer [venv](https://docs.python.org/3/library/venv.html) and [renv](https://rstudio.github.io/renv/articles/renv.html). 

When someone first clones your repo, there may be other steps they have to go through to run your code too. There may be environment variables that need adding to a `.env` file or sensitive data files adding that could not be stored in version control. A [good README.md file](https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/) helps with the setup steps. Here I have used some setup steps from an analytical web app I worked on recently which used the [Django](https://www.djangoproject.com/) Python web framework.

```text [README.md]
# My data visualisation app

    This app presents data visualisation in a web interface.

## Features

    * Security and user login
    * HTTPS Let's Encrypt
    * Object-relational mapping
    * Integration to Google Sheets API

## Running locally

    * Create and activate a virtual environment 

        python -m venv venv
        .\venv\Scripts\activate
        python -m pip install <package-name>
        python -m pip install -r requirements.txt

    * To deactivate use:

        deactivate

    * To install new packages use:

        python -m pip install <package-name>

    * To register newly installed packages use:

        python -m pip freeze > requirements.txt

    * Create the database 'db.sqlite3' and migrate the latest schema using:

        python manage.py migrate

    * Create a superuser account to login using:

        python manage.py createsuperuser
        Username: admin
        Email address: <your-email-address>
        Password: admin
        Bypass password validation and create user anyway? [y/N]: y

    * Pre-populate the database with some testing data (optional):

        python manage.py loaddata responses.json

    * Add environment variable file '.env' in /home directory with:

        ENVIRONMENT='Development'
        SECRET_KEY=''
        EMAIL_HOST=''
        EMAIL_HOST_USER=''
        EMAIL_HOST_PASSWORD=''
        DEFAULT_FROM_EMAIL=''

    * Run the application using:

        python manage.py runserver
```

## Keep code modular, adaptable, documented and simple

Some problems do sometimes call for quite complex solutions, but by abstracting away some of that complexity into easy to understand classes, methods, functions and variables we can make it simpler. The main characteristics of high quality code are:

* Clean and consistent style
* Functional 
* Easy to understand for others
* Efficient
* Testable
* Easy to maintain
* Easy to change and adapt 
* Well documented

We can achieve most of these things by creating well defined classes, methods and functions that do what they say they will, are well documented and are testable. We can also refactor early and often to ensure the code is the most readable it can be - we write code for humans more so than computers! Following a style guide such as the [Google Python style guide](https://google.github.io/styleguide/pyguide.html) or the [Tidyverse R style guide](https://style.tidyverse.org/index.html) can also keep the code standardised.

Files should start with a docstring describing the contents and usage of the module:

```python 
"""A one line summary of the module or program, terminated by a period.

Leave one blank line.  The rest of this docstring should contain an
overall description of the module or program.  Optionally, it may also
contain a brief description of exported classes and functions and/or usage
examples.

Typical usage example:

  foo = ClassFoo()
  bar = foo.FunctionBar()
"""
```

R function docstring:

```r
#' Short title for function
#'
#' @description
#' Longer description of the function
#'
#' @param first An object of class "?". Description of parameter
#' @param second An object of class "?". Description of parameter
#' @return Returns an object of class "?". Description of what the function returns
#' @examples
#' # Add some code illustrating how to use the function
my_new_function <- function(first, second) {
	return("hello world")
}
```

JavaScript function docstring:

```js 
/**
 * Summary. (use period)
 *
 * Description. (use period)
 *
 * @see  Function/class relied on
 * @link URL
 *
 * @param {type}   var           Description.
 * @param {type}   [var]         Description of optional variable.
 * @param {type}   [var=default] Description of optional variable with default variable.
 * @param {Object} objectVar     Description.
 * @param {type}   objectVar.key Description of a key in the objectVar parameter.
 *
 * @yield {type} Yielded value description.
 *
 * @return {type} Return value description.
 */
function myNewFunction () {
  return "hello world";
}
```

Python function docstring:

```python
def my_new_function(first: str, second: int) -> str:
    """Short title for function.

    Longer description of the function.

    Args:
        first (str): A description of the first argument.
        second (int): A description of the second argument.

    Returns:
        result (str): A description of the return value.

    Raises:
        IOError: A description of the error raised.
    """
    result = first + str(second)

    return result
```

Not only do docstrings make your code easier for yourself and others to understand, the best part is that you can auto-generate documentation using [Sphinx for Python](https://www.sphinx-doc.org/en/master/) and using [Roxygen for R](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html)! These require another article to go through but are really useful for keeping documentation up to date.

We can also make any code more adaptable by not hardcoding configuration values and instead putting them in a YAML or JSON config file. This makes input parameters easier to quickly change and see the result of that change on the outputs.

```yaml [config.yaml]
input_path: "C:/a/very/specific/path/to/input_data.csv"
output_path: "outputs/predictions.csv"

test_split_proportion: 0.3
random_seed: 42

prediction_parameters:
    constant_a: 7
    max_v: 1000
```

```python [model.py]
import yaml

with open("./config.yaml") as file:
    config = yaml.load(file)

data = read_csv(config["input_path"])
...
```

```r [model.R]
config <- yaml::yaml.load_file("config.yaml")

data <- read.csv(config$input_path)
...
```

## Use automated unit tests and peer review

Using a unit testing framework like [pytest](https://docs.pytest.org/en/7.2.x/), [unittest](https://docs.python.org/3/library/unittest.html), [testthat](https://testthat.r-lib.org/) or [Runit](https://www.rdocumentation.org/packages/RUnit/versions/0.4.32) will help you to check whether those nicely documented functions you wrote actually do what they say they should. Test driven development to me, simply means you are the first user of your own code. If all your functions, classes and methods do what they are expected to do, we can be very sure the overall program will behave as expected. 

These same frameworks can be used to write higher level acceptance tests too like 'does the whole program produce somewhat expected results?'. This tests the overall behaviour of the code as opposed to the implementation. Don't aim for 100% test coverage, I think testing the critical functions and most realistic use cases of your code are the most important. Create your first tests and build your library of tests from there. A unit test should be small, it should run fast and it should test one unit of code.

Below is an example of a unit test with pytest. This one fails as the function does not return the number multiplied by 3 but by 2! All test files must begin 'test_' before running the `pytest` command in the same directory. It also helps readability to use the [arrange, act, assert pattern](https://automationpanda.com/2020/07/07/arrange-act-assert-a-pattern-for-writing-good-tests/).

```python [test_calculations.py]
def times_number_by_three(number: float):
    return number * 2

def test_times_number_by_three():
    # Arrange
    value = 3
    
    # Act
    result = times_number_by_three(value)

    # Assert
    expected = 9
    assert result == expected
```

<code-runner :output="['test_calculations.py:5: AssertionError',
  '',
  'test_calculations.py ⨯         ',
  '=== short test summary info ===',
  'FAILED test_calculations.py::test_times_number_by_three - assert 6 == 9',
  'Results (48.92s):',
  '1 failed',
  '         - test_calculations.py:5 test_times_number_by_three']" 
  filename="pytest" 
  language="Python">
</code-runner>


Next is the same example but using R and [testthat](https://testthat.r-lib.org/). RStudio will automatically recognise the `test_that` function and give a 'Run Tests' option in the top right. Alternatively you can use the command `testthat::test_file("test_calculations.R")` to test a single file.

```r [test_calculations.R]
library(testthat)

time_number_by_three <- function(number) {
  return(number * 2)
}

test_that("number_is_multiplied_by_three", {
    # Arrange 
    value <- 3

    # Act 
    result <- time_number_by_three(value)

    # Assert
    expected <- 9
    expect_equal(result, expected)
})
```

<code-runner :output="['[ FAIL 1 | WARN 0 | SKIP 0 | PASS 0 ]',
  '',
  '--- Failure (test_calculations.R:16): number_is_multiplied_by_three ---',
  'result not equal to expected',
  '1/1 mismatches',
  '[1] 6 - 9 == -3',
  ]" 
  filename='testthat::test_file("test_calculations.R")' 
  language="R">
</code-runner>

Other things to be aware of when testing are:

* The function you want to test doesn't have to be in the test file like in these examples, you can import it from elsewhere in your project making testing super simple. 
* You can also split your tests up into separate files to keep the project structure clean. 
* You can create tests to validate any outputs and check the behaviour of the code as QA and acceptance tests.
* You can run all test files in a directory with both [pytest](https://docs.pytest.org/en/7.1.x/getting-started.html#run-multiple-tests) and [testthat](https://testthat.r-lib.org/reference/test_dir.html) fully automating your test suite.

Finally, although automation is great and having a suite of tests you can run every time you introduce a new change gives you confidence, having peer review is equally important. This is where someone else reviews your code and checks  that it is readable, understandable and actually works. When reviewing code you should ask yourself these questions:

* Can I easily understand what the code does?
  * Is the code sufficiently documented for me to understand it?
    Is there duplication in the code that could be simplified by refactoring into functions and classes?
    Are functions and class methods simple, using few parameters?
* Does the code fulfil its requirements?
* Is the required functionality tested sufficiently?
* How easy will it be to alter this code when requirements change? They always do.
  * Are high level parameters kept in dedicated configuration files? Or would somebody need to work their way through the code with lots of manual edits to reconfigure for a new run?
* Can I generate the same outputs that the analysis claims to produce?
  * Have dependencies been sufficiently documented?
  * Is the code version, input data version and configuration recorded?

In the useful site I shared at the beginning of this article, you can find [code quality assurance checklists](https://best-practice-and-impact.github.io/qa-of-code-guidance/checklists.html) for analytical projects which seem a really good starting point too.

## Conclusion

These six tips should make any analytical project you start a pleasure to work on. Spending the time to really think about the end goal, keep things simple and get your project structure set is worth it. I think it was Abraham Lincoln who said "give me six hours to chop down a tree and I will spend the first four sharpening the axe". Solid advice we should all take.

Thanks for reading 👍 If you enjoyed this article you might also like the article [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/).

Here are some recommended resources for further learning:

* [The Pragmatic Programmer, The: Your journey to mastery, 20th Anniversary Edition](https://www.amazon.co.uk/Pragmatic-Programmer-journey-mastery-Anniversary/dp/0135957052/)
* [The Effective Engineer: How to Leverage Your Efforts In Software Engineering to Make a Disproportionate and Meaningful Impact](https://www.amazon.co.uk/Effective-Engineer-Engineering-Disproportionate-Meaningful/dp/0996128107/)
* [Tips for urgent quality assurance of ad-hoc statistical analysis](https://gss.civilservice.gov.uk/policy-store/top-tips-for-quality-assuring-urgent-pieces-of-ad-hoc-statistical-analysis/)
* [Tips for urgent quality assurance of data](https://gss.civilservice.gov.uk/policy-store/tips-for-urgent-quality-assurance-of-data/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Creating statistical neighbours comparator benchmarking models with Python]]></title>
            <link>https://shedloadofcode.com/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/</link>
            <guid>https://shedloadofcode.com/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/</guid>
            <pubDate>Wed, 23 Nov 2022 12:25:00 GMT</pubDate>
            <description><![CDATA[Learn how to use both filtering and scoring approaches to developing comparator benchmarking models for finding similar observations ranked by 'closeness'.]]></description>
            <content:encoded><![CDATA[
This article will explore how to get started creating a statistical neighbours model to benchmark, compare and find similar observations within a dataset. This might be comparing the sales of a store, to only other stores that are statistically similar in terms of size, budget and staffing or comparing school attendance performance for a given area to only other areas of similar size, pupil numbers and other characteristics. 

The main problem of comparator models is how to define what is considered statistically 'similar'. We will explore two approaches to solving this problem.

**All of the data used in this article is not real data. It has been adapted and modified based upon real data sources for learning purposes.**

## Filtering approach

In this dummy dataset [school_data.xlsx](https://github.com/shedloadofcode/data-files/blob/main/school_data.xlsx?raw=true) I adapted from two good open data sources [Explore education statistics](https://explore-education-statistics.service.gov.uk/find-statistics/pupil-attendance-in-schools) and [Get Information about Schools](https://www.get-information-schools.service.gov.uk/) there are around 1,800 schools but we only want to compare a school's attendance levels to it's top ten most statistically similar in terms of pupil size, alongside FSM and SEN characteristics.

| School      | Attendance% | Pupils | FSM | SEN | Phase   | LocationID |
| ----------- | ----------- | ------ | --- | --- | ------- | ---------- |
| SCHOOL-0001 | 98.2        | 63     | 5   | 6   | PHASE-1 | 855        |
| SCHOOL-0002 | 81          | 1229   | 257 | 72  | PHASE-2 | 873        |
| SCHOOL-0003 | 94.8        | 250    | 10  | 16  | PHASE-1 | 891        |
| SCHOOL-0004 | 94.5        | 653    | 78  | 89  | PHASE-1 | 856        |
| SCHOOL-0005 | 93.9        | 463    | 83  | 45  | PHASE-1 | 866        |
| SCHOOL-0006 | 94.2        | 918    | 156 | 131 | PHASE-2 | 865        |
| SCHOOL-0007 | 0           | 81     | 25  | 18  | PHASE-2 | 888        |
| SCHOOL-0008 | 91.4        | 195    | 83  | 29  | PHASE-1 | 888        |
| SCHOOL-0009 | 96.5        | 223    | 89  | 63  | PHASE-1 | 888        |
| SCHOOL-0010 | 92.5        | 719    | 253 | 130 | PHASE-2 | 209        |
| ...         | ...         | ...    | ... | ... | ...     | ...        |

For each school, we will apply a series of filters to find it's top ten comparators in terms of both pupil size and characteristics like FSM and SEN.

```python ["attendance_comparators.py"]
"""
A model to identify school comparator's based on their size and 
characteristics in order to compare attendance performance.

Assumptions:
    
- Schools will only be compared to schools of the same phase type​.
- Results will be the top ten statistically closest schools.
- The comparators will be based on attendance %​.

Functionality:
    
- Ability to compare against schools of a similar size​.
​- Ability to compare against schools with similar characteristics
"""

import os
import time
import pandas as pd


def get_data() -> pd.DataFrame():
    """
    Reads the Excel dataset into a Pandas DataFrame and adds new features such 
    as %FSM and %SEN.
    """
    df = pd.read_excel("school_data.xlsx")
    df["Attendance%"] = df["Attendance%"] * 100
    df["%FSM"] = (df["FSM"] / df["Pupils"]) * 100
    df["%SEN"] = (df["SEN"] / df["Pupils"]) * 100

    return df


def generate_all_comparators(output_all_to_csv: bool = False) -> None:
    """
    Generates the top 10 comparators for every school in the dataset, for each
    of the 2 comparator groups (size, characteristics).
    
    Optionally saves the result to CSV files where the folder name is the name 
    of the school where output_all_to_csv is set to True.
    """
    df = get_data()
    df_length = len(df)
    comparator_mappings = []

    for index, row in df.iterrows():
        school_name = row["SchoolName"]

        similar_sized_comparators = find_similar_sized_comparators(
            school_name=school_name,
            df=df
        )

        similar_characteristics_comparators = find_similar_characteristics_comparators(
            school_name=school_name,
            df=df
        )

        add_comparators_to_mappings(
            comparators=similar_sized_comparators,
            mappings=comparator_mappings,
            school_name=school_name,
            grouping="Size"
        )

        add_comparators_to_mappings(
            comparators=similar_characteristics_comparators,
            mappings=comparator_mappings,
            school_name=school_name,
            grouping="Characteristics"
        )

        if output_all_to_csv:
            if not os.path.exists("output"):
                os.mkdir("output")

            school_name = school_name.replace("/", "")
            directory = f"output/{school_name}"

            if not os.path.exists(directory):
                os.mkdir(directory)

            similar_sized_comparators.drop(
                columns=["Unnamed: 0"], inplace=True)

            similar_characteristics_comparators.drop(
                columns=["Unnamed: 0"], inplace=True)

            similar_sized_comparators.to_csv(
                directory + "/similar_sized_comparators.csv",
                index=False
            )
            similar_characteristics_comparators.to_csv(
                directory + "/similar_characteristics_comparators.csv",
                index=False
            )

        print(f"{index + 1} of {df_length} done.")

    return pd.DataFrame.from_records(comparator_mappings)


def add_comparators_to_mappings(comparators,
                                mappings,
                                school_name,
                                grouping) -> None:
    """
    Builds the final output by adding all of the comparators from
    the size and characteristics DataFrames to the mapping list
    in JSON / dictionary format:
    
    [
     {
      "School": "A", 
      "Comparator": "B", 
      "Grouping": "Size"
     },
     {
      "School": "B", 
      "Comparator": "C", 
      "Grouping": "Characteristics"
     },
    ]
        
    Which results in the final output:
        
    School   Comparator  Grouping
    A        B           Size
    A        D           Size
    B        D           Characteristics
    
    Avoids adding the target school_name as it's own comparator.
    """
    for index, row in comparators.iterrows():
        comparator_school_name = row["SchoolName"]
        if comparator_school_name != school_name:
            mappings.append({
                "School": school_name,
                "Comparator": comparator_school_name,
                "Grouping": grouping
            })


def find_similar_sized_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Finds schools of a similar size and returns as comparators.
    
    This comparator is calculated by the total number of pupils in each school,
    per organisation type. The groupings for each organisation type will be 
    calculated based on the highest and lowest pupil count for schools in that 
    category i.e. within a given % threshold
    """
    school = df[df["SchoolName"] == school_name]
    school_size = school["Pupils"].values[0]
    school_type = school["Phase"].values[0]
    schools_with_same_type = df[df["Phase"] == school_type]

    upper_size_threshold = school_size * 1.25
    lower_size_threshold = school_size * 0.75

    schools_of_similar_size = schools_with_same_type[
        (schools_with_same_type["Pupils"] >= lower_size_threshold) &
        (schools_with_same_type["Pupils"] <= upper_size_threshold)
    ].copy(deep=True)

    schools_of_similar_size["Size difference"] = (abs(
        schools_of_similar_size["Pupils"] -
        school_size
    ))

    schools_of_similar_size = schools_of_similar_size.nsmallest(
        11, "Size difference")

    schools_of_similar_size["Rank"] = (
        schools_of_similar_size["Attendance%"].rank(
            ascending=False
        )
    )

    return schools_of_similar_size


def find_similar_characteristics_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Finds schools with similar %FSM and %SEN characteristics and returns as comparators.
    """
    school = df[df["SchoolName"] == school_name]
    school_type = school["Phase"].values[0]
    school_fsm_percentage = school["%FSM"].values[0]
    school_sen_percentage = school["%SEN"].values[0]
    schools_with_same_type = df[df["Phase"] == school_type]

    upper_fsm_threshold = school_fsm_percentage * 1.1
    lower_fsm_threshold = school_fsm_percentage * 0.9

    upper_sen_threshold = school_sen_percentage * 1.1
    lower_fsm_threshold = school_sen_percentage * 0.9

    schools_with_similar_characteristics = schools_with_same_type[
        (schools_with_same_type["%FSM"] >= lower_fsm_threshold) &
        (schools_with_same_type["%FSM"] <= upper_fsm_threshold) &
        (schools_with_same_type["%SEN"] >= lower_fsm_threshold) &
        (schools_with_same_type["%SEN"] <= upper_sen_threshold)
    ].copy(deep=True)

    schools_with_similar_characteristics["Characteristics difference"] = (
        abs(schools_with_similar_characteristics["%FSM"] - school_fsm_percentage) +
        abs(schools_with_similar_characteristics["%SEN"] -
            school_sen_percentage)
    )

    schools_with_similar_characteristics = schools_with_similar_characteristics.nsmallest(
        11,
        "Characteristics difference"
    )

    schools_with_similar_characteristics["Rank"] = (
        schools_with_similar_characteristics["Attendance%"].rank(
            ascending=False
        )
    )

    return schools_with_similar_characteristics


if __name__ == "__main__":
    start = time.time()

    output = generate_all_comparators(
        output_all_to_csv=True
    )

    output.to_csv("output/comparator-mappings.csv", index=False)

    end = time.time()

    print(f"Model finished in {round(end - start, 2)} seconds.")
```

If the `output_all_to_csv` flag is set to True then for each school a folder will be created in the `output` directory for it, containing all of it's comparators for both size and pupil characteristics. An example of one of these outputs for 'SCHOOL-005' can be seen in the image below.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1669030454/App%20Images/Blog%20Images/Article%20Images/Statistical%20Neighbours/individual-output_ie42rs_1_h95rqw.png" 
  alt="Example of single comparator results output" 
  loading="lazy" 
  styling=""
  caption="Example of a single comparator result output showing top ten statistical neighbours" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1669030454/App%20Images/Blog%20Images/Article%20Images/Statistical%20Neighbours/individual-output_ie42rs_1_h95rqw.png" 
  :showsource="false">
</article-image>

We can see within `similar_characteristics_comparators.csv` the %FSM and %SEN are within the upper and lower thresholds and within `similar_size_comparators.csv` Pupils are within the upper and lower thresholds. This shows the model is accurately filtering and ranking only those observations that fit inside these parameters.

Within the `output` directory, there is also the full list of comparators in the `comparator-mappings.csv` file.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1668703755/App%20Images/Blog%20Images/Article%20Images/Statistical%20Neighbours/comparator-mappings_um4d7v.png" 
  alt="All comparator mappings" 
  loading="lazy" 
  styling=""
  caption="The full list of all comparator mappings" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1668703755/App%20Images/Blog%20Images/Article%20Images/Statistical%20Neighbours/comparator-mappings_um4d7v.png" 
  :showsource="false">
</article-image>

If we also had columns for 'Easting' and 'Northing' for these schools, we could also add another filter to find the top ten geospatially closest schools.

```python ["attendance_comparators.py"]
from scipy.spatial import distance

def find_similar_location_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Finds schools which are geospatially closest and returns as comparators.
    """ 
    school = df[df["School"] == school_name]
    school_location_id = school["LocationID"].values[0]
    school_type = school["Phase"].values[0]
    schools_with_same_type = df[df["Phase"] == school_type]
    school_easting = school["Easting"].values[0]
    school_northing = school["Northing"].values[0]

    location_data_available = (
        (school_easting != 0) & 
        (school_northing != 0)
    )
    
    if location_data_available:
        geo_comparators = schools_with_same_type \
            .copy(deep=True) \
            .reset_index()
        
        distances = []

        for _, row in geo_comparators.iterrows():
            a = (school_easting, school_northing)
            b = (row["Easting"], row["Northing"])
            distances.append(
                distance.euclidean(a, b)
            )

        geo_comparators["distance"] = pd.Series(distances)
        geo_comparators = geo_comparators[
            geo_comparators["Phase"] == school_type
        ]
        geo_comparators = geo_comparators.sort_values(
            by="distance", 
            ascending=True
        )
        geo_comparators = geo_comparators.head(11)
        
        return geo_comparators

    schools_in_same_area = schools_with_same_type[
        (schools_with_same_type["LocationID"] == school_location_id)
    ].copy(deep=True)
        
    if len(schools_in_same_area) <= 11:
        return schools_in_same_area
    
    
    sample =  schools_in_same_area.sample(n=10)
    sample = sample.append(school)
    
    return sample
```


## Scoring approach

In the next example, our dummy dataset [la_data.csv](https://raw.githubusercontent.com/shedloadofcode/data-files/main/la_data.csv) (adapted from a dataset taken from the [ONS](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/incomeandwealth/articles/mappingincomedeprivationatalocalauthoritylevel/2021-05-24)) is at Local Authority (area) level. 

| Local Authority District code (2019) | Local Authority District name (2019) | Profile              | Rural-urban classification                                      | Deprivation gap (percentage points) | Deprivation gap % | Deprivation gap ranking | Moran's I | Moran's I ranking | Income deprivation rate | Income deprivation rate ranking | Income deprivation rate quintile | % of households with 3 or more children | School pupils | School attendance % | Schools total spending £ | School spend per pupil £ | School Free School Meal % |
| ------------------------------------ | ------------------------------------ | -------------------- | --------------------------------------------------------------- | ----------------------------------- | ----------------- | ----------------------- | --------- | ----------------- | ----------------------- | ------------------------------- | -------------------------------- | --------------------------------------- | ------------- | ------------------- | ------------------------ | ------------------------ | ------------------------- |
| E07000223                            | Adur                                 | n-shape              | Urban with City and Town                                        | 21.70%                              | 21.70             | 233                     | 0.17      | 234               | 10.80%                  | 158                             | 3                                | 10                                      | 37437         | 76                  | 307104                   | 8.20                     | 28.70                     |
| E07000026                            | Allerdale                            | Flat                 | Mainly Rural (rural including hub towns >=80%)                  | 36.60%                              | 36.60             | 95                      | 0.29      | 157               | 12.10%                  | 130                             | 3                                | 16                                      | 40461         | 69                  | 869572                   | 21.49                    | 43.60                     |
| E07000032                            | Amber Valley                         | n-shape              | Urban with Minor Conurbation                                    | 32.90%                              | 32.90             | 121                     | 0.29      | 157               | 10.90%                  | 153                             | 3                                | 6                                       | 22981         | 44                  | 652505                   | 28.39                    | 39.90                     |
| E07000224                            | Arun                                 | n-shape              | Urban with City and Town                                        | 28.70%                              | 28.70             | 164                     | 0.31      | 139               | 10.40%                  | 171                             | 3                                | 25                                      | 34449         | 64                  | 437529                   | 12.70                    | 35.70                     |
| E07000170                            | Ashfield                             | More income deprived | Urban with City and Town                                        | 36.00%                              | 36.00             | 98                      | 0.15      | 246               | 15.20%                  | 72                              | 2                                | 11                                      | 9366          | 50                  | 770050                   | 82.22                    | 43.00                     |
| E07000105                            | Ashford                              | n-shape              | Urban with Significant Rural (rural including hub towns 26-49%) | 29.10%                              | 29.10             | 160                     | 0.34      | 116               | 11.00%                  | 150                             | 3                                | 26                                      | 38834         | 71                  | 613225                   | 15.79                    | 36.10                     |
| E07000004                            | Aylesbury Vale                       | Less income deprived | Largely Rural (rural including hub towns 50-79%)                | 19.60%                              | 19.60             | 264                     | 0.47      | 55                | 6.70%                   | 272                             | 5                                | 22                                      | 38433         | 56                  | 609848                   | 15.87                    | 26.60                     |
| E07000200                            | Babergh                              | Less income deprived | Mainly Rural (rural including hub towns >=80%)                  | 16.90%                              | 16.90             | 280                     | 0.17      | 234               | 8.00%                   | 232                             | 4                                | 21                                      | 48694         | 53                  | 146570                   | 3.01                     | 23.90                     |
| E09000002                            | Barking and Dagenham                 | More income deprived | Urban with Major Conurbation                                    | 25.40%                              | 25.40             | 195                     | 0.27      | 175               | 19.40%                  | 20                              | 1                                | 21                                      | 36548         | 89                  | 326135                   | 8.92                     | 32.40                     |
| E09000003                            | Barnet                               | n-shape              | Urban with Major Conurbation                                    | 31.90%                              | 31.90             | 132                     | 0.36      | 105               | 11.10%                  | 148                             | 3                                | 15                                      | 48851         | 33                  | 448473                   | 9.18                     | 38.90                     |

We want to compare a Local Authority area to only other statistically similar areas, but not just on one factor, but many (or all) numeric factors available and score them in terms of 'closeness'. This will find the top ten closest neighbours for comparisons and benchmarking.

```python [statistical_neighbours.py]
import pandas as pd


def find_statistical_neighbours_for(local_authority_district_code: str) -> pd.DataFrame:
    df = pd.read_csv(
        filepath_or_buffer="la_data.csv",
        encoding="cp1252"
    )
    
    df["Comparator score"] = 0
    df["Comparator variables"] = ""
    
    target_la = df.loc[
        (df["Local Authority District code (2019)"] == local_authority_district_code)
    ]
    
    comparison_variables = {
       "Deprivation gap %": 1, 
       "Deprivation gap ranking": 1, 
       "Moran's I ranking": 1, 
       "Income deprivation %": 1,
       "Income deprivation rate ranking": 1, 
       "% of households with 3 or more children ": 1, 
       "School pupils": 2, 
       "School Free School Meal %": 2
    }
    
    # compare the comparator variables for each LA against the target LA and score them
    for index, row in df.iterrows():
        is_target_la = (
            row["Local Authority District code (2019)"] == local_authority_district_code
        )
        
        if is_target_la:
            continue
            
        for variable in comparison_variables:
            if variables_are_statistically_similar(target_la[variable].values[0], row[variable]):  
                df.loc[index, 'Comparator score'] = (
                    df.loc[index, 'Comparator score'] + comparison_variables[variable]
                )

                df.loc[index, 'Comparator variables'] = (
                    df.loc[index, 'Comparator variables'] + variable + ", "
                )
        
                
    return(df.nlargest(10, "Comparator score").append(target_la))


def variables_are_statistically_similar(target: float, comparator: float) -> bool:
    upper_bound = target * 1.10
    lower_bound = target * 0.90
    
    comparator_is_within_range = (
        comparator > lower_bound and comparator < upper_bound
    )
    
    return comparator_is_within_range


def print_attendance_comparisons(df: pd.DataFrame) -> None:
    target_la = df.iloc[-1]
    df = df[: -1]
    
    la_name = target_la["Local Authority District name (2019)"]
    la_school_attendance_percentage = target_la["School attendance %"]
    
    average_comparator_attendance_percentage = df["School attendance %"].mean()
    
    print("The average school attendance percentage of your comparator LAs was ", end="")
    print(f"{average_comparator_attendance_percentage}%", end="\n")
    print(f"School attendance in {la_name} was {la_school_attendance_percentage}%", end="\n")
    
    attendance_percentage_difference = (
        la_school_attendance_percentage - average_comparator_attendance_percentage
    )
    attendance_percentage_difference = round(abs(attendance_percentage_difference), 2)

    if la_school_attendance_percentage < average_comparator_attendance_percentage:
        print(
            f"This is {attendance_percentage_difference} " 
            f"percentage points lower than your comparator LAs"
        )
    else:
        print(
            f"This is {attendance_percentage_difference} " 
            f"percentage points higher than your comparator LAs"
        )
    

def print_spending_comparisons(df: pd.DataFrame) -> None:
    target_la = df.iloc[-1]
    df = df[: -1]
    
    la_name = target_la["Local Authority District name (2019)"]
    la_school_spending = target_la["Schools total spending £"]
    
    average_comparator_spending = df["Schools total spending £"].mean()
    
    print("", end="\n\n")
    
    print("The average school spending of your comparator LAs was ", end="")
    print(f"£{average_comparator_spending}", end="\n")
    print(f"School spending in {la_name} was £{la_school_spending}", end="\n")
    
    spending_difference = (
        la_school_spending - average_comparator_spending
    )
    spending_difference = round(abs(spending_difference), 2)
    
    if la_school_spending < average_comparator_spending:
        print(f"This is £{spending_difference} lower than your comparator LAs")
    else:
        print(f"This is £{spending_difference} higher than your comparator LAs")
        
    
if __name__ == "__main__":
    comparators = find_statistical_neighbours_for("E07000150")
    
    print_attendance_comparisons(comparators)
    print_spending_comparisons(comparators)
    
    html_file = open("index.html", "w")
    html_file.write(comparators.to_html())
    html_file.close()
```

<code-runner :output="['The average school attendance percentage of your comparator LAs was 60.8%',
  'School attendance in Corby was 68%',
  'This is 7.2 percentage points higher than your comparator LAs   ',
  '',
  'The average school spending of your comparator LAs was £319554.8',
  'School spending in Corby was £671195',
  'This is £351640.2 higher than your comparator LAs']" 
  filename="statistical_neighbours.py" 
  language="Python">
</code-runner>

The scoring model works by first assigning weights in the dictionary `comparison_variables`. Then later will check each of these to see if the `variables_are_statistically_similar()` against the target Local Authority, and if so, increment the score by the weight for each.

The scoring model then first prints some summary information to the console such as comparisons between the target Local Authority's average attendance and average spending against their comparator Local Authorities. It then outputs the comparators for the target Local Authority to a HTML file 'output.html' to see which has the highest score. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1669031231/App%20Images/Blog%20Images/Article%20Images/Statistical%20Neighbours/scoring-output-html-file_cvtpbf.png" 
  alt="Scoring model comparators output" 
  loading="lazy" 
  styling=""
  caption="Scoring model comparators output" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1669031231/App%20Images/Blog%20Images/Article%20Images/Statistical%20Neighbours/scoring-output-html-file_cvtpbf.png" 
  :showsource="false">
</article-image>

The output could be made to look a little nicer with some styling via CSS, but it clearly shows that across all of the comparison variables which are the 'closest' and even has a column 'Comparator variables' to show which variables were the ones driving those scores.

The target Local Authority (in this example Corby) is at the bottom of the table to refer back to. Go ahead and try plugging in different Local Authority District Codes to the `find_statistical_neighbours_for(local_authority_district_code: str)` function to see how it performs!

## What we learned

We have covered using both filtering and scoring approaches to solving statistical neighbour problems. You can now apply these models to other problems in different domains. It is a very useful ability to only compare to other observations that are statistically similar - it makes the comparison analysis more tailored and as a result the conclusions and decisions are more relevant. 

Much better to compare and benchmark observations against those with similar characteristics, else you may end up making decisions that don't really apply to the school, local authority, store, or anything else the observation may be!

I did use mostly an iterative approach whilst putting these solutions together, like looping over DataFrame rows for example. If you can think of more efficient ways to solve these statistical neighbour problems for larger datasets or have any other comparator techniques you would like to share, please post a comment in the comment section below! 

As always, if you liked this article please check out [other articles](/) on the site.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building an AutoTrader scraper with Python to search for multiple makes and models]]></title>
            <link>https://shedloadofcode.com/blog/building-an-autotrader-scraper-with-python-to-search-for-multiple-makes-and-models/</link>
            <guid>https://shedloadofcode.com/blog/building-an-autotrader-scraper-with-python-to-search-for-multiple-makes-and-models/</guid>
            <pubDate>Fri, 21 Oct 2022 11:35:00 GMT</pubDate>
            <description><![CDATA[Develop your Python skills by building an AutoTrader scraper that can search for and filter multiple makes and models.]]></description>
            <content:encoded><![CDATA[
**Update November 2023:** Please check out the new Autotrader scraper in the article [How to scrape AutoTrader with Python and Selenium to search for multiple makes and models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/) which uses Python, Selenium and RegEx.

---

**Update September 2023:** The Autotrader UK website has since changed their layout breaking this scraper, it last worked around June 2023 since I utilised it to find a used Honda Jazz! It seems HTML classes such as 'product-card-details__title' have been changed, making scraping more difficult. Thanks to everyone for your great feedback on this scraper, I will continue to try and find an alternative way or workaround and update the article if I find one 👍 Still lots to learn from this article!

---

Searching for used cars can be difficult and time consuming. AutoTrader is a great place to perform this search but as far as I can see, it does not allow to search for multiple makes and models in one search. Who wants to keep going back and forth between previously saved searchs, right? Wouldn't it be so much easier if you could compare all of them in one list? 

## Designing the solution

A solution would need to perform the following steps for each make and model given as inputs:

1. Go to [AutoTrader](https://www.autotrader.co.uk).
2. Search for the given make and model with filters applied (price, year, mileage etc).
3. Scrape information from each of the listings.
4. Add the information to a list.

We can then output the information to CSV for further data analysis. 

Finally, the CSV output can be formatted to make it easier to read and find the most optimal car even faster.

## Installing required packages

This will rely on a few Python packages so if you are following along or are wanting to use this program yourself, install the following:

```
python -m pip install numpy pandas requests cloudscraper bs4 xlsxwriter openpyxl 
```

## Building the AutoTrader scraper

A good starting point in any project is asking 'has this been done before?' and 'is there existing open source code that can be used for this?'. Sometimes you want to code everything from scratch, and sometimes you want to get things done quickly. Building on the work of others is the foundation of computing and a testament to how far technology has come along in my view. 

I found a very useful package called [autotrader-scraper](https://pypi.org/project/autotrader-scraper/) which used [cloudscraper](https://pypi.org/project/cloudscraper/) and [beautifulsoup](https://pypi.org/project/beautifulsoup4/) to scrape data from AutoTrader given some filter arguments. I extended this code to scrape the seller details and fixed an issue where the scraper retrieved the seller page link instead of the actual vehicle link from the HTML source.

```python [autotrader-scraper.py]
import requests
import json
import csv
from bs4 import BeautifulSoup
import traceback
import cloudscraper

def get_cars(
  make="BMW", 
  model="5 SERIES", 
  postcode="SW1A 0AA", 
  radius=1500, 
  min_year=1995, 
  max_year=1995, 
  include_writeoff="include", 
  max_attempts_per_page=5, 
  verbose=False):

	# To bypass Cloudflare protection
	scraper = cloudscraper.create_scraper()

	# Basic variables
	results = []
	n_this_year_results = 0

	url = "https://www.autotrader.co.uk/results-car-search"

	keywords = {}
	keywords["mileage"] = ["miles"]
	keywords["BHP"] = ["BHP"]
	keywords["transmission"] = ["Automatic", "Manual"]
	keywords["fuel"] = [
      "Petrol", 
      "Diesel", 
      "Electric", 
      "Hybrid – Diesel/Electric Plug-in", 
      "Hybrid – Petrol/Electric", 
      "Hybrid – Petrol/Electric Plug-in"
    ]
	keywords["owners"] = ["owners"]
	keywords["body"] = [
      "Coupe", 
      "Convertible", 
      "Estate", 
      "Hatchback", 
      "MPV", 
      "Pickup", 
      "SUV", 
      "Saloon"
    ]
	keywords["ULEZ"] = ["ULEZ"]
	keywords["year"] = [" reg)"]
	keywords["engine"] = ["engine"]

	# Set up parameters for query to autotrader.co.uk
	params = {
		"sort": "relevance",
		"postcode": postcode,
		"radius": radius,
		"make": make,
		"model": model,
		"search-results-price-type": "total-price",
		"search-results-year": "select-year",
	}

	if (include_writeoff == "include"):
		params["writeoff-categories"] = "on"
	elif (include_writeoff == "exclude"):
		params["exclude-writeoff-categories"] = "on"
	elif (include_writeoff == "writeoff-only"):
		params["only-writeoff-categories"] = "on"
		
	year = min_year
	page = 1
	attempt = 1

	try:
		while year <= max_year:
			params["year-from"] = year
			params["year-to"] = year
			params["page"] = page

			r = scraper.get(url, params=params)
			if verbose:
				print("Year:     ", year)
				print("Page:     ", page)
				print("Response: ", r)

			try:
				if r.status_code != 200:   # If not successful (e.g. due to bot protection)
					attempt = attempt + 1  # Log as an attempt
					if attempt <= max_attempts_per_page:
						if verbose:
							print("Exception. Starting attempt #", attempt, "and keeping at page #", page)
					else:
						page = page + 1
						attempt = 1
						if verbose:
							print("Exception. All attempts exhausted for this page. Skipping to next page #", page)

				else:

					j = r.json()
					s = BeautifulSoup(j["html"], features="html.parser")

					articles = s.find_all("article", attrs={"data-standout-type":""})

					# If no results or reached end of results...
					if len(articles) == 0 or r.url[r.url.find("page=")+5:] != str(page):
						if verbose:
							print("Found total", n_this_year_results, "results for year", year, "across", page-1, "pages")
							if year+1 <= max_year:
								print("Moving on to year", year + 1)
								print("---------------------------------")

						# Increment year and reset relevant variables
						year = year + 1
						page = 1
						attempt = 1
						n_this_year_results = 0
					else:
						for article in articles:
							car = {}
							car["name"] = article.find("h3", {"class": "product-card-details__title"}).text.strip()				
							car["link"] = "https://www.autotrader.co.uk" + \
                                  article.find("a", {"class": "listing-fpa-link"})["href"][: article.find("a", {"class": "listing-fpa-link"})["href"] \
                                  .find("?")]
							car["price"] = article.find("div", {"class": "product-card-pricing__price"}).text.strip()

							seller_info = article.find("ul", {"class": "product-card-seller-info__specs"}).text.strip()
							car["seller"] = " ".join(seller_info.split())

							key_specs_bs_list = article.find("ul", {"class": "listing-key-specs"}).find_all("li")
							
							for key_spec_bs_li in key_specs_bs_list:

								key_spec_bs = key_spec_bs_li.text

								if any(keyword in key_spec_bs for keyword in keywords["mileage"]):
									car["mileage"] = int(key_spec_bs[:key_spec_bs.find(" miles")].replace(",",""))
								elif any(keyword in key_spec_bs for keyword in keywords["BHP"]):
									car["BHP"] = int(key_spec_bs[:key_spec_bs.find("BHP")])
								elif any(keyword in key_spec_bs for keyword in keywords["transmission"]):
									car["transmission"] = key_spec_bs
								elif any(keyword in key_spec_bs for keyword in keywords["fuel"]):
									car["fuel"] = key_spec_bs
								elif any(keyword in key_spec_bs for keyword in keywords["owners"]):
									car["owners"] = int(key_spec_bs[:key_spec_bs.find(" owners")])
								elif any(keyword in key_spec_bs for keyword in keywords["body"]):
									car["body"] = key_spec_bs
								elif any(keyword in key_spec_bs for keyword in keywords["ULEZ"]):
									car["ULEZ"] = key_spec_bs
								elif any(keyword in key_spec_bs for keyword in keywords["year"]):
									car["year"] = key_spec_bs
								elif key_spec_bs[1] == "." and key_spec_bs[3] == "L":
									car["engine"] = key_spec_bs

							results.append(car)
							n_this_year_results = n_this_year_results + 1

						page = page + 1
						attempt = 1

						if verbose:
							print("Car count: ", len(results))
							print("---------------------------------")

			except KeyboardInterrupt:
				break

			except:
				traceback.print_exc()
				attempt = attempt + 1
				if attempt <= max_attempts_per_page:
					if verbose:
						print("Exception. Starting attempt #", attempt, "and keeping at page #", page)
				else:
					page = page + 1
					attempt = 1
					if verbose:
						print("Exception. All attempts exhausted for this page. Skipping to next page #", page)

	except KeyboardInterrupt:
		pass

	return results
```

This returns results from the `get_car()` function as a list. You can leave or edit the `keywords` inputs if you would like to pull back less or more results before filtering further.

## Searching AutoTrader for multiple makes and models

Now we have a file named 'autotrader_scraper.py' we will create another file for the searcher which we'll name 'autotrader_searcher.py'.

This will use the `get_car()` function we created in the last step to retrieve information from AutoTrader for each make and model and then combine them into one list. This list can then be used to create a Pandas DataFrame for further filtering. In the `criteria` dictionary, be sure to replace the postcode with your postcode.

```python [autotrader-searcher.py]
"""
Enables the automation of multiple autotrader searches.

Based on the autotrader-scraper package:
https://github.com/suhailidrees/autotrader_scraper
"""

from autotrader_scraper import get_cars
import pandas as pd

criteria = {
    "postcode": "SW1A 0AA", 
    "min_year": 2008,
    "max_year": 2014,
    "radius": 40,
    "min_price": 2000,
    "max_price": 6000,
    "fuel": "Petrol",
    "transmission": "Manual",
    "max_mileage": 100000,
    "max_attempts_per_page": 3,
    "verbose": False
}

civic = get_cars(
    make = "Honda",
    model = "Civic",
    postcode = criteria["postcode"],
    radius = criteria["radius"],
    min_year = criteria["min_year"],
    max_year = criteria["max_year"],
    include_writeoff = "exclude",
    max_attempts_per_page = criteria["max_attempts_per_page"],
    verbose = criteria["verbose"]
)

print("Civic search done.")

jazz = get_cars(
    make = "Honda",
    model = "Jazz",
    postcode=criteria["postcode"],
    radius = criteria["radius"],
    min_year = criteria["min_year"],
    max_year = criteria["max_year"],
    include_writeoff = "exclude",
    max_attempts_per_page = criteria["max_attempts_per_page"],
    verbose = criteria["verbose"]
)

print("Jazz search done.")

auris = get_cars(
    make = "Toyota",
    model = "Auris",
    postcode=criteria["postcode"],
    radius = criteria["radius"],
    min_year = criteria["min_year"],
    max_year = criteria["max_year"],
    include_writeoff = "exclude",
    max_attempts_per_page = criteria["max_attempts_per_page"],
    verbose = criteria["verbose"]
)

print("Auris search done.")

corolla = get_cars(
    make = "Toyota",
    model = "Corolla",
    postcode=criteria["postcode"],
    radius = criteria["radius"],
    min_year = 2000,
    max_year = criteria["max_year"],
    include_writeoff = "exclude",
    max_attempts_per_page = criteria["max_attempts_per_page"],
    verbose = criteria["verbose"]
)

print("Corolla search done.")

yaris = get_cars(
    make = "Toyota",
    model = "Yaris",
    postcode=criteria["postcode"],
    radius = criteria["radius"],
    min_year = criteria["min_year"],
    max_year = criteria["max_year"],
    include_writeoff = "exclude",
    max_attempts_per_page = criteria["max_attempts_per_page"],
    verbose = criteria["verbose"]
)

print("Yaris search done.")

mazda3 = get_cars(
    make="Mazda",
    model="Mazda3",
    postcode=criteria["postcode"],
    radius=criteria["radius"],
    min_year=criteria["min_year"],
    max_year=criteria["max_year"],
    include_writeoff="exclude",
    max_attempts_per_page=criteria["max_attempts_per_page"],
    verbose=criteria["verbose"]
)

print("Mazda3 search done.")

swift = get_cars(
    make="Suzuki",
    model="Swift",
    postcode=criteria["postcode"],
    radius=criteria["radius"],
    min_year=criteria["min_year"],
    max_year=criteria["max_year"],
    include_writeoff="exclude",
    max_attempts_per_page=criteria["max_attempts_per_page"],
    verbose=criteria["verbose"]
)

print("Swift search done.")

results = (
    civic + 
    jazz +
    auris + 
    corolla +
    yaris + 
    mazda3 + 
    swift
)

print(f"Found {len(results)} total results.")

df = pd.DataFrame.from_records(results)

df["price"] = df["price"] \
    .str.replace("£", "") \
    .str.replace(",", "") \
    .astype(int)

df["distance"] = df["seller"].str.extract(r'(\d+ mile)', expand=False)
df["distance"] = df["distance"].str.replace(" mile", "")
df["distance"] = pd.to_numeric(df["distance"], errors="coerce").astype("Int64")

df["year"] = df["year"].str.replace(r"\s(\(\d\d reg\))", "", regex=True)
df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")

shortlist = df[
  (df["price"] >= criteria["min_price"]) & 
  (df["price"] <= criteria["max_price"]) &
  (df["fuel"] == criteria["fuel"]) &
  (df["mileage"] <= criteria["max_mileage"]) &
  (df["transmission"] == criteria["transmission"]) &
  (df["engine"] != "1.0L") &
  (df["engine"] != "1.2L")
]

print(f"{len(shortlist)} cars met the criteria. Saving to 'autotrader-shortlist.csv'")

shortlist = shortlist.sort_values(by="distance")
shortlist.to_csv("autotrader-shortlist.csv")
```

As you can see from this code, when the time comes to replace my car I am determined to find a good condition, relatively low mileage, reliable Japanese car for less than £5000 that can get me from A to B without too many headaches! You might want to remove some of these cars and add others that are on your wish list.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1663944407/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Searcher/raw-output_yuhxen.png" 
  alt="Raw CSV output" 
  loading="lazy" 
  styling=""
  caption="Raw CSV output from the AutoTrader scraper" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1663944407/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Searcher/raw-output_yuhxen.png" 
  :showsource="false">
</article-image>

## Formatting the shortlist

Now we have results returned from AutoTrader in CSV format, it would be nicer to apply some conditional formatting to this to quickly pick out the most viable vehicles - the hidden gems. Create another file named 'shortlist_formatter.py'.

```python [shortlist-formatter.py]
import openpyxl
import numpy as np
import pandas as pd
import os
import shutil
import datetime

def format_autotrader_shortlist() -> None:
  df = pd.read_csv("autotrader-shortlist.csv")

  now = datetime.datetime.now()
  df["miles_pa"] = df["mileage"] / (now.year - df["year"])
  df["miles_pa"].fillna(0, inplace=True)
  df["miles_pa"] = df["miles_pa"].astype(int)

  most_viable_cars_mask = (
      (df["mileage"] < 85000) &
      (df["miles_pa"] < 9000) &
      (df["owners"] <= 3) 
  )

  df["viable"] = np.where(
    most_viable_cars_mask, 
    "Y", 
    ""
  )

  df = add_previously_viewed_cars(df)

  df = df[[
      "viable", 
      "viewed",
      "name",
      "link",
      "price",
      "year",
      "mileage",
      "miles_pa",
      "owners",
      "engine",
      "seller",
      "distance",
  ]]

  writer = pd.ExcelWriter("autotrader-shortlist.xlsx", engine="xlsxwriter")
  df.to_excel(writer, sheet_name="Sheet1", index=False)
  workbook = writer.book
  worksheet = writer.sheets["Sheet1"]

  worksheet.conditional_format("E2:E1000", {
      'type':      '3_color_scale',
      'min_color': '#63be7b',
      'mid_color': '#ffdc81',
      'max_color': '#f96a6c'
  })

  worksheet.conditional_format("F2:F1000", {
      'type':      '3_color_scale',
      'min_color': '#f96a6c',
      'mid_color': '#ffdc81',
      'max_color': '#63be7b'
  })

  worksheet.conditional_format("G2:G1000", {
      'type':      '3_color_scale',
      'min_color': '#63be7b',
      'mid_color': '#ffdc81',
      'max_color': '#f96a6c'
  })

  worksheet.conditional_format("H2:H1000", {
      'type':      '3_color_scale',
      'min_color': '#63be7b',
      'mid_color': '#ffdc81',
      'max_color': '#f96a6c'
  })

  worksheet.conditional_format("I2:I1000", {
      'type':      '3_color_scale',
      'min_color': '#63be7b',
      'mid_color': '#ffdc81',
      'max_color': '#f96a6c'
  })

  writer.save()
  print("Shortlist formatting done.")


def add_previously_viewed_cars(df) -> pd.DataFrame:
  df["viewed"] = ""

  if not os.path.exists("Previous searches/Last search/autotrader-shortlist.xlsx"):
    return df

  viewed_cars = pd.read_excel(
    "Previous searches/Last search/autotrader-shortlist.xlsx"
  )

  for index, row in df.iterrows():
    car_in_previous_search = (
      (viewed_cars["name"] == row["name"]) &
      (viewed_cars["link"] == row["link"])
    ).any()

    if car_in_previous_search:
      df.loc[index, "viewed"] = "Y"

  return df


def update_previous_search_history():
  """
  Copies the autotrader shortlist Excel file to 
  '/Previous searches/Last search' to find cars 
  seen previously and to '/Previous searches' 
  for documenting historic searches.
  """
  if not os.path.exists("autotrader-shortlist.xlsx"):
    return 

  now = datetime.datetime.now()
  date = f"{str(now.day)}-{now.strftime('%m')}-{str(now.year)}"

  shutil.copyfile(
      src="autotrader-shortlist.xlsx",
      dst=f"Previous searches/autotrader-shortlist-{date}.xlsx"
  )

  shutil.copyfile(
      src="autotrader-shortlist.xlsx",
      dst=f"Previous searches/Last search/autotrader-shortlist.xlsx"
  )


def open_file_in_excel() -> None:
  os.system("start EXCEL.EXE autotrader-shortlist.xlsx")


if __name__ == "__main__":
  format_autotrader_shortlist()
  update_previous_search_history()
  open_file_in_excel()
```

This calculates mileage per annum which is then used in a viability check. This means that the cars with the most potential are given a 'Y' in the viable column. Of course, even a car with relatively low mileage and a low number of previous owners can still be in a poor condition if it's not been looked after or has been sat idle for long periods of time, so this only highlights the *potential* gems. Using the `most_viable_cars_mask` identifies and marks cars as viable with a 'Y' which have less than 85000 miles, less than 9000 miles per annum and with 3 previous owners or less.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1663944407/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Searcher/formatted-output_ko5qqq.png" 
  alt="Formatted Excel output" 
  loading="lazy" 
  styling=""
  caption="Formatted Excel output from the AutoTrader scraper" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1663944407/App%20Images/Blog%20Images/Article%20Images/Autotrader%20Searcher/formatted-output_ko5qqq.png" 
  :showsource="false">
</article-image>

## Taking it for a spin

Let's see the scraper, searcher, and formatter all in action one after another, in this end-to-end demo. I perform this process weekly to get the most up to date listing for my area. The formatter makes it really easy to see the trade offs in terms of price, year, mileage and previous owners.

<article-video 
  id="516RIX7zJRE" 
  title="Building an AutoTrader scraper with Python to search for multiple makes and models">
</article-video>

## Troubleshooting

On the odd occasion, the program does hang as it retries after a failed connection. The best way to correct this is to end the program using Ctrl + C, wait a short while, and then re-run it in a new console. This will establish a new connection and successfully return the results from the multiple scraping calls started by the `get_cars()` function.

## Bonus: Identify cars seen in a previous search

As you might have noticed in `shortlist-formatter.py` after the formatting is complete, the autotrader search Excel file is copied to both the '/Previous searches' and '/Previous searches/Last search' folders with the `update_previous_search_history` function. This is so that on our *next* search we can cross-reference it with this historic data to find out if we've seen a particular car before! I found this to be an extremely useful addition especially if you are running this every week.

## Finishing in first place

Spreading the search net wider to multiple makes and models and automating the search has been an excellent strategy for finding suitable cars within a reasonable distance from my location fast. I will update this section when I do go ahead a buy one to let you know what is was 😄 I am hoping my current car will last into next year, but at least I have this handy program ready to go if not.

The only thing left for you to do is set your criteria, add the makes and models you want, and off you go! Happy car hunting.

If you enjoyed this article be sure to check out [other articles](/) on the site.]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Creating a screen and mouse jiggler with Python]]></title>
            <link>https://shedloadofcode.com/blog/creating-a-screen-and-mouse-jiggler-with-python/</link>
            <guid>https://shedloadofcode.com/blog/creating-a-screen-and-mouse-jiggler-with-python/</guid>
            <pubDate>Fri, 02 Sep 2022 14:00:00 GMT</pubDate>
            <description><![CDATA[Discover how to build your own screen and mouse jiggler to prevent your computer screen from turning off and keeping your status as online.]]></description>
            <content:encoded><![CDATA[
I recently came across the idea of a mouse jiggler (keeps your mouse moving) and after some investigation realised there are [products being sold](https://www.amazon.co.uk/s?k=mouse+jiggler&crid=W92Y3XH4RRF8&sprefix=mouse+jiggl%2Caps%2C580&ref=nb_sb_noss_2) to achieve this! Yes, even after doing up to 50% of my time working from home since long before the COVID-19 pandemic I had never heard of this 😄 The more I thought about it, I figured something like this would be really useful for me for a number of good reasons.

## Why build a mouse jiggler?

Sometimes I use my personal PC or laptop to try out ideas or perform testing outside of the organisation's internal network. However, if I spend more then approximately 1 minute away from my work laptop the screen will go off, my status will appear as 'away' on instant messenger. This makes it seem like I'm not available for my team's questions when really I'm just doing work on my own machine. I'd prefer the screen to just stay on instead using the touch pad to keep the screen on. Unfortunately, the screen saver / screen off / IM settings are disabled. 

The solution could be just moving the mouse back and forth slowly on the screen to keep the active window showing, or with a function to switch windows from time to time so I can check different apps as I work.

I guess some other reasons might be balancing work and life more generally - attending appointments, having a coffee break, making lunch, or letting the dog out.

I have no doubt some people may use such a tool to avoid work and appear present at their machine, but then that's not really a mouse jiggler problem, that's a job satisfaction, productivity, motivation, wellbeing and management problem.

More generally, automation skills with Python are very good to have, and can be used in other projects like if you wanted to [record your mouse and keyboard clicks to then automate repetitive tasks](/blog/record-mouse-and-keyboard-for-automation-scripts-with-python/).

## Explaining the mouse jiggler program

The only package that this program relies on is [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/). To install with pip, run:

```
pip install pyautogui
```

Once installed create a Python file.

```python [mouse_jiggler.py]
# -*- coding: utf-8 -*-

import pyautogui
import time
import random
import sys

pyautogui.FAILSAFE = False


def switch_screens() -> None:
    """
    Switches the active screen using Alt + Tab
    a random number of times.
    """
    max_switches = random.randint(1, 5)
    pyautogui.keyDown('alt') 
    
    for _ in range(1, max_switches):
        pyautogui.press('tab')     
     
    pyautogui.keyUp('alt')   


def wiggle_mouse() -> None:
    """
    Wiggles the mouse between two coordinates.
    """
    max_wiggles = random.randint(4, 9)
    
    for _ in range(1, max_wiggles):
        coords = get_random_coords()
        pyautogui.moveTo(
            x=coords[0], 
            y=coords[1],
            duration=5
        )
        time.sleep(10)
    

def get_random_coords() -> []:
    """
    Returns a list of coordinates in the 
    format [x=1980, y=1080]
    """
    screen = pyautogui.size()
    width = screen[0]
    height = screen[1]
    
    return [
        random.randint(100, width - 200),
        random.randint(100, height - 200)
    ]


if __name__ == "__main__":
    print('Press Ctrl-C to quit.')
    try:
        while True:
            switch_screens()
            wiggle_mouse()
            sys.stdout.flush()
    except KeyboardInterrupt:
        print("\n")
```

To start the program use this command from the same directory:

```
python mouse_jiggler.py
```

To end the program use Ctrl + C.

So the program relies on two functions: 

* `switch_screens` uses Alt + Tab to switch the active screen a set number of times.
* `wiggle_mouse` moves the mouse to a random set of coordinates.

These functions are using some of the [many methods that PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/quickstart.html) conveniently provides:

* `.size()` returns current screen resolution width and height
* `.moveTo(x, y, duration)` moves the mouse to XY coordinates over duration in seconds
* `.keyDown(key)` presses the key down and keeps the button pressed
* `.keyUp(key)` releases a key that was kept pressed by `keyDown()`
* `.keyPress(key)` presses the given key and combines `keyDown()` followed by `keyUp()` 

This creates a simple solution to always keep the screen active, preventing it from turning off and keeping you appearing as online. This has worked great and really has taken a burden off my mind whilst I try to innovate and prove techniques on my own personal machine that might not work on my work machine. It really is a win-win. If you only want the mouse to move and to keep the active window showing and don't want to switch screens, remove the switch_screens function call underneath `while True`.

I have heard stories of some employers using screen and keyboard tracking software for monitoring employees which I find really sad. I'm focused and I take pride in my work but I'm not always at 100% so I doubt any monitoring software would be a true reflection on how much productivity I give and how much value I bring to my workplace in terms of money and time. No one can be switched on all the time, and we all have to realise that mental health and wellbeing in general is so important. If you did find yourself in that situation and had to stick around a while and had the ability to install Python, I can see an modified version of this program being useful to spread the time between screen switches out. 

You might have noticed I set FAILSAFE to false to turn it off. This is NOT recommended [in the documentation](https://pyautogui.readthedocs.io/en/latest/index.html?highlight=failsafe#fail-safes) so consider yourself warned, however I found to reliably avoid the failsafe action when the mouse is in any of the four corners of the primary monitor, it was best to disable it. It just means you have to be extra careful with the code, and if in any doubt set it to true to re-enable it.

<subscribe-form></subscribe-form>

## Seeing the mouse jiggler in action

Here is a quick video of how the program behaves moving the mouse and switching screens a number of times.

<article-video 
  id="-jUcBH0CS1w" 
  title="Creating a screen and mouse jiggler with Python">
</article-video>

## What will you use yours for?

Okay this was a fun article, now you know how to create a screen and mouse jiggler with Python, and have a solid start to building more advanced robotic process automation (RPA) solutions with PyAutoGUI. You can refer to the [documentation](https://pyautogui.readthedocs.io/en/latest/) for more guidance on using PyAutoGUI and think about what else you might like to build 😄

If you enjoyed this article be sure to check out other articles on the site, some which also explore automation with Python and PyAutoGUI including:

* [Record mouse and keyboard for automation scripts with Python](/blog/record-mouse-and-keyboard-for-automation-scripts-with-python/) 
* [Reduce Material Design Icons Font to 7KB and automate with PyAutoGUI](/blog/reduce-material-design-icons-font-to-7kb-and-automate-with-pyautogui/) 
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) for improving your Python skills

Finally, if you have any questions or if you decide to use or extend this program, please leave a comment below. I'd love to know what you use it for and how it's helped you out 👍]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hide your own site visits from Cloudflare Analytics with JavaScript]]></title>
            <link>https://shedloadofcode.com/blog/hide-your-own-site-visits-from-cloudflare-analytics-with-javascript/</link>
            <guid>https://shedloadofcode.com/blog/hide-your-own-site-visits-from-cloudflare-analytics-with-javascript/</guid>
            <pubDate>Thu, 25 Aug 2022 14:32:00 GMT</pubDate>
            <description><![CDATA[Learn how to prevent counting your own site visits and skewing your figures with this simple technique.]]></description>
            <content:encoded><![CDATA[
In this short article, we'll look at how to keep your own site visits below the radar of Cloudflare Analytics so you don't skew your usage stats using JavaScript.

## Why is hiding your own visits important?

When I set up this site I wanted to test out the functionality even after [privacy-first analytics](/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets/) had been set up. However, this would misrepresent how many users were actually visiting the site! It's really difficult to remember, "oh yeah, that was me when I tested that page a bunch of times" when viewing the usage figures. So that would be bad enough if you were doing the testing or browsing your own site as a single developer or author, but what if you were a team of 5 - 10 or more?

That would mean for each member of the team that published articles or made improvements and then viewed the page on the live site, the usage figures would go way up and be completely skewed. To solve this problem, I created a simple but effective solution by creating a private route for internal users that would disable both my custom analytics and Cloudflare Analytics by never instantiating it in the first place 😄 Effectively a route that says "don't track my visits in the usage stats" which is perfect for testing and viewing the live site.

## Disable analytics with JavaScript

In a [previous article](/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets#bonus-avoid-tracking-your-own-activity), I covered how to 'Avoid tracking your own activity' in the bonus section. This approach set a boolean flag value in local storage when an internal user or myself hit the `/do-not-track-me` route. After this was set an internal user could go ahead and browse any pages on the site knowing they would not be adding to the usage counts. We can use a similar solution but applied to how Cloudflare Analytics is initialised.

When setting up Cloudflare Analytics you add a script to the page like:

```html
<script
  defer
  src="https://static.cloudflareinsights.com/beacon.min.js"
  data-cf-beacon='{"token": "42e216b9090ru59384ygu891dce9eecde"'
></script>
```

So as long as a user has visited the `/do-not-track-me` route first and the interim page loaded setting a value for `donottrack` as true:

```js
localStorage.setItem("donottrack", true);
window.location.href = "/";
```

We can then use a custom function to fire on page reload which checks it and disables analytics by not initialising it:

```js
initialiseCloudflareAnalytics() {
    let analyticsDeactivated = localStorage.getItem("donottrack") || false;

    if (analyticsDeactivated) {
        return;
    }

    let cloudflareScript = document.createElement("script");
    cloudflareScript.setAttribute("src", "https://static.cloudflareinsights.com/beacon.min.js");
    cloudflareScript.setAttribute("defer", true);
    cloudflareScript.setAttribute("data-cf-beacon", '{"token": "8bcfbc66e3f442149d3539d3cbfafc9b"}');
    document.body.appendChild(cloudflareScript);

    this.cloudflareScriptInitialised = true;
    console.log("Cloudflare Analytics initialised.");
},
```

If a regular user visits the site without going first through the `/donottrack` route, this flag will never be set and therefore the Cloudflare Analytics script will be appended to the document body and will work as expected.

I applied this to the mounted action in a Vue.js single page app, but you could just as easily apply this logic in any webpage using either JavaScript with `window.onload` or jQuery with `$(document).ready()`.

You might also want to provide an internal user with an option in your site to activate analytics again, you could achieve this simply by displaying a button only for users with analytics deactivated by checking the flag in local storage then firing `localStorage.removeItem("notrack");` when they click it. This will allow them to become regular users again and have their page visits logged.

## Short but sweet

I hope this article helped you to think about how you might selectively disable tracking to prevent inflating your usage figures with your own or your team's page visits and prevent headaches 😆 I think the same approach could be used with similar analytics tools such as Google Analytics too. It is a simple to implement solution, it does mean that you and your team need to remember to hit the newly added `/do-not-track` to set the cookie in local storage, but you only have to do it once. I think this tradeoff is worth it for the simplicity though, especially for small to medium sized sites and works across devices.

If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions feel free to leave a comment 👍]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Concepts of Artificial Intelligence with Python - a review of CS50 AI]]></title>
            <link>https://shedloadofcode.com/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/</link>
            <guid>https://shedloadofcode.com/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/</guid>
            <pubDate>Tue, 12 Jul 2022 17:41:00 GMT</pubDate>
            <description><![CDATA[This article covers the concepts of AI in Harvard's CS50 Introduction to Artificial Intelligence with Python course, along with a review of the course itself, what I learned from it, and helpful advice if you're looking to start it yourself.]]></description>
            <content:encoded><![CDATA[
<affiliate-disclaimer></affiliate-disclaimer>

This article covers the concepts of Artificial Intelligence (AI) introduced in Harvard's [CS50 Introduction to Artificial Intelligence with Python](https://edx.sjv.io/q4oLWq) course, along with a review of the course itself, what I learned from it, and helpful advice if you're looking to start it yourself. Spoiler alert, when outlining the projects for each week I may include example code, you might want to skip over these parts if you're taking the course yourself.

## So what is CS50 AI all about?

> CS50's Introduction to Artificial Intelligence (AI) with Python explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that give rise to technologies like game-playing engines, handwriting recognition, and machine translation. Through hands-on projects, students gain exposure to the theory behind graph search algorithms, classification, optimization, reinforcement learning, and other topics in artificial intelligence and machine learning as they incorporate them into their own Python programs. By course’s end, students emerge with experience in libraries for machine learning as well as knowledge of artificial intelligence principles that enable them to design intelligent systems of their own.

The course contains seven lectures, twelve projects and seven quizzes. The lectures and projects cover key AI concepts such as search, knowledge, uncertainty, optimisation, machine learning, neural networks and natural language processing. The suggested completion time is seven weeks, at between ten to thirty hours per week. The only prerequisites for the course are either taking the [CS50 Introduction to Computer Science](https://edx.sjv.io/EKAg9W) course or prior programming experience in Python. The course is free and if you submit and receive a score of at least 70% on each of this course’s projects, you will be eligible for a [free certificate](https://cs50.harvard.edu/ai/2020/certificate/) like the one below. A nice recognition of the hard work put in to get it. 🤓 You can also choose to pay £145 (at the time of writing) to get a [verified certificate](https://edx.sjv.io/6ezQAQ) from [edX](https://edx.sjv.io/q4oLWq). This might be worthwhile if you are wanting to show to an employer or talk about in an interview. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097571/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/certificate_khhrht.webp" 
  alt="Artificial Intelligence Technology Landscape" 
  loading="lazy" 
  styling=""
  caption="Free certificate example" 
  captionsrc="https://certificates.cs50.io/00397619-a705-4e00-9761-a578d30912e0.png" 
  :showsource="false">
</article-image>

If you've already achieved a verified certificate for [CS50 Introduction to Computer Science](https://edx.sjv.io/EKAg9W) (I completed this in 2018 and loved the course) then after completing this course in AI you in turn complete the [Professional Certificate in Computer Science for Artificial Intelligence](https://edx.sjv.io/KjEvJ7). Both of these courses combined make for a solid introduction to Computer Science. In covering programming, web development, probability, machine learning and artificial intelligence you have the foundation to enter a number of career paths including Software Engineer and Data Scientist roles. CS50 in collaboration with edX offers a few different 'pathways' as outlined below.

| Level      | Course                                                                                                                                             | Estimated Duration           | Topics                                                                                              | Languages Covered                                | Certificate                               | Final Certificate (combined with CS50)                                                                                                                                        |
|------------|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|-----------------------------------------------------------------------------------------------------|--------------------------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Core       | [CS50's Introduction to Computer Science](https://edx.sjv.io/EKAg9W)                                 | 12 weeks 6-18 hours per week | Abstraction, Algorithms, Data Structures, Encapsulation, Software Engineering, and Web Development  | C, Python, SQL, and JavaScript plus CSS and HTML | $90 edX (What I paid, might have changed) | -                                                                                                                                                                             |
| Specialist | [CS50's Web Programming with Python and JavaScript](https://edx.sjv.io/5gzD7b)                   | 12 weeks 6-9 hours per week  | Git, Models, Migration, User Interfaces, Testing, CI/CD, Scalability, Security                      | HTML, CSS, Python, SQL, JavaScript               | $199 edX (may have now changed)                                 | [Professional Certificate in Computer Science for Web Programming](https://edx.sjv.io/q4oL2g)                |
| Specialist | [CS50's Mobile App Development with React Native](https://edx.sjv.io/jraGOZ)                       | 13 weeks 6-9 hours per week  | Components, Props, State, Views, Navigation, User Input, Performance, Shipping, Testing             | JavaScript                                       | $199 edX (may have now changed)                                 | [Professional Certificate in Computer Science for Mobile Apps](https://edx.sjv.io/eKj5AQ)                        |
| Specialist | [CS50's Introduction to Game Development](https://edx.sjv.io/oqJgyW)                                       | 12 weeks 6-9 hours per week  | 2D and 3D Graphics, Animation, Sound, Collision Detection, Unity, LOVE 2D                           | Lua, C#                                          | $199 edX (may have now changed)                                 | [Professional Certificate in Computer Science for Game Development](https://edx.sjv.io/WqEvGe)              |
| Specialist | [CS50's Introduction to Artificial Intelligence with Python](https://edx.sjv.io/q4oLWq) | 7 weeks 10-30 hours per week | Graph Search Algorithms, Knowledge Representation, Logical Inference, Probability, Machine Learning | Python                                           | $199 edX (may have now changed)                                 | [Professional Certificate in Computer Science for Artificial Intelligence](https://edx.sjv.io/KjEvJ7) |
|            |                                                                                                                                                    |                              |                                                                                                     |                                                  |                                           |                                                                                                                                                                               |

AI is the ability of a machine to display human-like capabilities such as reasoning, learning, planning and creativity. AI has completely changed the world and has the potential to continually do so. I do think however, that it can be misunderstood. I see people using the term "artificial intelligence" without realising fully what it means - particular the difference between [strong and weak AI](https://www.ibm.com/cloud/learn/strong-ai#toc-strong-ai--YaLcx8oG). The uses of AI day to day are vast, including search engines, predictive search, image recognition, games, voice assistants, email spam detection, bank fraud detection, smart devices, movie and music recommendations, chatbots, finding map directions and more. Other applications that might soon be seen more often include autonomous drones, self-driving vehicles, robots and virtual workers.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097373/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/ai-tech-landscape_ewlctj_mf8en1.webp" 
  alt="Artificial Intelligence Technology Landscape" 
  loading="lazy" 
  styling=""
  caption="Callaghan Innovation" 
  captionsrc="https://www.callaghaninnovation.govt.nz/news-and-events/ai-demystified" 
  :showsource="true">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097383/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/artificial-intelligence-portfolio_opdmhs.png" 
  alt="Artificial Intelligence Portfolio" 
  loading="lazy" 
  styling=""
  caption="Callaghan Innovation" 
  captionsrc="https://www.callaghaninnovation.govt.nz/news-and-events/ai-demystified" 
  :showsource="true">
</article-image>

I think the great thing about this course, is that it lifts the lid on what otherwise can be seen as a black box, to explore the concepts and algorithms that are key to implementing AI systems. It gives you the core knowledge required to build your own intelligent programs which "mimic the problem-solving and decision-making capabilities of the human mind" ([IBM](https://www.ibm.com/uk-en/cloud/learn/what-is-artificial-intelligence#toc-what-is-ar-DhYPPT4m)). Although not essential, I would recommend the book [Artificial Intelligence: A Modern Approach](https://www.amazon.co.uk/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/) as a companion to the course.

The following sections cover the core concepts covered in each lecture, and the projects completed with links to my submitted code in GitHub. If you are taking the course yourself, you should not view these solutions as it might be seen as breaking [Academic Honesty](https://cs50.harvard.edu/ai/2020/honesty/).

Okay, let's dive into the concepts covered in the course!

## Lecture 0: Search

**Concepts:**

- **Agent**: entity that perceives its environment and acts upon that environment.
- **State**: a configuration of the agent and its environment.
- **Actions**: choices that can be made in a state.
- **Transition model**: a description of what state results from performing any applicable action in any state.
- **Path cost**: numerical cost associated with a given path.
- **Evaluation function**: function that estimates the expected utility of the game from a given state.

**Algorithms:**

- [**DFS**](https://youtu.be/D5aJNFWsWew?t=1557) (depth first search): search algorithm that always expands the deepest node in the frontier.
- [**BFS**](https://www.youtube.com/watch?v=D5aJNFWsWew) (breath first search): search algorithm that always expands the shallowest node in the frontier.
- [**Greedy best-first search**](https://youtu.be/D5aJNFWsWew?t=3269): search algorithm that expands the node that is closest to the goal, as estimated by an heuristic function h(*n*).
- [**A\* search**](https://youtu.be/D5aJNFWsWew?t=3916): search algorithm that expands node with lowest value of the "cost to reach node" *g(n)* plus the "estimated goal cost" h(*n*). In other words, g(*n*) is the number of steps you had to take to get to the node you're at and the *h(n)* is the ['Manhatten distance'](https://xlinux.nist.gov/dads/HTML/manhattanDistance.html) heuristic estimate of how far a node is away from the goal. This can be expressed as *f(n) = g(n) + h(n)*.
- [**Minimax**](https://youtu.be/D5aJNFWsWew?t=4450): adversarial search algorithm.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640096660/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/DFS.gif" 
  alt="Depth first search process" 
  loading="lazy" 
  styling=""
  caption="Depth first search process" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640096660/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/DFS.gif" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640096730/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/BFS.gif" 
  alt="Breadth first search process" 
  loading="lazy" 
  styling=""
  caption="Breadth first search process" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640096730/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/BFS.gif" 
  :showsource="false">
</article-image>

**Data Structures**

- [**Frontier**](https://youtu.be/D5aJNFWsWew?t=993): represents all the possible nodes to search next that haven’t yet been explored
- **Stack**: last-in first-out data type used for DFS
- **Queue**: first-in first-out data type used for BFS
- [**Node**](https://youtu.be/D5aJNFWsWew?t=909): keeps track of a state, a parent (node that generated this node), an action (action applied to parent to get to node) and a path cost (from initial state to node)

**Projects**

- [**Tic-Tac-Toe**](https://cs50.harvard.edu/ai/2020/projects/0/tictactoe/) - Using [Minimax](https://en.wikipedia.org/wiki/Minimax), implement an AI to play Tic-Tac-Toe optimally. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/0.%20Search/tictactoe)
- [**Degrees**](https://cs50.harvard.edu/ai/2020/projects/0/degrees/) - Write a program that determines how many “degrees of separation” apart two actors are. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/0.%20Search/degrees)

```python [degrees.py]
def shortest_path(source, target):
    """
    Finds the shortest path between any two actors (source, target)
    by choosing a sequence of movies that connects them. 
    Returns the shortest list of (movie_id, person_id) pairs
    that connect the source to the target.
    If no possible path, returns None.
    """
    print(
        f"Finding shortest path between {people[source]['name']} ({source}) and {people[target]['name']} ({target})...")
    timer = time.time()

    # Start with frontier and initial node
    frontier = QueueFrontier()
    initial_node = Node(state=source, parent=None, action=None)
    frontier.add(initial_node)

    # Start with empty explored set
    explored = set()
    number_of_states_explored = 0

    while True:

        # If frontier is empty no solution
        if frontier.empty():
            return None

        # Remove a node from the frontier
        node = frontier.remove()
        number_of_states_explored += 1

        # Add the node to the explored set
        explored.add(node.state)

        # Expand node, add resulting nodes to the frontier if the aren't already
        # in the frontier or the explored set
        for movie_id, person_id in neighbors_for_person(node.state):
            if not frontier.contains_state(person_id) and person_id not in explored:
                child = Node(state=person_id, parent=node, action=movie_id)

                # If child node (neighbor) contains goal state, no need to add it to the frontier
                # instead return the solution immediately.
                if child.state == target:
                    path = []
                    node = child

                    while node.parent is not None:
                        path.append((node.action, node.state))
                        node = node.parent
                    path.reverse()

                    seconds_taken = time.time() - timer
                    print(f"Explored { number_of_states_explored } states in { seconds_taken } seconds")
                    
                    return path

                frontier.add(child)
```

There are two approaches to the order of this solution, one of them dramatically [reduces time complexity](https://youtu.be/cEnVl_xopjo?t=245).

## Submitting the first project

I started CS50 AI a while back, but other commitments got in the way. So I was really happy to dive back in. I'd already done the tictactoe project so I submitted that first (I know it was the second project, but it interested me more so I did it first 😆). The first obstacle you might hit on week 0 is "I've finished my first project... How do I submit my work?!"

I had the same question. So let's take submitting tictactoe as an example. In the main CS50 AI site in the [tictactoe project](https://cs50.harvard.edu/ai/2020/projects/0/tictactoe/) page, there is a section "Getting started" to pull the project code from. Once the project is completed, we have a section 'How to Submit' which contains a series of steps: 

* Visit [this link](https://submit.cs50.io/invites/8f7fa48876984cda98a73ba53bcf01fd), log in with your GitHub account, and click **Authorize cs50**. Then, check the box indicating that you’d like to grant course staff access to your submissions, and click **Join course**.
* [Install Git](https://git-scm.com/downloads) and, optionally, [install submit50](https://cs50.readthedocs.io/submit50/).
* If you’ve installed submit50, execute `submit50 ai50/projects/2020/x/tictactoe`
* Submit [this form](https://forms.cs50.io/4aeea18e-5aa0-4ae2-9086-5941d5556954).

I had a folder structure broken down by lecture and project:

```
0. Search
| --- degrees
| --- tictactoe
|
1. Knowledge
| --- knights
| --- minesweeper
...
```

Seems straightforward but there were a few gotchas. So here is how I stumbled through it:

* cd into project directory `Search/tictactoe`
* I tried install submit50 on Windows using `pip3 install submit50`. This is a no-no [it does not work on Windows](https://github.com/cs50/submit50/issues/196). So I launched Ubuntu (which has Python preinstalled) on a virtual machine using VirtualBox
* To install the Python packages for the project (for tictactoe it was pygame) alongside submit50 I needed to install pip using `sudo apt install python3-pip`
* I could now install submit50 using `pip3 install submit50`
* Once submit50 is installed I needed to reboot the Ubuntu virtual machine to ensure the terminal recognised it (I was getting `submit50: Command not found`)
* In the project directory I could now install all the packages using `pip3 install -r requirements.txt` - you might want to create and install packages to [a virtual environment](https://docs.python.org/3/library/venv.html) per project folder if you wish
* I was then able to run tictactoe for testing using `python3 runner.py`

These steps got me very close to my first submission. There were two more obstacles... Since I was using VS Code within Ubuntu everytime I tried to submit GitHub would open in the browser, I'd sign in but [submission would fail](https://cs50.stackexchange.com/questions/37360/using-submit50-on-vscode) when I returned to VS Code. The solution is go to File > Preferences > Settings > Extensions > GitHub and untick Git Authentication 😄

So now when using `submit50 ai50/projects/2020/x/tictactoe` to submit, the prompt for my GitHub username and password would appear within VS Code itself, much better. The final hurdle was, if you have two factor authentication turned on with GitHub, you might get this message 😧

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097428/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/you-need-a-personal-access-token_lythss_dibtuj.png" 
  alt="Personal access token required error" 
  loading="lazy" 
  styling=""
  caption="Personal access token required error message" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097428/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/you-need-a-personal-access-token_lythss_dibtuj.png" 
  :showsource="false">
</article-image>

The link provided in the error message https://cs50.ly/github-2fa has all the steps for creating a personal access token. Once you have it, re-submit and use that token at the password prompt. Now using `submit50 ai50/projects/2020/x/tictactoe` again, the submission for tictactoe was successfully uploaded!

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097425/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/first-submission-success_h0ebh9_wdoco5.png" 
  alt="First successful submission" 
  loading="lazy" 
  styling=""
  caption="First successful submission" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097425/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/first-submission-success_h0ebh9_wdoco5.png" 
  :showsource="false">
</article-image>

Hopefully this should serve as a good example of how to submit tictactoe, and you can now use the same method for submitting each of the other projects. You might find a much easier way to do this, I'm sure you could use Windows Subsystem for Linux instead, but this worked nicely for me even if there were a few headaches to overcome.

## Lecture 1: Knowledge

**Concepts**

- [**Sentence**](https://youtu.be/LucW-p6zC5c?t=104): an assertion about the world in a knowledge representation language.
- [**Knowledge base**](https://youtu.be/LucW-p6zC5c?t=975): a set of sentences known by a knowledge-based agent.
- [**Entailment**](https://youtu.be/LucW-p6zC5c?t=1022): _a_ entails _b_ if in every model in which sentence _a_ is true, sentence _b_ is also true.
- [**Inference**](https://youtu.be/LucW-p6zC5c?t=1308): the process of deriving new sentences from old ones.
- [**Conjunctive normal form**](https://youtu.be/LucW-p6zC5c?t=4985): logical sentence that is a conjunction of clauses.
- [**First order logic**](https://youtu.be/LucW-p6zC5c?t=5910): Propositional logic.
- **Second order logic**: Proposition logic with universal and existential quantification.
- **Truth table**: table showing the outputs for all possible combinations of inputs to a logic gate or circuit.

**Algorithms** 

- **Model checking**: enumerate all possible models and see if a proposition is true in every one of them.
- **Conversion to CNF** and **Inference by resolution**

**Projects**

- [**Knights**](https://cs50.harvard.edu/ai/2020/projects/1/knights/) - Write a program to solve logic puzzles [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/1.%20Knowledge/knights)
- [**Minesweeper**](https://cs50.harvard.edu/ai/2020/projects/1/minesweeper/) - Write an AI to play Minesweeper [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/1.%20Knowledge/minesweeper)

I think the [Minesweeper project](https://cs50.harvard.edu/ai/2020/projects/1/minesweeper/) was one of my favourite! The general logic of the AI was adding sentences to it's knowledge base where a sentence consisted of a set of board cells,
and a count of the number of those cells which are mines, so something like `Sentence({(0, 1), (1, 0), (1, 1)}, 3)`. This says out of cells `{(0, 1), (1, 0), (1, 1)}` exactly 3 of them are mines. We can then infer they must all be mines as the number of cells is equal to the count! On every move the following process was executed:

1. Mark the cell as a move that has been made
2. Mark the cell as safe
3. Get the neighbours of the current cell
3. Add a new sentence to the AI's knowledge base based on the cell's neighbours and count (of adjacent mines)
4. Mark any additional cells as safe or as mines if it can be concluded based on the AI's knowledge base
5. Add any new sentences to the AI's knowledge base if they can be inferred from existing knowledge

This meant that as sentences are added to the knowledge base the AI can make yet more inferences. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097397/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/example-minesweeper-board_zw7ym4_oi1dwh.png" 
  alt="First successful submission" 
  loading="lazy" 
  styling=""
  caption="cs50.harvard.edu/ai/2020/projects/1/minesweeper/" 
  captionsrc="https://cs50.harvard.edu/ai/2020/projects/1/minesweeper/" 
  :showsource="true">
</article-image>

Given this board, we can see there is one mine next to the top row cells and two mines next to the bottom middle cell. The top row's sentence would be `{A, B, C} = 1`. the bottom middle's sentence would be `{A, B, C, D, E} = 2`. Now we have two sentences where the first sentence's set of cells are a subset of the second sentence's set of cells. We can now construct a new sentence by doing set2 - set1 = count2 - count 1 which is `{D, E} = 1`. If two of A, B, C, D, and E are mines, and only one of A, B, and C are mines, then it stands to reason that exactly one of D and E must be the other mine.

Here is a demo of the Minesweeper AI in action!

<article-video 
  id="XsZakY_sVMo" 
  title="CS50 AI Minesweeper Project Demo">
</article-video>

## Lecture 2: Uncertainty

When the answer isn't certain, we can use probability based methods to assess the knowledge available, to then make decisions.

**Concepts**

- **Unconditional probability**: degree of belief in a proposition in the absence of any other evidence.
- [**Conditional probability**](https://youtu.be/uQmYZTTqDC0?t=577): degree of belief in a proposition given some evidence that has already been revealed.
- [**Possible worlds**](https://youtu.be/uQmYZTTqDC0?t=170): every possible outcome for a given series or combination of events
- [**Random variable**](https://youtu.be/uQmYZTTqDC0?t=1040): a variable in probability theory with a domain of possible values it can take on.
- [**Independence**](https://youtu.be/uQmYZTTqDC0?t=1316): the knowledge that one event occurs does not affect the probability of the other event.
- [**Bayes' Rule**](https://youtu.be/uQmYZTTqDC0?t=1608): _P(a) P(b|a) = P(b) P(a|b)_
- [**Bayesian network**](https://youtu.be/uQmYZTTqDC0?t=2982): data structure that represents the dependencies among random variables.
- [**Markov assumption**](https://youtu.be/uQmYZTTqDC0?t=5580): the assumption that the current state depends on only a finite fixed number of previous states.
- **Markov chain**: a sequence of random variables where the distribution of each variable follows the Markov
 assumption.
- [**Hidden Markov Model**](https://youtu.be/uQmYZTTqDC0?t=6257): a Markov model for a system with hidden states that generate some observed event.

**Algorithms**

- **Inference by enumeration**
- **Sampling**
- **Likelihood weighting**

**Projects**

- [**Heredity**](https://cs50.harvard.edu/ai/2020/projects/2/heredity/) - Write an AI to assess the likelihood that a person will have a particular genetic trait. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/2.%20Uncertainty/heredity)
- [**PageRank**](https://cs50.harvard.edu/ai/2020/projects/2/pagerank/) - Write an AI to rank web pages by importance. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/2.%20Uncertainty/pagerank)

Having worked as a Data Scientist and Statistician, I like the following questions for starting to think about probability. Firstly, if you have two fair dice, what is the probablity of rolling a 12 (6 and 6)? The answer is 1 in 36 or a 2.778% chance because we can see out of all the 36 possible words (the possible combinations of dice throws) only one satisfies the requirement of rolling a 12.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097394/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/dice-probability-of-twelve_h1qtuc_tcvbjh.webp" 
  alt="Dice probability table" 
  loading="lazy" 
  styling=""
  caption="Dice roll possible worlds" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097394/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/dice-probability-of-twelve_h1qtuc_tcvbjh.webp" 
  :showsource="false">
</article-image>

I read in the book [The Art of Statistics: Learning from Data](https://www.amazon.co.uk/Art-Statistics-Learning-Pelican-Books/dp/0241398630) about how in 2012, 97 Members of Parliament were asked 'If you spin a coin twice, what is the probablity of getting two heads?' 60 out of 97 of them couldn't give the correct answer. The answer is 1 in 4 or a 25% chance because we can see out of the 4 possible outcomes only one satisfies the requirement of flipping two heads.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097386/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/coin-toss-two-heads_ee9t75_wfdfyy.png" 
  alt="Coin toss possible worlds" 
  loading="lazy" 
  styling=""
  caption="Coin toss possible worlds" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097386/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/coin-toss-two-heads_ee9t75_wfdfyy.png" 
  :showsource="false">
</article-image>

Another favourite of mine that seemingly breaks the laws of probablity is the [Monty Hall Problem](https://www.youtube.com/watch?v=4Lb-6rxZxx0). There is a [follow up explanation](https://www.youtube.com/watch?v=7u6kFlWZOWg) for this and an excellent comment on this video from Rundvelt showing the importance of looking at 'possible worlds':

> I think that if you drew out all the possibilities that would demonstrate the fact better. For example.
> 
> Scenario 1:
> Car / Goat / Goat
> 
> Scenario 2:
> Goat / Car / Goat
> 
> Scenario 3:
> Goat / Goat / Car
> 
> Let's say you pick the door on the left and do not switch.
>
> Scenario 1:   Win
>
> Scenario 2:  Lose
>
> Scenario 3:  Lose
> 
> Let's say you pick the door on the left and switch doors.
>
> Scenario 1:  Lose
>
> Scenario 2:  Win
>
> Scenario 3:  Win.
> 
> Not Switching = 1 win out of 3.
>
> Switching = 2 wins out of 3.

To learn more about statistics and probability, I recommend the book [Practical Statistics for Data Scientists](https://www.amazon.co.uk/Practical-Statistics-Data-Scientists-Essential-dp-149207294X/dp/149207294X/ref=dp_ob_title_bk) - I love using this as a reference book.

<subscribe-form></subscribe-form>

## Lecture 3: Optimisation

Optimisation can be summarised as choosing the best option from a set of options.

**Concepts**
 
- [**Local search**](https://youtu.be/TA5ZJm1ZYS4?t=104): search algorithm that maintain a single node and searches by moving to a neighbouring node, but is not concered about finding the path, just the optimal solution.
- [**State-space landscape**](https://youtu.be/TA5ZJm1ZYS4?t=252): the different configuations of possible worlds and their cost value. 
- **Objective function**: function to find the global maximum from the state space landscape.
- **Cost function**: function to find the global minimum from the state space landscape.
- **Neighbouring state**: a state that is close to the current state, but slightly different to compare objective or cost function value.

**Algorithms**

- [**Hill Climbing**](https://youtu.be/TA5ZJm1ZYS4?t=450): start at a given state, then consider the neighbours of that state and pick the highest or lowest.
    - [**steepest-ascent**](https://youtu.be/TA5ZJm1ZYS4?t=1271): choose the highest-valued neighbour.
    - **stochastic**: choose randomly from higher-valued neighbours.
    - **first-choice**: choose the first higher-valued neighbour.
    - [**random-restart**](https://youtu.be/TA5ZJm1ZYS4?t=1604): conduct hill climbing multiple times.
    - **local beam search**: chooses the _k_ highest-valued neighbours.
- [**Simulated Annealing**](https://youtu.be/TA5ZJm1ZYS4?t=1750): early on, more likely to accept worse-valued neighbours than the current state.
- [**Linear Programming**](https://youtu.be/TA5ZJm1ZYS4?t=2445): a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships.
    - [**Simplex**](https://en.wikipedia.org/wiki/Simplex_algorithm)
    - [**Interior Point**](https://en.wikipedia.org/wiki/Interior-point_method)
- [**Constraint Satisfaction problems**](https://youtu.be/TA5ZJm1ZYS4?t=3061): problems where the state has constraints or limiations.
    - [**Node Consistency**](https://youtu.be/TA5ZJm1ZYS4?t=3549): when all the values in a variable's domain satisfy the variable's unary constraints.
    - [**Arc Consistency**](https://youtu.be/TA5ZJm1ZYS4?t=3789): when all the values in a variable's domain satisfy the variable's binary constraints.
[**Backtracking Search**](https://youtu.be/TA5ZJm1ZYS4?t=4619): a search algorithm to solve a constraint satisfcation problem that incrementally builds candidates as the solution, but abandons a candidate ('backtracks') as soon as it finds the candidate cannot possibly be a valid solution.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097406/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/optimisation-problem-formulation_xpry8n_zwujlg.webp" 
  alt="Optimisation problem formulation" 
  loading="lazy" 
  styling=""
  caption="Optimisation problem formulation" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097406/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/optimisation-problem-formulation_xpry8n_zwujlg.webp" 
  :showsource="false">
</article-image>

**Projects**
- [**Crossword**](https://cs50.harvard.edu/ai/2020/projects/3/crossword/) - Write an AI to generate crossword puzzles. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/3.%20Optimisation/crossword)
  
## Lecture 4: Learning

Machine learning models focus on finding and learning from patterns in existing data, then use those patterns to predict new outcomes with a high degree of accuracy. Although accuracy is important it's also essential to build [explainable models / explainable AI (XAI)](/blog/understanding-explainable-ai-for-classification-regression-and-clustering-with-python/) so subjects, stakeholders and businesses can understand them and have more confidence in them.

**Concepts**

- [**Supervised learning**](https://youtu.be/E4M_IQG0d9g?t=75): given a data set of input-output pairs, learn a function to map inputs to outputs.
    - [**Classification**](https://youtu.be/E4M_IQG0d9g?t=493): supervised learning task of learning a function mapping an input point to a discrete category.
    - [**Regression**](https://youtu.be/E4M_IQG0d9g?t=2371): supervised learning task of learning a function mapping and input point to a continuous value.
    - [**Loss function**](https://youtu.be/E4M_IQG0d9g?t=2564): function that express how poorly our hypothesis performs (L1, L2).
    - [**Overfitting**](https://youtu.be/E4M_IQG0d9g?t=2974): when a model fits too closely to a particular data set and therefore may fail to generalize to
     future data.
    - [**Regularization**](https://youtu.be/E4M_IQG0d9g?t=3347): penalizing hypotheses that are more complex to favor simpler, more general hypotheses.
    - [**Holdout cross-validation**](https://youtu.be/E4M_IQG0d9g?t=3403): splitting data into a training set and a test set, such that learning happens on the
     training set and is evaluated on the test set.
    - [**k-fold cross-validation**](https://youtu.be/E4M_IQG0d9g?t=3497): splitting data into _k_ sets, and experimenting _k_ times, using each set as a test
     set once, and using remaining data as training set.
- [**Reinforcement learning**](https://youtu.be/E4M_IQG0d9g?t=4198): given a set of rewards or punishments, learn what actions to take in the future.
- [**Unsupervised learning**](https://youtu.be/E4M_IQG0d9g?t=5935): given input data without any additional feedback, learn patterns.
- [**Clustering**](https://youtu.be/E4M_IQG0d9g?t=6019): organizing a set of objects into groups in such a way that similar objects tend to be in the same
 group.

**Algorithms**

- [**k-nearest-neighbor classification**](https://youtu.be/E4M_IQG0d9g?t=491): given an input, chooses the most common class out of the _k_ nearest data
 points to that input.
- [**Support Vector Machines (SVM)**](https://youtu.be/E4M_IQG0d9g?t=2001): algorithm which creates a line or a hyperplane which separates the data into classes.
- **Markov decision process**: model for decision-making, representing states, actions and their rewards.
- **Q-learning**: method for learning a function _Q_(s, a), estimate of the value of performing action _a_ in state _s_.
- **Greedy decision-making**
- **epsilon-greedy**
- **k-means clustering**: clustering data based on repeatedly assigning points to clusters and updating those
 clusters' centers.

**Basic template for building a machine learning classifier model**

 ```python [ml-scaffold.py]
  import pandas as pd
  import numpy as np
  from sklearn.svm import SVC
  from sklearn.linear_model
  from sklearn.naive_bayes import GaussianNB
  from sklearn.neighbors import KNeighborsClassifier
  from sklearn.model_selection import train_test_split

  model = KNeighborsClassifier()
  data = pd.read_csv("filepath goes here.csv")

  target = data['ColumnName'].values
  features = data['ColumnNameA', 'ColumnNameB', 'ColumnNameC']

  X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.3
  )

  model.fit(X_train, y_train)

  predictions = model.predict(X_test)

  correct = (y_test == predictions).sum()
  incorrect = (y_testing != predictions).sum()
  total = len(predictions)

  print(f"Results for model {type(model).__name__}")
  print(f"Correct: {correct}")
  print(f"Incorrect: {incorrect}")
  print(f"Accuracy: {100 * correct / total:.2f}%")
 ```

 **Packages**

- [pandas](https://pandas.pydata.org/): fast, powerful, flexible and easy to use open source data analysis and manipulation tool,built on top of the Python programming language.
- [scikit-learn](https://scikit-learn.org/stable/): Machine learning and predictive analysis package built on NumPy, SciPy, and matplotlib. [[Lecture]](https://youtu.be/E4M_IQG0d9g?t=3582)

**Resources**
- [Google's Machine Learning Glossary](https://developers.google.com/machine-learning/glossary)

**Projects**

- [Shopping](https://cs50.harvard.edu/ai/2020/projects/4/shopping/) - Write an AI to predict whether online shopping customers will complete a purchase. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/4.%20Learning/shopping)
- [Nim](https://cs50.harvard.edu/ai/2020/projects/4/nim/) - Write an AI that teaches itself to play Nim through reinforcement learning. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/4.%20Learning/nim)

## Lecture 5: Neural Networks

An artificial nerual network is a mathematical model for learning inspired by biological neural networks.

**Concepts**

- [**Multilayer neural network**](https://youtu.be/mFZazxxCKbw?t=1800): artificial neural network with an input layer, an output layer, and at least one hidden layer.
- [**Deep neural network**](https://youtu.be/mFZazxxCKbw?t=1833): neural network with multiple hidden layer.
- [**Dropout**](https://youtu.be/mFZazxxCKbw?t=2238): temporarily removing units - selected at random - from a neural network to prevent over-reliance on certain units.
- [**Computer vision**](https://youtu.be/mFZazxxCKbw?t=3185): computational methods for analysing and understanding digital images.
- [**Image convolution**](https://youtu.be/mFZazxxCKbw?t=3490): applying a filter that adds each pixel value of an image to its neighbours, weighted according to a kernel matrix.
- [**Pooling**](https://youtu.be/mFZazxxCKbw?t=3988): reducing the size of an input by sampling from regions in the input.
- [**Convolutional neural network**](https://youtu.be/mFZazxxCKbw?t=4098): neural networks that use convolution, usually for analyzing images.
- [**Recurrent neural network**](https://youtu.be/mFZazxxCKbw?t=5223): neural network that generates output that feeds back into its own inputs.

**Algorithms**

- **Gradient descent**: algorithm for minimizing loss when training neural network.
- **Backpropagation**: algorithm for training neural networks with hidden layers.

 **Packages**

- [tensorflow](https://www.tensorflow.org/learn): An open source software library for high performance numerical computation. It comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains. See also [The Sequential model with Tensorflow Keras](https://www.tensorflow.org/guide/keras/sequential_model).
- [scikit-learn](https://scikit-learn.org/stable/): A machine learning and predictive analysis package built on NumPy, SciPy, and matplotlib.
- [opencv-python](https://pypi.org/project/opencv-python/): A library of Python bindings designed to solve computer vision problems. See [docs](https://docs.opencv.org/4.5.2/d2/d96/tutorial_py_table_of_contents_imgproc.html).

**Projects**

- [Traffic](https://cs50.harvard.edu/ai/2020/projects/5/traffic/) - Write an AI to identify which traffic sign appears in a photograph. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/5.%20Neural%20Networks)

After downloading the distribution code, install the Python packages from the requirements file, I ran `python3 traffic.py gtsrb` as a test and received an `Illegal instruction (core dumped)` error message. I was running this on an Ubuntu Linux VM using VirtualBox. The [fix for this](https://github.com/tensorflow/tensorflow/issues/17411) was to re-install an earlier version of the tensorflow package:

```
pip3 uninstall tensorflow
pip3 install tensorflow==1.5
```

After this I ran `python3 traffic.py gtsrb` again and arrived at the line 62 not implemented error in the `load_data` function, as expected, `File "traffic.py", line 62, in load_data raise NotImplementedError`. Hope this helps you out if you find yourself getting the same error message!

Here is a demo of the Convolutional Neural Network model used for the Traffic project in action!

<article-video 
  id="cOcL6cPJ_a8" 
  title="CS50 AI Traffic Project Demo">
</article-video>

## Lecture 6: Language

Natural Language Processing or NLP aims to understand human language, both written and spoken to extract information.

**Concepts**

- [**n-gram**](https://youtu.be/_hAVVULrZ0Q?t=1681): a contiguous sequence of _n_ items inside of a text.
- [**Tokenization**](https://youtu.be/_hAVVULrZ0Q?t=1836): the task of splitting a sequence of characters into pieces (tokens).
- **Text Categorization**
    - [**Bag-of-words model**](https://youtu.be/_hAVVULrZ0Q?t=2561): represent text as an unordered collection of words.
- **Information retrieval**: the task of finding relevant documents in response to a user query.
    - [**Topic modeling**](https://youtu.be/_hAVVULrZ0Q?t=4199): models for discovering the topics for a set of documents.
    - [**Term frequency**](https://youtu.be/_hAVVULrZ0Q?t=4253): number of times a term appears in a document.
        - [**Function words**](https://youtu.be/_hAVVULrZ0Q?t=4456): words that have little meaning on their own, but are used to grammatically connect other words.
        - [**Content words**](https://youtu.be/_hAVVULrZ0Q?t=4492): words that carry meaning independently.
    - [**Inverse document frequency**](https://youtu.be/_hAVVULrZ0Q?t=4643): measure of how common or rare a word is across documents. Formula is *log(total_documents / number_of_documents_containing(word))*
- [**Information extraction**](https://youtu.be/_hAVVULrZ0Q?t=4873): the task of extracting knowledge from documents.
- [**WordNet**](https://youtu.be/_hAVVULrZ0Q?t=5413): a lexical database of semantic relations between words.
- [**Word representation**](https://youtu.be/_hAVVULrZ0Q?t=5537): looking for a way to represent the meaning of a word for further processing.
    - [**one-hot**](https://youtu.be/_hAVVULrZ0Q?t=5636): representation of meaning as a vector with a single 1, and with other values as 0.
    - [**distribution**](https://youtu.be/_hAVVULrZ0Q?t=5768): representation of meaning distributed across multiple values.

**Algorithms**

- [**Markov model applied to language**](https://youtu.be/_hAVVULrZ0Q?t=2281): generating the next word based on the previous words and a probability.
- [**Naive Bayes**](https://youtu.be/_hAVVULrZ0Q?t=2806): based on the Bayes' Rule to calculate probability of a text being in a certain category, given it contains specific words. Assuming every word is independent of each other.
  - [**Additive smoothing**](https://youtu.be/_hAVVULrZ0Q?t=3743): adding a value _a_ to each value in our distribution to smooth the data.
  - [**Laplace smoothing**](https://youtu.be/_hAVVULrZ0Q?t=3753): adding 1 to each value in our distribution (pretending we've seen each value one more time than we actually have).
- [**tf-idf**](https://youtu.be/_hAVVULrZ0Q?t=4703): ranking of what words are important in a document by multiplying term frequency (TF) by inverse document frequency (IDF).
- **Automated template generation**: giving AI some terms and let it look into a corpus for patterns where those terms show up together. Then it can use those templates to extract new knowledge from the corpus.
- [**word2vec**](https://youtu.be/_hAVVULrZ0Q?t=5873): model for generating word vectors.
- **skip-gram architecture**: neural network architecture for predicting context words given a target word.

**Packages**

- [**NLTK**](https://www.nltk.org/): Natural language toolkit or NLTK is a package for working iwth human language data. [Lecture](https://youtu.be/_hAVVULrZ0Q?t=1236])

**Projects**

- [Parser](https://cs50.harvard.edu/ai/2020/projects/6/parser/) - Write an AI to parse sentences and extract noun phrases. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/6.%20Language/parser)
- [Questions](https://cs50.harvard.edu/ai/2020/projects/6/questions/) - Write an AI to answer questions. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/6.%20Language/questions)

## Reflections on the course

Overall I found the course challenging yet extremely informative on the concepts and implementations of AI. It had the right balance between abstract concepts and concrete solutions in Python. I'm now much more aware of and always on the look out for applying these AI concepts to problems, or whether a problem can be framed as one of them. I think just knowing how to solve certain kinds of problems is half the battle, the other half is shaping the problem into a workable solution. To do that you need solid robust data, and a clear vision for the 'world' in which the AI agent will operate.

We have already seen widespread use of AI and this can only increase in the coming decades. I think having an understanding of the fundamentals and building your own small AI solutions is essential, especially for Software Engineers and Data Scientists. The main aim in building intelligent systems for me, is to enable the autonomous agents that operate within their 'world' to carry out tasks and make decisions at or above the accuracy a human domain expert could, but faster and more reliably. To achieve that, it may involve a combination of machine learning, statistics, software engineering, system architecture and data engineering skills, plus business domain knowledge. As shown in the below image, there is generally an overlap between roles and skills, but in my opinion all of of these skills have a benefit to any digital or data role.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640097388/App%20Images/Blog%20Images/Article%20Images/CS50%20AI%20Review/data-science-ai-skillset_h4jqcs_y0pgaa.png" 
  alt="Dice probability table" 
  loading="lazy" 
  styling=""
  caption="Insight" 
  captionsrc="https://blog.insightdatascience.com/how-emerging-ai-roles-fit-in-the-data-landscape-d4cd922c389b" 
  :showsource="true">
</article-image>

The concepts and skills learnt in this course certainly help to get you started on your journey to engineering intelligent, autonomous systems and your own AI programs that can help make other people's lives better. The certification is optional, however I opted to purchase it and have talked about it and the skills gained from it within interviews. I think it demonstrates a commitment to continuing professional development, an attitude of continuous learning and an accolade you can be proud of upon finishing the course. 

As always, if you enjoyed this article be sure to check out [other articles](/) on the site including [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) 😄]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to upload PDF files to Azure Blob Storage with Vue and Python Flask]]></title>
            <link>https://shedloadofcode.com/blog/how-to-upload-pdf-files-to-azure-blob-storage-with-vue-and-python-flask/</link>
            <guid>https://shedloadofcode.com/blog/how-to-upload-pdf-files-to-azure-blob-storage-with-vue-and-python-flask/</guid>
            <pubDate>Thu, 24 Mar 2022 11:15:00 GMT</pubDate>
            <description><![CDATA[In this article we'll be building a simple application to upload, view and download PDF files using Azure Blob Storage, Python Flask and Vue.]]></description>
            <content:encoded><![CDATA[
In this article we’ll take a look at uploading PDF files to Azure Blob Storage using Vue and Python Flask. This is a common use case I’ve come across for document storage. Although we’ll be uploading PDFs in this article, the same approach can be used for files of any kind.

## Getting started

We’re going to use the same Vue Flask template I used from another article [How to query a database with Python Flask and download data to CSV or XLSX in Vue](/blog/query-sql-and-download-csv-and-xlsx-in-flask/). The template is in this public [GitHub repository](https://github.com/gtalarico/flask-vuejs-template) from gtalarico.

Once you’ve cloned or downloaded the repo, setup a virtual environment with pipenv and install the packages that will be needed below.

```
cd flask-vuejs-template-master
python -m pip install pipenv
python -m pipenv install --dev
python -m pipenv install flask-restx azure-storage-blob
python -m pipenv shell
```

Since this template uses [Flask-RESTPlus](https://flask-restplus.readthedocs.io/en/stable/) and we're using [Flask-RESTX](https://flask-restx.readthedocs.io/en/latest/) which is a community driven fork of it, go ahead and replace all references to 'flask_restplus' with 'flask-restx'.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1647517803/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/replace-flask-restplus_qtkxgp.png" 
  alt="Application running at localhost" 
  loading="lazy" 
  styling=""
  caption="Replace references to flask_restplus with flask_restx" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1647517803/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/replace-flask-restplus_qtkxgp.png" 
  :showsource="false">
</article-image>

We'll be using the [azure-storage-blob](https://pypi.org/project/azure-storage-blob/) package which has a [quickstart guide](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#run-the-code) from Microsoft. There are similar packages available for other languages too, including Java, C# and .NET, JavaScript, C++, Go and more.

Now that the Python packages are installed, let's install and upgrade the Vue dependencies with Yarn, and build the Vue dist directory.

```
yarn install --dev
yarn upgrade
yarn build
```

If everything went smoothly, you should be able to run both the backend and frontend dev servers. Run `python run.py` and from another terminal window in the same directory run `yarn serve`. You should see the app running at http://localhost:8080/ 👍

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/application-running_qtafho.png" 
  alt="Application running at localhost" 
  loading="lazy" 
  styling=""
  caption="Application running at localhost:8080" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/application-running_qtafho.png" 
  :showsource="false">
</article-image>

## Set up Azure Blob Storage container

Beginning with the end in mind, we'll need a place to store files in Azure. So the first job is to setup a Storage Account for that in Azure. This [article](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal) from Microsoft goes over the process of setting one up. You head to the [Azure portal](https://portal.azure.com/#create/hub) and search for "storage account" then hit **Create**.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/storage-account-in-azure_v3cxcf.png" 
  alt="Azure portal" 
  loading="lazy" 
  styling=""
  caption="Head to Azure portal and create a Storage account" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/storage-account-in-azure_v3cxcf.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/create-storage-account-wizard_pqr7kk.png" 
  alt="Creating a storage account in Azure" 
  loading="lazy" 
  styling=""
  caption="Follow the steps to create a storage account in Azure" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/create-storage-account-wizard_pqr7kk.png" 
  :showsource="false">
</article-image>

Make sure you delete this Storage Account resource after testing as you may incur costs if you don't. If in any doubt always [check the pricing calculator](https://azure.microsoft.com/en-gb/pricing/calculator/) or [Azure Blob Storage pricing page](https://azure.microsoft.com/en-gb/pricing/details/storage/blobs/) from Microsoft.

Once finished deploying you'll need to create a container and grab the credentials as we'll need them later on! 

To create a container, go to the Storage Account resource and hit the add container button. We'll use this new container to store the PDF files.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/create-a-container_lwagzo.png" 
  alt="Creating a container in the storage account in Azure" 
  loading="lazy" 
  styling=""
  caption="Create a container in the storage account in Azure" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/create-a-container_lwagzo.png" 
  :showsource="false">
</article-image>

Then head to **Access keys** under **Security + networking** and hit **Show keys**. Copy the storage account name, the keys and connection strings. You should only need the **Connection string** under **key1** to connect with the Python SDK though. Never hurts to have backup credentials.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/get-storage-account-keys_rfpm1h.png" 
  alt="Getting the credentials for the storage account" 
  loading="lazy" 
  styling=""
  caption="Get the credentials for the storage account" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1646329432/App%20Images/Blog%20Images/Article%20Images/Upload%20PDF%20to%20Azure%20Blob%20Storage/get-storage-account-keys_rfpm1h.png" 
  :showsource="false">
</article-image>

## Create form to upload file in Vue

Inside src/views/Home.vue we first want a very basic outline of our file upload form. Substitute the template tags for this new template creating the form.

```html [src/views/Home.vue]
<template>
  <div class="container">
    <div>
      <label>File
        <input type="file" ref="fileInput" accept="image/*,.pdf" @change="handleFileUpload($event)"/>
      </label>
      <br/>
      <br/>
      <button v-on:click="submitFile()">Submit</button>
    </div>
  </div>
</template>
``` 

We then need to implement the `handleFileUpload` and `submitFile` methods. The first will allow us to **stage** a file, and the second will allow us to **submit** and send that file with Axios to the Python API endpoint at /api/upload we'll create later.

```js [src/views/Home.vue]
<script>
/* eslint-disable */
import axios from 'axios'

export default {
  data () {
    return {
      file: null
    }
  },
  methods: {
    /**
     * Uploads file to server.
     * @param {Event} event The form change event with the file to be uploaded.
     */
    handleFileUpload(event) {
      this.file = event.target.files[0];
    },
    /**
     * Uploads the file to the server.
     */
    submitFile() {
      if (this.file == null) {
        return;
      }

      console.log("Submitting file for upload...");
      let formData = new FormData();
      formData.append('file', this.file);

      axios.post(`api/upload`, formData, {
        headers: { 
          'Content-Type': 'multipart/form-data' 
        },
        timeout: 5000
      })
        .then(response => {
          console.log("File upload successful!");
          this.$refs.fileInput.value = "";
          console.log(response);
        }).catch(error => {
          console.log("File upload failed.");
          console.error(error);
        });
    }
  }
}
</script>
```

## Handle file upload in Flask

Now we have a basic form to upload the file to the server with Axios, let's create an API endpoint that will actually upload the file to Azure Blob Storage. 

Inside app/api/resources.py we'll add a route to handle this operation.

```python [app/api/resources.py]
"""
REST API Resource Routing
http://flask-restplus.readthedocs.io
"""


import io
from datetime import datetime
from flask import request, jsonify, send_file
from flask_restx import Resource
from azure.storage.blob import BlobServiceClient, ContainerClient
from .security import require_auth
from . import api_rest


AZURE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;" + \
    "AccountName=vueflaskstorageaccount;" + \
    "AccountKey=m6Vegjl44F28CnuujeYI27kZblp7pQBRftsuDXGLUN0PkfuRxAkY3MqJogwu2FShclWFWHfD3n4hJYeQEmk3GQ==;" + \
    "EndpointSuffix=core.windows.net"


@api_rest.route('/upload')
class UploadFile(Resource):
    """ Uploads file to Azure Blob Storage """

    def post(self):
        f = request.files["file"]

        try:
            service_client = BlobServiceClient.from_connection_string(AZURE_CONNECTION_STRING)
            container_client = service_client.get_container_client("pdf-container")
            blob_client = container_client.get_blob_client(f.filename)
            blob_client.upload_blob(f)

            return jsonify(success=True)
        except:
            return jsonify(success=False)
```

After assigning the connection string from earlier to `AZURE_CONNECTION_STRING` (in production you don't want to hardcode this sensitive connection string, instead use an environment variable) we initialise a service client which gets the container and uploads the file inside of it. 

The [sample](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/samples/blob_samples_hello_world.py) and [introduction](https://docs.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python#getting-started) from Microsoft are useful for learning more about working with the Azure Blob Storage SDK for Python.

## Listing all files in the container

Whilst in app/api/resources.py we're gonna add in two more routes, one to get all files in the container, and another to download a given file by name. This will allow us to list all files in the application and select one to download. Add these under the `UploadFile` class.

```python [app/api/resources.py]
@api_rest.route('/all-files')
class GetAllFiles(Resource):
    """ Gets all filenames in the Azure Blob Storage container """

    def get(self):
        container = ContainerClient.from_connection_string(
            conn_str=AZURE_CONNECTION_STRING, 
            container_name="pdf-container"
        )

        all_filenames = []
        blob_list = container.list_blobs()
        for blob in blob_list:
            all_filenames.append(blob.name)

        return {
            "filenames": all_filenames
        }


@api_rest.route('/download/<string:filename>')
class DownloadFile(Resource):
    """ Downloads a file from Azure Blob Storage by filename """


    def post(self, filename):
        service_client = BlobServiceClient.from_connection_string(AZURE_CONNECTION_STRING)
        container_client = service_client.get_container_client("pdf-container")
        blob_client = container_client.get_blob_client(filename)

        bytes_stream = io.BytesIO()

        blob_data = blob_client.download_blob()
        blob_data.readinto(bytes_stream)

        bytes_stream.seek(0)

        return send_file(bytes_stream,
                          attachment_filename=blob_data.name,
                          mimetype="application/pdf",
                          as_attachment=True)
```

## Downloading a file

Now we have API endpoints to handle both returning the list of files in the container, and to actually download a file, let's make a very simple UI to do both. We'll repurpose `src/views/Api.vue` for this. 

```html [src/views/Api.vue]
<template>
  <div class="all-files-container">
    <ul>
      <li v-for="filename in files" :key="filename.id">
        <a href="#" @click.prevent="downloadFile(filename)">{{ filename }}</a>
      </li>
    </ul>
  </div>
</template>


<script>
/* eslint-disable */
import axios from 'axios'

export default {
  name: 'all-files',
  data () {
    return {
      files: [],
    }
  },
  mounted () {
    this.getAllFiles();
  },
  methods: {
    getAllFiles () {
      axios.get(`api/all-files`)
        .then(response => {
          this.files = response.data.filenames;
        }).catch(error => {
          console.error(error);
        });
    },
    downloadFile(filename) {
      axios({
        url: `api/download/${filename}`,
        method: "POST",
        responseType: "blob",
      }).then(response => {
          const blob = new Blob([response.data], { 
            type: 'application/pdf'
          });
          const url = window.URL.createObjectURL(blob);
          const link = document.createElement("a");
          link.target = "_blank";
          link.href = url;
          link.download = filename;
          link.click();
          window.URL.revokeObjectURL(url);
        }).catch(error => {
          console.error(error);
        });
    }
  }
}
</script>

<style lang="scss">
</style>
```

## What we learned

We now have a small but working application that can handle file uploads and downloads. We learned how to build an upload form in Vue.js and then configure and work with Axios and the Azure Blob Storage Python package. Let's take a look at a short video of the application in action! Let's upload three PDF files from my Downloads folder to Azure Blob Storage then download them via the app.

<article-video 
  id="pxDCZqRfNw0"
  title="Uploading PDFs to Azure Blob Storage with Flask and Vue Demo">
</article-video>

I hope you enjoyed this article and can put what you learned here into practice in your own projects. If you're interested in deploying a Vue Flask app be sure to check out the article [Automated deployment of a Vue Flask app using Azure Pipelines](/blog/automated-deployment-of-a-vue-flask-app-using-azure-pipelines/).

If you enjoyed this article be sure to check out [other articles](/) on the site. ]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Preparing for a statistical data science interview]]></title>
            <link>https://shedloadofcode.com/blog/preparing-for-a-statistical-data-science-interview/</link>
            <guid>https://shedloadofcode.com/blog/preparing-for-a-statistical-data-science-interview/</guid>
            <pubDate>Wed, 09 Feb 2022 17:45:00 GMT</pubDate>
            <description><![CDATA[Ready yourself for a data science interview using the same process I used to successfully prepare for a recent Senior Data Scientist role.]]></description>
            <content:encoded><![CDATA[
In this article, I'll cover the steps I followed during my recent application for a Senior Data Scientist role. I hope this helps you organise your own preparation for data science and any other roles 😄 

## Start with the job requirements

To write a good application and prepare for interview, you need to know your selling points against the job criteria. Below are criteria for three different Senior Data Scientist roles I found. They all have similarities but can vary in tooling and technology used. Select each of the panels to view the requirements.

<list-panel title='Job Criteria 1' :items="jobCriteriaOne"></list-panel>

<list-panel title='Job Criteria 2' :items="jobCriteriaTwo"></list-panel>

<list-panel title='Job Criteria 3' :items="jobCriteriaThree"></list-panel>

In preparing an application, you want to be ready to stress where the requirements of the role and your strong areas collide.  

## Write and submit an application

When it comes to applications, you've either worked as a Data Scientist or you haven't. If you have, showcase your experience and projects. If you haven't, apply Data Science techniques and solve data problems in your current role or in your own time and showcase those projects. Either way, whilst sifting applications they're gonna be looking for **relevant** experience. Give them what they want. 

When I've been on the other side of recruitment sifting applications, the number one thing that marks candidates down is not providing relevant evidence of using the skills required for the job they're applying for. 

All of the examples in this article are made up and light on detail but the structure and format of them are the same as what I use for real. You'll need to expand on them but they give a solid framework. 

For applications, the style should be direct, punchy, quick to scan, easy for a hiring manager to sift and shouldn't include anything that makes you look bad. If you don't have any direct data science experience, show data science techniques you use in your current role. If you don't have any data science examples in your current role, start 'doing data science' in your current role! No one just starts out as a data scientist, but the good news is, you don't have to be a data scientist to apply analytical techniques. The CV and personal statement below cover both angles.

<text-panel title="CV" body="2020 – Present <br/>Data Scientist at Shedload Of Code :) <br/><li>Acted as technical lead on numerous projects</li><li>Combined expert knowledge on both data science and software engineering to advise on, build, test, maintain and release complex data-driven web applications to support analysis and decision making, increase productivity and reduce costs</li><li>Ownership and continued enhancement of established projects, including a key analytical dissemination portal with a user base over 100,000</li><br/><br/>2017 - 2020 <br/>Data Scientist at The Next Big Search Engine Company <br/><li>Developed complex analytical web applications</li><li>Delivered high profile projects under time pressure in an agile working environment</li><li>Delivered analytical projects to tight deadlines in response to the coronavirus pandemic – prioritised tasks to ensure projects remained on track</li><li>Produced in-depth analytics for internal and external clients for data-driven decision making</li><li>Created robotic process automation solutions with Python and PyAutoGUI to increase productivity</li><li>Developed analytical dashboards using Power BI (Embedded) and R Shiny</li><li>Performed geospatial analysis with R Shiny and Leaflet.js by combining datasets with ONS Geo-Portal shape filesand GeoJSON files</li><li>Utilised machine learning algorithms for classification, regression and reinforcement learning using scikit-learn and tidymodels packages</li><li>Built databases and ETL processes with SQL and Azure Data Factory to support applications and tools</li><li> Applied K-nearest neighbours algorithm to identify closest ‘statistical comparator groups’ in a client project</li><li>Built a custom segmentation analysis script to find key influencers on target variables</li><li>Produced high quality peer-reviewed code using Python, R, JavaScript, C# and T-SQL</li><br/><br/>2012 - 2017 <br/>Office Assistant at Dunder Mifflin Paper Company<br/><li>Supervised a small team within a regional office</li><li>Applied data science techniques to improve business performance in addition to main role</li><li>Developed a sales forecasting tool using Python and scikit-learn which improved sales revenue predictions by 17%</li><li>Carried out analysis into stock levels and presented to department head prompting a better ordering process</li><li>Created monte carlo models and 'what if' analyses to support branch decision making</li><li>Improved knowledge of data science by completing Data Science qualification with Harvard and EdX</li>"></text-panel>

<text-panel title="Personal statement" body="My motivation to pursue this position is due to my passion for using insight to enable decision making. I think I would be a great fit for the role both for my keen interest in the company's mission and the skills and experience I can bring.<br/><br/>I bring over 5 years of experience working as a Data Scientist delivering technical and analytical projects end-to-end. I have delivered analysis and predictive modelling on web usage whilst working in my current role at Shedload Of Code.<br/><br/>Before that, I improved a search recommendation algorithm at The Next Big Search Engine Company resulting in increased adoption. <br/><br/>Whilst working at Dundler Mifflin Paper Company, I led my own project to build a forecasting model which increased sales predictions by 17%. Although this role is not directly related to data science, I have successfully applied data science and analytical techniques to solve problems and improve the business.<br/><br/>I am committed to ongoing professional development, and have recently completed a course with Harvard and EdX in Data Science. I built a convolutional neural network to classify images in a large road signs dataset. I think this would be very useful experience in the role because your company uses image classification techniques.<br/><br/>In conclusion, I would make an excellent addition to the team, and would bring strong demonstrable experience to the role. I would appreciate the opportunity to discuss my skills in relation to this role further at interview."></text-panel>

## Prepare competency answers using STAR

Also called behaviour questions, these require prepared answers that tell a story about your behaviour. Always try to start sentences with '**I**' rather than '**we**'. The assessor is interested in your contribution, not what other people did. Approach these questions like you are trying to tell the assessor how great you are, and what an asset you'll be by proving you've handled tough situations before, and delivered strong results. Look at the competencies you'll be assessed on, think about what you've done in the past, and start writing up an answer in the STAR format. 

* Situation = briefly outline the context
* Task = briefly outline what you needed to do and why
* Action = go into detail about what you did and your thinking process (why you did those things)
* Result = drive home that as a result of your actions, X was the outcome, quantify results if possible (saved X%)

Here is a short example that follows the STAR format. 

<text-panel title="Describe a time you used analysis to influence others to make a decision." body="Whilst working at Dundler Mifflin Paper Company, I decided to carry out a project to increase the accuracy in predicting sales. This had been a major issue at the regional office and without accurate and realistic targets to aim for, staff morale became low.<br/><br/>I looked at the current difference between actual and forecasted sales as a baseline success measure. I created a forecasting tool using Python which ingested all of the historical sales, then used an ARIMA model adjusting for seasonal trends to predict future results. <br/><br/>I overcame obstacles in trying to adapt the model to each individual employee so their forecast reflected the customer buying trends of the clients they managed. The model was 35% more accurate in predicting sales than the baseline.<br/><br/>The result was more accurate sales predictions, improved staff morale, and praise from head office. The model is now being adapted and used in other regional offices and influencing planning decisions taken by management. In successfully applying data science techniques in this project, I am now being asked to explore other challenging problems to help the business succeed.">
</text-panel>

For the real thing I would expand on the action paragraphs, adding in: 

* Why you chose that analytical method
* What alternatives and options you explored 
* How you handled messy data 
* How you evaluated the model
* How you updated the model to include new data
* What obstacles and issues you overcame
* How you got buy-in from others
* How you validated your methods
* How you handled conflict and disagreements
* Did you need to delegate any tasks
* Was there time pressure
* How did you prioritise conflicting tasks
* How did you avoid burnout juggling extra responsibilities
* How you ensured standards were high
* How you ensured you were meeting the customer's needs
* How you measured success
* How did you disseminate the analysis to non-technical people
* How you deployed and maintained the model

I would spend *some* time thinking through possible hypothetical questions that might test your understanding of the company, business area or sector. These might include questions like:

* Imagine we gave you data on X, what kind of analysis would you perform on it?
* How would you use data science techniques to improve products and services in our sector?
* We do X analysis here, why do you think that's important for us?

## Prepare technical presentation

Main thing for any presentation is to keep it clear and concise. Address any points you're asked to, otherwise stick to the [rule of three](https://medium.com/swlh/the-rule-of-three-how-to-use-it-9c67219364f6). Keep it mostly high level, but be prepared to drill into details. I was asked to present a recent analytical project. I don't like PowerPoint but created some slides as talking points:

* Introduction - quick about me then into the problem statement
* Research - how I researched the issue
* Development - how I built the solution
* User journey - to understand the product
* Challenges and solutions - how I overcame obstacles
* Analytical techniques - drill into key data science techniques used
* Launch - releasing a working product or model into production
* Outcomes - the value that was added and success metrics

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1642607893/App%20Images/Blog%20Images/Article%20Images/Preparing%20for%20DS%20Interview/presentation-slide-x_zyo39c.png" 
  alt="Presentation slide on development" 
  loading="lazy" 
  styling=""
  caption="Development slide - only talking points not a mountain of text" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1642607893/App%20Images/Blog%20Images/Article%20Images/Preparing%20for%20DS%20Interview/presentation-slide-x_zyo39c.png" 
  :showsource="false">
</article-image>

## Review statistical concepts

Statistics underpin almost all of data science. I think data science can even be referred to as statistical learning. So going back to basics can never hurt. I wouldn't get too bogged down during this part, but certainly don't neglect it.

* Descriptive statistics
* Inferential statistics
* Distributions (normal, binomial, poisson, exponential)
* Sampling (random, stratified, cluster)  
* Hypothesis testing
* Statistical significance
* Regression
* Confidence Intervals
* Correlation
* P-values
* Probability (Bayes theorem)
* Bias
* Testing (z-score, t-test, Chi-square, ANOVA)

A great book for brushing up on statistical concepts is [Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python](https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential/dp/149207294X/ref=sxin_13_mbs_w_global_sims). Another extremely useful book I mention later that covers lots of topics including statistics is [Ace the Data Science Interview: 201 Real Interview Questions Asked By FAANG, Tech Startups, & Wall Street](https://www.amazon.com/Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_1).

## Review machine learning concepts

As with statistical concepts it's always good to refresh your knowledge of machine learning algorithms and when to use each before heading into a data science interview. These can be broadly categorised as:

* Supervised learning - classification, regression algorithms
* Unsupervised learning - clustering algorithms
* Reinforcement learning

<article-image 
  src="https://scikit-learn.org/stable/_static/ml_map.png" 
  alt="Model selection diagram" 
  loading="lazy" 
  styling=""
  caption="Model selection diagram from scikit-learn documentation" 
  captionsrc="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html" 
  :showsource="true">
</article-image>

A book I refer back to again and again on machine learning is [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ref=sr_1_1)

## Complete a practice take-home test

Although I don't really agree with take-home tests or coding challenges, it is out there, it exists, and if I find a role I really want but there's a test attached I'll consider it based upon time investment. A test might mean spending significant time investment brushing up on concepts you've not used in a while. Nevertheless, putting aside the LeetCode style coding challenges covered in my [coding interview topics in Python article](/blog/exploring-coding-interview-topics-in-python/), I figure for data science there will be only one of two possibilities for a test. It will either be an analytical (tell us something interesting about this data) or modelling (predict X outcome with this data, model this data to calculate X outcome) project.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1642607694/App%20Images/Blog%20Images/Article%20Images/Preparing%20for%20DS%20Interview/take-home-test-plan_sb8zj1_1_qayjbc.png" 
  alt="How to approach a data science take home exercise" 
  loading="lazy" 
  styling=""
  caption="How to approach a data science take home exercise" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1642607694/App%20Images/Blog%20Images/Article%20Images/Preparing%20for%20DS%20Interview/take-home-test-plan_sb8zj1_1_qayjbc.png" 
  :showsource="false">
</article-image>

For analytical I'd use Jupyter Notebook or Jupyter Lab and for modelling I'd use Visual Studio Code with the [cookiecutter package](https://drivendata.github.io/cookiecutter-data-science/) for good code organisation. This makes creating a new machine learning project as easy as:

```
pip install cookiecutter
```

```
cookiecutter https://github.com/drivendata/cookiecutter-data-science 
```

I found a [Reddit post](https://www.reddit.com/r/datascience/comments/a0xi77/practice_takehome_case_study_datasetscode_included/) linking to a [practice case study with code](https://www.interviewqs.com/blog/case-study-example-1) applicable to any analytical role - and with some modifications to a data science role. This is the kind of thing any take-home might look like:

> Build a simple model based on insights you've found and describe how its predictions add value to the company. Present
> the model you fitted, why you chose it, explain the model as if speaking to a non-technical audience and how the 
> predictions could have an impact on the business processes going forward.

I have also come across statistical and numerical aptitude tests but they are usually only at entry-level or for large recruitment campaigns. For the last numerical aptitude test I had I used [Assessment Day](https://www.assessmentday.co.uk/aptitudetests_numerical.htm) to prepare. They are usually fast pace like one minute per question so the winning formula is, read the question, look for the specific data the question is talking about, perform calculations, sense check, select the answer, then move on (you don't have time to double check answer). For statistical tests, here is a [good practice test](https://files.civilservicejobs.service.gov.uk/admin/fairs/apptrack/download.cgi?SID=b3duZXI9NTA3MDAwMCZvd25lcnR5cGU9ZmFpciZkb2NfdHlwZT12YWMmZG9jX2lkPTk4OTUyMCZ2ZXJpZnk9NDliZDZiMTQ1ZDQ2NjZlMDkyYWRmZDBlMGM3MDZhYmYmcmVxc2lnPTE2NDI2MTE5NTgtODJhMDg1Y2Q2MmU2ODJjN2NjZjUwNGMzOGM3ZjE3NzVlNTU1NGIzNw==) with answers at the end.

A book that guided me through the case study and coding aspects of data science interviews [Ace the Data Science Interview: 201 Real Interview Questions Asked By FAANG, Tech Startups, & Wall Street](https://www.amazon.com/Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_1). This is an invaluable resource considering sheer breadth of it's contents. I took a risk on this one and decided to try and get it delivered ASAP before my interview. It was just what I needed and covers almost all topics you'd need to be aware of. This includes chapters on probability, statistics, machine learning, coding / data structures and algorithms, SQL and database design, product sense and case studies.

## On the day

The main things you should do on the day of the interview include:

* Stay relaxed!
* Be yourself
* Enjoy the process as much as possible
* Let your passion for data science show
* Tell them how amazing you are in your answers (not arrogant but confident)
* Like in the application say '**I**' not '**we**' (they are interested in your actions)
* Remember the STAR format (it will keep your answers on track)
* Remember the data science lifecycle (it will keep your answers on track)
* Ask questions (you're interviewing them too!)
* Don't be afraid to say if you don't know something (How would you learn about it?)

## After the interview

Well done! You did it! The interview is over and you can breathe a sigh of relief. My last major tip might not be what you're expecting... Now that the interview is over, write down the questions you were asked (to practice in future) and then forget about the interview completely! Don't dwell on things that you could have said, mistakes you think you made, or even what went well. Just resign it to the history books. 

Yes, celebrate that it's over and done with and that you gave it a solid effort, but expect the answer to be 'sorry we went with another candidate, but we thought you were great'. Expect the worst, hope for the best. By doing this, you'll force yourself to view applications and interviews as opportunities and won't over-invest yourself emotionally in them. Everyone fails interviews for many different reasons. If you did great, it's their loss. If you stumbled, see it as practice and improve for next time. The key to getting what you want in anything is to never stop trying, failing, then improving, then trying again.

I hope this article helped you prepare for your own data science interview and wish you the best of luck with it!

If you enjoyed this article, be sure to check out [other articles](/) on the site.]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reduce Material Design Icons Font to 7KB and automate with PyAutoGUI]]></title>
            <link>https://shedloadofcode.com/blog/reduce-material-design-icons-font-to-7kb-and-automate-with-pyautogui/</link>
            <guid>https://shedloadofcode.com/blog/reduce-material-design-icons-font-to-7kb-and-automate-with-pyautogui/</guid>
            <pubDate>Wed, 05 Jan 2022 16:10:00 GMT</pubDate>
            <description><![CDATA[In this article, we'll self-host and reduce the size of the Material Design Icons woff2 font file from 361KB to 7KB, keeping only the icons actually used, and then automate the entire optimisation process.]]></description>
            <content:encoded><![CDATA[
This article will cover how I reduced the total size of loading Material Design Icons Font from 361KB to 7KB, and then automated that process using PyAutoGUI. We’ll go through a full end-to-end tutorial of the process. If you want to follow along you can [download the distribution code](https://github.com/shedloadofcode/reduce-mdi-icons-font) before continuing.

## Why optimise Material Design Icons Font?

Website performance is critical for delivering a solid user experience. It is especially important for serving web pages to mobile devices and/or locations with poor network connectivity. Not only that, lower page sizes saves bandwidth usage and in turn saves money. Every byte and millisecond counts. I try to identify any opportunity to improve performance and page speed I can. It’s a continuous process of improvement. I recently optimised this site and ticked off the following improvements:

* Compressing images with tools like TinyPNG 
* Using smaller image formats like WebP
* Lazy loading images and videos below the fold
* Lazy hydration for SPA’s
* Minifying JavaScript and CSS
* Reducing payload sizes for data requests
* Reducing webpack bundle size 
* Eliminating any redirects
* Caching or precomputing results for expensive operations

There was something still bothering me though, I had a Lighthouse error indicating “ensure text remains visible during webfont load”. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640012416/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/ensure-text-remains-visible-error_onhug0.png" 
  alt="Ensure text remains visible Lighthouse error" 
  loading="lazy" 
  styling=""
  caption="Ensure text remains visible Lighthouse error" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640012416/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/ensure-text-remains-visible-error_onhug0.png" 
  :showsource="false">
</article-image>

This was because I was using a CDN to pull in the Material Design icons stylesheet from [cdn.jsdelivr.net](https://cdn.jsdelivr.net/npm/@mdi/font@5.8.55/css/materialdesignicons.min.css) which then downloaded the [woff2 font](https://cdn.jsdelivr.net/npm/@mdi/font@5.8.55/fonts/materialdesignicons-webfont.woff2?v=5.8.55). From the CDN, the font file weighed in at 320KB and the stylesheet was 43.5KB.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640012416/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/font-size-before-self-hosting_y79jn5.png" 
  alt="File sizes for stylesheet and font using CDN" 
  loading="lazy" 
  styling=""
  caption="File sizes for stylesheet and font using CDN" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640012416/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/font-size-before-self-hosting_y79jn5.png" 
  :showsource="false">
</article-image>

The solution recommended was to add [font-display swap](https://web.dev/font-display/?utm_source=lighthouse&utm_medium=devtools) to the font stylesheet selector. I know that seems silly for an icon font as there is no 'fallback' for icons really, a better suggestion for icon fonts might be to use [font-display block instead](https://stackoverflow.com/questions/49461308/correct-font-display-value-for-icon-fonts). This was impossible to achieve using a CDN although I didn’t mind the idea of self-hosting icon web fonts. I knew there were trade offs between using a CDN opposed to self hosting, but in the interests of site reliability (who wants to use a site without icons if the CDN stops working, right?) I decided to self-host the icon font. This is where my optimisation experiment began!

## Download the Material Design Icon pack

With the distribution code I’m starting with a simple HTML page, alongside empty style and font folders. We can see the page has 19 Material Design icons and they are coming from the CDN source to begin with. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640014313/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/webpage-with-icons_1_xi1iue.png" 
  alt="Icons on a web page" 
  loading="lazy" 
  styling=""
  caption="Icons on our one-page site" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640014313/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/webpage-with-icons_1_xi1iue.png" 
  :showsource="false">
</article-image>

The Material Icons are being loaded via the stylesheet link tag in the head section of the document, which then loads the 320KB woff2 font file which you will see by hitting F12 and inspecting the Network tab in Chrome (or a different browser's) DevTools. To make viewing this information easier, you can filter the Network tab to just 'CSS' and 'Font' like in the image above.

```html [index.html]
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <title>Icons Site</title>
  <link rel="stylesheet preload" href="https://cdn.jsdelivr.net/npm/@mdi/font@5.8.55/css/materialdesignicons.min.css">
</head>

<body>
  <span class="icon"><i class="mdi mdi-48px mdi-language-cpp"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-cpp"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-csharp"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-python"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-javascript"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-ruby"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-html5"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-css3"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-fortran"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-go"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-java"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-kotlin"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-lua"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-markdown"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-php"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-r"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-language-web"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-cpu-64-bit"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-server"></i></span>
  <span class="icon"><i class="mdi mdi-48px mdi-access-point-network"></i></span>
</body>

</html>
```

We’re going to swap this out for a locally hosted icon font and stylesheet. Download the [Material Design icon font](https://github.com/Templarian/MaterialDesign-Webfont) and extract the contents. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640014407/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/download-mdi-font_1_jqyx4v.png" 
  alt="Downloading zip file of Material Design Icons fonts" 
  loading="lazy" 
  styling=""
  caption="Downloading zip file of Material Design Icons fonts" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640014407/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/download-mdi-font_1_jqyx4v.png" 
  :showsource="false">
</article-image>

Move all of the font files in the 'fonts' folder to our project's 'fonts' folder. Then move 'materialdesignicons.css' in the 'css' folder to our project's 'css' folder. At the end, we'll only be using the .woff2 file as it provides [improved compression and is supported by major browsers](https://stackoverflow.com/questions/11002820/why-should-we-include-ttf-eot-woff-svg-in-a-font-face). 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640014895/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/css-and-fonts-copied-to-project-folder_lwlgln.png" 
  alt="Fonts and CSS copied to our project's folders" 
  loading="lazy" 
  styling=""
  caption="Fonts and CSS copied to our project's folders" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640014895/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/css-and-fonts-copied-to-project-folder_lwlgln.png" 
  :showsource="false">
</article-image>

With the stylesheet and the font files in the correct folders, let’s hook up the stylesheet and remove the CDN by updating the head section. 

```html [index.html]
...
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <title>Icons Site</title>
  <link rel="stylesheet preload" href="css/materialdesignicons.css">
</head>
...
```

To eliminate the pesky 'ensure text remains visible' Lighthouse error I also added 'font-display block' to 'materialdesignicons.css'.

```css [materialdesignicons.css]
/* MaterialDesignIcons.com */
@font-face {
  font-family: "Material Design Icons";
  src: url("../fonts/materialdesignicons-webfont.eot?v=6.5.95");
  src: url("../fonts/materialdesignicons-webfont.eot?#iefix&v=6.5.95") format("embedded-opentype"), url("../fonts/materialdesignicons-webfont.woff2?v=6.5.95") format("woff2"), url("../fonts/materialdesignicons-webfont.woff?v=6.5.95") format("woff"), url("../fonts/materialdesignicons-webfont.ttf?v=6.5.95") format("truetype");
  font-weight: normal;
  font-style: normal;
  font-display: block;
}
```

After hard refreshing the page (Ctrl + F5) you should see the icon font is still working as expected but with the CDN removed. Checking the Network tab again we can see the icons are now being loaded locally via 'materialdesignicons.css'. 👍

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640016553/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/self-hosting-css-and-icon-font_dgdbm8.png" 
  alt="Stylesheet and font self-hosted and served locally" 
  loading="lazy" 
  styling=""
  caption="Stylesheet and font self-hosted and served locally" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640016553/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/self-hosting-css-and-icon-font_dgdbm8.png" 
  :showsource="false">
</article-image>

The major problem from the image above is that the 'materialdesignicons.css' file is over 26,000 lines of code for over 5,000 icons, and is 369KB, and the .woff2 file is 361KB, and yet we’re only using 19 icons! The page load time will be bloated, our bandwidth is being consumed and the visitor experience badly affected as a result. The average web page is around 2-3MB, but the recommended size is 1MB. This is 73% of that recommended 1MB in the icon stylesheet and font alone! We could minify 'materialdesignicons.css' to look similar to 'materialdesignicons.min.css' from the original download which is 298KB but that's still too large. Let’s embark on the next step in our efficiency quest to reduce both the stylesheet and font file sizes.

## Identify which icons are actually being used throughout the site

I first searched the site to figure out which icons were actually being used throughout it. The one page site we’re using has 19 icons, this site had around 84, mostly in the tools (especially the [System Capacity Calculator](/tools/system-capacity-calculator/)). I made a note of these by inspecting with DevTools, finding the CSS selector in the full stylesheet 'materialdesignicons.css', then copying them into a separate Notepad++ file. This can take a little time, but well worth it!

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640017702/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/combing-site-for-used-icons_vz2xm4.webp" 
  alt="Inspect the icons used in the site" 
  loading="lazy" 
  styling=""
  caption="Inspect the icons used in the site" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640017702/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/combing-site-for-used-icons_vz2xm4.webp" 
  :showsource="false">
</article-image>

## Remove unused selectors from the stylesheet 

The result of my investigation to identify the icons actually used gave me a list of 19 CSS selectors. I made a backup of the full stylesheet in case I wanted to add any more icons in the future, but after replacing the body with the condensed list this is what it looked like:

```css [materialdesignicons.css]
/* MaterialDesignIcons.com */
@font-face {
  font-family: "Material Design Icons";
  src: url("../fonts/materialdesignicons-webfont.eot?v=6.5.95");
  src: url("../fonts/materialdesignicons-webfont.eot?#iefix&v=6.5.95") format("embedded-opentype"), 
       url("../fonts/materialdesignicons-webfont.woff2?v=6.5.95") format("woff2"), 
       url("../fonts/materialdesignicons-webfont.woff?v=6.5.95") format("woff"), 
       url("../fonts/materialdesignicons-webfont.ttf?v=6.5.95") format("truetype");
  font-weight: normal;
  font-style: normal;
  font-display: block;
}

.mdi:before,
.mdi-set {
  display: inline-block;
  font: normal normal normal 24px/1 "Material Design Icons";
  font-size: inherit;
  text-rendering: auto;
  line-height: inherit;
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

.mdi-language-c::before {
  content: "\F0671";
}

.mdi-language-cpp::before {
  content: "\F0672";
}

.mdi-language-csharp::before {
  content: "\F031B";
}

.mdi-language-css3::before {
  content: "\F031C";
}

.mdi-language-fortran::before {
  content: "\F121A";
}

.mdi-language-go::before {
  content: "\F07D3";
}

.mdi-language-html5::before {
  content: "\F031D";
}

.mdi-language-java::before {
  content: "\F0B37";
}

.mdi-language-javascript::before {
  content: "\F031E";
}

.mdi-language-kotlin::before {
  content: "\F1219";
}

.mdi-language-lua::before {
  content: "\F08B1";
}

.mdi-language-markdown::before {
  content: "\F0354";
}

.mdi-language-php::before {
  content: "\F031F";
}

.mdi-language-python::before {
  content: "\F0320";
}

.mdi-language-r::before {
  content: "\F07D4";
}

.mdi-language-ruby::before {
  content: "\F0D2D";
}

.mdi-cpu-64-bit::before {
  content: "\F0EE0";
}

.mdi-server::before {
  content: "\F048B";
}

.mdi-access-point-network::before {
  content: "\F0002";
}

.mdi-18px.mdi-set, .mdi-18px.mdi:before {
  font-size: 18px;
}

.mdi-24px.mdi-set, .mdi-24px.mdi:before {
  font-size: 24px;
}

.mdi-36px.mdi-set, .mdi-36px.mdi:before {
  font-size: 36px;
}

.mdi-48px.mdi-set, .mdi-48px.mdi:before {
  font-size: 48px;
}
```

You can use this to replace the entire contents of the stylesheet. 112 lines is much better than 26,000 lines. This file now weighs in a at 2.1KB and we’re feeling lighter already. If we hard refresh again, we can see the site is still working as expected 😆

Now onto the harder part, optimising the font file.

## Remove unused selectors from the font file 

So we’ve reduced the size of stylesheet to only the icons we’re using, how do we do the same for the .woff2 font file? To do this I used a free tool called [FontForge](https://fontforge.org/en-US/). The process sounds difficult at first but this is what worked for me:

1. Download FontForge
2. Open the .ttf font file 
3. Select the icons you want to keep by searching for an icon with `Ctrl + Shift + >` then ticking 'Merge into selection' 
4. Invert the selection (selecting all the icons you want to get rid of)
5. Remove the unused icons 
6. Condense 
7. Generate the font 
7. Save as .woff2

This process feels very repetitive and I certainly wasn’t doing this for 84 icons or even for our 19 icons. Of course, if you’re only using 5 you might not mind searching for them then removing the rest, but for any more it’s tedious. I automated this step using [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/) and have a video of a robot following the process outlined above in the next section.

## Automate the minify font file process with PyAutoGUI

So we’ve decided selecting 19 or more icons is too repetitive, time consuming and prone to human error. Let’s automate the process. This video shows the IconFontMiniferRobot in action following the process we outlined earlier. I opened FontForge, loaded the .ttf file, ran the robot, switched back to FontForge and the robot takes over. I just love building robotic automation process solutions.

<article-video 
  id="OWivMT0DSLk" 
  title="Reducing Material Design Icons font file size with FontForge and PyAutoGUI">
</article-video>

The robot reads the CSS stylesheet, extracts the icons selector names used in the stylesheet using a regular expression, selects those identified icons in FontForge, selects the inverse, condenses and generates the font then saves as a .woff2 file and it’s only 2.8KB! When I did the same thing for this site it was around 7KB for 84 icons. Now we can test it still works in our site. Before that though, you want to see the code for the robot right? 

```python [icon_font_minifier_robot.py]
"""Automates removing unused material icons from a .tff font file.

Reads in a material icons scss or css file and parses applied css 
selectors such as '.mdi-close-box-multiple-outline::before'. Uses
PyAutoGUI to control FontForge in order to remove all unused icons 
from the .tff file then saves the output as a .woff2 file.

Ensure FontForge is opened and loaded with the .tff before running, 
then run the program, switch to FontForge and let the robot take over :)

  Typical usage example:

  robot = IconFontMinifierRobot()

  robot.removeUnusedIcons(
    css_filepath="css/materialdesignicons.css",
    woff2_output_path="C:\\Users\\shedloadofcode\\Documents\\icon-fonts-project\\fonts\\"
  )
"""
import pyautogui
import re
import time

class IconFontMinifierRobot():

    def removeUnusedIcons(self, css_filepath, woff2_output_folderpath):
        print(f"Opening stylesheet...")
        stylesheet = open(css_filepath, "r")
        
        print("Parsing stylesheet...")
        icons = self.get_icon_styles_from(stylesheet)

        print("Now switch active window to FontForge :)")
        time.sleep(15)

        print("Selecting icons in FontForge...")
        for icon in icons:
          self.select_icon_in_fontforge(icon)

        print("Removing icons not in CSS...")
        self.invert_selection()
        self.detach_and_remove_selected_glpyhs()
        self.make_compact()

        print("Generating .woff2 file...")
        self.generate_fonts(
            woff2_output_folderpath, 
        )
        self.confirm_generate()
        print("Font saved.")

    def get_icon_styles_from(self, stylesheet):
        pattern = re.compile(r"\.mdi-[a-z\-A-Z\-0-9]+::before")
        icon_styles = pattern.findall(stylesheet.read(), re.IGNORECASE)
        print(f"{len(icon_styles)} icons found.")
        
        return icon_styles

    def select_icon_in_fontforge(self, icon):
        pyautogui.hotkey("ctrl", "shift", ">")
        time.sleep(0.5)
        pyautogui.typewrite(
            icon.replace(".mdi-", "").replace("::before", "")
        )
        pyautogui.moveTo(927, 533)
        pyautogui.click()
        time.sleep(0.5)
        pyautogui.moveTo(915, 560)
        pyautogui.click()
        time.sleep(0.5)
        pyautogui.press('enter')
        time.sleep(2)

    def click_encoding_menu(self):
        pyautogui.moveTo(240, 35)
        pyautogui.click()
        time.sleep(2)

    def invert_selection(self):
        pyautogui.moveTo(53, 32)
        pyautogui.click()
        time.sleep(1)
        pyautogui.moveTo(112, 477)
        pyautogui.click()
        time.sleep(1)
        pyautogui.moveTo(444, 503)
        pyautogui.click()
        time.sleep(2)

    def detach_and_remove_selected_glpyhs(self):
        self.click_encoding_menu()
        time.sleep(1)
        pyautogui.moveTo(292, 182)
        pyautogui.click()
        time.sleep(2)
        pyautogui.press("enter")
        time.sleep(10)

    def make_compact(self):
        self.click_encoding_menu()
        pyautogui.moveTo(248, 80)
        pyautogui.click()
        time.sleep(2)

    def generate_fonts(self, woff2_output_folderpath):
        pyautogui.hotkey("ctrl", "shift", "g")
        time.sleep(2)
        pyautogui.hotkey("ctrl", "a")
        time.sleep(1)
        pyautogui.typewrite(
            woff2_output_folderpath + \
            "materialdesignicons-webfont-min.woff2"
        )
        pyautogui.press("enter")
        time.sleep(2)

    def confirm_generate(self):
        pyautogui.moveTo(1022, 609)
        pyautogui.click()
        time.sleep(3)


if __name__ == "__main__":
    robot = IconFontMinifierRobot()

    robot.removeUnusedIcons(
        css_filepath="css/materialdesignicons.css",
        woff2_output_folderpath="C:\\Users\\shedloadofcode\\Documents\\icon-fonts-project\\fonts\\"
    )
```

You’ll need to install PyAutoGUI to use this script with

```
pip install pyautogui 
```

You might need to update the screen coordinates in all of the `moveTo` methods too if you’re using a different resolution screen to adjust where the robot clicks to be the same as in the video. PyAutoGUI is a super useful tool but I’ve found it needs adjustments when using on different devices, so consider this your chance to practice and perfect your automation skills. You can check the screen coordinates of your current mouse position with the script below which is [from the docs](https://pyautogui.readthedocs.io/en/latest/mouse.html):

```python [get_mouse_coordinates.py]
import pyautogui, sys

print('Press Ctrl-C to quit.')
try:
    while True:
        x, y = pyautogui.position()
        positionStr = 'X: ' + str(x).rjust(4) + ' Y: ' + str(y).rjust(4)
        print(positionStr, end='')
        print('\b' * len(positionStr), end='', flush=True)
except KeyboardInterrupt:
    print('\n')
```

I also used PyAutoGUI in another interesting project [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/). It really is a great lightweight automation tool.

## Replace font file with minified version

Now we have the minified font file generated and saved to the font folder as 'materialdesignicons-webfont-min.woff2', we can update our stylesheet to use the minified version instead of the bloated version as seen at the end of the video.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640023757/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/using-the-minified-font-file-in-the-stylesheet_onzxca.png" 
  alt="Using the minified font file in the CSS stylesheet" 
  loading="lazy" 
  styling=""
  caption="Using the minified font file in the CSS stylesheet" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640023757/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/using-the-minified-font-file-in-the-stylesheet_onzxca.png" 
  :showsource="false">
</article-image>

You can see I've removed all of the other font files leaving only the .woff2 font file, and only referenced that in the stylesheet.

```css [materialdesignicons.css]
@font-face {
  font-family: "Material Design Icons";
  src: url("../fonts/materialdesignicons-webfont-min.woff2?v=6.5.95") format("woff2");
  font-weight: normal;
  font-style: normal;
  font-display: block;
}
...
```

If we hard refresh we can see the icons still worked as expected! Checking the network tab shows the CSS file using only the woff2 at 1.8KB and the font file at 2.8KB! This is a 99.22% reduction to 2.8KB in font file size from our starting 361KB!

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1640027488/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/final-file-sizes_jhe0zd.png" 
  alt="Final stylesheet and font file sizes" 
  loading="lazy" 
  styling=""
  caption="Final stylesheet and font file sizes" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1640027488/App%20Images/Blog%20Images/Article%20Images/Optimise%20Icons%20Font/final-file-sizes_jhe0zd.png" 
  :showsource="false">
</article-image>
 
## Performance improvements summary

I am very pleased with the performance improvements as a result of this project. It means that every user doesn't have to download a 361KB font file just to see icons display on the page. This has led to a better user experience, better page load times and has reduced bandwidth consumption. The stats for file size reductions from this project can be seen in the table below:

| Type  | Starting Size KB | Final Size KB | Reduction % |
|-------|------------------|---------------|-------------|
| CSS   | 369KB            | 1.8KB         | 99.51%      |
| Woff2 | 361KB            | 2.8KB         | 99.24%      |

If you have any questions about this tutorial please leave a comment in the comments section below or feel free to reach out via the contact button at the bottom of this page 👍 I hope this has given you an insight into how you can go about self-hosting and reducing icon fonts in your own projects.

If you enjoyed this article, be sure to check out [other articles on the site](/).]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to do an index match with Python and Pandas]]></title>
            <link>https://shedloadofcode.com/blog/how-to-do-an-index-match-with-python-and-pandas/</link>
            <guid>https://shedloadofcode.com/blog/how-to-do-an-index-match-with-python-and-pandas/</guid>
            <pubDate>Wed, 08 Dec 2021 13:30:00 GMT</pubDate>
            <description><![CDATA[Learn how to do the Python equivalent of Excel's INDEX MATCH or VLOOKUP functions using Pandas merge.]]></description>
            <content:encoded><![CDATA[
Inspired by my previous article [How to batch rename files in folders with Python](/blog/how-to-batch-rename-files-in-folders-with-python/) and the theme of quickly solving problems with Python, let's explore how make life easier and do an index match using Pandas rather than with Excel. The code and files used are available to download via a link at the end of the article 😄

## Index Match with Excel

Let's say we have three tables, Orders,  OrderDetails and Products. All of these tables are related by either OrderID or ProductID. A typical problem might be trying to add the ProductName and TotalPrice column values to OrderDetails like this...

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840233/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Excel_Join_Before_hbezgh_ijv5uf.png" 
  alt="Example before" 
  loading="lazy" 
  styling=""
  caption="We want to match information here from Products" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840233/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Excel_Join_Before_hbezgh_ijv5uf.png" 
  :showsource="false">
</article-image>

Here we are effectively trying to merge / match the values based upon the ProductID column from the OrderDetails table and the ID column from the Products table.

Using the INDEX MATCH formula in Excel has become the better option vs VLOOKUP due to it not breaking if new columns are inserted. 

```
=INDEX(TargetArray, MATCH(LookupValue, LookupArray, ExactMatch=0))
```

As we can see, the ProductName and TotalPrice (ListPrice * Quantity) have been filled after dragging the formula downwards. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840233/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Excel_Join_Solution_jowmh5_eemx36.png" 
  alt="Example after" 
  loading="lazy" 
  styling=""
  caption="Now we have matched information from Products on ProductID" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840233/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Excel_Join_Solution_jowmh5_eemx36.png" 
  :showsource="false">
</article-image>

Although I am using a formatted table (using Ctrl + T) in this example, you could also use this without formatted tables by amending the index match formula, but remembering to include the $ for [fixed references](https://support.microsoft.com/en-us/office/switch-between-relative-absolute-and-mixed-references-dfec08cd-ae65-4f56-839e-5f0d8d0baca9) to the TargetArray and the LookupArray.

```
=INDEX($N$2:$N$9, MATCH(H3, $M$2:$M$9, 0))
```

## Merge with Python and Pandas

We're now going to try and do the same thing, but this time using Pandas. We're going to use [Pandas merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html). We'll need a few packages for working with Excel and of course Pandas itself.

```
pip install numpy pandas openpyxl xlrd
```

Before running this script, I placed each table into it's own sheet within the 'Index Match Python Problem.xlsx' workbook.

```python [index-match-merge-solution.py]
import pandas as pd

excel_file = pd.ExcelFile("Index Match Python Problem.xlsx")
orders = pd.read_excel(excel_file, sheet_name="Orders")
order_details = pd.read_excel(excel_file, sheet_name="OrderDetails")
products = pd.read_excel(excel_file, sheet_name="Products")

df = pd.merge(
    left=order_details,
    right=products,
    left_on="ProductID",
    right_on="ID",
    how="inner"
)

df["TotalPrice"] = df["ListPrice"] * df["Quantity"]
df.to_csv("outputs/merge-output.csv", index=False)
```

We read in each sheet from the Excel workbook, merge the OrderDetails with the Products table on the ProductID and ID columns, then calculate TotalPrice and output to CSV.

If the ID columns were named the same, we could have just used the `on=` argument, however `left_on=` and `right_on=` allows us to specify different column names to merge on. By also using the `how=` argument we can specify what kind of merge we want to perform. For those familiar with SQL JOINs, here we are using an inner join, which is the most common generally. For those unfamiliar, I find this [Visual JOIN](https://joins.spathon.com/) a great way to understand what's happening. You can also see a summary of each in the table below.

| Join Type | Description                                                                                                                                                           |
|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| inner     | selects records that have matching values in both dataframes.                                                                                                         |
| left      | returns all records from the left dataframe, and the matching records from the right dataframe. The result is null records from the right side, if there is no match. |
| right     | returns all records from the right dataframe, and the matching records from the left dataframe. The result is null records from the left side, if there is no match.  |
| outer     | returns all records when there is a match in left or right dataframe records.                                                                                         |
| cross     | returns cartesian product of both dataframes (number of rows in the first dataframe multiplied by the number of rows in the second dataframe).                        |

Be aware cross merges can result in very large result sets, you also don't need the `on=` argument, since both tables are merged on every record.

This script produces the CSV we can see below in the output folder.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840234/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Inner_Merge_uhi3ks_jkiblu.png" 
  alt="Inner merge solution" 
  loading="lazy" 
  styling=""
  caption="The output from the inner merge, matched on ProductID and ID" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840234/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Inner_Merge_uhi3ks_jkiblu.png" 
  :showsource="false">
</article-image>

All done! We have the output for OrderDetails showing the ProductName and TotalPrice. You might notice that this isn't sorted the same way as in our original Excel file. This is because by using an inner merge, we are using intersection of keys from both dataframes (ProductID and ID). We can change this to a 'left' merge to use only the keys from the left dataframe.

```python
df = pd.merge(
    left=order_details,
    right=products,
    left_on="ProductID",
    right_on="ID",
    how="left"
)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840233/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Left_Merge_drsavo_hfxnrg.png" 
  alt="Left merge solution" 
  loading="lazy" 
  styling=""
  caption="The output from the left merge - this keeps the same order as before on OrderID" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840233/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Left_Merge_drsavo_hfxnrg.png" 
  :showsource="false">
</article-image>

If you want to try out a 'right' merge, I added a Product called 'Robot' in 'Index Match Python Problem.xlsx' that isn't included in any orders so wouldn't show up using a left or inner join as there is no match. 

If you wanted to drop any unneeded columns, like the ID and ListPrice columns from the right dataframe you can add a line before outputting to CSV.

```python
df.drop(columns=["ID", "ListPrice"], axis=1, inplace=True)
```

## Merging multiple tables

Using the same dataset, we will now look at a more advanced example to demonstrate the power of merging. We'll write a function to retrieve order information for a given OrderID and CustomerName. This merges together all tables Orders, OrderDetails and Products. 

```python [index-match-order-lookup.py]
import pandas as pd

def load_data():
    excel_file = pd.ExcelFile("Index Match Python Problem.xlsx")
    orders = pd.read_excel(excel_file, sheet_name="Orders")
    order_details = pd.read_excel(excel_file, sheet_name="OrderDetails")
    products = pd.read_excel(excel_file, sheet_name="Products")

    return orders, order_details, products


def get_order_information(id, customer_name):
    orders, order_details, products = load_data()

    order = orders.loc[
        (orders['OrderID'] == id) & 
        (orders['Customer'] == customer_name)
    ]

    order_info = pd.merge(
        left=order,
        right=order_details,
        on="OrderID",
        how="inner"
    )

    order_info = pd.merge(
        left=order_info,
        right=products,
        left_on="ProductID",
        right_on="ID"
    )

    order_info["TotalPrice"] = order_info["ListPrice"] * order_info["Quantity"]
    order_info.drop(columns=["ID", "ListPrice"], inplace=True)
    products = order_info.groupby(["OrderID"])["ProductName"].agg(list)

    order_info = order_info \
        .groupby(["OrderID", "Customer"])['ProductName', 'TotalPrice'].agg(sum) \
        .reset_index()

    order_info["Products"] = products.values
    
    print(order_info)

    order_info.to_csv(f"outputs/order-information-for-id-{id}.csv", index=False)


if __name__ == "__main__":
    get_order_information(id=4, customer_name="Mike")
```
<code-runner :output="['OrderID  Customer  TotalPrice  Products',
  '4  Mike  110  [Desk Lamp, Mousemat]']" 
  filename="index-match-order-lookup.py" 
  language="Python">
</code-runner>

We load the dataframes, use [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to find the rows in OrderDetails where where OrderID and Customer is a match with the inputs giving us the `order` itself. We inner merge `order` with `order_details`, then merge that with `products`. We calculate TotalPrice, drop any columns not required, and aggregate the `products` into a list. Finally, we group by the OrderID and calculate the sum of each OrderDetail, and add the Products for the order. 

Going back to verify we can see for OrderID 4, Mike did indeed purchase three Desk Lamps and a Mousemat for a combined total of £110! He must really like Desk Lamps!

This is a script I will keep coming back to, as it provides so many useful things you might want to do. Particularly if you don't necessarily want to merge you just want to 'lookup' or 'filter' the dataframe by one or more criteria - for this example we filtered on both OrderID and Customer name to demonstrate. The line using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) can be applied to other datasets to achieve this. You can also filter without using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) like in the alternative below but [this post](https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc) explains why it might be better to use it.

```python 
order = orders[
  (orders["OrderID"] == id) & 
  (orders["Customer"] == customer_name)
]
```

We could also do something like this to lookup a single value like the name of the customer for the given OrderID.

```python
orders, order_details, products = load_data()
customer_name = orders.at[orders.loc[orders["OrderID"] == 4].index[0], "Customer"]  
print(customer_name)
```
<code-runner :output="['Mike']" 
  filename="" 
  language="Python">
</code-runner>

An alternative to Pandas merge is to use [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) which is very similar. The [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging) gives a comparison for those who wish to learn more.

## Bonus: Stacking multiple tables

As a bonus, what if we're not trying to merge multiple tables, but stack multiple tables? First of all, this is what I mean by stack. Let's say you have two or more tables that all need 'stacking' on top of one another.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840234/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Concat_Before_rtcnm7_ahmew2.png" 
  alt="Concat example before" 
  loading="lazy" 
  styling=""
  caption="We want to 'stack' or concatenate these tables" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840234/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Concat_Before_rtcnm7_ahmew2.png" 
  :showsource="false">
</article-image>

It might be hundreds of different CSV files that need bringing together! We can use [Pandas concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) to handle this. This script targets the 'logs' folder and stacks all 12 CSV files into one file. Each CSV has 37 rows, so after combining we should expect 444 rows.

```python [stack-with-concat-solution.py]
import glob
import pandas as pd
from pandas.core.reshape.concat import concat

csv_files = glob.glob("logs/*.csv")
dataframes = []

for filename in csv_files:
  df = pd.read_csv(filename, index_col=None, header=0)
  dataframes.append(df)

concatenated_df = pd.concat(dataframes, axis=0, ignore_index=True)
print(concatenated_df.shape)

concatenated_df.to_csv(f"logs/concatenated.csv", index=False)
```
<code-runner :output="['(444, 9)']" 
  filename="stack-with-concat-solution.py" 
  language="Python">
</code-runner>

Now all files have been saved to the 'logs' folder in the file 'concatenated.csv' which we can see in the image below. Perfect! This is a super fast way to bring similar but dispersed datasets together and 'stack' them on top of one another. The main thing your source files need, are to all have the same column names so they all align whilst concatenating. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840234/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Concat_After_tjep5h_z0kw9l.png" 
  alt="Concat example after" 
  loading="lazy" 
  styling=""
  caption="All 12 files concatenated into one with 444 rows" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639840234/App%20Images/Blog%20Images/Article%20Images/Index%20Match%20Pandas/Python_Concat_After_tjep5h_z0kw9l.png" 
  :showsource="false">
</article-image>

A similar option is to use [Pandas append](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas-dataframe-append), however my understanding is concat is faster as the append method will add rows of the second dataframe to the first dataframe iteratively one at a time. However, the concat function will do a single operation, which makes it faster than append.

<subscribe-form></subscribe-form>

## What we learned

Using Pandas [merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) brings the power of SQL database-style joins to Excel, it gives you many more options than an index match ever could and with greater simplicity and scalability. We can also lookup rows and values by given criteria using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and easily 'stack' data from many files using [concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). In my opinion, it's essential to keep each on your Data Science toolbelt as you never known when you'll need them!

As always, if you have any questions leave a comment in the comments section, or use the contact button at the bottom of the page to get in touch. You can [download all of the code and files](https://github.com/shedloadofcode/index-match-with-python-and-pandas) used in this article to try things out yourself.

I hope this article helped you out. If you enjoyed this article be sure to check out:

* [How to build a random recipe selector with Python](/blog/how-to-build-a-random-recipe-selector-with-Python/)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to batch rename files in folders with Python]]></title>
            <link>https://shedloadofcode.com/blog/how-to-batch-rename-files-in-folders-with-python/</link>
            <guid>https://shedloadofcode.com/blog/how-to-batch-rename-files-in-folders-with-python/</guid>
            <pubDate>Sat, 04 Dec 2021 15:30:00 GMT</pubDate>
            <description><![CDATA[Learn how to conditionally batch rename multiple files in a folder or subfolders recursively.]]></description>
            <content:encoded><![CDATA[
Although there are many tutorials on renaming files with Python, most don’t go into how to create flexible logic to tailor that batch file rename job to your needs. This is a situation I found myself in recently, a seemingly simple request to help rename a few hundred files in a folder. However, not all of the renaming followed a set pattern! Nor did it follow any real pattern at all, so using regex probably wasn’t going to help. This called for a custom script to help out a fellow engineer. 

## Problem

The problem given was that during an automation process hundreds of files had been produced but using the wrong names. These now all needed changing. The files names on the left needed to look like the file names on the right (this is a small sample but there were hundreds of files). As you can see it isn’t a straight up find and replace job, we will need some logic to match a search term to a replacement. For example, if the file name includes X then replace with Y. To trim the identifier at the beginning of the file name we’ll use string slicing.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639924613/App%20Images/Blog%20Images/Article%20Images/Batch%20Renaming%20Files/files-before-and-after_kwpx2l.png" 
  alt="File names before and after" 
  loading="lazy" 
  styling=""
  caption="File names before (left) and what they needed to look like after (right)" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639924613/App%20Images/Blog%20Images/Article%20Images/Batch%20Renaming%20Files/files-before-and-after_kwpx2l.png" 
  :showsource="false">
</article-image>

## Solution 

This script makes use of the [os module](https://docs.python.org/3/library/os.html). We provide a folder path and then loop over all of the files within it, renaming with the replacements where the file name contains the search term.

```python [renamer.py]
import os

def rename_files(path):
    replacements = ["_dualforecast", "_narrative", "_pf1", "_summary", "_txn"]
    search_terms = ["CLAIM", "NARRATIVE", "PF1", "SUMMARY", "Txn"]
    count = 0
    
    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)
        name, extension = os.path.splitext(filename)

        for i, term in enumerate(search_terms):
            if term in name:
                prefix = name[:11]
                postfix = replacements[i]
                new_name = os.path.join(path, prefix + postfix + extension)
                os.rename(file_path, new_name)
                continue

        count += 1
    
    print(f"{count} files in folder {path} were renamed.")


if __name__ == "__main__":
    rename_files(r"C:\\Users\\shedloadofcode\\Documents\\TestFolder")
```

Success! All of the files were renamed according to the logic applied and are now in the format like on the right side of the image shown earlier. This logic will also rename any subfolders in the directory too if you were wanting to rename folders rather than files. In this script I have also seperated the file `name` from the `extension` so if you were wanting to say change hundreds of txt files to csv format you can do that with just one change `new_name = os.path.join(path, prefix + postfix + ".csv")`.

If you want to give this script a test drive, download the [test folder](https://github.com/shedloadofcode/batch-rename-files-in-folders), then extract the contents and place the directory 'TestFolder' in your Documents folder ensuring it has the name 'TestFolder'. Then update the path given to the `rename_files` function with your username before running 😄

## Bonus: Recursive batch renaming

You might be thinking, but what if I have files within folders within folders? Do I have to run this in each folder one at a time? Hell no 😆 we can adapt the function to go through every subfolder and perform the rename operation in each recursively. Let's say we have folders A, B and C in the TestFolder directory.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639924613/App%20Images/Blog%20Images/Article%20Images/Batch%20Renaming%20Files/renamer-recursive-before_ewwj0a_juodyq.png" 
  alt="Subfolders before renaming" 
  loading="lazy" 
  styling=""
  caption="Subfolders within folders - we need recursion!" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639924613/App%20Images/Blog%20Images/Article%20Images/Batch%20Renaming%20Files/renamer-recursive-before_ewwj0a_juodyq.png" 
  :showsource="false">
</article-image>

Now let's take a look at the recursive function we'll run against the TestFolder directory path.

```python [recursive-renamer.py]
import os

def rename_files_recursively(root_path):
    replacements = ["_dualforecast", "_narrative", "_pf1", "_summary", "_txn"]
    search_terms = ["CLAIM", "NARRATIVE", "PF1", "SUMMARY", "Txn"]
    count = 0

    for path, subdirs, files in os.walk(root_path):
        for filename in files:
            file_path = os.path.join(path, filename)
            name, extension = os.path.splitext(filename)

            for i, term in enumerate(search_terms):
                if term in name:
                    prefix = name[:11]
                    postfix = replacements[i]
                    new_name = os.path.join(path, prefix + postfix + extension)
                    os.rename(file_path, new_name)
                    continue

            count += 1

    print(f"{count} files were renamed recursively from root {root_path}")


if __name__ == "__main__":
    rename_files_recursively(r"C:\\Users\\shedloadofcode\\Documents\\TestFolder")
```

Now we can see every file in every subfolder is renamed in one operation. This will also work to any folder tree depth.. subfolders within subfolders within subfolders.. everything. Isn't recursion wonderful? Here are folders A, B and C after the operation completes:

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639924613/App%20Images/Blog%20Images/Article%20Images/Batch%20Renaming%20Files/renamer-recursive-after_qrljid_ygchz8.png" 
  alt="Subfolders after renaming" 
  loading="lazy" 
  styling=""
  caption="All files renamed in every folder" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639924613/App%20Images/Blog%20Images/Article%20Images/Batch%20Renaming%20Files/renamer-recursive-after_qrljid_ygchz8.png" 
  :showsource="false">
</article-image>

## Adapting to your needs

There we are, two short adaptable and extendable functions that give us everything we need to get the job done! My colleague was certainly happy with the result, they said it worked like a dream. You can easily adapt these functions to your own needs by changing or adding to the conditional logic in the inner loop that processes each file name. Not only does this script apply to conditionally renaming files, but also conditionally deleting files. You could use `os.remove(file_path)` instead of the `os.rename(file_path, new_name)` we used.

Thanks very much for reading, this was a very short article covering how to effectively batch rename files in folders with Python. If you have any questions feel free to leave a comment 👍 

If you enjoyed this article be sure to check out other articles on the site, you may be interested in:

* [How to do an index match with Python and Pandas](/blog/how-to-do-an-index-match-with-python-and-pandas/)
* [How to build a random recipe selector with Python](/blog/how-to-build-a-random-recipe-selector-with-Python/)
* [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Five ways to improve your system design and software architecture skills]]></title>
            <link>https://shedloadofcode.com/blog/five-ways-to-improve-your-system-design-and-software-architecture-skills/</link>
            <guid>https://shedloadofcode.com/blog/five-ways-to-improve-your-system-design-and-software-architecture-skills/</guid>
            <pubDate>Thu, 11 Nov 2021 19:38:00 GMT</pubDate>
            <description><![CDATA[Explore five things you can do to improve your system design skills so you can build more solid technical solutions.]]></description>
            <content:encoded><![CDATA[
I always thought working in software development and data science, and building systems more generally, that writing code would take up the majority of the time. I mean you 'learn to code' right? You don't 'learn to design systems'. I thought coding creative solutions to problems would be the main task, and [data structures and algorithms](/blog/exploring-coding-interview-topics-in-python/) would be the main skills. Although these things are very important, it didn’t really turn out to be the case. I found the majority of the time would be spent in meetings explaining to stakeholders and others, how systems worked or would work. I learnt quickly that when features were suggested or requested, it wasn’t coding ability that would allow you to discuss them, it was the ability to discuss whether there was a workable design. This can be particularly hard when you’re just starting out.

After these conceptual discussions of feasibility with the business stakeholders, there would be technical discussions like estimation and planning how to divide the tasks, whether they could be performed concurrently, or whether any exploration was required. Still no code had been written at this point. So the main skill being used here is one of system design. How to either create a new system or extend an existing one. It’s almost like a mini system design interview process. Knowing what’s possible from a system design point of view can really help you zone in on whether an ask is feasible, technically possible, and most importantly, if it’s even needed in the first place! The main reason to constantly improve your architecture aptitude is to always ensure you are building solid, inexpensive, maintainable, scalable and speedy technical solutions or features. Whether that be a machine learning model, a web application, an automated process or anything else, they will all benefit from these things. So if you’ve focused a little too much on the code and neglected your system design and software architecture skills, read on to find out five things you can do to improve them.

## Know the core concepts of system design

Technopedia describes [system design](https://www.techopedia.com/definition/29998/system-design) as:

> “the process of defining the elements of a system such as the architecture, modules and components, the different interfaces of those components and the data that goes through that system. It is meant to satisfy specific needs and requirements of a business or organization through the engineering of a coherent and well-running system.”

System design is a vast subject that includes the following topics:

* Programming paradigms - Object oriented, Functional
* Programming design patterns - [Gang of Four](https://en.wikipedia.org/wiki/Design_Patterns)
* Code organisation
* Frameworks
* Dependencies
* Design principles
* Components
* [Functional](https://en.wikipedia.org/wiki/Functional_requirement) vs. [Non-functional requirements](https://en.wikipedia.org/wiki/Non-functional_requirement)
* N-tier Layering
* Microservices
* Messaging
* Caching 
* Load balancing
* Performance
* Relational and NoSQL databases
* [Database design](https://en.wikipedia.org/wiki/Database_design)
* [Data model design](https://en.wikipedia.org/wiki/Data_modeling)
* API design 
* Polling and Sockets
* User interface design
* Networking and Proxies
* Scaling! Both [horizontal and vertical](https://www.section.io/blog/scaling-horizontally-vs-vertically/#:~:text=Horizontal%20scaling%20means%20scaling%20by,as%20%E2%80%9Cscaling%20up%E2%80%9D)
* Capacity and demand estimations
* Storage
* Fault tolerance
* Maintainability
* Extensibility
* Accessibility - [WCAG 2.1](https://www.w3.org/TR/WCAG21/)
* Security - [OWASP Top Ten](https://owasp.org/www-project-top-ten/#)
* Analytics and Machine Learning
* Communication
* Authentication - [OIDC](https://en.wikipedia.org/wiki/OpenID), [WsFederation](https://en.wikipedia.org/wiki/WS-Federation), [JWT](https://jwt.io/)

As we can see from this list, there is so much involved in system design! I think although all of the things listed above are important, the key one to understand scalability. It is really useful to understand [how to scale a system from 100 to 1,000,000 users](https://systeminterview.com/scale-from-zero-to-millions-of-users.php). The seperation of concerns is a key factor in scalability, hence the adoption of [N-tier architecture](https://www.techopedia.com/definition/17185/n-tier-architecture#:~:text=N%2Dtier%20architecture%20is%20a,both%20logically%20and%20physically%20separated.&text=N%2Dtier%20architecture%20is%20also%20known%20as%20multi%2Dtier%20architecture.), [MVC (Model-View-Controller)](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller) or [MVVM (Model-View-ViewModel)](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel) and [Microservices](https://en.wikipedia.org/wiki/Microservices) or [service oriented architecture](https://en.wikipedia.org/wiki/Service-oriented_architecture) patterns.

An analytics system I have been in the process of designing recently had the same pattern as the [scaling to millions of users diagram](https://systeminterview.com/imgs/top10/millions_of_users.png). I think it is a great starting point for most designs. I am a big fan of the [component template](https://systeminterview.com/drawing.php) from systeminterview.com and utilised it in this design diagram.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/system-design-example-wide.drawio_bjeg1k_hhit1q.png" 
  alt="Recent system design diagram" 
  loading="lazy" 
  styling=""
  caption="A recent system design diagram of mine" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/system-design-example-wide.drawio_bjeg1k_hhit1q.png" 
  :showsource="false">
</article-image>

Just knowing the core components and concepts that go into a robust solid system, then allows you to start putting components together and building your own designs. It also means you have a general awareness of the kinds of things to start considering learning more about. Whether it be networking, databases or object oriented design, you'll no doubt find some weaker area that you can go away and read up on. 

## Learn from the designs of existing systems

You can learn a lot from systems that have already been built and most are documented online. I created this Trello board which served as my checklist and notes on designing various systems from Netflix and YouTube to Facebook and Amazon. This not only meant I would be more prepared for any system design interviews, but for building real systems within those industries such as e-commerce, video streaming, content management, social media, storage, chat and messaging applications.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/system-design-trello-board_qhc3ru_wpnn8v.webp" 
  alt="System design questions Trello board" 
  loading="lazy" 
  styling=""
  caption="System design questions Trello board" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/system-design-trello-board_qhc3ru_wpnn8v.webp" 
  :showsource="false">
</article-image>

You can find walkthroughs on how to design some of these systems on [The System Design Primer](https://github.com/donnemartin/system-design-primer#system-design-interview-questions-with-solutions) Github page. There is also a page exploring [real world architectures](https://github.com/donnemartin/system-design-primer#real-world-architectures). 

Another key resource to explore are [company technical blogs](https://github.com/donnemartin/system-design-primer#company-engineering-blogs), where you can get an inside view on design decisions taken by engineering teams. These are some of my favourite engineering and data blogs at the moment:

* [Discord Engineering Blog](https://blog.discord.com/engineering-posts/home)
* [Twitter Engineering Blog](https://blog.twitter.com/engineering/en_us)
* [Instagram Engineering Blog](https://instagram-engineering.com/)
* [LinkedIn Engineering Blog](https://engineering.linkedin.com/blog)
* [Data in Government Blog](https://dataingovernment.blog.gov.uk/)
* [Dropbox Infrastructure Blog](https://dropbox.tech/infrastructure)
* [Stripe Engineering Blog](https://stripe.com/blog/engineering)
* [Government Digital Service Blog](https://gds.blog.gov.uk/)
* [Heroku Engineering Blog](https://blog.heroku.com/engineering)
* [Netflix Tech Blog](https://netflixtechblog.com/?gi=14887958ebcb)
* [Spotify Engineering Blog](https://engineering.atspotify.com/)
* [Airbnb Tech Blog](https://medium.com/airbnb-engineering)
* [Uber Engineering Blog](https://eng.uber.com/)
* [Google Developers Blog](https://developers.googleblog.com/)

## Follow a framework for practical system design

A quote by Richard Pattis I added to my [favourite quotes article](/blog/programming-quotes-that-offer-wisdom-and-motivation/) says 'If you cannot grok the overall structure of a program while taking a shower, you are not ready to code it'. This means the design is not clear enough, and for me the essence of good software design is reducing and managing complexity. That is, the ability to easily understand how a system works, and therefore easily modify it. This is the main theme of the book [A Philosophy of Software Design](https://www.amazon.co.uk/Philosophy-Software-Design-2nd/dp/173210221X/) which I found really insightful. 

To create a clear design, a framework can help to structure it and ensure nothing is missed out. I like the [PEDALS method](https://www.lewis-lin.com/blog/pedals-method) from The [System Design Interview](https://www.amazon.co.uk/System-Design-Interview-2nd/dp/B09559NJKL/ref=sr_1_4?keywords=the+system+design+interview&qid=1636480313&s=books&sr=1-4) to guide the process of architecting a system. This stands for:

* Process requirements
* Estimate
* Design the system
* Articulate the data model
* List the architectural components
* Scale

This provides a nice easy to remember process to kick off designing a system. I know this is geared to system design interviews but really the process should also be very useful on the job. I mean I’ve always thought of a system design interview as a conversation between two or more engineers that need to plan out a solution, this process facilitates that conversation very well.

Once you've mastered using the PEDALS framework, you might want to explore more 'enterprise-level' architecture frameworks. These might include [The Open Group Architecture Framework (TOGAF)](https://en.wikipedia.org/wiki/The_Open_Group_Architecture_Framework) and [The Zachman Framework](https://en.wikipedia.org/wiki/Zachman_Framework).

## Explore cloud computing providers and services

With most solutions now deployed using cloud infrastructure it helps to know the range of cloud providers and their offerings. The big players are [Microsoft Azure](https://azure.microsoft.com/en-gb/), [Amazon Web Services](https://aws.amazon.com/) (AWS) and [Google Cloud Platform](https://cloud.google.com/) (GCP). Other providers include Heroku, Linode, IBM Cloud, Digital Ocean and more. Each provider offers a whole range of services, with [Microsoft Azure](https://azure.microsoft.com/en-gb/overview/what-is-azure/) for example provides over [200 products and cloud services](https://azure.microsoft.com/en-gb/services/). These include Machine Learning, Virtual Machines, Chatbots, Web App Hosting, Storage, Databases, Serverless Functions and many other services.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/azure-products_tfeam9_teajqs.png" 
  alt="Azure cloud products" 
  loading="lazy" 
  styling=""
  caption="Azure cloud products" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/azure-products_tfeam9_teajqs.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/aws-products_n9qpqo_ja9sie.png" 
  alt="AWS cloud products" 
  loading="lazy" 
  styling=""
  caption="AWS cloud products" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/aws-products_n9qpqo_ja9sie.png" 
  :showsource="false">
</article-image>

The [benefits of cloud computing](https://www.salesforce.com/products/platform/best-practices/benefits-of-cloud-computing/) are numerous and it seems most big companies and government organisations are moving towards the cloud, so having a good understanding of the providers and services and how they fit into a robust cloud based architecture is vital. Most of the cloud services providers offer free introductory trials - usually for 12 months. This allows you to try some of their services, and build your own cloud based solutions for practice. You usually need a credit card to register, but as long as you only use the free services you shouldn't be charged. Always check your costs section though, as if you use any service not part of the free trial, it will be added to your bill. This is good practice for making sure you're provisioning cost efficient services, and keeping an eye on their cost as demand and usage increases! That skill alone is vital for a company to manage costs and prevent them from spiralling out of control. Most cloud services providers have tools to calculate product usage costs, here is the [Azure Pricing Calculator](https://azure.microsoft.com/en-gb/pricing/calculator/) as an example.

To learn more about cloud solution architecture (but geared towards AWS) a good book is [Solution Architect’s Handbook](https://www.amazon.co.uk/Solutions-Architects-Handbook-Kick-start-architecture/dp/1838645640/ref=sr_1_1).

## Study, practice then prototype

System design is a huge topic and can feel overwhelming. I think the more you implement different aspects of systems, you learn what’s possible and it becomes easier. Therefore the best way to continually improve your system design expertise is constant learning and experimenting with new ideas. I like using [diagrams.net](https://diagrams.net) for planning out a design - a free open source tool. 

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/using-diagrams-draw-io_ml6udl_oedpyw.png" 
  alt="Using diagrams.net for architecture diagrams" 
  loading="lazy" 
  styling=""
  caption="Using diagrams.net for architecture diagrams" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639839066/App%20Images/Blog%20Images/Article%20Images/Improving%20System%20Design%20Skills/using-diagrams-draw-io_ml6udl_oedpyw.png" 
  :showsource="false">
</article-image>

You can practice estimating system capacity using our [System Capacity Calculator](/tools/system-capacity-calculator). If you decide to read [The System Design Interview](https://www.amazon.co.uk/System-Design-Interview-2nd/dp/B09559NJKL/ref=sr_1_4?keywords=the+system+design+interview&qid=1636569963&sr=8-4) book, you can run the scenario metrics in chapter 4 (Estimates) on page 27 through the calculator. This calculator was built to help with both system design interview scenarios, alongside building real world scalable systems.

After planning the design, go ahead and try to build a small working prototype of the system in your selected tech stack. This will teach you a lot of how a more complex version of the system might work. This is an essential step and reminds me of [one of my favourite quotes](/blog/programming-quotes-that-offer-wisdom-and-motivation/) from John Gall "A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system".

## Conclusion

The key takeaway is to never stop learning, practising and improving your system design skills. Using the five things discussed in this article, you'll be able to improve no matter your current experience level. We've never had access to more opportunities to learn and improve skills. This morning at 11am, I observed a two minute silence for [Remembrance Day](https://en.wikipedia.org/wiki/Remembrance_Day) which is a reminder how lucky we are to have the freedom and tools to learn.

I should also explain why I chose the cover image I did for this article. It was in reference to [Margaret Hamilton](https://en.wikipedia.org/wiki/Margaret_Hamilton_(software_engineer)) who led to the team which developed the onboard flight software for the [Apollo space program](https://en.wikipedia.org/wiki/Apollo_program). Here is an [interesting interview about her journey](https://www.youtube.com/watch?v=4sKY6_nBLG0). An incredible feat of software development and engineering, and a very inspirational story on how important well designed, well built systems are when other's lives are on the line. 

Finally, here are some recommended resources for further learning:
 
* [System Design Playlist by Gaurev Sen](https://youtube.com/playlist?list=PLMCXHnjXnTnvo6alSjVkgxV-VH6EPyvoX)
* [System Design Interview](https://www.amazon.co.uk/System-Design-Interview-insiders-Second/dp/B08CMF2CQF/)
* [The System Design Interview](https://www.amazon.co.uk/System-Design-Interview-2nd/dp/B09559NJKL/)
* [Solution Architect's Handbook](https://www.amazon.co.uk/Solutions-Architects-Handbook-Kick-start-architecture/dp/1838645640/)
* [A Philosophy of Software Design](https://www.amazon.co.uk/Philosophy-Software-Design-2nd/dp/173210221X/)
* [Web Scalability for Startup Engineers](https://www.amazon.co.uk/Scalability-Startup-Engineers-Artur-Ejsmont/dp/0071843655/)
* [Release It!: Design and Deploy Production-Ready Software](https://www.amazon.co.uk/Release-Design-Deploy-Production-Ready-Software/dp/1680502395/)
* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://www.amazon.co.uk/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321)
* [Software Engineering at Google: Lessons Learned from Programming Over Time](https://www.amazon.co.uk/Software-Engineering-Google-Lessons-Programming/dp/B08VKJXVHK)
* [The Imagineering Process: Using the Disney Theme Park Design Process to Bring Your Creative Ideas to Life](https://www.amazon.co.uk/Imagineering-Pyramid-Principles-Develop-Creative/dp/194150096X)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Exploring coding interview topics in Python]]></title>
            <link>https://shedloadofcode.com/blog/exploring-coding-interview-topics-in-python/</link>
            <guid>https://shedloadofcode.com/blog/exploring-coding-interview-topics-in-python/</guid>
            <pubDate>Thu, 02 Sep 2021 16:22:00 GMT</pubDate>
            <description><![CDATA[Improve your understanding of algorithms, data structures and time complexity for coding interviews.]]></description>
            <content:encoded><![CDATA[
There are fundamental topics on algorithms and data structures that need to be understood for coding interviews. I am certainly no expert on coding interviews themselves, but I did embark on working through [Elements of Programming Interviews in Python](https://www.amazon.co.uk/Elements-Programming-Interviews-Python-Insiders/dp/1537713949/ref=pd_bxgy_img_2/262-9365292-3109168?pd_rd_w=Y6OlR&pf_rd_p=c7ea61ca-7168-47e3-9c8b-d84748f5b23c&pf_rd_r=D0WECF6DRCT5DPW9E23H&pd_rd_r=1f09cc37-a87b-404f-8ed5-79f25c54beb0&pd_rd_wg=zODeE&pd_rd_i=1537713949&psc=1). I used this book in combination with it’s companion [EPI-Judge](https://github.com/adnanaziz/EPIJudge) and also [LeetCode](https://leetcode.com/) problems. 

I did this to become better at programming in general and to brush up on algorithms and data structures. I’m not sure if you agree, but I feel (and have heard others say) that most of the time, the kinds of problems you find in programming interview questions are not the same as what actually occur on the job. Probably more of the job is focused on [system design and architecture](/blog/five-ways-to-improve-your-system-design-and-software-architecture-skills/) instead. Nevertheless, they have their merits and I admit they made me think more algorithmically and improved the efficiency of my code. 

With all this in mind, in this article I’ve collated what I think are good examples mostly from LeetCode that help with learning and applying the concepts in the real world. They cover the major coding interview topics. It should be a good overview for those new to these topics, and a good reminder for those wishing to recap knowledge that might not have been used for a while. This is a long article you can use as a reference again and again - you can use the contents panel above to find your way back to the relevant section more easily.

## Big-O Notation

Before diving into the topics and examples, it's important to understand [Big-O Notation](https://en.wikipedia.org/wiki/Big_O_notation#:~:text=Big%20O%20notation%20is%20a,a%20particular%20value%20or%20infinity.&text=In%20computer%20science%2C%20big%20O,as%20the%20input%20size%20grows.) first. This allows us to assess the efficiency of an algorithm. For time complexity we ask how fast does the algorithm execute it's operations as the input size scales and becomes very large. For space complexity we ask how much memory will the algorithm consume as the input size scales and becomes very large. Space complexity consists of auxiliary space (space for extra variables and data structures we declare), input space (space for the given input) and stack space (for recursion). As the input size can vary, it is referred to as *n*. Below are the common notations you will see, with complexities ordered from smallest to largest, along with examples. These notations can be applied to both time and space complexity.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639930533/App%20Images/Blog%20Images/Article%20Images/Coding%20Interview%20Topics/big-o-complexity-chart_jrrbua_wv0k6q.jpg" 
  alt="Big-O Cheat Sheet" 
  loading="lazy" 
  styling=""
  caption="https://www.bigocheatsheet.com/ (great resource!)" 
  captionsrc="https://www.bigocheatsheet.com/" 
  :showsource="true">
</article-image>

**Constant time: O(1)**

Where an algorithm does not depend on input size *n*, it runs in constant time. In this example, the loop will always run 100 times.

```python
count = 0
for i in range(100):
  count += 1
print(count)
```

Other examples include accessing an array by index, adding or removing an element from an array, looking up a value in a dictionary (hashmap) and arithmetic operations.

**Logarithmic time: O(log(n))**

Where an algorithm's run time grows in proportion to the logarithm of the input size *n*. This means the algorithm isn't really affected by the input size and still runs rapidly on large inputs.

Using Binary Search to find an element in a sorted list is a good example. The algorithm uses a "divide and conquor" approach, it jumps to the middle of the list, divides the list into two and repeats until the element is found. So the algorithm is reducing the size of the input at each step therefore doesn't need to check every value.

```python
n = 1000000
my_list = list(range(n)) # generates a list of numbers 0 through to "n"

def binary_search(array, target_value):
  list_length = len(array)
  left = 0
  right = list_length - 1
  while left <= right:
    middle = (left + right) // 2 # // performs integer division rather than floating-point division
    if target_value < array[middle]:
      right = middle - 1
    elif target_value > array[middle]:
      left = middle + 1
    else:
      return middle
  return "Search completed but value not found in array"

search_result = binary_search(my_list, 300000)
print(search_result)
```

This example searches a sorted array of integers 0 through to 1000000 (*n*). The algorithm sets a `left` and `right` index, then while the left index is lower than or equal to the right, checks whether the target value is less than the middle or greater than the middle, adjusting the left or right indexes accordingly to "split" the array. If the target value isn't less than or greater than the middle, we've found it and can just return `middle` 😄 This algorithm runs extremely fast even if the list input size *n* grows larger.

**Linear time: O(n)**

Where an algorithm depends on input size *n*, it runs in linear time.

```python
n = 10000000
count = 0
for i in range(n):
  count += 1
print(count)
```

**Linearithmic time: O(n log(n))**

Where an algorithm uses a combination of linear and logarithmic time complexity. In the first place a linear search taking O(n) occurs followed by a reduction by half which means the next operation is O(log(n)) - we saw this "divide and conquer" approach used in Binary Search earlier. Therefore it's O(n*log(n)). Examples include Merge Sort, Quick Sort and Heap Sort. Let's implement Merge Sort to sort an array.

```python
my_list = [54, 567, 26, 93, 17, 77, 31, 44, 55, 20, 44, 55, 14, 52]

def merge_sort(array: list):
  if len(array) > 1:
    # split the array into two
    print("Splitting", array)
    middle = len(array) // 2
    left = array[:middle]
    right = array[middle:]

    # recursive calls
    print("Recursing")
    merge_sort(left)
    merge_sort(right)

    # merge
    i = 0 # index to traverse the left array
    j = 0 # index to traverse the right array
    k = 0 # index for the main array

    # compare the left array and right array and 
    # overwrite the main array with the lowest value
    print("Merging ", array)
    while i < len(left) and j < len(right):
      if left[i] < right[j]:
        array[k] = left[i]
        i += 1
      else:
        array[k] = right[j]
        j += 1
      k += 1
    
    # transfer all remaining values in the left array
    while i < len(left):
      array[k] = left[i]
      i += 1
      k += 1

    # transfer all remaining values in the right array
    while j < len(right):
      array[k] = right[j]
      j += 1
      k += 1

merge_sort(my_list)
print(my_list)  # [14, 17, 20, 26, 31, 44, 44, 52, 54, 55, 55, 77, 93, 567]
```

An initial array is divided into two roughly equal parts. If the array has an odd number of elements, one of those "halves" is by one element larger than the other. The subarrays are divided over and over again into halves until you end up with arrays that have only one element each. Then you combine the pairs of one-element arrays into two-element arrays, sorting them in the process. Then these sorted pairs are merged into four-element arrays, and so on until you end up with the initial array sorted. This [CS50 video](https://www.youtube.com/watch?v=Ns7tGNbtvV4) explains the process in more detail.

You can visualise Merge Sort through it's [algorithm diagram](https://commons.wikimedia.org/wiki/File:Merge_sort_algorithm_diagram.svg). If you want some real fun you can check out what Merge Sort looks like in real time [in this video](https://youtu.be/kPRA0W1kECg?t=67).

**Quadratic time: O(n²)**  

Where an algorithm has two nested loops / iterations, it runs in quadratic time.

```python
n = 100
array_x = [42] * n  # list which is the length of "n" with all the same elements 42

def print_all_array_pairs(array_x):
  count = 0
  for i in range(len(array_x)):
    for j in range(len(array_x)):
      print(array_x[i], array_x[j])
      count += 1
  print(count)

print_all_array_pairs(array_x)
```

The final run count is 10000 for this example where *n* is 100. As 100^2 = 10000 so it's O(n²).

However a nested loop where the input sizes are different would be O(n*y). 

```python
x, y = 50, 100
array_x = [42] * x # list which is the length of "x" with all the same elements 42
array_y = [22] * y # list which is the length of "y" with all the same elements 22

def print_all_array_pairs(array_x, array_y):
  count = 0
  for i in range(len(array_x)):
    for j in range(len(array_y)):
      print(array_x[i], array_y[j])
      count += 1
  print(count) 

print_all_array_pairs(array_x, array_y)
```

This is because the inner loop with a constant number of iterations is run y times for each iteration of the outer loop that is run x times. In this example the the outer loop runs for the length of `array_x` which is 50 and the inner loop runs for the length of `array_y` which is 100. So the final `count` is 5000 which is 50 x 100 therefore O(n*y).
 
**Cubic time: O(n³)**

Where an algorithm has three nested loops or iterations, it runs in cubic time. Here is an example I made to find the sum of all the numbers with three loops:

```python
my_list = [44, 55, 63, 123, 54, 43, 34, 54] # "n" is the length of the list which is 8

def sum_all_numbers(array: list):
  sum_of_numbers = 0
  run_count = 0

  for i in my_list:
    sum_of_numbers += i
    for j in my_list:
      sum_of_numbers += j
      for k in my_list:
        sum_of_numbers += k
        run_count += 1
        print(i, j, k)

  return sum_of_numbers, run_count

result, run_count = sum_all_numbers(my_list)
print(f"Sum with three nested iterations: " + str(result)) # Summing this list across three nested loops is 34310
print("Run count: " + str(run_count)) # Run count where "n" is 8 is 8^3 so 512
```

The print statements helps to visualise what's happening as `i` `j` and `k` iterate over the list. As the list in this example has a length of 8 then n = 8. We're iterating over the list with three loops so the time complexity is O(n³), therefore the total executions stored in `run_count` is 8^3 or 8x8x8 which is 512.

**Exponential time: O(2^n)**

Where an algorithm's run time doubles with each addition to the input. Iterating through subsets comes to mind here. A good example is the use of a recursive algorithm to calculate Fibonacci numbers. The Fibonacci sequence is where each number is the sum of the two preceding numbers starting from 0 and 1. So 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ... So if we set *n* = 9, the 9th number in the sequence starting from 1 is 34, and this runs very fast. But what if we want the 40th number in the sequence?

```python
import time

def find_nth_number_in_fibonacci_sequence(n):
    if n <= 1:
        return n
    return find_nth_number_in_fibonacci_sequence(n - 2) + \
           find_nth_number_in_fibonacci_sequence(n - 1)

n = 40
start = time.time()
print(find_nth_number_in_fibonacci_sequence(n)) # 40th number in the fibonacci sequence is 102334155
end = time.time()
print(f"Time taken: {end - start} seconds") # This took 59.58 seconds for me
```

You can see the greater the number in the sequence you're looking for, the more recursive calls are required. You can better visualise what's happening in a [recursion diagram](https://www.google.com/search?q=fibonacci+recursive+diagram&tbm=isch&ved=2ahUKEwiZ09jiuOrxAhUHmhoKHRQxAbsQ2-cCegQIABAA&oq=fibonacci+recursive+diagram&gs_lcp=CgNpbWcQAzoECAAQQzoCCAA6BAgAEB46BggAEAgQHjoECAAQGFDIN1jbP2D3QGgAcAB4AIABVIgB9QOSAQE4mAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=VvryYJnQDYe0apTihNgL&bih=1007&biw=1920). In the Dynamic Programming / Memoization section later in the article, we'll look at how to dramatically improve this run-time.

**Factorial time: O(n!)**

Where an algorithm's run time increases factorially with the increase in input size. A quick refresher on what a factorial is:

> the product of all positive integers less than or equal to a given positive integer and denoted by that integer and an exclamation point. Thus, factorial seven is written 7!, meaning 1 × 2 × 3 × 4 × 5 × 6 × 7. - [Britannica](https://www.britannica.com/science/factorial)

So to find the factorial of a number *n* using a recursive approach has n factorial or O(n!) time complexity. We can see this algorithm will perform exponentially more operations as the input size increases (calculating the factorial of all positive integers before *n*)

```python
def find_factorial(n):
  if n == 1:
    return n
  else:
    return n * find_factorial(n - 1)
```

Now that we've covered Big-O Notation and the common time complexities, we can dive into the topics 😆

## Arrays

**Definition:** An array (list in Python) is a data structure that holds a group of elements, usually of the same data type (but not always) - like this `[1, 2, 3, 4, 5]`

**Example Problem:** Chunk an array into a given size *n*.

**Example Input:** [1, 2, 3, 4, 5, 6, 7], size=3

**Example Output:** [[1, 2, 3], [4, 5, 6], [7]]

```python
import math

def chunk(collection: list, size: int) -> list:
  result = []
  count = 0
    
  for i in range(math.ceil(len(collection) / size)):
    start = i * size
    end = start + size
    result.append(collection[start:end])
    count += 1
  
  print(count)
  return result

if __name__ == "__main__":
  print(chunk([1, 2, 3, 4, 5, 6, 7], size=2))
  print(chunk([1, 2, 3, 4, 5, 6, 7], size=3))
  print(chunk([1, 2, 3, 4, 5, 6, 7], size=4)) 
```

**Explanation:** On each iteration a new start and end index is defined, and the array is sliced then added to the new result array. The run time of this algorithm is O(n) as the loop runs the length of `collection` *n*/`size`, each loop runs a slice operation of `size` using `start` and `end`. So the time complexity is *n*/size * size which is *n*.

**Practical use:** A practical use of an algorithm like this I have seen is creating a navigation tile layout on a webpage. Of course there are libraries that fulfil this need too, but why not implement it yourself to cut down your list of dependencies 😄 This is the first example I could find to illustrate.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639930533/App%20Images/Blog%20Images/Article%20Images/Coding%20Interview%20Topics/tile-design_bew2dh_pmqaaj.webp" 
  alt="Web page tile navigation" 
  loading="lazy" 
  styling=""
  caption="https://www.nhs.uk/live-well/" 
  captionsrc="https://www.nhs.uk/live-well/" 
  :showsource="true">
</article-image>

This is how we might make use of the algorithm to achieve it.

```python
chunks = chunk(["Wellbeing", "Healthy weight", "Exercise", "Sleep", "Eat well", "Alcohol support"], size=3)
  for chunk in chunks:
    for index, item in enumerate(chunk):
      print(item, end="\n") if index == 2 else print(item, end=", ")
```

## Strings

**Definition:** Strings can be thought of as an array, but made up of characters

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/4/45/String_Variable_Diagram_Middle_Aspect_Ratio.png" 
  alt="String diagram" 
  loading="lazy" 
  styling=""
  caption="TripleShortOfACycle via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://upload.wikimedia.org/wikipedia/commons/4/45/String_Variable_Diagram_Middle_Aspect_Ratio.png" 
  :showsource="true">
</article-image>

**Example Problem:** [Reverse String (Leetcode 344)](https://leetcode.com/problems/reverse-string/)

**Example Input:** hello

**Example Output:** olleh

```python
class Solution:
    def reverseString(self, s: List[str]) -> None:
        """
        Do not return anything, modify s in-place instead.
        """
        left = 0
        right = len(s) - 1
        
        while left < right:
            temp = s[left]
            s[left] = s[right]
            s[right] = temp
            
            left += 1
            right -= 1
```

**Explanation:** Using a two pointer approach we start from the left and right, swapping each character with the help of a temporary variable, eventually meeting in the middle.

## Linked Lists

**Definition:** A linked list is a linear collection of data elements similar to an array, but the order is not given by their physical placement in memory. A linked list can be singly or doubly linked.

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Singly-linked-list.svg/408px-Singly-linked-list.svg.png" 
  alt="Linked list diagram" 
  loading="lazy" 
  styling=""
  caption="Lasindi via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Singly-linked-list.svg/408px-Singly-linked-list.svg.png" 
  :showsource="true">
</article-image>

**Example Problem:** [Merge Two Sorted Lists (Leetcode 21)](https://leetcode.com/problems/merge-two-sorted-lists/)

**Example Input:** l1 = [1,2,4], l2 = [1,3,4]

**Example Output:** [1,1,2,3,4,4]

```python
# Definition for singly-linked list.
# class ListNode:
#     def __init__(self, val=0, next=None):
#         self.val = val
#         self.next = next
class Solution:
    def mergeTwoLists(self, l1: ListNode, l2: ListNode) -> ListNode:
        dummy = ListNode(0)
        head = dummy
        
        while l1 and l2:
            if l1.val < l2.val:
                dummy.next = l1
                l1 = l1.next
            else:
                dummy.next = l2
                l2 = l2.next
                
            dummy = dummy.next
        
        if l1 != None:
            dummy.next = l1
        else:
            dummy.next = l2
            
         
        return head.next
```

**Explanation:** We are asked to return a *sorted* list by merging two sorted linked lists. We create a `dummy` head node, then while both `l1` and `l2` are not None, we assign the lower of the two as the `dummy.next` node and move them along. This builds up our new linked list in sorted order. When we break out of the while loop, we check which one has the leftover node (still not None) and assign it as the last node in the chain. Finally, returning `head.next` to avoid the first dummy node we created 😄

## Stacks

**Definition:** A stack holds an ordered, linear sequence of items. In contrast to a queue, a stack is a last in, first out (LIFO) data structure. It is also used to implement depth first search.

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/b/b4/Lifo_stack.png" 
  alt="Stack diagram" 
  loading="lazy" 
  styling="height:430px; width:650px;"
  caption="Maxtremus via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://en.wikipedia.org/wiki/Stack_(abstract_data_type)#/media/File:Lifo_stack.png" 
  :showsource="true">
</article-image>

**Example Problem:** [Min Stack (Leetcode 155)](https://leetcode.com/problems/min-stack/) 

**Example Input:** ["MinStack","push","push","push","getMin","pop","top","getMin"]

**Example Output:** [[],[-2],[0],[-3],[],[],[],[]]

```python
class MinStack:
    def __init__(self):
        """
        initialize your data structure here.
        """
        self.stack = []
        self.min_stack = []
        
    def push(self, val: int) -> None:
        self.stack.append(val)
        val = min(val, self.min_stack[-1]) if len(self.min_stack) > 0 else val
        self.min_stack.append(val)

    def pop(self) -> None:
        self.stack.pop()
        self.min_stack.pop()

    def top(self) -> int:
        return self.stack[-1]

    def getMin(self) -> int:
        return self.min_stack[-1]


# Your MinStack object will be instantiated and called as such:
# obj = MinStack()
# obj.push(val)
# obj.pop()
# param_3 = obj.top()
# param_4 = obj.getMin()
```

**Explanation:** To keep track of and lower the expense of retrieving the minimum element, we implement a two stack approach. The main stack holds the entries and the min stack holds the current minimum value, which updates during `push` with either the new value or the popped minimum value, whichever is lower.

## Queues

**Definition:** A queue holds an ordered, linear sequence of items. In contrast to a stack, a queue is a first in, first out (FIFO) data structure. It is used to implement breadth first search.

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Data_Queue.svg/1920px-Data_Queue.svg.png" 
  alt="Queue diagram" 
  loading="lazy" 
  styling="height:430px; width:650px;"
  caption="Vegpuff via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://en.wikipedia.org/wiki/Queue_(abstract_data_type)#/media/File:Data_Queue.svg" 
  :showsource="true">
</article-image>

**Example Problem:** [Binary Tree Level Order Traversal (Leetcode 102)](https://leetcode.com/problems/binary-tree-level-order-traversal/)

**Example Input:** root = [3,9,20,null,null,15,7]

**Example Output:** [[3],[9,20],[15,7]]

```python
# Definition for a binary tree node.
# class TreeNode:
#     def __init__(self, val=0, left=None, right=None):
#         self.val = val
#         self.left = left
#         self.right = right
import collections

class Solution:
    def levelOrder(self, root: TreeNode) -> List[List[int]]:
        result: List[List[int]] = []
        
        if root == None:
            return result
        
        # Initialise queue and add first node
        queue: Deque[int] = collections.deque()
        queue.append(root)
        
        # Loop over queue
        while not len(queue) == 0:
            current_level = []
            for i in range(len(queue)):
                current_node: TreeNode = queue.popleft()
                current_level.append(current_node.val)
                if (current_node.left):
                    queue.append(current_node.left)
                    
                if (current_node.right):
                    queue.append(current_node.right)
                    
            result.append(current_level)
        
        return result
```

**Explanation:** We create a queue frontier to implement breadth first search and append the root node. Then at each iteration we clear the queue appending everything to the `current_level` before expanding the node's left and right children. To understand the difference between using a queue or stack as a frontier [watch this video](https://youtu.be/D5aJNFWsWew?t=1561) from CS50 AI.

## Heaps

**Definition:** A heap is a data structure like a tree with the interesting property that any node has a lower value than any of its children (min-heap) or any node has a higher value than any of its children (max-heap). 

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Max-Heap-new.svg/800px-Max-Heap-new.svg.png" 
  alt="Heap diagram" 
  loading="lazy" 
  styling="height:430px; width:405px;"
  caption="A max-heap by Kelott via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Max-Heap-new.svg/800px-Max-Heap-new.svg.png" 
  :showsource="true">
</article-image>

**Example Problem:** [Last Stone Weight (Leetcode 1046)](https://leetcode.com/problems/last-stone-weight/)

**Example Input:** [2,7,4,1,8,1]

**Example Output:** 1

```python
import heapq

class Solution:
    """
    See https://docs.python.org/3/library/heapq.html for heapq docs
    """
    def lastStoneWeight(self, stones: List[int]) -> int:
        heap = [-abs(x) for x in stones] # negative value as heapq is min heap by default
        heapq.heapify(heap)
        
        while len(heap) > 1:
            stone_one = abs(heapq.heappop(heap))
            stone_two = abs(heapq.heappop(heap))
            
            if stone_one != stone_two:
                heapq.heappush(heap, -abs(stone_one - stone_two))
                
        heap_is_empty = len(heap) == 0
        
        return 0 if heap_is_empty else abs(heapq.heappop(heap))
```

**Explanation:** Our brief is if x == y, both stones are destroyed, and if x != y, the stone of weight x is destroyed, and the stone of weight y has new weight y - x. At the end of the game, there is at most one stone left. We create a `heap` list with the negative value of the stones (because heapq is a min-heap by default we need to turn that into a max-heap). Then `heapq.heapify(heap)` transform the list in-place. We then pop the two heaviest stones from the max-heap and if not equal add back their difference, not forgetting to make the value negative. If they are the same we do nothing (both stones were destroyed). We then just need to check if any stones are left with `heap_is_empty` and if it is return the last stone's weight 😄

## HashMaps or Dictionaries

**Definition:** A hashmap or hashtable (dictionary in Python) is a data structure that implements an associative array abstract data type - a structure that can map keys to values, like this `{ "name": "John", "age": "44" }`

**Example Problem:** [Valid Anagram (Leetcode 242)](https://leetcode.com/problems/valid-anagram/)

**Example Input:** s = "anagram", t = "nagaram"

**Example Output:** true

There are a few valid solutions for this - we'll start with the fundamental example of using a hashmap, then simplify.

```python
class Solution:
    def isAnagram(self, s: str, t: str) -> bool:
        if len(s) != len(t):
            return False

        counter = {}

        for letter in s:
            if letter in counter.keys():
                counter[letter] += 1
            else:
                counter[letter] = 1

        for letter in t:
            if letter not in counter.keys():
                return False

            if counter[letter] < 1:
                return False

            counter[letter] -= 1


        return True
```

```python
class Solution:
    def isAnagram(self, s: str, t: str) -> bool:
        return Counter(s) == Counter(t)
```

```python
class Solution:
    def isAnagram(self, s: str, t: str) -> bool:
        return sorted(s) == sorted(t)
```

**Explanation:** Wikipedia tells us 'An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. For example, the word anagram itself can be rearranged into nagaram, also the word binary into brainy and the word adobe into abode'. We need to test if `t` is an anagram of `s`. In the first approach, implement a counter ourselves, counting each character in `s` and storing the count in a dictionary. We then go over each letter in `t` decrementing from the count. If the letter isn't in the dictionary, or the count drops below zero we know it's not a valid anagram. Approach two simplifies this to use Python's built in Counter to compare both strings. As the order of the words don't matter, we could also sort the strings then compare them as in the third approach.

The commonly used data structures for hashmaps in Python are set, dict, collections.defaultdict and collections.Counter.

## Searching

**Definition:** A search algorithm is used to find specific data within a data structure.

**Example Problem:** [Binary Search (Leetcode 704)](https://leetcode.com/problems/binary-search/)

**Example Input:** nums = [-1,0,3,5,9,12], target = 9

**Example Output:** 4

```python
class Solution:
    def search(self, nums: List[int], target: int) -> int:
        left, right = 0, len(nums) - 1
        
        while left <= right:
            middle = (left + right) // 2
            
            if nums[middle] == target:
                return middle
            
            if target > nums[middle]:
                left = middle + 1
            else:
                right = middle - 1
            
            
        return -1
```

**Explanation:** For our example input, 9 exists in `nums` and its index is 4. To satisfy a O(log n) runtime complexity, we implement binary search. We keep finding the middle, if it is the target we return it's index, else when the target is greater than the middle we replace the left index with the middle or the right index when less than the middle. This effectively cuts the array in half every time until we find the target or leave the while loop.

## Sorting

**Definition:** A sorting algorithm re-organises a data structure into a specific order, such as alphabetical, highest-to-lowest value or shortest-to-longest distance.

**Example Problem:** [Intersection of Two Sorted Arrays II (Leetcode)](https://leetcode.com/problems/intersection-of-two-arrays-ii/)

**Example Input:** nums1 = [4,9,5], nums2 = [9,4,9,8,4]

**Example Output:** [4,9] or [9,4]

```python
class Solution:
    def intersect(self, nums1: List[int], nums2: List[int]) -> List[int]:
        if len(nums1) > len(nums2):
            return self.intersect(nums2, nums1)
        
        map: dict = {}
        
        for number in nums1:
            if number in map.keys():
                map[number] += 1
            else:
                map[number] = 1
                
        intersection: List = []
        
        for number in nums2:
            count: int = map[number] if number in map.keys() else 0

            if count > 0:
                intersection.append(number)
                map[number] -= 1
                
                
        return intersection
```

or by using Counter we saw in the hashmaps section with it's [elements() method](https://docs.python.org/3/library/collections.html#collections.Counter.elements) ...

```python
class Solution:
    def intersect(self, nums1: List[int], nums2: List[int]) -> List[int]:
        if len(nums1) > len(nums2):
            return self.intersect(nums2, nums1)
        
        nums1_count = Counter(nums1)
        nums2_count = Counter(nums2)
        
        return (nums1_count & nums2_count).elements()
```

**Explanation:** The intersection is everything that `nums1` and `nums2` have in common. We must ensure each element in the result must appear as many times as it shows in both arrays. In the first approach we create our own counter `map` to count the occurance of each number in `nums1`. Then for each number in `nums2` we check if it's in the dictionary and if it is append it to the `intersection` list then decrement the count by one. We can then return the intersection as the answer. Approach two simplifies this by using Counter.

## Graphs

**Definition:** A graph represents a non-linear relationship between it's nodes which are connected by edges.

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/6n-graph2.svg/375px-6n-graph2.svg.png" 
  alt="Graph diagram" 
  loading="lazy" 
  styling="height:395px; width:405px;"
  caption="Chris-martin via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://commons.wikimedia.org/wiki/File:6n-graph2.svg" 
  :showsource="true">
</article-image>

**Example Problem:** [Clone Graph (Leetcode 133)](https://leetcode.com/problems/clone-graph/)

**Example Input:** adjList = [[2,4],[1,3],[2,4],[1,3]]

**Example Output:** [[2,4],[1,3],[2,4],[1,3]]

```python
"""
# Definition for a Node.
class Node:
    def __init__(self, val = 0, neighbors = None):
        self.val = val
        self.neighbors = neighbors if neighbors is not None else []
"""

class Solution:
    def cloneGraph(self, node: 'Node') -> 'Node':
        if not node:
            return None
        
        map: dict = {}
            
        def dfs(node, map):
            if node in map:
                return map[node]
            
            print(f"Copying node {node.val}")
            copy = Node(val=node.val)
            map[node] = copy
            
            for neighbor in node.neighbors:
                print(f"Appending neighbour node {neighbor.val} to node {copy.val}")
                copy.neighbors.append(dfs(neighbor, map))

            return copy
        
        
        return dfs(node, map)
```

**Explanation:** We are asked to effectively manually implement `copy.deepcopy(node)` to copy the contents of a graph given it's entry node. We initialise a dictionary `map` to store and return the copied nodes we've already seen. If we've not already seen the node, we create a copy and store it, then for each of it's neighbours append them as the copy's neighbours using depth first search and recursion with `dfs`. This clones every node and in turn copies the neighbors of each node.

## Bitwise manipulation

**Definition:** Bitwise manipulation performs a logical operation on each individual bit of a binary number. 

**Preparation:** To solve these problems you must first know the [bitwise operators](https://realpython.com/python-bitwise-operators/#overview-of-pythons-bitwise-operators) and converting [binary numbers to denery](https://youtu.be/q7nZbAUTSC4) and [denery numbers to binary](https://youtu.be/70lM1qAD5u4). It is also useful to understand [signed and unsigned numbers](https://www.youtube.com/watch?v=miwMEUfkqfY) and [least significant bit](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit).

Here is a concise bitwise operators reference table I stapled together from various sources.

| Operator | Syntax | Meaning                    | Description                                                                       | Example                                          |
| -------- | ------ | -------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------ |
| &        | a & b  | Bitwise AND                | Returns 1 if both the bits are 1 else 0                                           | 1010 & 0100 = 0000                               |
| \|       | a \| b | Bitwise OR                 | Returns 1 if either of the bit is 1 else 0                                        | 1010 \| 0100 = 1110                              |
| ^        | a ^ b  | Bitwise XOR (exclusive OR) | Returns 1 if one of the bits is 1 and the other is 0 else returns false.          | 1010 ^ 0100 = 1110                               |
| ~        | ~a     | Bitwise NOT                | Returns one’s complement of the number                                            | ~1010 = -(1010 + 1) = -(1011) = -11 (decimal)    |
| <<       | a << n | Bitwise left shift         | Shifts the bits of the number to the left and fills 0 on voids left as a result.  | 0000 0101 << 2 = 0001 0100                       |
| \>>      | a >> n | Bitwise right shift        | Shifts the bits of the number to the right and fills 0 on voids left as a result. | 0000 0101 >> 2 = 0000 0001                       |

**Example Problem:** [Counting Bits (Leetcode 338)](https://leetcode.com/problems/counting-bits/)

**Example Input:** 9

**Example Output:** 2

A good example to illustrate bitwise manipulation is counting bits set to 1 in a positive integer. The Leetcode example is an array of positive integers - so the same solution but for each item in the array.

```python 
def count_bits(x: int) -> int:
    num_bits = 0
    while x:
        num_bits += x & 1   # checks if the rightmost bit is 1 (0001 & 0001 = 1)
        x >>= 1             # shifts the number right one bit, shifting out the least significant bit

    return num_bits

count_bits(9) # Returns 2
```

**Explanation:** If we take the number 9, which in binary is 1001, then we can see there are two bits set to 1. 

* We start with 1001 and add `x & 1` (1001 & 0001 = 0001) to `num_bits` which now has a count of 1 (1 added)
* Then shift the bits right making 0100 and repeat adding `x & 1` (0100 & 0001 = 0000) to `num_bits` which now has a count of 1 (nothing added).
* Then shift the bits right making 0010 and repeat adding `x & 1` (0010 & 0001 = 0000) to `num_bits` which now has a count of 1 (nothing added).
* Then shift the bits right making 0001 and repeat adding `x & 1` (0001 & 0001 = 0001) to `num_bits` which now has a count of 2 (1 added).
* `x` is now 0 so the while loop exits and the returned count of `num_bits` is 2! 😄

## Binary Trees

**Definition:** A binary tree is a tree data structure in which each node has at most two children, which are referred to as the left child and the right child.

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Binary_tree.svg/1280px-Binary_tree.svg.png" 
  alt="Binary tree diagram" 
  loading="lazy" 
  styling="height:350px; width:425px;"
  caption="Derrick Coetzee via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Binary_tree.svg/1280px-Binary_tree.svg.png" 
  :showsource="true">
</article-image>

**Example Problem:** [Balanced Binary Tree (Leetcode 110)](https://leetcode.com/problems/balanced-binary-tree/)

**Example Input:** root = [3,9,20,null,null,15,7]

**Example Output:** true

Here I've presented the code along with the explanation in an image, to visualise what's going on.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639930533/App%20Images/Blog%20Images/Article%20Images/Coding%20Interview%20Topics/balanced-binary-tree-explanation_ymrh1g_oogc8w.png" 
  alt="Balanced binary tree explanation" 
  loading="lazy" 
  styling=""
  caption="Balanced binary tree solution with explanation" 
  captionsrc="" 
  :showsource="true">
</article-image>

**Explanation:** A binary tree is balanced when the left and right subtrees of every node differ in height by no more than 1. We use recursion (covered later) to carry out [postorder traversal](https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/) to ensure every subtree is balanced all the way back to the top. If that's a bit confusing [this video](https://www.youtube.com/watch?v=LU4fGD-fgJQ) goes into more detail.

## Binary Search Trees

**Definition:** Binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure whose internal nodes each store a key greater than all the keys in the node’s left subtree and less than those in its right subtree.

<article-image 
  src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/da/Binary_search_tree.svg/1280px-Binary_search_tree.svg.png" 
  alt="Binary search tree diagram" 
  loading="lazy" 
  styling="height:350px; width:425px;"
  caption="Derrick Coetzee via Wikimedia Commons, [Public Domain]" 
  captionsrc="https://upload.wikimedia.org/wikipedia/commons/thumb/d/da/Binary_search_tree.svg/1280px-Binary_search_tree.svg.png" 
  :showsource="true">
</article-image>

**Example Problem:** [Validate Binary Search Tree (Leetcode 98)](https://leetcode.com/problems/validate-binary-search-tree/)

**Example Input:** [2,1,3]

**Example Output:** true

```python
# Definition for a binary tree node.
# class TreeNode:
#     def __init__(self, val=0, left=None, right=None):
#         self.val = val
#         self.left = left
#         self.right = right
class Solution:
    def isValidBST(self, root: Optional[TreeNode]) -> bool:
        
        def validate(node: TreeNode, lower_bound: float, upper_bound: float):
            if not node:
                return True
            
            node_in_bounds = node.val < upper_bound and node.val > lower_bound
            if not node_in_bounds:
                return False
            
            return (
                validate(node.left, lower_bound, node.val) and
                validate(node.right, node.val, upper_bound)
            )
            
        return validate(root, float("-inf"), float("inf"))
```

**Explanation:** We need to determine if a binary tree is a valid binary search tree (BST) given it's root node. Any node must be greater than all the keys in it's left subtree and less than those in it's right subtree. We can use depth first search and recursion to `validate` each node is within the `lower_bound` and `upper_bound`. This ensures that if a left or right subtree falls out of bounds it is not a valid BST. 

## Recursion

**Definition:** Recursion is a process in which a function calls itself as a subroutine, thereby dividing a problem into subproblems of the same type.

**Example Problem:** [Permutations (Leetcode 46)](https://leetcode.com/problems/permutations/)

**Example Input:** nums = [1,2,3]

**Example Output:** [[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]]

```python
class Solution:
    def permute(self, nums: List[int]) -> List[List[int]]:
        result: List[List[int]] = []
        
        if len(nums) == 0:
            return [nums[:]]
        
        for i in range(len(nums)):
            number = nums.pop(0)
            permutations = self.permute(nums)
            
            for permutation in permutations:
                permutation.append(number)
                
            result.extend(permutations)
            nums.append(number)
            
        return result
```

**Explanation:** We are asked to return *all the possible permutations* of `nums` in any order. So for each integer in `nums` [1,2,3] we pop the first element leaving [2,3] and then call `permute` again (recursively) to get each sub-permutation. This would leave us with [3,2] and [2,3] so now we append the popped `number` back the the `permutation`, giving [3,2,1] and [2,3,1] and add both of these to the `result` using extend(). Finally, we append the popped `number` back to `nums`. This will repeat for each element giving all possible permutations. The magic of recursion right? Here is a diagram to visualise the process that will be carried out for each. We always pop the first element, then get permutations, add the element back, extend the result, append the element back.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639930533/App%20Images/Blog%20Images/Article%20Images/Coding%20Interview%20Topics/permutations-diagram_mfq9ej_qrw9lf.png" 
  alt="Permutations recursive diagram" 
  loading="lazy" 
  styling=""
  caption="Permutations recursive diagram" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Dynamic Programming or Memoization

**Definition:** Dynamic programming is a technique for solving problems of recursive nature, iteratively and is applicable when the computations of the subproblems overlap. Memoization is a term describing an optimization technique where you cache previously computed results, and return the cached result when the same computation is needed again. 

**Example Problem:** [Fibonacci Number (Leetcode 509)](https://leetcode.com/problems/fibonacci-number/)

**Example Inputs:** n = 40

**Example Output:** 102334155

Earlier when discussing Big-O Notation I used an example `find_nth_number_in_fibonacci_sequence` to demonstrate exponential time O(2^n). In the example I tried to find the 40th number in the fibonacci sequence using recursion and this took a huge 59.58 seconds. The greater the number in the sequence we were looking for, the more the runtime grew exponentially. How can we improve this? What about using memoization to cache the results of each recursive call so no unnecessary repeat calls are ever made.

```python
import time

def find_nth_number_in_fibonacci_sequence(n, cached_results: dict):
    if n in cached_results.keys():
      return cached_results[n] # return result if already in cache

    if n <= 1:
        result = n
    else:
      result = find_nth_number_in_fibonacci_sequence(n - 2, cached_results) + \
               find_nth_number_in_fibonacci_sequence(n - 1, cached_results) # ensure cache is passed to all recursive calls
    
    cached_results[n] = result # cache the result
    return result

start = time.time()
n = 40
print(find_nth_number_in_fibonacci_sequence(n, dict())) # 40th number in the fibonacci sequence is 102334155
end = time.time()
print(f"Time taken: {end - start} seconds") # This took 0.0009975433349609375 seconds for me
```

**Explanation:** Modifying the code we used earlier to include a cache in the form of a dictionary, to find the 40th number in the sequence it now takes 0.0009975433349609375 seconds for me!! By caching and reusing earlier results the speed has improved dramatically. Using a hashmap (dictionary) to lookup cached results has a constant time complexity of O(1).

## Final thoughts

So now you should have a good idea of the data structures and algorithms included in a coding interview. This might not prepare you for one (only constant focused practice can do that) but it will make you aware of what you don’t know. I have been doing problems on LeetCode and EPI and trying to really understanding the solution before moving on. A key part of this strategy has been tracking performance and repeating failed problems. Here is my Trello board when I first started.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639930533/App%20Images/Blog%20Images/Article%20Images/Coding%20Interview%20Topics/coding-problem-tracker_z7x0vy_cpm0k3.webp" 
  alt="Coding problems Trello board" 
  loading="lazy" 
  styling=""
  caption="My coding problems Trello board" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639930533/App%20Images/Blog%20Images/Article%20Images/Coding%20Interview%20Topics/coding-problem-tracker_z7x0vy_cpm0k3.webp" 
  :showsource="false">
</article-image>

The approach was to add problems for each topic we've covered to the Problems list, then each day move 3-5 problems across to the Doing list. Once attempted, I rated performance out of 5 (5 being perfect with no help needed, 1 being didn't finish it without help and further study) and move it to the Repeat list (worst at the top). Then on subsequent days take another 3-5 problems from the Problems list, and one from the top of the Repeat list. Rinse and repeat. The idea came from this [Engineering with Utsav video](https://youtu.be/7UlslIXHNsw?t=696) (I really like this channel).This has allowed me to focus on breath of knowledge, whilst revisiting and repeating weaker areas.

More so than passing any test, I hope this article gives you the inspiration to become an (even) better programmer and to think more algorithmically. As always if you have any thoughts let me know in the comments section below. Alternatively, you can complete the site's new [feedback form](https://forms.office.com/r/Eu2HTx8kvn) - you might have noticed the new 👍 feedback button on the navbar, so you can say how you think the site is doing and what you would like to see more of in the future 😄

## Resources

* [Python Standard Library Reference](https://docs.python.org/3/library/index.html#library-index)
* [Elements of Programming Interviews](https://elementsofprogramminginterviews.com/) | [Book](https://www.amazon.co.uk/Elements-Programming-Interviews-Python-Insiders/dp/1537713949/ref=pd_bxgy_img_2/262-9365292-3109168?pd_rd_w=Y6OlR&pf_rd_p=c7ea61ca-7168-47e3-9c8b-d84748f5b23c&pf_rd_r=D0WECF6DRCT5DPW9E23H&pd_rd_r=1f09cc37-a87b-404f-8ed5-79f25c54beb0&pd_rd_wg=zODeE&pd_rd_i=1537713949&psc=1)
* [EPI-Judge](https://github.com/adnanaziz/EPIJudge)
* [LeetCode](https://leetcode.com/)
* [Grokking Algorithms](https://www.amazon.co.uk/Grokking-Algorithms-illustrated-programmers-curious/dp/1617292230/ref=pd_sbs_1/262-9365292-3109168?pd_rd_w=ft5SM&pf_rd_p=a3a7088f-4aec-4dbd-97cc-9a059581fe7b&pf_rd_r=ZE7W9K1EBJ7VNZPCHC07&pd_rd_r=58618f0b-9890-477a-9f09-a2df9551f80d&pd_rd_wg=zdcP9&pd_rd_i=1617292230&psc=1)
* [Computer Science Distilled](https://www.amazon.co.uk/Computer-Science-Distilled-Computational-Problems/dp/0997316020/ref=sr_1_1?dchild=1&keywords=computer+science+distilled&qid=1626618696&s=books&sr=1-1)
* [Big-O Examples in Python](https://www.youtube.com/watch?v=5yJ_QLec0Lc)
* [Time Complexity Examples in Python](https://towardsdatascience.com/understanding-time-complexity-with-python-examples-2bda6e8158a7)
* [Memoization and Dynamic Programming Explained](https://www.youtube.com/watch?v=WbwP4w6TpCk)
* [Bitwise Operators](https://www.geeksforgeeks.org/python-bitwise-operators/)
* [Data Structures & Algorithms in Python](https://www.amazon.co.uk/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275) | Excellent but expensive, might be able to find an e-book version cheaper
* [10 Important Data Structures & Algorithms for Interviews](https://www.youtube.com/watch?v=RcvQagxK_9w)
* [Understanding Merge Sort in Python](https://www.youtube.com/watch?v=rAqBlKhy_oI)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Searching for text in PDFs at increasing scale]]></title>
            <link>https://shedloadofcode.com/blog/searching-for-text-in-pdfs-at-increasing-scale/</link>
            <guid>https://shedloadofcode.com/blog/searching-for-text-in-pdfs-at-increasing-scale/</guid>
            <pubDate>Wed, 04 Aug 2021 13:58:00 GMT</pubDate>
            <description><![CDATA[Explore multiple approaches to extract and search text from PDFs at increasing scale using Python with PyPDF2, C# with iTextSharp alongside C++ and pdftotext.]]></description>
            <content:encoded><![CDATA[
I had the interesting challenge of searching for text within a large number of PDFs recently. This was to assist a finance team in automating the organising and categorising of some of their existing documents. When I said large number, it was around 350,000 PDF documents, so quite a few! I iterated through a few different solutions and tried to focus on delivering something optimal and efficient. I tested each on a smaller scenario to benchmark how they might perform at increasing scale - the results can be found at the end of the article.

## Getting started with PyPDF2

With Python being my usual go to Swiss Army Knife for many things, I first installed this very useful package to give it a go:

```
pip install PyPDF2
```

I had read about [PyPDF2](https://pypi.org/project/PyPDF2/) in [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/chapter13/) so at least I had a starting point. PyPDF2 has also has some changes in the latest version 3.0.1 which you can read about in the [documentation](https://pypdf2.readthedocs.io/en/latest/) and [migration guide](https://pypdf2.readthedocs.io/en/3.0.0/user/migration-1-to-2.html) so some of the functions have changed. I put together the following CLI tool using the PyPDF2 package:

```python
import PyPDF2
import re
import time
import sys

def main():
    if len(sys.argv) != 2:
        sys.exit("Usage: python pdf_searcher.py filename.pdf")

    filename = sys.argv[1]
    file = open(filename, "rb")

    pdf_reader = PyPDF2.PdfReader(file) # Formerly PyPDF2.PdfFileReader(file)
    number_of_pages = len(pdf_reader.pages) # Formerly pdf_reader.getNumPages()
    start = time.time()

    print("Type your search term and hit enter")
    print("You can add as many search terms as you like")
    print("Once you're done, hit enter to continue...")
    search_terms = get_search_terms_from_user(search_terms = [])
    
    for i in range(0, number_of_pages):
        page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(i) 
        page_content = page.extract_text() # Formerly page.extractText()


        for search_term in search_terms:
            if re.search(search_term, page_content):
                print(f"Matched '{search_term}' on page {i}")


    print(f"Program took {time.time() - start} seconds")


def get_search_terms_from_user(search_terms: list) -> list:
    search_term = str(input("Search term: "))
    if search_term != "":
        search_terms.append(search_term)
        return get_search_terms_from_user(search_terms)
    else:
        return search_terms


if __name__ == "__main__":
    main()
```

This accepted a filename as a command line argument, followed by a prompt to enter search terms.

## Optimising the PyPDF2 script

So this was a good start and a fun program for searching a single PDF but some optimisations were needed. In addition, the program needed to search an entire directory of files so it needed extending. The program didn't need to find every word that matched the search criteria in the given document, just that it does in fact occur in there at least once. So to optimise based on that use case, once it's certain that the search term does exist for the given document, it doesn't have to look for that word again saving time.

```python [pypdfsearcher.py]
import PyPDF2
import re
import time
import sys
import os
import glob

def main():
    directory = os.path.dirname(os.path.abspath(__file__))
    pdf_filepaths = glob.glob("**/*.pdf", recursive=True)
    start = time.time()
    results = {}

    for filepath in pdf_filepaths:
        print(f"Searching document {filepath}")
        search_terms = ["hurricanes", "walt", "avenue", "disney", "mercedes"]
        filename = os.path.basename(filepath)
        found_terms = {}

        file = open(filepath, "rb")
        pdf_reader = PyPDF2.PdfReader(file) # Formerly PyPDF2.PdfFileReader(file)
        number_of_pages = len(pdf_reader.pages) # Formerly pdf_reader.getNumPages()

        for i in range(0, number_of_pages):
            page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(i) 
            page_content = page.extract_text() #Formerly page.extractText()

            for term in search_terms:
                if term in found_terms.keys():
                    continue


                if re.search(term.lower(), page_content.lower()):
                    print(f"Found '{term}' in document '{filename}'")
                    found_terms[term] = 1
                    if filename in results.keys():
                        results[filename].append(term)
                    else:
                        results[filename] = [term]

    print(f"Program took {time.time() - start} seconds")
    print(results)


if __name__ == "__main__":
    main()
```

## Alternative approach with pdftotext subprocess

The second solution called the [pdftotext](https://www.xpdfreader.com/pdftotext-man.html) program in a Python subprocess to receive the text as the subprocess output. It did exactly the same thing as the previous script but might be faster - we'll compare the speed of each approach later.

```python [pdftotextsearcher.py]
import os
import subprocess
import re
import time
import glob


def main():
    directory = os.path.dirname(os.path.abspath(__file__))
    pdf_filepaths = glob.glob("**/*.pdf", recursive=True)
    start = time.time()
    results = {}

    for filepath in pdf_filepaths:
        print(f"Searching document {filepath}")
        search_terms = ["hurricanes", "epcot", "daimler", "disney", "mercedes"]
        filename = os.path.basename(filepath)
        found_terms = {}

        args = ["pdftotext",
                '-enc',
                'UTF-8',
                filepath, # Example: "pdfs/United-Kingdom-Strategic-Export-Controls-Annual-Report-2021.pdf"
                '-']
        res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = res.stdout.decode('utf-8')

        for term in search_terms:
            if term in found_terms.keys():
                continue

            if re.search(term.lower(), output.lower()):
                print(f"Found '{term}' in document '{filename}'")
                found_terms[term] = 1
                if filename in results.keys():
                    results[filename].append(term)
                else:
                    results[filename] = [term]

    print(f"Program took {time.time() - start} seconds")
    print(results)

if __name__ == "__main__":
    main()
```

## Trying out C# and iTextSharp

I thought I'd switch to C# and investigate the [iTextSharp](https://www.nuget.org/packages/iTextSharp/) NuGet package for reading and searching PDFs. I was pleasantly surprised at how well this package worked. It was also quick to install and get started with. Here is the program I put together using it:

```csharp [Program.cs]
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFSearcherSharp
{
    class Program
    {
        static void Main(string[] args)
        {
            var stopwatch = new System.Diagnostics.Stopwatch();
            stopwatch.Start();

            string directory = @"C:/Users/shedloadofcode/source/repos/PDFSearcherSharp/pdfs/";
            string[] files = Directory.GetFiles(directory, "*.pdf");
            List<string> searchTerms = new List<string>() { "hurricanes", "epcot", "daimler", "disney", "mercedes" };

            foreach (var filename in files)
            {
                Console.WriteLine($"Searching document {filename}");
                StringBuilder stringBuilder = new StringBuilder();

                string filePath = System.IO.Path.Combine(directory, filename);
                using (PdfReader reader = new PdfReader(filePath))
                {
                    List<string> foundTerms = new List<string>();

                    for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
                    {
                        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                        string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
                        text = Encoding.UTF8.GetString(
                            ASCIIEncoding.Convert(
                                Encoding.Default,
                                Encoding.UTF8,
                                Encoding.Default.GetBytes(text)
                            )
                        );
                        stringBuilder.Append(text);

                        foreach (string term in searchTerms)
                        {
                            if (foundTerms.Contains(term))
                            {
                                continue;
                            }

                            if (text.ToLower().Contains(term.ToLower()))
                            {
                                Console.WriteLine($"Found '{term}' in document '{filename}'");
                                foundTerms.Add(term);
                            }
                        }
                    }
                }

                // Console.WriteLine(stringBuilder.ToString());
            }

            stopwatch.Stop();
            Console.WriteLine($"Program took {stopwatch.ElapsedMilliseconds / 1000} seconds");
        }
    }
}
```

## A last approach with C++ and pdftotext

The fourth and final approach involved calling the [pdftotext](https://www.xpdfreader.com/pdftotext-man.html) executable again, but this time with the main script written in C++. I was curious to see how to put together a solution for this in C++ more than anything. I couldn't figure out a way to return the output from the pdftotext executable to stdout in-process, so resorted to converting the PDFs to text files first, searching the text files and then finally deleting them - this created added overhead so will likely slow it down.

```cpp [PdfSearcher.cpp]
#include <Windows.h>
#include <fstream>
#include <iostream>
#include <string>
#include <regex>
#include <map>
#include <filesystem>
#include <vector>
#include <thread>

using namespace std;
using std::filesystem::directory_iterator;

void DeleteTextFile(string filePath)
{
    string fileName = filePath;
    fileName = fileName.substr(0, fileName.size() - 4);
    fileName = fileName + ".txt";

    const char* file = fileName.c_str();
    if (remove(file) != 0)
        cout << "Error deleting file " << fileName << endl;
    else
        cout << "File " << fileName << " successfully deleted" << endl;
}

string TransformLineToLowercase(string line)
{
    std::for_each(line.begin(), line.end(), [](char& c)
    {
        c = ::tolower(c);
    });

    return line;
}

void SearchTextFile(string fileName, string searchTerms[], int searchTermsLength)
{
    map<string, bool> foundSearchTerms;
    for (int i = 0; i < searchTermsLength; i++)
    {
        foundSearchTerms[searchTerms[i]] = false;
    }

    fstream textFile;
    textFile.open(fileName, ios::in);
    if (textFile.is_open())
    {
        int totalNumberOfMatches = 0;
        string line;
        while (getline(textFile, line))
        {
            string lowercaseLine = TransformLineToLowercase(line);

            for (int i = 0; i < searchTermsLength; i++)
            {
                bool searchTermAlreadyFound = foundSearchTerms[searchTerms[i]] == 1;
                if (searchTermAlreadyFound)
                {
                    continue;
                }

                int indexOfMatch = lowercaseLine.find(searchTerms[i]);

                if (indexOfMatch > -1)
                {
                    cout << "Found search term " << searchTerms[i] << "in " << fileName << " at ";
                    cout << "position " << indexOfMatch << " in line" << lowercaseLine << endl;
                    foundSearchTerms[searchTerms[i]] = 1;
                }  
            }
        }
 
        textFile.close();   
    }
}

vector<std::filesystem::path> GetAllFileNamesInDirectory()
{
    string path = "pdfs/";
    vector<std::filesystem::path> filePaths;

    for (const auto& file : directory_iterator(path))
    {
        filePaths.push_back(file.path());
    }

    return filePaths;
}

void GenerateTextFile(string filePath)
{
    STARTUPINFO startupInfo;
    PROCESS_INFORMATION processInformation;
    ZeroMemory(&startupInfo, sizeof(startupInfo));
    ZeroMemory(&processInformation, sizeof(processInformation));

    wstring filePathWs = wstring(filePath.begin(), filePath.end());
    wstring commandLineArgs = L"pdftotext.exe -enc UTF-8 \"" + filePathWs + L"\"";
    wstring commandLineArgsWs = wstring(commandLineArgs.begin(), commandLineArgs.end()).c_str();
    std::wstring commandLineInput(commandLineArgsWs);

    // This was the first attempt
    // wchar_t commandLineInput[] = TEXT("pdftotext.exe -enc UTF-8 \"pdfs/United-Kingdom-Strategic-Export-Controls-Annual-Report-2021 - Copy - Copy (7).pdf\"");

    bool output = CreateProcess(
        NULL,                   // Application name
        &commandLineInput[0],   // Command line arguments
        NULL,                   // Process attributes   
        NULL,                   // Thread attributes
        TRUE,                   // Inherit handles
        0,                      // No creation flags
        NULL,                   // Environment
        NULL,                   // Current directory
        &startupInfo,           // Startup information
        &processInformation     // Process information
    );

    if (output == FALSE)
    {
        cout << "Generating text file for PDF " << filePath << " failed" << endl;
    }
    else
    {
        cout << "Generating text file for PDF " << filePath << endl;
        // cout << "Process ID: " << processInformation.dwProcessId << endl;
    }

    WaitForSingleObject(processInformation.hProcess, INFINITE);

    CloseHandle(processInformation.hProcess);
    CloseHandle(processInformation.hThread);
}

int main()
{
    clock_t start = clock();

    vector<std::filesystem::path> filePaths = GetAllFileNamesInDirectory();

    for (int i = 0; i < filePaths.size(); i++)
    {
        string filePath = filePaths[i].string();
        GenerateTextFile(filePath);
    }

    for (int i = 0; i < filePaths.size(); i++)
    {
        string filePath = filePaths[i].string();
        string fileName = filePath;
        fileName = fileName.replace(0, 5, "");
        fileName = fileName.substr(0, fileName.size() - 4);

        string searchTerms[5] = { "hurricanes", "epcot", "daimler", "disney", "mercedes" };
        string textFilePath = "pdfs/" + fileName + ".txt";
        SearchTextFile(textFilePath, searchTerms, (sizeof(searchTerms) / sizeof(*searchTerms)));
    }

    for (int i = 0; i < filePaths.size(); i++)
    {
        string filePath = filePaths[i].string();
        DeleteTextFile(filePath);
    }

    double duration = (clock() - start) / (double)CLOCKS_PER_SEC;
    cout << "Program took " << duration << " seconds" << endl;
    system("pause > 0");

    return 0;
}
```

## Test exercise and speed benchmarks

So we now have four (almost) equivalent programs in terms of logic and desired output. It was time to run all of the solutions above through a scenario to see how they perform. The scenario was a directory `/pdfs` containing around 200 PDF documents inside. The program would need to search all of the PDF documents and return the names of the PDF files containing the search terms. I had placed a few PDFs I knew contained the search terms with unique file names to test it works. Most documents were around 71 - 150 pages, with the largest at 432 pages. So I was testing with quite large files. If this ever went into production the files would likely be much smaller. Ok here we go!

**Inputs**

* 200 PDF documents
* Each PDF between 71 and 432 pages
* Average PDF file size was 5MB
* Number of search terms was 5 ["hurricanes", "epcot", "daimler", "disney", "mercedes"]
* My two target files were a [Disney financial report](https://thewaltdisneycompany.com/app/uploads/2021/01/2020-Annual-Report.pdf) and a [Daimler financial report](https://www.daimler.com/documents/investors/reports/annual-report/daimler/daimler-ir-annual-report-2019-incl-combined-management-report-daimler-ag.pdf) as I knew these actually contained the search terms (no particular reason I chose these, they were just the first I could find 😆)

**Results**

| Approach                     | Found all search terms | Time in seconds |
| ---------------------------- | ---------------------- | --------------- |
| Python and PyPDF2            | Yes                    | 306             |
| Python running pdftotext.exe | Yes                    | 66              |
| C# and iTextSharp            | Yes                    | 66              |
| C++ running pdftotext.exe    | Yes                    | 72              |

As I predicted the C++ program was likely slowed down by having to convert to text files first before searching. The most performant approaches and my most preferred, are Python running pdftotext.exe (which is straightforward to receive the stdout of the child process) and C# with the iTextSharp NuGet package. Both of these solutions completed in 66 seconds in the test scenario.

**Folder structure for Python project (containing both versions)**

```
/pdfs
pdftotext.exe
pdftotextsearcher.py
pypdfsearcher.py
```

**Folder structure for C# Visual Studio project**

```
/bin
/obj
/pdfs
/PDFSearcherSharp
PDFSearcher.csproj
PDFSearcherSharp.sln
Program.cs
```

**Folder structure for C++ Visual Studio project**

```
/pdfs
PdfSearcher.cpp
PdfSearcher.sln
PdfSearcher.vcxproj
PdfSearcher.vcxproj.filters
PdfSearcher.vcxproj.user
pdftotext.exe
```

## Reflections

So I learned quite a bit about from this exercise, and this provides a good starting point to further develop a solution. It certainly needs more testing and refining to the specific use case. If searching these 200 or so fairly large files took 66 seconds, then at worst case 350,000 / 200 is 1,750 and 1750 x 66 gives 115,500 seconds. Dividing that by 60 gives 1,925 minutes. Dividing that by 60 gives 32 hours. Finally, dividing that by 24 gives 1.33 days 😄. Moving one of these scripts onto a virtual machine and letting it run until done might be the best solution depending where the files are stored. 

A caveat to note is the PDFs I used had searchable text, so if you had scanned PDF documents you might need to go down the avenue of using OCR (optical character recognition). I hear [pytesseract](https://pypi.org/project/pytesseract/) is useful for this as it acts as a  wrapper for [Google’s Tesseract-OCR Engine](https://github.com/tesseract-ocr/tesseract). I might venture into this area next if the need for it arises 😄. Altogether I hope I've shown that reading and searching many PDFs at increasing scale is possible with different approaches, if not always temperamental.

## Resources

* [Searching text in a PDF using Python](https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python)
* [Using pdftotext on AWS Lambda](http://howto.philippkeller.com/2018/03/13/How-to-extract-text-from-pdf-in-python/)
* [Extract text from PDF in C#](https://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET)
* [Extract text from PDF using iTextSharp](https://www.youtube.com/watch?v=y6s2mLpYfMc)
* [Searching strings in C#](https://docs.microsoft.com/en-us/dotnet/csharp/how-to/search-strings)
* [Child Process in Windows System Programming](https://www.youtube.com/watch?v=W2Qu4RDk__k)
* [Creating a Child Process with Redirected Input and Output](https://docs.microsoft.com/en-us/windows/win32/procthread/creating-a-child-process-with-redirected-input-and-output)
* [PDF parsing in C++](https://stackoverflow.com/questions/11715561/pdf-parsing-in-c-podofo)
* [C++ regex](https://www.youtube.com/watch?v=uL9Qt2v2yjk)
* [C++ list files in a directory](https://www.delftstack.com/howto/cpp/how-to-get-list-of-files-in-a-directory-cpp/)
* [C++ Wide Char Array Strings](https://www.youtube.com/watch?v=R21fh-17um0)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to scrape and analyse your Chess.com data]]></title>
            <link>https://shedloadofcode.com/blog/how-to-scrape-and-analyse-your-chess-com-data/</link>
            <guid>https://shedloadofcode.com/blog/how-to-scrape-and-analyse-your-chess-com-data/</guid>
            <pubDate>Sat, 10 Jul 2021 12:38:00 GMT</pubDate>
            <description><![CDATA[Learn how to scrape data from Chess.com, and analyse your historical game performance using a basic reproducible analytical pipeline.]]></description>
            <content:encoded><![CDATA[
In this article I will scrape data from my Chess.com profile and analyse my historical performance in live matches. This is a reproducible pipeline using Python. I took up Chess again at the end of 2020 after a long hiatus, so was eager to monitor my performance and see where the weaknesses were. The good part of this pipeline is that the data will be automatically updated so I can always see what I need to improve on and ask the interesting questions on my performance just by re-running these scripts.

## Before starting

Before starting you will need a few things. These will set you up to carry out other Data Science projects in the future too - like [analysing your Amazon spending data](/blog/how-to-scrape-and-analyse-your-amazon-spending-data/) or [scraping AutoTrader for multiple makes / models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/)

* Anaconda
* Jupyter Notebooks (installed with Anaconda)
* Selenium
* Google Chrome (latest version)
* Chrome Driver (latest version)

This article will not cover installing programs in detail, but here is a starting point. Install [Anaconda](https://www.anaconda.com/distribution/) first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using `pip install selenium`. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of [Google Chrome](http://google.co.uk/chrome/?brand=CHBD&gclid=EAIaIQobChMI0LPsqNXl5QIVCLTtCh3pJwybEAAYASAAEgJxkvD_BwE&gclsrc=aw.ds) installed and [ChromeDriver](https://chromedriver.chromium.org/downloads) for the version number of Chrome you're running. On Windows, ensure `chromedriver.exe` is in a [suitable location](https://chromedriver.chromium.org/getting-started) such as `C:\Windows`.

## What will the web scraper do?

Here are the step by step actions the web scraper will perform to scrape Amazon spending data: 

* Launches a Chrome browser controlled by Selenium 
* Navigates to the Chess.com login page and logs in with your given details
* After login, navigates to the [My Games](https://www.chess.com/games/archive) page 
* Scrapes all game data
* Repeats for each page in the archive until finished

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639925301/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/My_Games_Data_h96con_r3ver8.png" 
  alt="My Games archive" 
  loading="lazy" 
  styling=""
  caption="My Games archive - the data to be scraped" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639925301/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/My_Games_Data_h96con_r3ver8.png" 
  :showsource="false">
</article-image>

The resulting data will be enough to answer questions such as:

* Do I win more matches as black or white?
* Do I win shorter or longer games?
* Am I losing to higher or lower rated players?
* Is time-pressure affecting my wins?
* How many of my games reach the endgame?
* Do specific days affect my results?
* Does seasonality affect my results?
* How has my rating developed in 30 min games?

## Scraping games data

First to scrape the required data using Selenium. You must provide your Chess.com `USERNAME` and `PASSWORD` so the script can log you in so be sure to amend these variables these first.

```python [chess-scraper.py]
import numpy as np
import pandas as pd
import bs4
from bs4 import BeautifulSoup
import requests
import csv
import datetime
import time
import hashlib
import os  
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options 

options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
now = datetime.datetime.now()

USERNAME = "DeadlyKnightX"
PASSWORD = "Your password here"
GAMES_URL = "https://www.chess.com/games/archive?gameOwner=other_game&username=" + \
        USERNAME + \
        "&gameType=live&gameResult=&opponent=&opening=&color=&gameTourTeam=&" + \
        "timeSort=desc&rated=rated&startDate%5Bdate%5D=08%2F01%2F2013&endDate%5Bdate%5D=" + \ 
        str(now.month) + "%2F" + str(now.day) + "%2F" + str(now.year) + \ 
        "&ratingFrom=&ratingTo=&page="
LOGIN_URL = "https://www.chess.com/login"

driver = webdriver.Chrome("chromedriver.exe", options=options)
driver.get(LOGIN_URL)
driver.find_element_by_id("username").send_keys(USERNAME)
driver.find_element_by_id("password").send_keys(PASSWORD)
driver.find_element_by_id("login").click()
time.sleep(5)

tables = []
game_links = []

for page_number in range(4):
    driver.get(GAMES_URL + str(page_number + 1))
    time.sleep(5)
    tables.append(
        pd.read_html(
            driver.page_source, 
            attrs={'class':'table-component table-hover archive-games-table'}
        )[0]
    )
    
    table_user_cells = driver.find_elements_by_class_name('archive-games-user-cell')
    for cell in table_user_cells:
        link = cell.find_elements_by_tag_name('a')[0]
        game_links.append(link.get_attribute('href'))
        
driver.close()

games = pd.concat(tables)

identifier = pd.Series(
    games['Players'] + str(games['Result']) + str(games['Moves']) + games['Date']
).apply(lambda x: x.replace(" ", ""))

games.insert(
    0, 
    'GameId', 
    identifier.apply(lambda x: hashlib.sha1(x.encode("utf-8")).hexdigest())
)

print(games.head(3))
```

| GameId            | Unnamed: 0          | Players | Result | Accuracy | Moves | Date | Unnamed: 6 | 
|-------------------|---------------      |-------- |--------|------    |-------|------|----------- |
|7e0c2bc5f27e025 |	1 hour |	DominikHrbaty (1319) DeadlyKnightX (1387) |	0 1 | 84.7 84.4 |68 | Dec 22,2020 | NaN |
|7f6c05e773ebe23 |	30 mins |	Omarricardo34 (1126) DeadlyKnightX (1359) |	0 1 | 49 57.2 | 52 | Dec 19,2020 | NaN |
|af2b84926911844 |	30 mins |	DeadlyKnightX (1344) albert106 (1138) |	1 0 | 94.4 5.6 |13 | Dec 19,2020 | NaN |


Now we have a `games` DataFrame which holds the raw data, we can concentrate on transforming the data by splitting columns, removing unnecessary columns, and adding calculated columns to derive more insight. 

## Transform games data

```python [chess-scraper.py]
# Create white player, black player, white rating, black rating
new = games.Players.str.split(" ", n=5, expand=True)
new = new.drop([1,4], axis=1)
new[2] = new[2].str.replace('(','').str.replace(')','').astype(int)
new[5] = new[5].str.replace('(','').str.replace(')','').astype(int)
games['White Player'] = new[0]
games['White Rating'] = new[2]
games['Black Player'] = new[3]
games['Black Rating'] = new[5]

# Add results
result = games.Result.str.split(" ", expand=True)
games['White Result'] = result[0]
games['Black Result'] = result[1]

# Drop unneccessary columns
games = games.rename(columns={"Unnamed: 0": "Time"})
games = games.drop(['Players', 'Unnamed: 6', 'Result', 'Accuracy'], axis=1)

# Add calculated columns for wins, losses, draws, ratings, year, game links
conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '1'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '1'),
        (games['White Player'] == USERNAME) & (games['White Result'] == '0'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '0'),
        ]
choices = ["Win", "Win", "Loss", "Loss"]
games['W/L'] = np.select(conditions, choices, default="Draw")

conditions = [
        (games['White Player'] == USERNAME),
        (games['Black Player'] == USERNAME)
        ]
choices = ["White", "Black"]
games['Colour'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] == USERNAME),
        (games['Black Player'] == USERNAME)
        ]
choices = [games['White Rating'], games['Black Rating']]
games['My Rating'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] != USERNAME),
        (games['Black Player'] != USERNAME)
        ]
choices = [games['White Rating'], games['Black Rating']]
games['Opponent Rating'] = np.select(conditions, choices)

games['Rating Difference'] = games['Opponent Rating'] - games['My Rating']

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '1'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '1')
        ]
choices = [1, 1]
games['Win'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '0'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '0')
        ]
choices = [1, 1]
games['Loss'] = np.select(conditions, choices)

conditions = [
        (games['White Player'] == USERNAME) & (games['White Result'] == '½'),
        (games['Black Player'] == USERNAME) & (games['Black Result'] == '½')
        ]
choices = [1, 1]
games['Draw'] = np.select(conditions, choices)

games['Year'] = pd.to_datetime(games['Date']).dt.to_period('Y')

games['Link'] = pd.Series(game_links)

# Optional calculated columns for indicating black or white pieces - uncomment if interested in these
# games['Is_White'] = np.where(games['White Player']==USERNAME, 1, 0)
# games['Is_Black'] = np.where(games['Black Player']==USERNAME, 1, 0)

# Correct date format
games["Date"] = pd.to_datetime(
    games["Date"].str.replace(",", "") + " 00:00", format = '%b %d %Y %H:%M'
)

print(games.head(3))
```

| GameId                                   | Time   | Moves | Date       | White Player  | White Rating | Black Player  | Black Rating | White Result | Black Result | W/L | Colour | My Rating | Opponent Rating | Rating Difference | Win | Loss | Draw | Year | Link                                       |
| ---------------------------------------- | ------ | ----- | ---------- | ------------- | ------------ | ------------- | ------------ | ------------ | ------------ | --- | ------ | --------- | --------------- | ----------------- | --- | ---- | ---- | ---- | ------------------------------------------ |
| 7e0c2bc5f27e025b741fa464cf45a40054e0e637 | 1 hour | 68    | 22/12/2020 | DominikHrbaty | 1319         | DeadlyKnightX | 1387         | 0            | 1            | Win | Black  | 1387      | 1319            | \-68              | 1   | 0    | 0    | 2020 | https://www.chess.com/game/live/6032087036 |
| 17f6c05e773ebe23c52164b09fec2ea9de2a9dc6 | 30 min | 52    | 19/12/2020 | Omarricardo34 | 1126         | DeadlyKnightX | 1359         | 0            | 1            | Win | Black  | 1359      | 1126            | \-233             | 1   | 0    | 0    | 2020 | https://www.chess.com/game/live/6009160294 |
| af2b84926911833c2e644d6400f39437f8fe0341 | 30 min | 13    | 19/12/2020 | DeadlyKnightX | 1344         | albert106     | 1138         | 1            | 0            | Win | White  | 1344      | 1138            | \-206             | 1   | 0    | 0    | 2020 | https://www.chess.com/game/live/6009042670 |

Great! The data has been transformed, extended and is now ready for analysis.

## Analysing games data

With a solid dataset prepared, you can now apply any analysis you would like to it. These are the visualisations I produced based upon what I was interested in. First let's import the key visualisations libraries [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/).

```python 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.facecolor':'white'})
```

## Overall rating

```python
fig, ax = plt.subplots(figsize=(15,6))
plt.title("Chess.com Rating Development")
sns.lineplot(x="Date", y="My Rating", data=games.iloc[::-1], color="black")
plt.xticks(rotation=0)
plt.show()
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639925301/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/rating-development_drg7hj_ocnpjr.png" 
  alt="Overall rating development" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639925301/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/rating-development_drg7hj_ocnpjr.png" 
  :showsource="false">
</article-image>

I can quite clearly see here that I didn't play for a while, until the end of 2020 when I picked Chess back up. This was met by a few losses and a rating dip - I was certainly out of practice.

## Wins, losses and draws

```python
fig, ax = plt.subplots(figsize=(15,6))
plt.title("Wins, Losses and Draws")
sns.countplot(data=games, x='W/L', palette="Greys", edgecolor="black")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931708/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/wins-losses-draws_z5ycdh.png" 
  alt="Wins, losses and draws chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

The good news from this data, is that I win more than I lose... but plenty of room for improvement!

## Wins with white vs black pieces

```python
fig, ax = plt.subplots(figsize=(15,6))
plt.title("Wins, Losses and Draws by Colour")
sns.countplot(data=games, x='W/L', hue="Colour", palette={"Black": "Grey", "White": "White"}, edgecolor="black");
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931708/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/wins-by-piece-colour_upf00w.png" 
  alt="Wins, losses and draws by piece colour" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

This clearly shows that I am stronger playing as black.

## Win rate with white vs black pieces

```python
fig, ax = plt.subplots(figsize=(15,6))
ax.set_title("Win Rate by Colour")
sns.barplot(data=games, x='Colour', y='Win', palette={"Black": "Grey", "White": "White"}, edgecolor="black", ax=ax);
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931708/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/win-rate-by-piece-colour_xkeeb2.png" 
  alt="Win rate by piece colour" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

A higher win rate as black.

## Correlation

```python
corr = games.corr()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
sns.heatmap(corr, cmap="Greys", annot=True, fmt='.2f', linewidths=.05, ax=ax).set_title("Chess Results Correlation Heatmap")
fig.subplots_adjust(top=0.93)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639925301/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/correlation_vostii_mkzkqg.png" 
  alt="Correlation heat chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

Can see an immediate negative correlation on Wins with Rating Difference and Moves.

## Moves in a typical game

```python
fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("How many moves in my typical game?")

sns.histplot(games, x="Moves", hue="Colour", palette={"Black": "Black", "White": "Grey"})
plt.close(2)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931708/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/move-count_e4slpz.png" 
  alt="Moves in a typical game" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931708/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/move-count_e4slpz.png" 
  :showsource="false">
</article-image>

Most of my games are around 25 to 30 moves in length.

## Moves vs wins

```python
fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("Does the amount of moves affect my win rate?")

sns.histplot(games, x="Moves", hue="W/L", multiple="stack", palette={"Loss": "Black", "Win": "Gray", "Draw": "lightgray"})
plt.close(2)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625934076/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/moves-vs-win-rate_qnn4f1.png" 
  alt="Moves vs wins chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1625934076/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/moves-vs-win-rate_qnn4f1.png" 
  :showsource="false">
</article-image>

My win rate does seem to decrease the more moves taken - around the 40 to 80 range is a problem. The number of draws increases as moves taken goes up also. I seem to win more around the sub-35 move range. Lets confirm that...

```python
grouped_df = games.groupby(['W/L', pd.cut(games['Moves'], 10)])
grouped_df = grouped_df.size().unstack().transpose()

total_games = grouped_df["Win"] + grouped_df["Loss"] + grouped_df["Draw"]
total_wins = grouped_df["Win"]

grouped_df["Win Rate %"] = round((total_wins / total_games) * 100, 0)
grouped_df
```

| W/L             | Draw | Loss | Win | Win Rate % |
| --------------- | ---- | ---- | --- | ---------- |
| Moves           |      |      |     |            |
| (0.846, 16.4\]  | 1    | 5    | 12  | 67         |
| (16.4, 31.8\]   | 0    | 37   | 44  | 54         |
| (31.8, 47.2\]   | 2    | 19   | 29  | 58         |
| (47.2, 62.6\]   | 9    | 17   | 14  | 35         |
| (62.6, 78.0\]   | 0    | 4    | 3   | 43         |
| (78.0, 93.4\]   | 1    | 0    | 2   | 67         |
| (93.4, 108.8\]  | 0    | 0    | 0   | NaN        |
| (108.8, 124.2\] | 0    | 0    | 0   | NaN        |
| (124.2, 139.6\] | 0    | 0    | 0   | NaN        |
| (139.6, 155.0\] | 1    | 0    | 0   | 0          |

As thought, only a 35% win rate in the 47-63 moves bin, and a 43% win rate in the 62-78 move bin. Seems like a good idea to practice the endgame more right?

## Opponent's rating vs wins

```python
fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("Does my opponent's rating affect my win rate?")

sns.histplot(games, x="Rating Difference", hue="Win", palette={0: "Black", 1: "Grey"})
plt.close(2)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931709/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/opponents-rating-vs-win-rate_kdueka.png" 
  alt="Opponent's rating vs wins chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

Clearly a higher loss rate against higher rated opponents (+) which I think is to be expected.

## Time pressure vs wins

```python
fig = plt.figure(figsize=(14,8))
plt.title("How is time pressure affecting my game?")
sns.countplot(data=games, x='Time', hue="W/L", palette={"Win":"#CCCCCC", "Loss":"Grey", "Draw":"White"}, edgecolor="Black");
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931708/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/time-pressure_owl0uf.png" 
  alt="Time pressure vs wins chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

Overwhelmingly better at 30 and 10 minute games, quicker games fair much worse - a lesson to be learnt here, take your time and play long games.

## Rating vs wins

```python
fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(1,1,1)
ax.set_title("How does my rating affect wins?")

sns.histplot(games, x="My Rating", hue="Win", multiple="dodge", palette={0: "Black", 1: "Grey"})
plt.close(2)
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1625931709/App%20Images/Blog%20Images/Article%20Images/Chess%20Performance/rating-vs-wins_bxqjin.png" 
  alt="Rating vs wins chart" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

There is a pattern of high losses, then an increase in rating, higher wins then high losses again - this must be a development pattern in action. Importantly, must get more experience playing games at the higher level to match the 1000 - 1200 range. The 1400 - 1600 should be as high to be able to break into the 1600 - 1800 range.

## Final words

I hope you enjoyed this tutorial. Now you have a way to monitor, track and analyse your Chess.com games archive to identify trends. Some of the actions this analysis has led me to are:

* Concentrating on improving on the endgame.
* Increasing my exposure to higher rated games.
* Strengthening play with the White pieces.
* Playing more consistently to ensure rating is accurate.

If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article. 

If you want to export the data to CSV you can use something like this on the `games` DataFrame:

```python
path = os.path.join(os.path.dirname(os.getcwd()), 'my-chess-games-data.csv')
games.to_csv(path, index=False)
```]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Multiple authentication schemes with ASP.NET Core and Azure Active Directory]]></title>
            <link>https://shedloadofcode.com/blog/multiple-authentication-schemes-with-aspnet-core-and-azure-active-directory/</link>
            <guid>https://shedloadofcode.com/blog/multiple-authentication-schemes-with-aspnet-core-and-azure-active-directory/</guid>
            <pubDate>Fri, 25 Jun 2021 16:49:00 GMT</pubDate>
            <description><![CDATA[I was recently asked to add Azure Active Directory authentication to an existing ASP.NET Core application which already had two other sign in options. This became quite the challenge!]]></description>
            <content:encoded><![CDATA[
I recently came across an interesting and challenging problem. I was asked to add Azure Active Directory (AAD) authentication to an existing ASP.NET Core web app, which already had two sign in options. I had added AAD to an application as the only sign in option before, but not alongside other sign in options.

I found that within the documentation [adding AAD to an application as the only sign option](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-v2-aspnet-core-webapp) was fairly straightforward - as mentioned I’d done this before. However, when trying to add it as a third authentication scheme, things got a little more tricky. There was some guidance for [multiple authentication](
https://github.com/AzureAD/microsoft-identity-web/wiki/Multiple-Authentication-Schemes) but not much. Although this article is not extensive and I can’t share all the code because it was at work, hopefully it will provide enough information to help you out if you find yourself attempting the same thing. This article is certainly not a tutorial, more of a reflection on how I arrived at the solution.

## The starting point

The application I was working on already had two sign in options. There was a selection screen flow which looked something like the image below. Another option would need adding to this for internal AAD users. The first and second option would go off to the existing sign in options, the third would direct to the AAD / Microsoft Identity sign in page. Excuse the bad flow diagram 😆

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/login-flow-diagram_qdqeuz_ialjar.png" 
  alt="Login flow diagram" 
  loading="lazy" 
  styling=""
  caption="Proposed login flow diagram" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/login-flow-diagram_qdqeuz_ialjar.png" 
  :showsource="false">
</article-image>

The existing authentication schemes were configured in the `Startup` class using a method `AddAndConfigureExternalAuthentication`. I have only included relevant parts in the code snippets, so these are not working examples.

```csharp [Startup.cs]
using Microsoft.Identity.Web.UI;
using Microsoft.IdentityModel.Protocols.OpenIdConnect;
using Microsoft.OpenApi.Models;
...

namespace ShedloadOfCode.Web
{
    public class Startup
    {
        private readonly IConfiguration _configuration;
        private readonly IHostEnvironment _hostEnvironment;

        public Startup(IConfiguration configuration,
            IHostEnvironment hostEnvironment)
        {
            _configuration = configuration;
            _hostEnvironment = hostEnvironment;
        }

        public void ConfigureServices(IServiceCollection services)
        {
          ...
          
          services.AddAndConfigureExternalAuthentication(_configuration);
          
          ... 
        }

        ...

    }
}
```

The app handled sign in and sign out within an `AccountController`, particularly important is the `ExternalLogin` action, as when the option in the diagram is selected this action will take the given authentication scheme and issue a new challenge redirecting to the relevant identity provider: 

```csharp [AccountController.cs]
using Microsoft.AspNetCore.Authentication;
using Microsoft.AspNetCore.Authorization;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.Options;
using System.Linq;
using System.Threading.Tasks;

namespace ShedloadOfCode.Web.Controllers
{
    public class AccountController : Controller
    {
        ...

        [HttpGet]
        [AllowAnonymous]
        public async Task<IActionResult> Login(string returnUrl = null)
        {
            var result = await _appAuthenticationHandler.SignInAsync(returnUrl, this);
            return result;
        }

        [HttpPost]
        [AllowAnonymous]
        public async Task<IActionResult> Login(
            LoginViewModel credentials, string returnUrl = null)
        {
            var result = await _appAuthenticationHandler.SignInAsync(
                credentials, returnUrl, this);
            return result;
        }

        public new IActionResult SignOut()
        {
            var callbackUrl = Url.Action("Index", "Home");
            HttpContext.ClearAllTempData();
            return _appAuthenticationHandler.SignOut(callbackUrl, this);
        }

        public IActionResult SignedOut()
        {
            if (User.Identity.IsAuthenticated)
            {
                return RedirectToAction(nameof(HomeController.Welcome), "Home");
            }

            return RedirectToAction(nameof(HomeController.Index), "Home");
        }

        [HttpGet]
        public async Task<IActionResult> Selector()
        {
            if ((await _authenticationSchemeProvider.GetRequestHandlerSchemesAsync()).Count() < 2)
            {
                return NotFound();
            }

            return View();
        }

        [HttpGet]
        [AllowAnonymous]
        public async Task<IActionResult> ExternalLogin(
            [FromQuery] string provider,
            [FromQuery] string returnUrl = "/")
        {
            if ((await _authenticationSchemeProvider.GetRequestHandlerSchemesAsync()).Count() < 2)
            {
                return NotFound();
            }

            string authenticationScheme = _appAuthenticationHandler.GetAuthenticationScheme(provider);

            if (string.IsNullOrWhiteSpace(authenticationScheme))
            {
                ModelState.AddModelError(nameof(provider), "Select a sign in option");
                return View("Selector");
            }

            var auth = new AuthenticationProperties
            {
                RedirectUri = Url.Action(nameof(LoginCallback), new { provider, returnUrl })
            };

            return new ChallengeResult(authenticationScheme, auth);
        }

        public IActionResult LoginCallback(
            string provider,
            string returnUrl = "~/")
        {
            if (User.Identity.IsAuthenticated)
            {
                return LocalRedirect(string.IsNullOrEmpty(returnUrl) ? "~/" : returnUrl);
            }

            return RedirectToAction(nameof(Selector), new { returnUrl = returnUrl });
        }
    }
}
```

As you might have noticed this controller had a few helper methods injected from a service. I added a new value 'AAD' to the `GetAuthenticationScheme` lookup method - this would return an authentication scheme called 'AzureAd':

```csharp [FederationAppAuthenticationHandler.cs]
using Microsoft.AspNetCore.Authentication;
using Microsoft.AspNetCore.Authentication.Cookies;
using Microsoft.AspNetCore.Authentication.OpenIdConnect;
using Microsoft.AspNetCore.Authentication.WsFederation;
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;

namespace ShedloadOfCode.Web.Services
{
    public class FederationAppAuthenticationHandler : IAppAuthenticationHandler
    {
        private readonly IHttpContextAccessor _httpContextAccessor;

        public FederationAppAuthenticationHandler(
            IHttpContextAccessor httpContextAccessor)
        {
            _httpContextAccessor = httpContextAccessor;
        }

        public Task<IActionResult> SignInAsync(
            string returnUrl, Controller controller)
        {
            throw new NotSupportedException("No such page exists");
        }

        public Task<IActionResult> SignInAsync(
            LoginViewModel credentials, string returnUrl, Controller controller)
        {
            throw new NotSupportedException();
        }

        public IActionResult SignOut(string callbackUrl, Controller controller)
        {
            var provider = _httpContextAccessor.HttpContext.User.AuthenticationProvider();
            var authenticationScheme = GetAuthenticationScheme(provider);

            return controller.SignOut(
                new AuthenticationProperties { RedirectUri = callbackUrl },
                CookieAuthenticationDefaults.AuthenticationScheme,
                authenticationScheme);
        }

        public string GetAuthenticationScheme(string provider)
        {
            string authenticationScheme = null;

            if (String.Equals("FirstAuthenticationProviderName",
                provider, StringComparison.OrdinalIgnoreCase))
            {
                authenticationScheme = WsFederationDefaults.AuthenticationScheme;
            }
            else if (String.Equals("SecondAuthenticationProviderName",
                provider, StringComparison.OrdinalIgnoreCase))
            {
                authenticationScheme = OpenIdConnectDefaults.AuthenticationScheme;
            }
            else if (String.Equals("AAD",
                provider, StringComparison.OrdinalIgnoreCase))
            {
                authenticationScheme = "AzureAd";
            }

            return authenticationScheme;
        }
    }
}
```

## My first steps

I recalled how I had added AAD as the only sign in method to an app before, and tried those steps first:

* Create an app registration in the AAD in the Azure Portal

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/create-a-new-registration-1_nkdstk_ccjfgz.png" 
  alt="Create an app registration in AAD" 
  loading="lazy" 
  styling=""
  caption="Create an app registration in AAD" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/create-a-new-registration-1_nkdstk_ccjfgz.png" 
  :showsource="false">
</article-image>

* Create a sign-in and sign-out route for the new app registration, and enable ID tokens

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/create-signin-and-signout-routes_jw2rvg_hfzgwe.png" 
  alt="Create a signin and signout endpoint" 
  loading="lazy" 
  styling=""
  caption="Create a signin and signout endpoint" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/create-signin-and-signout-routes_jw2rvg_hfzgwe.png" 
  :showsource="false">
</article-image>

* Create a client secret for the new app registration

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/create-client-secret_a3dfro_zt2htq.png" 
  alt="Create a client secret" 
  loading="lazy" 
  styling=""
  caption="Create a client secret" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/create-client-secret_a3dfro_zt2htq.png" 
  :showsource="false">
</article-image>

* Install [Microsoft.Identity.Web](https://www.nuget.org/packages/Microsoft.Identity.Web) and [Microsoft.Identity.Web.UI](https://www.nuget.org/packages/Microsoft.Identity.Web.UI) Nuget packages in the project

* Update `appsettings.json` with the app registration details (found in the 'Overview' tab in the Azure portal)

```json [appsettings.json]
{
  "AzureAd": {
    "Instance": "https://login.microsoftonline.com/",
    "Domain": "yourdomain.onmicrosoft.com",
    "ClientId": "11adca46-d907-4803-945f-demoClientId",
    "TenantId": " b3b8b34a82f9-c69a-4da1-a5f2-demoTenantId",
    "ClientSecret": ".dVv3r.2g2ED6_Xb-bSaXROml~demoClientSecret",
    "MetadataAddress": "https://login.microsoftonline.com/b3b8b34a82f9-c69a-4da1-a5f2-demoTenantId/v2.0/.well-known/openid-configuration",
    "CallbackPath": "/signin-oidc",
    "SignedOutCallbackPath": "/signout-callback-oidc",
    "SignedOutRedirectUri": "/"
  }
  ...
}
```

* Add the same method I had used before for AAD authentication to `Startup.cs` called `AddMicrosoftIdentityWebApp` which is also in the [documentation](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-v2-aspnet-core-webapp#more-information). I also initialised the Microsoft.Identity.Web.UI package with `AddMicrosoftIdentityUI` to handle the sign in screen.

```csharp [Startup.cs]
using Microsoft.Identity.Web.UI;
using Microsoft.IdentityModel.Protocols.OpenIdConnect;
using Microsoft.OpenApi.Models;
...

namespace ShedloadOfCode.Web
{
    public class Startup
    {
        private readonly IConfiguration _configuration;
        private readonly IHostEnvironment _hostEnvironment;

        public Startup(IConfiguration configuration,
            IHostEnvironment hostEnvironment)
        {
            _configuration = configuration;
            _hostEnvironment = hostEnvironment;
        }

        public void ConfigureServices(IServiceCollection services)
        {
          ...
          
          services.AddAndConfigureExternalAuthentication(_configuration);
          
          services.AddAuthentication()
            .AddMicrosoftIdentityWebApp(_configuration,
              configSectionName: "AzureAd",
              openIdConnectScheme: "AzureAd",
              cookieScheme: "AzureAdCookies")

          services.AddRazorPages()
                .AddMicrosoftIdentityUI();

          ... 
        }

        ...

    }
}
```


I had to add a distinct `openIdConnect` and `cookieScheme` to [avoid scheme conflicts](https://stackoverflow.com/questions/56433112/system-invalidoperationexception-scheme-already-exists-identity-application) when using this approach. `configSectionName` just pulls the relevent config section `AzureAd` from `appsettings.json`. 

However, after selecting the new sign in option for AAD, being sent to the Microsoft Identity sign in page and entering credentials and clicking login, I was redirected back to the application, but wasn't authenticated! I was very confused by this, especially since it had worked so well in other apps as the only sign in method. Plus we can see quite clearly here in the docs for [single authentication](https://github.com/AzureAD/microsoft-identity-web/wiki/web-apps) and [multiple authentication](
https://github.com/AzureAD/microsoft-identity-web/wiki/Multiple-Authentication-Schemes) this is the recommended approach:

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/multiple-auth-docs_oablyo_e1ez1i.png" 
  alt="Adding AAD sign in docs" 
  loading="lazy" 
  styling=""
  caption="Summary for adding AAD / Microsoft Identity sign in" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/multiple-auth-docs_oablyo_e1ez1i.png" 
  :showsource="false">
</article-image>

## The solution - using AddOpenIdConnect()

So I came across this [super helpful article](https://www.codeproject.com/Articles/5297820/Azure-Active-Directory-Authentication-with-OpenID), and thought okay I should try using the `AddOpenIdConnect` method to sign in. I added the configuration for each option... and this time, after the redirect back to the application, the user was authenticated! 😄

```csharp [Startup.cs]
using Microsoft.Identity.Web.UI;
using Microsoft.IdentityModel.Protocols.OpenIdConnect;
using Microsoft.OpenApi.Models;
...

namespace ShedloadOfCode.Web
{
    public class Startup
    {
        private readonly IConfiguration _configuration;
        private readonly IHostEnvironment _hostEnvironment;

        public Startup(IConfiguration configuration,
            IHostEnvironment hostEnvironment)
        {
            _configuration = configuration;
            _hostEnvironment = hostEnvironment;
        }

        public void ConfigureServices(IServiceCollection services)
        {
          ...
          
          services.AddAndConfigureExternalAuthentication(_configuration);
          
          var azureAdConfiguration = _configuration.GetSection("AzureAd").Get<AzureAdConfigOptions>();

          services.AddAuthentication()
            .AddOpenIdConnect("AzureAd", options =>
            {
                options.SignInScheme = CookieAuthenticationDefaults.AuthenticationScheme;
                options.Authority = azureAdConfiguration.MetadataAddress;
                options.ClientId = azureAdConfiguration.ClientId;
                options.ClientSecret = _configuration.GetValue<string>(azureAdConfiguration.ClientSecret);
                options.CallbackPath = new PathString(azureAdConfiguration.CallbackPath);
                options.MetadataAddress = azureAdConfiguration.MetadataAddress;
                options.SignedOutCallbackPath = new PathString(azureAdConfiguration.SignedOutCallbackPath);
                options.SignedOutRedirectUri = new PathString(azureAdConfiguration.SignedOutRedirectUri);
                options.ResponseType = OpenIdConnectResponseType.Code;
                options.UsePkce = true;
                options.Scope.Add("openid");
                options.Scope.Add("profile");
                options.SaveTokens = true;
                options.Events.OnSignedOutCallbackRedirect += context =>
                {
                    context.Response.Redirect(azureAdConfiguration.SignedOutRedirectUri);
                    context.HandleResponse();

                    return Task.CompletedTask;
                };
                options.Events.OnTokenValidated = async (context) =>
                {
                    if (context.Principal.Identity.IsAuthenticated)
                    {
                        // Set auth provider using an extension method to facilitate logout
                        context.Principal.SetAuthenticationProvider("AAD");

                        // Get AAD username from claims
                        var emailAddress = context.Principal.Claims
                            .Where(c => c.Type == "preferred_username")
                            .Select(c => c.Value)
                            .ToList()
                            .First();

                        // Get AAD security groups from claims
                        var groups = context.Principal.Claims
                            .Where(c => c.Type == "groups")
                            .Select(c => c.Value)
                            .ToList();
                    }
                };
            });

          services.AddRazorPages()
                .AddMicrosoftIdentityUI();

          ... 
        }

        ...

    }
}
```

I set the authentication scheme as `AzureAd` so the controller knows which challenge to issue after the selection screen. After the token validates, I can see the user is authenticated and I can get the user details and claims that are returned from AAD. No separate `cookieScheme` needs setting for this approach either, it will just use `CookieAuthenticationDefaults.AuthenticationScheme` which is 'Cookies'. This code is still using the values we set in `appsettings.json` just mapping them to `AzureAdConfigOptions` and using them individually.

```csharp [AzureAdConfigOptions.cs]
namespace ShedloadOfCode.Web.Options
{
  public class AzureAdConfigOptions
  {
    public string Instance { get; set; }
    public string Domain { get; set; }
    public string ClientId { get; set; }
    public string TenantId { get; set; }
    public string ClientSecret { get; set; }
    public string MetadataAddress { get; set; }
    public string CallbackPath { get; set; }
    public string SignedOutCallbackPath { get; set; }
    public string SignedOutRedirectUrl { get; set; }
  }
}
```

I was really pleased with this outcome. Usually, when it comes to searching documentation, reading Stack Overflow and general Google-Fu, I’m quite skilled. However the answer to this one evaded me for some time! I traced back the usage of `AddOpenIdConnect` within the `AddMicrosoftIdentityWebApp` method in the Microsoft.Identity.Web [source code](https://github.com/AzureAD/microsoft-identity-web/blob/master/src/Microsoft.Identity.Web/WebAppExtensions/MicrosoftIdentityWebAppAuthenticationBuilderExtensions.cs).

## Getting AAD group information 

One requirement for authorisation was to only allow users with a specific AAD group to access the application - others needed to ask permission to be added to the AAD group. I retrieved them in the solution code in the `groups` variable, however for group claims to be returned from AAD, they need enabling in Azure.

To enable group claims, you head back to the app registration and select 'Add groups claim' inside 'Token configuration'.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/add-group-claims_ylikwq_slkjrl.png" 
  alt="Enabling group claims" 
  loading="lazy" 
  styling=""
  caption="Enabling group claims to be returned" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1639837976/App%20Images/Blog%20Images/Article%20Images/AAD%20Multiple%20Authentication/add-group-claims_ylikwq_slkjrl.png" 
  :showsource="false">
</article-image>

This allows the AAD group information to be returned for the authenticated user. You can then access these as claims and use specific groups a user belongs to for authorisation and access control.

## Next steps

My next steps will be code clean up. I’ll move the `ClientSecret` into [Azure Key Vault](https://azure.microsoft.com/en-gb/services/key-vault/), and move the AAD authentication code into an `AddAndConfigureAzureAdAuthentication` method to tidy things up. So now any user who selects that new option, and is part of the organisation's AAD and within the specific AAD group can access the application 😄 Well it was a tough journey, but got there in the end. I would be lying if I said I didn't nearly give up on it a few times! 

I really hope this article has helped you to avoid the issues I had trying to set this up. 

If you enjoyed this article be sure to check out [other articles](/) on the site including:

* [Searching for text in PDFs at increasing scale](/blog/searching-for-text-in-pdfs-at-increasing-scale/) with C# and Python]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a website analytics dashboard with Power BI and Google Sheets]]></title>
            <link>https://shedloadofcode.com/blog/building-a-website-analytics-dashboard-with-power-bi-and-google-sheets/</link>
            <guid>https://shedloadofcode.com/blog/building-a-website-analytics-dashboard-with-power-bi-and-google-sheets/</guid>
            <pubDate>Fri, 18 Jun 2021 10:40:00 GMT</pubDate>
            <description><![CDATA[Learn how to build a website usage analytics dashboard from scratch using Power BI.]]></description>
            <content:encoded><![CDATA[
In a previous article I demonstrated [creating a website analytics solution using AWS Lambda and Google Sheets](/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets/). The data collected was then used to build a Power BI dashboard. I chose Power BI because I’ve worked with it in a professional setting, and for quickly putting together high quality, interactive dashboards, it’s very good. There are a few gotchas to watch out for with it, but on the whole it’s quite straightforward to use.

In this article, we’ll go over how I built the usage analytics dashboard for this site and in the process you’ll learn some fundamental Power BI skills. Unlike some tutorials, this is a real world use case, many workplaces have digital products and want to monitor how well they are performing to improve them for their customers. Although this tutorial will be suitable for beginners, we'll be diving straight into the skills needed to build a professional report including using the Power Query Editor and DAX (data analysis expressions). I think jumping into the deep end is a good thing, gaps in understanding can be filled in later. Being able to import website usage data and turn it into valuable information is a very useful skill to have. By the end of this article you should be able to develop an entire dashboard from scratch without any prior knowledge of Power BI. Let’s begin!


## Requirements

Firstly I set out a list of requirements of what metrics and functionality the dashboard would need to have:

* Total visitors card
* Total page views card
* Total page views by device and timezone bar chart
* Total page views by browser and operating system bar chart
* Total views by path table
* Daily page views time series
* Hourly breakdown for any given day
* Date slicer

The finished product should end up looking something like this:

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1623768801/App%20Images/Blog%20Images/Article%20Images/Analytics/dashboard-preview-1_bdk96s.png" 
  alt="Power BI analytics dashboard" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Download Power BI Desktop

Let's start by [downloading Power BI Desktop](https://powerbi.microsoft.com/en-us/downloads/). Once installed, open Power BI Desktop and you should arrive at a screen which looks like this:

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1624009390/App%20Images/Blog%20Images/Article%20Images/Analytics/power-bi-intro-screen_spvtq5.png" 
  alt="Power BI Desktop start screen" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

Power BI Desktop receives monthly updates from Microsoft and the layout can change slightly. If you're ever interested in keeping up to date with the monthly updates, you can find them on the [Power BI blog](https://powerbi.microsoft.com/en-us/blog/) and review previous month's updates on this [previous updates page](https://docs.microsoft.com/en-us/power-bi/fundamentals/desktop-latest-update-archive?tabs=powerbi-desktop). 

The first thing you'll always want to do from this start screen, is click 'Get data'. That will be our starting point for the next section. 


## Building the report

Power BI Desktop has a very interactive interface with lots of things to click on! So rather than writing out all the steps with screenshots, I've added a video in this section that shows the whole dashboard building process from start to finish. This should make it much easier to follow along 😄 The steps in the video are:

1. Get data from the Google Sheet using this URL: https://docs.google.com/spreadsheets/d/1jIUARNqb02c0xzTqj6AhE94M38WnAEcymSfsXfZm06s/edit?usp=sharing
2. Change the URL ending from `/edit?usp=sharing` to `/export?format=xlsx` 
3. Transform data in Power Query by removing blank rows then adding a date, hour and index column
4. Add [calculated columns](https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-tutorial-create-calculated-columns) and [measures](https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-tutorial-create-measures) with [DAX](https://docs.microsoft.com/en-us/dax/) to calculate visitors, page views and average time on page in seconds
5. Create report visuals 
6. Style the report
7. Add a drill through page for hourly analysis
8. Add a toggle between browser, device, OS and timezone visuals using bookmarks

The Google Sheet data we're using is test data - not the live site data. It’s the same structure, but only contains logs from early testing activity.  For step 4, you’ll find all the DAX you’ll need for it underneath the video. 

<article-video 
  id="Ocr4r7Fo2TY" 
  title="Building a website analytics dashboard with Power BI and Google Sheets">
</article-video>

**DAX statements for step 4 as promised 😄**

In step 4, you can see I'm first calculating the next row's session ID and created at date. Then if the session ID is different from the previous row, I know it's a completely different person / session. We can't predict how long that last page view event was, but for all the others we can calculate the time between dates using `DATEDIFF` to find `TimeOnPageInSeconds`. The average of that column gives the `AverageTimeOnPageInSeconds` measure - concatenated with an 's' so it displays units nicely in the visual. Allowing users to quickly interpret the units of measurement is very important.

**Visitors**

```dax [visitors.dax]
Visitors = CALCULATE(
    COUNT(
        EventsLog[EventType]
    ),      
    EventsLog[EventType] = "Visit Site"
)
```

**Page Views**

```dax [page-views.dax]
Page Views = COUNT(EventsLog[EventType]) 
```

**NextSessionId**

```dax [next-session-id.dax]
NextSessionId = 

VAR PreviousIndex =
CALCULATE(
    MAX( EventsLog[Index] ),
    FILTER(
        EventsLog,
        EventsLog[Index] < EARLIER( EventsLog[Index] )
    )
)

VAR Result =
CALCULATE(
    MAX( EventsLog[SessionId] ),
    FILTER(
        EventsLog,
        EventsLog[Index] = PreviousIndex
    )
)

RETURN Result
``` 

**NextCreatedAt**

```dax [next-created-at.dax]
NextCreatedAt = 

VAR PreviousIndex =
CALCULATE(
    MAX( EventsLog[Index] ),
    FILTER(
        EventsLog,
        EventsLog[Index] < EARLIER( EventsLog[Index] )
    )
)

VAR Result =
CALCULATE(
    MAX( EventsLog[CreatedAt] ),
    FILTER(
        EventsLog,
        EventsLog[Index] = PreviousIndex
    )
)

RETURN Result
```

**TimeOnPageInSeconds**

```dax [time-on-page.dax]
TimeOnPageInSeconds = 

IF(
    EventsLog[SessionId] <> EventsLog[NextSessionId],
    0,
    DATEDIFF(EventsLog[CreatedAt], EventsLog[NextCreatedAt], SECOND)
)
```

**AverageTimeOnPageInSeconds**

```dax [average-time-on-page.dax]
AverageTimeOnPageInSeconds = CONCATENATE(
    ROUND(
        AVERAGE(EventsLog[TimeOnPageInSeconds]),
        2
    ),
    "s"
)
```

In **step 7** I used a drill through page. To use drill through, a user must right-click a data point in another report page, and drill through to the focused page to get details that are filtered to that context. This effectively 'filters' the destination page by whichever data point you drilled through on. In our case, when you right click a `Date` data point on the time series on the 'Dashboard' page, you drill through to the 'Hourly Analysis' page, which breaks down the usage by hour for that day. This works because on the 'Hourly Analysis' we added the `Date` column to the drill through section, which enables drillthrough for any visual using that column. Once you wrap your head around that, it becomes a very powerful tool for providing deeper insight without overloading pages. You can use it to separate the main high-level visualisations from more low-level analysis. Some users might only want the high-level information, but more advanced users might want to drill through to the details. This let's you accomodate both. So usually I just want to see the day by day page views, but if I see a spike on any given day, I might drill through to see at what hours the page views happened.

In **step 8** I used bookmarks to toggle between visuals. Bookmarks are created first and then can be linked to buttons. Bookmarks sort of 'take a snapshot' of which visuals are visible or hidden and what filters have been applied (if any). So in this case we have many buttons to only show one visual at a time, whilst hiding the others. We then attach those bookmarks to the buttons as actions, so when they are clicked that bookmark 'snapshot' is applied. This can be time consuming to set up, but works well for simple show and hide or toggle functionality like this. The main use for this is to avoid overcrowding your report page. It also gives it more of an app-like feel. 


## Job done! Where to next?

I hope you’ve enjoyed this tutorial, and have picked up some knowledge of Power BI you can use in other projects. You should now be able to build a robust professional dashboard from scratch, so well done! You might have noticed that using Power BI is as much about preparing and transforming the data, as it is about the visuals themselves. The 'garbage in, garbage out' principle is very important, your report will only ever be as good as the data fed into it. So always know your underlying data inside out and question the quality of it. 

We used DAX to calculate the average time on the page in the tutorial, but did we really need that metric? Will knowing how much time a user spent on the page help to deliver a better product? It might, it might not. Knowing what to measure is the absolute key skill. Don't overcomplicate a report if you don't have to, keep it as simple as possible. Follow the quote 'Don’t include a single line in your code which you could not explain to your grandmother in a matter of two minutes' - one on [the favourite quotes list](/blog/programming-quotes-that-offer-wisdom-and-motivation/) and as applicable to analysis and reports as it is of code. 

If you just keep including measures blindly, it will crowd the report with noise, and soon you'll face the dreaded analysis paralysis - you're tracking so much stuff but it doesn't offer any insight or call to action. I wanted a way to present some simple stats on how the site is being received, which pages are popular and which need improving. When presenting data, keep the audience and purpose in mind. Although that sounds simplistic, it can be easy to forget those things.
 
I think we’ve explored some key topics, but there is a lot more to learn for those who wish to. One thing we didn't cover is [relationships](https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-create-and-manage-relationships), which are important for more complex multi-source data models. Here are my top recommendations for where to go next if you want to learn more about Power BI:

* [Power BI Docs](https://docs.microsoft.com/en-us/power-bi/) - Offical Power BI docs from Microsoft
* [Analysing and Visualising Data with Power BI Course](https://www.youtube.com/watch?v=1c01r_pAZdk&list=PL1N57mwBHtN0JFoKSR0n-tBkUJHeMP2cP) - full course from Microsoft 
* [Guy in a cube](https://www.youtube.com/channel/UCFp1vaKzpfvoGai0vE5VJ0w) - great YouTube channel for Power BI tutorials
* [SQLBI](https://www.sqlbi.com/) - articles on business intelligence, Power BI, DAX and more
* [DAX reference](https://dax.guide/) - Browse DAX functions
* [DAX reference](https://docs.microsoft.com/en-us/dax/dax-function-reference) - Browse DAX functions]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Automated deployment of a Vue Flask app using Azure Pipelines]]></title>
            <link>https://shedloadofcode.com/blog/automated-deployment-of-a-vue-flask-app-using-azure-pipelines/</link>
            <guid>https://shedloadofcode.com/blog/automated-deployment-of-a-vue-flask-app-using-azure-pipelines/</guid>
            <pubDate>Tue, 15 Jun 2021 12:15:00 GMT</pubDate>
            <description><![CDATA[How to automate the deployment of a Vue Flask app to Azure App Service with Azure Pipelines.]]></description>
            <content:encoded><![CDATA[
In this article we will look at how to automate the deployment of a Vue Flask app to Azure App Service with Azure Pipelines. In a previous article I covered [building a Vue Flask app](/blog/query-sql-and-download-csv-and-xlsx-in-flask/) to query a SQL database and return data to the browser to view or download. We will start with the same template, prepare it for deployment and configure the app service and pipeline in Azure. The result will be a deployed Flask app which serves a static Vue.js frontend. If you have your own application, you can adapt these steps. Before starting, you will need an [Azure subscription](https://azure.microsoft.com/en-gb/free/) alongside Python, Node.js and Yarn installed.

## Download the template

First go to this [public repository](https://github.com/gtalarico/flask-vuejs-template) and download the project template as a zip file. Extract the folder contents and open the folder in a code editor like Visual Studio Code.

## Configure the template for deployment

One thing needs adding before we deploy. Create a new file `startup.py` at the top of the folder - same directory as `run.py`. This will be the file Azure App Service uses to start the application.

```python [/startup.py]
""" 
The startup file for Azure App Service that just imports the app object.
"""

from app import app
```

## Setting up the automated pipeline

Here are the step by step actions the video below will go through to create the automated pipeline: 

* Create a new Azure App Service in the [Azure portal](https://portal.azure.com/#home)
* Set the environment variable `SCM_DO_BUILD_DURING_DEPLOYMENT` to true in the App Service
* Create a new project in [Azure DevOps](https://dev.azure.com/)
* Create an Azure Repo in the project
* Push the application code to the Azure Repo
* Set up an Azure Pipeline in the project
* Build and deploy the app to Azure App Service 
* Check the site is deployed (had to hard refresh with Ctrl + F5) 😄

<article-video 
  id="1JxBkgqEsWY" 
  title="Automated deployment of a Vue Flask app using Azure Pipelines video">
</article-video>

Setting the `SCM_DO_BUILD_DURING_DEPLOYMENT` environment variable to true took me a while to figure out. It's in [this section of the docs](https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/python-webapp?view=azure-devops#run-the-pipeline) and states:

> If your app fails because of a missing dependency, then your requirements.txt file was not processed during deployment. This behavior happens if you created the web app directly on the portal rather than using the az webapp up command as shown in this article. The az webapp up command specifically sets the build action SCM_DO_BUILD_DURING_DEPLOYMENT to true. If you provisioned the app service through the portal, however, this action is not automatically set.

The YAML I used for the build and deploy steps looked like this:

```yaml [pipeline.yml]
# Python to Linux Web App on Azure
# Build your Python project and deploy it to Azure as a Linux Web App.
# Change python version to one thats appropriate for your application.
# https://docs.microsoft.com/azure/devops/pipelines/languages/python

trigger:
- master

variables:
  # Azure Resource Manager connection created during pipeline creation
  azureServiceConnectionId: 'f59ed866-b638-412b-bdce-02504965ee64'

  # Web app name
  webAppName: 'vue-flask-app'

  # Agent VM image name
  vmImageName: 'ubuntu-latest'

  # Environment name
  environmentName: 'vue-flask-app'

  # Project root folder. Point to the folder containing manage.py file.
  projectRoot: $(System.DefaultWorkingDirectory)

  # Python version: 3.6
  pythonVersion: '3.6'

stages:
- stage: Build
  displayName: Build stage
  jobs:
  - job: BuildJob
    pool:
      vmImage: $(vmImageName)
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '$(pythonVersion)'
      displayName: 'Use Python $(pythonVersion)'
    
    - task: NodeTool@0
      inputs:
        versionSpec: '10.x'
      displayName: 'Install Node.js'
      
    - script: pip install --upgrade pip
      displayName: 'Upgrade pip'
      workingDirectory: $(projectRoot)

    - script: pip install pipenv
      displayName: 'Install pipenv'

    - script: python -m pipenv install --dev
      displayName: 'Install Python dependencies'

    - script: python -m pipenv run pip freeze > requirements.txt
      displayName: 'Generate requirements.txt'

    - script: |
        curl -o- -L https://yarnpkg.com/install.sh | bash -s -- --version 1.9.4
        export PATH="$HOME/.yarn/bin:$PATH"
        yarn install
        yarn upgrade
      displayName: 'Install Node dependencies'

    - script: yarn build
      displayName: 'Build Vue app'

    - script: |
        pip install codecov
        pip install pytest
        pip install pytest-sugar
        pip install pytest-cov
        pip install pytest-azurepipelines
        python -m pipenv run pytest --junitxml=$(System.DefaultWorkingDirectory)/testResults.xml  --cov=app --cov-report=xml --cov-report=html
      displayName: 'Run tests with pytest'

    - task: PublishTestResults@2
      displayName: "Publish test results"
      inputs:
        testResultsFiles: '$(System.DefaultWorkingDirectory)/testResults.xml'
        testRunTitle: '$(Agent.OS) - $(Build.BuildNumber)[$(Agent.JobName)] - Python $(python.version)'
        failTaskOnFailedTests: true
      condition: succeededOrFailed()
      
    - task: PublishCodeCoverageResults@1
      displayName: "Publish code coverage"
      inputs:
        codeCoverageTool: Cobertura
        summaryFileLocation: '$(System.DefaultWorkingDirectory)/**/coverage.xml'
        reportDirectory: '$(System.DefaultWorkingDirectory)/**/htmlcov'

    - task: ArchiveFiles@2
      displayName: 'Archive files'
      inputs:
        rootFolderOrFile: '$(projectRoot)'
        includeRootFolder: false
        archiveType: zip
        archiveFile: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip
        replaceExistingArchive: true

    - upload: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip
      displayName: 'Upload package'
      artifact: drop

- stage: Deploy
  displayName: 'Deploy Web App'
  dependsOn: Build
  condition: succeeded()
  jobs:
  - deployment: DeploymentJob
    pool:
      vmImage: $(vmImageName)
    environment: $(environmentName)
    strategy:
      runOnce:
        deploy:
          steps:
          
          - task: UsePythonVersion@0
            inputs:
              versionSpec: '$(pythonVersion)'
            displayName: 'Use Python version'

          - task: AzureWebApp@1
            displayName: 'Deploy Azure Web App :  vue-flask-app'
            inputs:
              azureSubscription: $(azureServiceConnectionId)
              appName: $(webAppName)
              package: $(Pipeline.Workspace)/drop/$(Build.BuildId).zip

              startUpCommand: 'gunicorn --bind=0.0.0.0 --workers=4 --timeout 600 startup:app'
```

Your `azureServiceConnectionId` will be different so be sure to change that.

## Deployment was successful!

You now have a deployed Vue Flask app with a continuous integration pipeline configured. You can deploy new features with a simple push to the master branch which will trigger the pipeline. You could completely change the application we have deployed and take it in your own direction. Not only that, you might have noticed that this setup also publishes pytest code test coverage to the pipeline! Let me know in the comments if this helped you and if you have any questions. I know this was quite Azure specific, I think you could set up a similar pipeline using AWS or Google Cloud Platform.

I really like the Vue Flask combination for the ease of creating an interactive experience with Vue, alongside the many packages for data science that Python offers. You could separate this setup and have Vue served from a CDN and Python running as the API layer, but for a quick starter single-deploy setup this is perfect. It might need a little tailoring to your own needs, the template we used in this tutorial used Python 3.6 and pipenv, your setup might not, so adjust the Pipeline and App Service accordingly.

If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in:

* [How to query a database with Python Flask and download data to CSV or XLSX in Vue](/blog/query-sql-and-download-csv-and-xlsx-in-flask/)
* [How to upload PDF files to Azure Blob Storage with Vue and Python Flask](/blog/how-to-upload-pdf-files-to-azure-blob-storage-with-vue-and-python-flask/)
* [How to import a CSV from Dropbox or GitHub into Google Sheets](/blog/how-to-import-a-csv-from-dropbox-or-github-into-google-sheets/)]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Creating your own website analytics solution with AWS Lambda and Google Sheets]]></title>
            <link>https://shedloadofcode.com/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets/</link>
            <guid>https://shedloadofcode.com/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets/</guid>
            <pubDate>Mon, 14 Jun 2021 11:12:00 GMT</pubDate>
            <description><![CDATA[How I built a free lightweight website analytics solution with AWS Lambda and Google Sheets.]]></description>
            <content:encoded><![CDATA[
After creating and launching this site, I needed a way to capture some simple usage statistics. Although I could have used Google Analytics, which is free, the mantra of “if the product is free, then you are the product” was at the back of my mind. After reading other articles like [roll your own analytics](https://www.pcmaffey.com/roll-your-own-analytics/ ) and [logging sensor data to Google Sheets via AWS Lambda](https://ncd.io/logging-data-to-google-sheets-through-aws-iotlambda/) I was inspired to give it a go. 

So for this setup I wanted:

* No third party tracking
* Free or very low cost 
* Serverless
* Low maintenance
* No bloat for fast page load
* Completely anonymous data
* Only useful data collected - visitors, page views etc
* No personal data collected 
* No cookies and no cookie banner
* A simple dashboard to present the analytics

I know what you’re thinking, why not use Google Analytics when you’re using Google Sheets anyway? Well, my opinion is that the Google Sheet is my own data, controlled by me. The alternative is capturing lots of information I don’t need - bloating the page load time alongside placing tracking and ad cookies on users devices. I’m not against Google Analytics but because many sites use it, and Google runs on advertising, it gives it a powerful position - and let’s face it most users (including myself sometimes) are quick to click that ‘Accept cookies’ button without realising just how much tracking they are subjected to. However, I am impressed by the [opt-out browser add on](https://tools.google.com/dlpage/gaoptout) offered by Google which prevents any data being sent to Google Analytics.

The plan for how it would work looked like this:

* Collect events in state as the user browses the site
* The user ends their browsing session 
* Events data is sent to AWS Lambda function 
* AWS Lambda function writes the data to a Google Sheet
* The Google Sheet acts as the database 
* Consume the Google Sheet into a dashboard tool like Power BI Desktop
* Build the analytics dashboard

To determine when to send the analytics events to the AWS Lambda function, I will be adding event listeners to my Vue web app. They will listen for the [`pagehide`](https://developer.mozilla.org/en-US/docs/Web/API/Window/pagehide_event), [`beforeunload`](https://developer.mozilla.org/en-US/docs/Web/API/WindowEventHandlers/onbeforeunload),  and [`unload`](https://developer.mozilla.org/en-US/docs/Web/API/Window/unload_event) events, alongside [`visibilitychange`](https://developer.mozilla.org/en-US/docs/Web/API/Document/visibilitychange_event) and [`blur`](https://developer.mozilla.org/en-US/docs/Web/API/Element/blur_event) to handle mobile closing or switching tabs, particularly on iOS.

## Setting up the infrastructure

In the video below I replicate the setup to demo how the solution is put together. Creating a Lambda Layer is not covered in the video but I cover it in the section following the video. The step by step actions are:

* Create a Google Cloud project
* Enable the Google Sheets and Google Drive APIs for the project
* Create a service account
* Create credentials for the service account
* Create and share Google Sheet with service account email
* Create AWS Lambda function
* Add a Layer to AWS Lambda function for the [gspread](https://pypi.org/project/gspread/) package
* Create AWS API Gateway to call function
* Call the API endpoint with Postman to test it

<article-video 
  id="yg9NmP0RpCI" 
  title="Creating your own website analytics solution with AWS Lambda and Google Sheets">
</article-video>

When creating the AWS Lambda function, I added a file `google_service_account_credentials.json` and pasted in the json from the generated service account credentials. This allows the function to use the gspread Python package to read and write to the Google Sheet. I also shared the Google Sheet with the service account email to ensure it had permission to access it. 

**AWS Lambda function**

```python
import json
import gspread

def lambda_handler(event, context):
    request_body = json.loads(event["body"]) if type(event["body"]) is str else event["body"]
    write_events_to_google_sheet(request_body["events"])

    return { "statusCode": 200 }
    

def write_events_to_google_sheet(events):
    gc = gspread.service_account(filename='google_service_account_credentials.json')
    gsheet = gc.open("Website Analytics")
    
    for event in events:
        row = [
            event["sessionId"], 
            event["eventType"], 
            event["createdAt"], 
            event["device"], 
            event["userAgent"], 
            event["browser"],
            event["os"],
            event["language"],
            event["timezone"],
            event["path"]
        ]
        
        gsheet.sheet1.insert_row(row, index=2)
        
    print(
        str(len(events)) + " events logged to the Google Sheet."
    )
```

**Data used to test function**

```json
{
  "body": {
    "events": [
      {
        "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d",
        "eventType": "visit-site",
        "createdAt": "01-01-2021 09:21:11",
        "device": "Desktop",
        "userAgent": "Chrome",
        "browser": "Safari",
        "os": "MacOS",
        "language": "en-GB",
        "timezone": "London-GMT",
        "path": "/"
      },
      {
        "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d",
        "eventType": "visit-page",
        "createdAt": "01-01-2021 09:41:11",
        "device": "Desktop",
        "userAgent": "Chrome",
        "browser": "Safari",
        "os": "MacOS",
        "language": "en-GB",
        "timezone": "London-GMT",
        "path": "/blog/article-1"
      },
      {
        "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d",
        "eventType": "visit-page",
        "createdAt": "01-01-2021 09:31:11",
        "device": "Desktop",
        "userAgent": "Chrome",
        "browser": "Safari",
        "os": "MacOS",
        "language": "en-GB",
        "timezone": "London-GMT",
        "path": "/blog/article-2"
      },
      {
        "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d",
        "eventType": "visit-page",
        "createdAt": "01-01-2021 09:51:11",
        "device": "Desktop",
        "userAgent": "Chrome",
        "browser": "Safari",
        "os": "MacOS",
        "language": "en-GB",
        "timezone": "London-GMT",
        "path": "/about"
      }
    ]
  }
}
```

**Adding a Layer**

You may have seen I added a Layer to the function so it had access to the gspread package (and it’s dependencies) for interacting with the Google Sheet. This [video](https://youtu.be/3BH79Uciw5w) covers adding a Layer nicely but my steps were:

* Open command prompt
* Create a folder using `mkdir python`
* Install package and dependencies to the folder using `pip install gspread -t .`
* Zip the python folder in file explorer
* Go to AWS Lambda Layers (Image A below)
* Create a new Layer and upload your zip file (Image B below)
* You can now use that Layer with any function

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1623491618/App%20Images/Blog%20Images/Article%20Images/Analytics/go-to-layers_ymm6tz.png" 
  alt="Go to Lambda Layers" 
  loading="lazy" 
  styling=""
  caption="Image A: Go to Layers" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1623491618/App%20Images/Blog%20Images/Article%20Images/Analytics/go-to-layers_ymm6tz.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1623491618/App%20Images/Blog%20Images/Article%20Images/Analytics/create-a-layer_z5mhuv.png" 
  alt="Create a Lambda Layer" 
  loading="lazy" 
  styling=""
  caption="Image B: Create a new Layer" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1623491618/App%20Images/Blog%20Images/Article%20Images/Analytics/create-a-layer_z5mhuv.png" 
  :showsource="false">
</article-image>

Now that the function is receiving data and writing it to the Google Sheet, the main thing to focus on now, is actually capturing events data in the Vue app to send to it. We’ll explore how I did that in the following section.

## Capturing and logging events in Vue

Since I was using Nuxt with Vue, I stored the events in the top level `default.vue` component state. I did consider using Vuex but this approach worked well. As the user browses the site and the page changes, the `logVisitPageOnRouteChange` method saves the events in the `analyticsEvents` array. Within the mounted hook, I listen for a number of exit events such as `beforeunload` and `pagehide`. This means whenever a user switches tabs, closes the tab, closes the browser, switches to another app on mobile or just visits another site, the `sendAnalyticsData` method is fired. This logs all the events currently stored in state to the AWS Lambda function, then clears the state to ensure it never logs duplicate records.

```html [layouts/default.vue]
<template>
  <div>
    <Navbar />
    <div class="container main-content mt-6 pt-4">
      <nuxt />
    </div>
    <Footer />
  </div>
</template>

<script>
import Navbar from "~/components/Navbar";
import Sidebar from "~/components/Sidebar";
import Footer from "~/components/Footer";
import identifyBrowser from "~/utils/identifyBrowser";
import { v4 as uuidv4 } from "uuid";

export default {
  components: {
    Navbar,
    Sidebar,
    Footer,
  },
  data() {
    return {
      uuid: null,
      sendingAnalyticsData: false,
      analyticsEvents: [],
    };
  },
  mounted() {
    this.uuid = uuidv4();
    this.listenForAllExitEvents();
    this.logVisitSiteEvent();
    this.logVisitPageOnRouteChange();
  },
  methods: {
    listenForAllExitEvents() {
      window.addEventListener("pagehide", this.sendAnalyticsData);
      window.addEventListener("beforeunload", this.sendAnalyticsData);
      window.addEventListener("unload", this.sendAnalyticsData);
      document.addEventListener("visibilitychange", this.sendAnalyticsData);
      if (this.iOS()) window.addEventListener("blur", this.sendAnalyticsData);
    },
    sendAnalyticsData() {
      let url =
        "https://f2hrck8yp5.execute-api.eu-west-1.amazonaws.com/website-analytics-logger-demo";
      let data = JSON.stringify({ events: [...this.analyticsEvents] });

      if (!this.sendingAnalyticsData) {
        this.sendBeacon(url, data);
      }
    },
    sendBeacon(url, data) {
      if (this.analyticsEvents.length > 0) {
        console.log("Sending analytics data");
        this.sendingAnalyticsData = true;

        if (
          window.navigator.sendBeacon ||
          (window.navigator.sendBeacon && document.visibilityState == "hidden")
        ) {
          const beacon = window.navigator.sendBeacon(url, data);
          this.analyticsEvents = [];
          this.sendingAnalyticsData = false;
          console.log("Analytics data sent and cleared from state");
          if (beacon) return;
        }

        const { vendor } = window.navigator;

        const async = !this.iOS();
        const request = new XMLHttpRequest();
        request.open("POST", url, async); // 'false' makes the request synchronous
        request.setRequestHeader("Content-Type", "application/json");
        request.send(data);

        if (!async || ~vendor.indexOf("Google")) return;

        const t = Date.now() + Math.max(300, latency + 200);
        while (Date.now() < t) {
          // postpone the JS loop for 300ms so that the request can complete
          // a hack necessary for Firefox and Safari refresh / back button
        }

        this.analyticsEvents = [];
        this.sendingAnalyticsData = false;
        console.log("Analytics data sent and cleared from state");
      }
    },
    iOS() {
      return (
        [
          "iPad Simulator",
          "iPhone Simulator",
          "iPod Simulator",
          "iPad",
          "iPhone",
          "iPod",
        ].includes(navigator.platform) ||
        (navigator.userAgent.includes("Mac") && "ontouchend" in document)
      );
    },
    logVisitPageOnRouteChange() {
      this.$router.beforeEach((to, from, next) => {
        let visitPageEvent = {
          sessionId: this.uuid,
          eventType: "Visit Page",
          createdAt: this.getCurrentDateTime(),
          device: this.isMobileDevice() ? "Mobile" : "Desktop",
          userAgent: navigator.userAgent,
          browser: identifyBrowser(),
          os: this.getOSName(),
          language: navigator.language,
          timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
          path: to.path === "/" ? to.path : to.path.replace(/\/$/, ""),
        };

        this.analyticsEvents.push(visitPageEvent);
        next();
      });
    },
    logVisitSiteEvent() {
      let visitSiteEvent = {
        sessionId: this.uuid,
        eventType: "Visit Site",
        createdAt: this.getCurrentDateTime(),
        device: this.isMobileDevice() ? "Mobile" : "Desktop",
        userAgent: navigator.userAgent,
        browser: identifyBrowser(),
        os: this.getOSName(),
        language: navigator.language,
        timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
        path:
          window.location.pathname === "/"
            ? window.location.pathname
            : window.location.pathname.replace(/\/$/, ""),
      };

      this.analyticsEvents.push(visitSiteEvent);
    },
    getCurrentDateTime() {
      let currentDate = new Date();
      return (
        currentDate.getDate() +
        "/" +
        (currentDate.getMonth() + 1) +
        "/" +
        currentDate.getFullYear() +
        " " +
        currentDate.getHours() +
        ":" +
        currentDate.getMinutes() +
        ":" +
        currentDate.getSeconds()
      );
    },
    isMobileDevice() {
      return (
        typeof window.orientation !== "undefined" ||
        navigator.userAgent.indexOf("IEMobile") !== -1
      );
    },
    getOSName() {
      var OSName = "Unknown OS";
      if (navigator.appVersion.indexOf("Win") != -1) OSName = "Windows";
      if (navigator.appVersion.indexOf("Mac") != -1) OSName = "MacOS";
      if (navigator.appVersion.indexOf("X11") != -1) OSName = "UNIX";
      if (navigator.appVersion.indexOf("Linux") != -1) OSName = "Linux";

      return OSName;
    },
  },
};
</script>
```

There are many helper methods in this component, mostly for identifying things like the browser, timezone, language and OS. To handle switching tabs or apps on mobile I added the `visibilitychange` exit listener. This was very effective in capturing events from iOS devices, which proved tricky at first until I [read more on the topic](https://stackoverflow.com/questions/6162188/javascript-browsers-window-close-send-an-ajax-request-or-run-a-script-on-win). I took inspiration from the article [roll your own analytics](https://www.pcmaffey.com/roll-your-own-analytics/) for the `sendBeacon` implementation. I used the [uuid](https://www.npmjs.com/package/uuid) package to generate a random identifier so it persists over tab switching, but not refreshing the page or closing the browser - no cookies, privacy first approach.

Here is my Google Sheet after sending through quite a bit of test data by interacting with the site.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1623498957/App%20Images/Blog%20Images/Article%20Images/Analytics/google-spreadsheet-data_hjry5i.png" 
  alt="Google Sheet test data" 
  loading="lazy" 
  styling=""
  caption="Google Sheet test data" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1623498957/App%20Images/Blog%20Images/Article%20Images/Analytics/google-spreadsheet-data_hjry5i.png" 
  :showsource="false">
</article-image>

## Presenting the data in a dashboard

Now data is coming in from the Vue app, I needed a way to make sense of it in some form of dashboard. I chose to build an analytics dashboard using Power BI Desktop. It is free to download, fairly quick to create a dashboard and lots of support online to get started. 

You can get data from your Google Sheet by following these steps:

* Go to the Google Sheet
* Click Share
* Get a link as share with anyone 
* Change URL ending `/edit?usp=sharing` to `/export?format=xlsx`
* Open Power BI and select get data from Web
* Paste in the share link
* Select the name of your sheet and Power BI will load it as a table

There are many other ways to present the data held in Google Sheets, use whichever tool you like the most. This is what my dashboard looks like with test data:

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1623768802/App%20Images/Blog%20Images/Article%20Images/Analytics/power-bi-analytics-dashboard-1_gsijoy.png" 
  alt="Power BI analytics dashboard" 
  loading="lazy" 
  styling=""
  caption="Power BI analytics dashboard" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1623768802/App%20Images/Blog%20Images/Article%20Images/Analytics/power-bi-analytics-dashboard-1_gsijoy.png" 
  :showsource="false">
</article-image>

It’s simple, straightforward and easy to read. It provides all the high level and detailed information I need to see how well the site is being received, which pages are popular, and which need more work. As the data is automatically logged to the Google Sheet, all I need to do to receive the most up to date data in the Power BI dashboard is hit the Refresh button. The styling might not be amazing, but it’s for my eyes only, I’m not out to win any style awards 😆. The most important part for me, is that I'm only capturing the data I need, without any third party tracking or cookies. It's a privacy first approach. There is no personal data collected, it's all anonymous aggregated data.

There was some DAX involved to create calculated columns and measures for the average time on page calculation. I have covered the entire process for building this dashboard from scratch in the article [building a website analytics dashboard with Power BI and Google Sheets](/blog/building-a-website-analytics-dashboard-with-power-bi-and-google-sheets/).

## Bonus: Avoid tracking your own activity

As I tested and interacted with the site myself, I didn't want to track my own activity. During the site launch, I didn't want any logs of testing activity either. This would skew the usage statistics and create an inaccurate picture. I addressed this by adding a private route for internal users that saved a value in local storage then redirected back to the home page. So for any internal testing, we can use the private route URL to deactivate analytics logging.  

```html [deactivateanalytics.vue]
<template></template>

<script>
export default {
  mounted() {
    localStorage.setItem("analyticsDeactivated", true);
    window.location.href = "/";
  }
};
</script>
```

Once this value is set, I added a guard just before the `sendBeacon` method is called. So if analytics are set to deactivated, the events data won't be sent.

```javascript [layouts/default.vue]
sendAnalyticsData() {
  let url =
    "https://f2hrck8yp5.execute-api.eu-west-1.amazonaws.com/website-analytics-logger-demo";
  let data = JSON.stringify({ events: [...this.analyticsEvents] });
  let analyticsDeactivated = localStorage.getItem("analyticsDeactivated") || false;
  
  if (!this.sendingAnalyticsData && !analyticsDeactivated) {
    this.sendBeacon(url, data);
  }
}  
```

## Lessons learnt

This has been a fun project and overall I’m pleased with the outcome. Does it give me an insight into visitors and page views? Absolutely. It’s not perfect, there are some negatives but it meets most of my initial goals. I did find it difficult to handle mobile use cases such as switching tabs, closing tabs, leaving the browser and switching to another app. This was overcome with the `visibilitychange` and `blur` events - effectively creating a ‘log when you can’ approach. Whenever the `sendBeacon` method is successfully called I clear the `analyticsEvents` array held in state, so if it happens to try and send again when a user comes back, it won’t send if there are no new events to log 😄

Although I acknowledge I will be missing some sessions, I am happy with that. I only set out to get a simple overview of how the site is being received so I can improve it. This satisfies that purpose nicely. If capturing every single event was the number one priority, I would switch this setup to log each event as it happens - using the `beforeEach()` hook to call the AWS function on each page change rather than all in one call at the end of the session. This would lead to increased AWS function calls which would increase the costs at scale. The AWS Lambda [free usage tier](https://aws.amazon.com/lambda/pricing/) includes 1M free requests per month and 400,000 GB-seconds of compute time per month at the time of writing.

I can see uses for this setup beyond website analytics logging. I think it could be handy in a variety of situations when it comes to logging information. If you have adapted this setup to your own needs, I'd love to hear about it in the comments below.

## How it’s performed

I will update this section when more data on performance is available.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Programming quotes that offer wisdom and motivation]]></title>
            <link>https://shedloadofcode.com/blog/programming-quotes-that-offer-wisdom-and-motivation/</link>
            <guid>https://shedloadofcode.com/blog/programming-quotes-that-offer-wisdom-and-motivation/</guid>
            <pubDate>Sat, 12 Jun 2021 11:12:00 GMT</pubDate>
            <description><![CDATA[A collection of my favourite programming quotes. Some didn’t come from programmers, but they are very applicable nonetheless.]]></description>
            <content:encoded><![CDATA[
This article is a place I keep all of my favourite programming quotes. Some didn’t come from programmers, but they are very applicable nonetheless. A few wise words can go a long way in furthering understanding, I have grouped them by topic so they’re a little easier to find. Enjoy!

## Being effective

> Give me six hours to chop down a tree and I will spend the first four sharpening the axe — Abraham Lincoln

> Slow is smooth and smooth is fast - US Navy SEALs

> An investment in knowledge pays the best interest — Benjamin Franklin

> The best work happens when it doesn’t feel like you’re working at all — Shedload Of Code

> Most good programmers do programming not because they expect to get paid or get adulation by the public, but because it is fun to program — Linus Torvalds

> Simplicity is the soul of efficiency — Austin Freeman

> One of my most productive days was throwing away 1000 lines of code — Ken Thompson

> Every great developer you know got there by solving problems they were unqualified to solve until they actually did it — Patrick McKenzie

> Prolific developers don’t always write a lot of code, instead they solve a lot of problems. The two things are not the same — J. Chambers

> Measuring programming progress by the lines of code is like measuring aircraft building progress by weight - Bill Gates

> Good software, like wine, takes time — Joel Spolsky

> To be effective engineers, we need to be able to identify which activities produce more impact with smaller time investments. Not all work is created equal. Not all efforts, however well-intentioned, translate into impact ― Edmond Lau, The Effective Engineer

> Choose a job you love, and you will never have to work a day in your life — Confucius

> Delegate - work smarter not harder; do what you do best and drop the rest; get control of your calendar; do what you love because it will give you energy; work with people you like so your energy isn't depleted — John C. Maxwell

> A hacker on a roll may be able to produce-in a period of a few months - something that a small development group (say, 7-8 people) would have a hard time getting together over a year. IBM used to report that certain programmers might be as much as 100 times as productive as other workers, or more — Peter Seebach

> Better than a thousand days of diligent study is one day with a great teacher — Japanese Proverb

## Writing code

> First, solve the problem. Then, write the code — John Johnson

> Programming is a blend of gardening and surgery — Shedload Of Code

> Computer science education cannot make anybody an expert programmer any more than studying brushes and pigment can make somebody an expert painter – Eric S. Raymond 

> Programming is a skill best acquired by practice and example rather than from books — Alan Turing

> All problems in computer science can be solved by another level of indirection — David Wheeler

> Any fool can write code that a computer can understand. Good programmers write code that humans can understand — Martin Fowler

> Don’t include a single line in your code which you could not explain to your grandmother in a matter of two minutes — Unknown

> Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live — Martin Golding 

> Programming isn't about what you know; it's about what you can figure out - Chris Pine

> In some ways, programming is like painting. You start with a blank canvas and certain basic raw materials. You use a combination of science, art, and craft to determine what to do with them. You sketch out an overall shape, paint the underlying environment, then fill in the details. You constantly step back with a critical eye to view what you've done. Every now and then you'll throw a canvas away and start again. But artists will tell you that all the hard work is ruined if you don't know when to stop. If you add layer upon layer, detail over detail, the painting becomes lost in the paint ― Andrew Hunt, The Pragmatic Programmer: From Journeyman to Master

> Premature optimization is the root of all evil - Donald Knuth

## Building systems

> Simplicity is prerequisite for reliability — Edsger Dijkstra

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system — John Gall

> Complexity kills. It sucks the life out of developers, it makes products difficult to plan, build and test, it introduces security challenges, and it causes end-user and administrator frustration — Ray Ozzie

> Software being 'Done' is like lawn being 'Mowed' — Jim Benson

> If you cannot grok the overall structure of a program while taking a shower, you are not ready to code it — Richard Pattis

> No one in the brief history of computing has ever written a piece of perfect software. It's unlikely that you'll be the first — Andy Hunt

> It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change — Charles Darwin

> Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away — Antoine de Saint-Exupery 

> As a programmer, it is your job to put yourself out of business. What you do today can be automated tomorrow - Doug McIlroy

> The purpose of software engineering is to control complexity, not to create it — Pamela Zave

> It’s easier to ask for forgiveness, than it is to get permission — Admiral Grace Hopper

> Computers make it easier to do a lot of things, but most of the things they make it easier to do don't need to be done — Andy Rooney

## Using statistics

> All models are wrong, but some are useful — George Box

> Not everything that counts can be counted, and not everything that can be counted counts — William Bruce Cameron

> The greatest value of a picture is when it forces us to notice what we never expected to see — John Tukey

> He uses statistics as a drunken man uses lamp posts - for support rather than for illumination — Andrew Lang

> Statistics are no substitute for judgment — Henry Clay]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to scrape and analyse your Amazon spending data]]></title>
            <link>https://shedloadofcode.com/blog/how-to-scrape-and-analyse-your-amazon-spending-data/</link>
            <guid>https://shedloadofcode.com/blog/how-to-scrape-and-analyse-your-amazon-spending-data/</guid>
            <pubDate>Thu, 03 Jun 2021 09:19:00 GMT</pubDate>
            <description><![CDATA[Ever wondered just how much you've spent on Amazon since signing up? This article will use web scraping and data analysis with Python on the Amazon UK site to answer that question and a few more.]]></description>
            <content:encoded><![CDATA[
Ever wondered just how much you've spent on Amazon since signing up? Well I read an article recently from Dataquest which outlined how to find out [how much you've spent on Amazon](https://www.dataquest.io/blog/how-much-spent-amazon-data-analysis/?utm_content=buffer06d87&utm_medium=social&utm_source=twitter.com&utm_campaign=dataquest_buffer). However, I quickly found out that this feature of downloading your spending in a report, is not available on the UK version of this site! I really wanted to gather this data, and started a small project to do just that. So, if you're interested in gathering and analysing your Amazon spending data with Python, while learning some web scraping, you're in the right place.

## Before starting

Before starting you will need a few things. These things will set you up to carry out other Data Science projects in the future too.

* Anaconda
* Jupyter Notebooks (installed with Anaconda)
* Selenium
* Google Chrome (latest version)
* Chrome Driver (latest version)

This article will not cover installing programs in detail, but here is a starting point. Install [Anaconda](https://www.anaconda.com/distribution/) first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using `pip install selenium`. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of [Google Chrome](http://google.co.uk/chrome/?brand=CHBD&gclid=EAIaIQobChMI0LPsqNXl5QIVCLTtCh3pJwybEAAYASAAEgJxkvD_BwE&gclsrc=aw.ds) installed and [ChromeDriver](https://chromedriver.chromium.org/downloads) for the version number of Chrome you're running. On Windows, ensure `chromedriver.exe` is in a [suitable location](https://chromedriver.chromium.org/getting-started) such as `C:\Windows`.

There is a link to download the Jupyter Notebook at the end of this article so you can try out the code on your own. Alternatively, just use the code you find in this page if you don't want to use Anaconda and Jupyter Notebooks, and install the required Python packages in a virtual environment.

## What will the web scraper do?

Here are the step by step actions the web scraper will perform to scrape Amazon spending data: 

* Launches a Chrome browser controlled by Selenium 
* Navigates to the Amazon login page 
* Waits 30 seconds for you to manually log in 
* After login, navigates to the Orders page 
* Scrapes Item Costs, Order IDs, and Order Dates
* Repeats for each year in the year filter and each page in the pagination filter until finished
* Outputs the data model to a CSV file

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1573916200/Analysis/year_filter_uvrycw.png" 
  alt="Amazon orders year filter" 
  loading="lazy" 
  styling=""
  caption="First the scraper loops through year filter" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1573916200/Analysis/year_filter_uvrycw.png" 
  :showsource="false">
</article-image>

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622711113/Analysis/orders-pagination_ztlsnd.png" 
  alt="Amazon orders pagination filter" 
  loading="lazy" 
  styling=""
  caption="Then loops through the each page in the pagination filter" 
  captionsrc="https://res.cloudinary.com/dayqxxsip/image/upload/v1622711113/Analysis/orders-pagination_ztlsnd.png" 
  :showsource="false">
</article-image>

The result will be enough to answer questions such as:

* How much have I spent in total?
* How much do I spend on average per order?
* What were the most expensive orders?
* What is my spending like per day of the week, month, year?

Before we step into the code, let's take a look at the automated scraper in action. Pay attention to the `&orderFilter=` and `&startIndex=` parameters in the URL bar. I've blurred out personal details of course, but you'll see how the scraper moves from year to year, and then page to page to scrape all of the order data.

<article-video 
  id="tgj15h93Nvo" 
  title="Web Scraping Amazon orders with Python and Selenium demo">
</article-video>

## Scraping the data

Let's look at the `AmazonOrderScraper` class which will be center stage. Bear in mind, this script was accurate at the time of writing, however if the Amazon website changes (id or class names, page structure or url paths) this script may no longer work and will require amending. Underneath this fairly long snippet you can simulate running the code to understand what it's doing, and what the final dataframe would look like.

```python [order-scraper.py]
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import time

from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

class AmazonOrderScraper:
    
    def __init__(self):
        self.date = np.array([])
        self.cost = np.array([])
        self.order_id = np.array([])
        
    
    def URL(self, year: int, start_index: int) -> str:
        return "https://www.amazon.co.uk/gp/your-account/order-history/" + \
                "ref=ppx_yo_dt_b_pagination_1_4?ie=UTF8&orderFilter=year-" + \
                str(year) + \
                "&search=&startIndex=" + \
                str(start_index)
    
    
    def scrape_order_data(self, start_year: int, end_year: int) -> pd.DataFrame:
        years = list(range(start_year, end_year + 1))
        driver = self.start_driver_and_manually_login_to_amazon()

        for year in years:
            print(f"Scraping order data for { year }")

            driver.get(
                self.URL(year, 0)
            )
            
            number_of_pages = self.find_max_number_of_pages(driver)
            
            self.scrape_first_page_before_progressing(driver)

            for i in range(number_of_pages):
                self.scrape_page(driver, year, i)

            print(f"Order data extracted for { year }") 
            
        driver.close()
        
        print("Scraping done :)")
            
        order_data = pd.DataFrame({
            "Date": self.date,
            "Cost £": self.cost,
            "Order ID": self.order_id
        })
        
        order_data = self.prepare_dataset(order_data)
        
        order_data.to_csv(r"amazon-orders.csv")
        print("Data saved to amazon-orders.csv")
            
        return order_data
    

    def start_driver_and_manually_login_to_amazon(self) -> webdriver:
        options = webdriver.ChromeOptions()
        options.add_argument("--start-maximized")
        service = Service(executable_path=ChromeDriverManager().install()) 

        driver = webdriver.Chrome(service=service, options=options) 
        # Alternatively, provide path to chromedriver using: 
        # webdriver.Chrome("chromedriver.exe", options=options)

        amazon_sign_in_url = "https://www.amazon.co.uk/ap/signin?" + \
            "_encoding=UTF8&accountStatusPolicy=P1&" + \
            "openid.assoc_handle=gbflex&openid.claimed_id" + \
            "=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&" + \
            "openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier" + \
            "_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid" + \
            ".net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net" + \
            "%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid" + \
            ".return_to=https%3A%2F%2Fwww.amazon.co.uk%2Fgp%2Fcss%2Forder-history" + \
            "%3Fie%3DUTF8%26ref_%3Dnav_orders_first&" + \
            "pageId=webcs-yourorder&showRmrMe=1"

        seconds_to_login = 30 # Allows time for manual sign in. Increase if you need more time
        print(f"You have { seconds_to_login } seconds to sign in to Amazon.")

        driver.get(amazon_sign_in_url)
        time.sleep(seconds_to_login) 
        
        
        return driver
    
    
    def find_max_number_of_pages(self, driver: webdriver) -> int:
        time.sleep(2)
        page_source = driver.page_source
        page_content = BeautifulSoup(page_source, "html.parser")

        a_normal = page_content.findAll("li", {"class": "a-normal"})
        a_selected = page_content.findAll("li", {"class": "a-selected"})
        max_pages = len(a_normal + a_selected) - 1
       

        return max_pages
    
    
    def scrape_first_page_before_progressing(self, driver: webdriver) -> None:
        time.sleep(2)
        page_source = driver.page_source
        page_content = BeautifulSoup(page_source, "html.parser")
        order_info = page_content.findAll("span", {"class": "a-color-secondary value"})

        orders = []
        for i in order_info:
            orders.append(i.text.strip())

        index = 0
        for i in orders:
            if index == 0:
                self.date = np.append(self.date, i)
                index += 1
            elif index == 1:
                self.cost = np.append(self.cost, i)
                index += 1
            elif index == 2:
                self.order_id = np.append(self.order_id, i)
                index = 0
    
    
    def scrape_page(self, driver: webdriver, year: int, i: int) -> None:
        start_index = list(range(10, 110, 10))
        
        driver.get(
            self.URL(year, start_index[i])
        )
        time.sleep(2)

        data = driver.page_source
        page_content = BeautifulSoup(data, "html.parser")

        order_info = page_content.findAll("span", {"class": "a-color-secondary value"})

        orders = []
        for i in order_info:
            orders.append(i.text.strip())

        index = 0
        for i in orders:
            if index == 0:
                self.date = np.append(self.date, i)
                index += 1
            elif index == 1:
                self.cost = np.append(self.cost, i)
                index += 1
            elif index == 2:
                self.order_id = np.append(self.order_id, i)
                index = 0
                
    
    def prepare_dataset(self, order_data: pd.DataFrame) -> pd.DataFrame:
        order_data.set_index("Order ID", inplace=True)

        order_data["Cost £"] = order_data["Cost £"].str.replace("£", "").astype(float)
        order_data['Order Date'] = pd.to_datetime(order_data['Date'])
        order_data["Year"] = pd.DatetimeIndex(order_data['Order Date']).year
        order_data['Month Number'] = pd.DatetimeIndex(order_data['Order Date']).month
        order_data['Day'] = pd.DatetimeIndex(order_data['Order Date']).dayofweek
        
        day_of_week = { 
            0:'Monday', 
            1:'Tuesday', 
            2:'Wednesday', 
            3:'Thursday', 
            4:'Friday', 
            5:'Saturday', 
            6:'Sunday'
        }
        
        order_data["Day Of Week"] = order_data['Order Date'].dt.dayofweek.map(day_of_week)
        
        month = { 
            1:'January', 
            2:'February', 
            3:'March', 
            4:'April', 
            5:'May', 
            6:'June', 
            7:'July', 
            8:'August', 
            9:'September', 
            10:'October', 
            11:'November', 
            12:'December'
        }

        order_data["Month"] = order_data['Order Date'].dt.month.map(month)
        
        return order_data


if __name__ == "__main__":
    aos = AmazonOrderScraper()
    order_data = aos.scrape_order_data(start_year = 2010, end_year = 2024)
    print(order_data.head(3))
```

<code-runner :output="['Order data extracted for 2010',
  'Order data extracted for 2011',
  'Order data extracted for 2012',
  'Order data extracted for 2013',
  'Order data extracted for 2014',
  'Order data extracted for 2015',
  'Order data extracted for 2016',
  'Order data extracted for 2017',
  'Order data extracted for 2018',
  'Order data extracted for 2019',
  'Order data extracted for 2020',
  'Order data extracted for 2021',
  'Scraping done :)',
  'Order ID            Date              Cost £  Order Date  Year  Month Number Day Day Of Week Month',
  '202-8936883-1234567 27 December 2010  9.02    2010-12-27  2010  12           0   Monday      December',
  '202-8936883-1234567 27 December 2010  4.03    2010-12-27  2010  12           0   Monday      December',
  '202-8936883-1234567 12 December 2010  4.33    2010-12-12  2010  12           6   Sunday      December']" 
  filename="order-scraper.py" 
  language="Python">
</code-runner>

Once instantiated as `aos`, we call the `scrape_order_data` method and it handles everything else. You will need to pass `start_year` and `end_year` as parameters to it, this allows for scraping the full range of years applicable to you, or a selected range. 

I also recently used this script again in 2024, this time adding the [webdriver-manager](https://pypi.org/project/webdriver-manager/) package to auto-install Chrome, avoiding having to find the correct version and provide the path to chromedriver.exe

I used similar methods to these in [How to scrape AutoTrader with Python and Selenium to search for multiple makes and models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/). 



## Analysing the data

The `prepare_dataset` method applied some feature engineering to enhance the dataset. This is simply to ensure that the data is able to be sliced by date, year, month and day of the week. It carried out a series of data manipulation steps, such as removing the pound sign from the cost column, ensuring data types were correct, and mapping day and month names to their integer representations ready to use with charts.

So now you have your data, you can apply any analysis you would like to it. I will give you some inspiration on the kinds of questions you might wish to ask. You might find (like I did) your spending is higher or lower than you expected, so brace yourself for unexpected surprises! 

## Import packages

```python 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set(rc={'figure.facecolor':'white'})
```

## Summary statistics

```python
order_data.describe()
```

<div style="overflow-x:auto;">

|        | Cost £       |  Year       | Month Number  | Day         |
|--------|-------------|--------------|---------------|-----------  |
| count  | 523.000000  |  523.000000  |  523.000000   | 523.000000  |
| mean   |  18.695985  | 2015.139579  |    6.699809   |   2.797323  |
| std    |  23.793675  |    3.276180  |    3.612417   |   2.164905  |
| min    |   0.000000  | 2010.000000  |    1.000000   |   0.000000  |
| 25%    |   5.330000  | 2012.000000  |    3.500000   |   1.000000  |
| 50%    |  12.750000  | 2015.000000  |    7.000000   |   3.000000  |
| 75%    |  23.015000  | 2018.000000  |   10.000000   |   5.000000  |
| max    |  299.990000 |  2021.000000 |    12.000000  |    6.000000 |

</div>

## Total spend

```python
total_amount_spent = order_data["Cost £"].sum()
print(f"Total amount spent: £{ total_amount_spent }")
```

<code-runner :output="['Total amount spent: £9778.0']" 
  filename="" 
  language="Python">
</code-runner>

## Average spend per order

```python
average_amount_spent_per_order = order_data["Cost £"].mean()
print(f"Average amount spent per order: £{ round(average_amount_spent_per_order, 2) }")
```

<code-runner :output="['Average amount spent per order: £18.7']" 
  filename="" 
  language="Python">
</code-runner>

## Most and least expensive orders

```python
order_data.loc[order_data["Cost £"] == order_data["Cost £"].max()]
```

<div style="overflow-x:auto;">

| Order ID            | Date          | Cost £ | Order Date | Year | Day Of Week | Month | 
|---------------------|---------------|--------|------------|------|-------------|-------|
| 205-1516165-1234567 | 31 March 2020 | 299.99 | 2020-03-31	| 2020 | Tuesday     |  March| 

</div>

```python
order_data.loc[order_data["Cost £"] == order_data["Cost £"].min()]
```

<div style="overflow-x:auto;">

| Order ID            | Date          | Cost £ | Order Date | Year | Day Of Week | Month | 
|---------------------|---------------|--------|------------|------|-------------|-------|
| 123-5616156-1234567 | 21 June 2011  | 0.0    | 2011-06-21 | 2011 | Tuesday     |  June | 

</div>

## Top five most expensive orders

```python
order_data.sort_values(ascending=False, by="Cost £").head(5)
```
<div style="overflow-x:auto;">

| Order ID            | Date            | Cost £ | Order Date | Year | Day Of Week | Month    | 
|---------------------|---------------  |--------|------------|------|-------------|-------   |
| 205-2452455-9123505	| 31 March 2020	  | 299.99 | 2020-03-31	| 2020 | Tuesday	   | March    |
| 204-4525421-7169117	| 15 November 2020| 239.00 | 2020-11-15	| 2020 | Sunday	     | November |
| 205-5245215-9426706	| 28 February 2020| 138.22 | 2020-02-28	| 2020 | Friday	     | February |
| 202-5278588-7857857	| 17 November 2018| 135.99 | 2018-11-17	| 2018 | Saturday	   | November |
| 204-2542525-5654645	| 5 December 2020	| 127.37 | 2020-12-05	| 2020 | Saturday	   | December |                             

</div>

## Total spend per year

```python
fig, ax = plt.subplots(figsize=(15,6))
yoy_cost = order_data.groupby(["Year"], as_index=False).sum()
sns.lineplot(x=yoy_cost["Year"], y=yoy_cost["Cost £"], color="grey")
plt.title("How much spending per year?")
plt.ylabel("Spending £")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622741449/App%20Images/Blog%20Images/Article%20Images/Amazon%20Spending/spending-per-year_azrkpy.png" 
  alt="Total spend per year graph" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Count of orders per year

```python
fig, ax = plt.subplots(figsize=(15,6))
yoy_order_count = order_data.groupby(["Year"], as_index=False).count()
sns.lineplot(x=yoy_order_count["Year"], y=yoy_order_count["Cost £"], color="Grey")
plt.title("How many orders per year?")
plt.ylabel("Count of Orders")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622741449/App%20Images/Blog%20Images/Article%20Images/Amazon%20Spending/orders-per-year_emwewi.png" 
  alt="Count of orders per year graph" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Total monthly spend

```python
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

fig, ax = plt.subplots(figsize=(15,6))
monthly_cost = order_data.groupby(["Month"], as_index=False).sum()
sns.barplot(x=monthly_cost["Month"], y=monthly_cost["Cost £"], order=months, color="Grey")
plt.ylabel("Spending £")
plt.title("How much overall spending per month?")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622741449/App%20Images/Blog%20Images/Article%20Images/Amazon%20Spending/total-monthly-spend_bngkem.png" 
  alt="Total monthly spend graph" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Average monthly spend

```python
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

fig, ax = plt.subplots(figsize=(15,6))
monthly_cost = order_data.groupby(["Month"], as_index=False).mean()
sns.barplot(x=monthly_cost["Month"], y=monthly_cost["Cost £"], order=months, color="Grey")
plt.ylabel("Spending £")
plt.title("Average spending per month?")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622741449/App%20Images/Blog%20Images/Article%20Images/Amazon%20Spending/average-monthly-spend_aszzpl.png" 
  alt="Average monthly spend graph" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Day of the week with highest spend

```python
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

fig, ax = plt.subplots(figsize=(15,6))
day_of_week_cost = order_data.groupby(["Day Of Week"], as_index=False).sum()
sns.barplot(x=day_of_week_cost["Day Of Week"], y=day_of_week_cost["Cost £"], order=days_of_week, color="Grey")
plt.ylabel("Spending £")
plt.title("Which day of the week has the highest spend?")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622741449/App%20Images/Blog%20Images/Article%20Images/Amazon%20Spending/day-of-week-spending_vsdpfl.png" 
  alt="Day of week with highest spend graph" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Full time series

```python
fig, ax = plt.subplots(figsize=(15,6))
sns.lineplot(x=order_data['Order Date'], y=order_data["Cost £"], color="Grey")
plt.ylabel("Spending £")
plt.title("Spending Time Series")
```

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622739406/App%20Images/Blog%20Images/Article%20Images/Amazon%20Spending/overall-spending_onxhce.png" 
  alt="Total spending graph" 
  loading="lazy" 
  styling=""
  caption="" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Final words and next steps

So there it is, you can now scrape and analyse your Amazon spending data using Python. Hopefully, the answers to the questions we've asked in this article haven't caused too many surprises! Now you have a way to monitor, track and analyse spending to identify trends. If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article. The full Jupyter notebook can be [downloaded for reference](https://github.com/shedloadofcode/notebooks/blob/main/Amazon%20Orders%20Web%20Scraping.ipynb).

Ideas for future development might include importing the CSV into Power BI or other analysis tools. This would allow interactive data exploration and would introduce cross-filtering functionality. You could then cross examine day of the week with year, or day of the month with month and all other combinations. This could unlock further insights.]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Maintaining a healthy positive mindset as a programmer]]></title>
            <link>https://shedloadofcode.com/blog/maintaining-a-healthy-positive-mindset-as-a-programmer/</link>
            <guid>https://shedloadofcode.com/blog/maintaining-a-healthy-positive-mindset-as-a-programmer/</guid>
            <pubDate>Wed, 02 Jun 2021 19:42:00 GMT</pubDate>
            <description><![CDATA[Being in a positive frame of mind is so important in any profession, even more so for programmers who undergo daily mental gymnastics! Read how to overcome the negatives and keep your mental and physical health in shape.]]></description>
            <content:encoded><![CDATA[
 

Being in a positive frame of mind is so important in any profession, even more so for programmers who undergo daily mental gymnastics! Easier said than done, I know. Not only is writing code hard (whether it be for software development or data science), dealing with people can be harder. Of course I can only speak from my own experience so far, but maybe some of it will relate. This article will go over the main things that cause my positive mindset to turn negative, and how I try to overcome them to get back on track.  

## You overwork yourself 

I wanted to enter the programming world because I found programming to be fun! It’s so good to be able to have an idea and then go ahead and build it. By doing so you can make other people’s lives easier and genuinely provide value. The downside of that, is you might end up working on things for too long. I can find myself finishing the working day, only to start working on my side projects or to study at night. I do it because it’s fun, it’s almost like being paid a good amount for a hobby, which is great. The only problem is it leaves no time to wind down and do other things. This can lead to burn out and stress. These are two things you need to avoid. They will really harm you in the long run and are unsustainable. 

**Suggested remedy:** Always have a start time and an end time for the working day. You value your ‘work’ time and should value your ‘free’ time. Your free time is sacred. I love building things and writing code but there are other things in the world too 😄 You don’t want life to pass you by while coding all hours, no matter how fun it is! On that note, try to go to bed at the same time and get at least eight hours sleep. Always take periodic breaks throughout the day - maybe Pomodoro, or five minute walks every so often (your eyes will thank you for time away from a screen). Sitting for long periods is very bad for you. Drink water throughout the day and use a smaller cup so you have to get up to refill it. Don't skip breakfast or lunch, and try to eat a balanced nutritional diet. I found trying a standing desk helped with taking breaks, easier to move around when you’re already standing right? Might be worth considering. Try to get some frequent exercise in your down time too, it boosts your mood and your health is the most important thing you have.  

## You must attend scrum rituals 

I really gave agile and scrum a chance when I first started in the field. It was new to me so I thought I’ll see what it’s all about. It didn’t really leave a good impression on me (maybe I’ve just been unfortunate). The daily stand ups were way too long and more like a status report to the project manager. Those meetings felt so unnatural, everyone seemed to be justifying their existence, sometimes with what felt like busywork. It started to look like these [kinds of things](https://www.aaron-gray.com/a-criticism-of-scrum/), rather than the [agile manifesto](https://agilemanifesto.org) I had read. I didn’t get into programming to justify myself daily that’s for sure. It all felt a little belittling and hostile. It doesn't surprise me many others have [similar thoughts](https://www.quora.com/In-a-nutshell-why-do-a-lot-of-developers-dislike-Agile-What-are-better-project-management-paradigm-alternatives) that agile is fundamentally a good thing, but it's become a hindrance rather than a help. I've sat in my fair share of meetings that looked a little too much [like this](https://www.youtube.com/watch?v=BKorP55Aqvg) - containing vague requests and haphazard, irrational plans. Despite always being the voice of reason, by the end of them I had no idea what just happened much like Anderson 😆 I still work in agile teams but I handle it differently now, I’ve come to terms with what agile is and what it’s not. It’s not a silver bullet. The main ingredient in getting anything done is amazing experienced people who are team players and want to improve the product or service they’re building. 

**Suggested remedy:** Remember why you got into programming in the first place. The answer for me is to have fun, get paid for it and build amazing things that help other people. I want to manage deadlines, costs and slackers as much as the next guy, but checking up on people daily is not my idea of trust. Always stay away from the politics, and stand up for yourself if you find yourself up against hostile people who are asking too much. At the end of the day, the doer is the most important person in the room. As a doer you hold a lot of power over the talkers, and if they aren’t nice to you they can either do the work themselves or find someone else who will put up with it right? I let those people who love their meetings and rituals get on with it, I focus on building amazing products that help others and that I’m passionate about. If you want some fun counting the cost of scrum, try [running these numbers](https://www.aaron-gray.com/a-criticism-of-scrum/#count-the-cost) through our [Meeting Cost Calculator](/tools/meeting-cost-calculator).

## You find ‘how long do you think that will take’ hard to answer 

It’s a question so difficult to answer, yet asked by everyone. Entire books have been written on the subject of giving accurate estimates. The problem is it’s not always taken as an estimate, but as a commitment. I feel unless you’ve done the exact same thing a hundred times before, in a similar setting, the estimate will be wrong. That creates resentment, dysfunction and distrust after ‘missed’ deadlines. It makes people feel bad, they feel responsible because they thought it would be done quicker. They question their own ability to get things done, when it could be something outside of their own control or something unforeseen by everyone. There are many things you know you don’t know and things that surprise us when it’s too late to change course. This can lead to a very negative mood. I watched an interesting talk on [no estimates](https://youtu.be/QVBlnCTu9Ms) that seems like a great way to work. If you’re building and improving a working product consistently, on time and in budget, why do estimates matter anyway? There is strong evidence that once a task requires even rudimentary cognitive skill, rewards and other motivators (like deadlines) simply don’t work, they actually [lead to poorer performance](https://youtu.be/rrkrvAUbU9Y?t=98). 

**Suggested remedy:** Honesty is the best policy. If you’ve done something similar, use that as a starting point, and maybe double it. State your plan out loud - this will help to break down the steps and what tasks are involved. Give a range, so something like ‘worst case scenario one week, best case three days’. Don’t try to impress anyone and if you don’t know how long something will take, say so. People don’t like uncertainty, they will press for an estimate, but if you’ve never done something before how can you say how long it will take? Better to ask for time to explore the problem first, or speak to a more experienced colleague, to gauge how much effort is involved in solving it. You’ll feel much better and you won’t be pressured into accepting a timeframe you’re not comfortable with. At the end of the day, be professional, but things take as long as they take. If you ever find yourself in a disagreement, come at it from a business / economics point of view - writing subpar code and cutting corners slows things down in the long run and the costs of that can be massive. Finally, if you want to provide more robust estimates using statistical techniques be sure to check out our [Agile Task Estimation Calculator](/tools/agile-task-estimation-calculator).

## You feel like you’re not good enough 

This is referred to as ‘imposter syndrome’ and it affects everyone I think. Sometimes you get a negative feeling that you simply don’t know enough to be good. What I’ve seen is that programmers of all types are looked to for guidance. They are seen as the experts, the problem solvers and the clever people in the room. So what happens when the expert is asked a question they don’t know the answer to? Or asked a question others think they should know the answer to? They feel like a fraud or unqualified for their position. These feelings make you doubt and question yourself as to how good you are. This is true of newcomers and veterans alike, I imagine veterans have become better at handling these thoughts, but not always. I think when you arrive at a point where you’ve built some projects that others have used (production code) it helps with those doubts. You have concrete evidence that you can code, you can solve problems, and you can build working products. You might not be an expert at everything, but you know enough to get things done. 

**Suggested remedy:** Remember no one can know everything. Even experts in every field forget or don’t know something from time to time. Work hard at filling gaps in your knowledge - if you don’t understand something, read up on it. If you work alongside someone who knows way more than you do, learn from them. Never stop learning new things whenever you can. Staying inquisitive is better than assuming or pretending you know everything. It is this motivation to learn new things, and find solutions to problems that gives you immense worth, not pre-existing knowledge. 

## You are no longer learning anything new

At the beginning of a new role, learning is the main activity. You might be learning a new technology stack, a new programming language or a new way of working. This process of initial learning can last up to a year I've found. You pick up small bits of information until eventually, there isn't much that happens which surprises you. You reach a competence level in a role where you know how to solve everything (almost everything). In a good organisation, you'll be encouraged to try new things, learn new technologies and undergo any training that can help you improve professionally. In a bad organisation, you won't. Regardless, both of these situations can still lead to you feeling negative. The reason for that is no matter how much learning and development you do, if you're not using that new-found knowledge on a day-to-day basis, it won't be fully realised. Say you learn about cloud computing services with AWS, but your organisation uses Azure, you won't get to use that new skill. Was it worth learning though? Absolutely. You have a new valuable skill, but to use it day-to-day in a professional setting, it might require you to change organisations. Even worse is the scenario where your organisation doesn't encourage learning new things. I think of this erroding the value of your skills the same way [inflation](https://en.wikipedia.org/wiki/Inflation) errodes the value of money. You see, whilst you're working for an organisation that offers no time for learning, they're gaining your portfolio of skills, without giving you the time to grow that portfolio. Over time, your portfolio becomes less valuable - new technology emerges, updates are made to existing technology and frameworks, and old skills become rusty. Some may argue the portfolio should be maintained on your own time, I disagree - any organisation you work for should be very interested in the state of your skills portfolio and actively help you to grow it.

**Suggested remedy:** Always keep your skills portfolio healthy and growing. Continuously learn new things whether it be via online courses on platforms like EdX, Coursera or YouTube, or reading a technical book. The more knowledge you add to your portfolio, the more marketable, valuable and competent you become. Knowledge certainly is power, but it can also [improve your mental wellbeing](https://www.nhs.uk/mental-health/self-help/guides-tools-and-activities/five-steps-to-mental-wellbeing/#:~:text=Research%20shows%20that%20learning%20new,you%20to%20connect%20with%20others), boosting your self-confidence, self-esteem and giving you new directions and opportunities. If you're in an organisation where you're not encouraged to learn new things or you feel locked into a particular tech stack with no room for growth, consider finding another organisation or another role which does offer that support and a new challenge. 

## You don’t have anyone around you to turn to for help 

Programming is labelled as a job for introverts. However programming is very much a team game. Think about how many times you search Google or Stack Overflow to find insight and guidance - you’re consulting with the community each time. The problems that these places can’t help you with, are the problems very specific to the project or place you’re working at. You might find yourself at a loss when you face these issues, I certainly have done. You have to turn to other members of the team with internal knowledge of the company to solve these problems. On the odd occasion, particularly smaller projects whilst working as a solo developer, there is no one to really turn to. In these circumstances, I’ve had non-technical managers to turn to, but they can’t really help you fix issues within the code base. I think there should always be someone you can go to for technical guidance and support. This is true whether you are beginner, intermediate or advanced. Just because you might be advanced in most areas, doesn’t mean something won’t come up that makes you feel like a total beginner. In most cases, I’ve been fortunate enough to have someone around. On my first real project, I worked alongside an amazing senior developer who could talk the talk and walk the walk, and would always be available to guide me. When that’s not the case, it can leave you feeling isolated, with no one to turn to for support and therefore unable to deliver what’s being asked. It can make you feel like quitting, because without a mentor of any kind to guide and support you to the next level, you lose direction and focus.  

**Suggested remedy:** Remember you can’t do it all by yourself all of the time. As said before, be honest. Let it be known that you need support on something - and if you don’t get it then it then offer two choices. Either the ask is abandoned because you tried but can’t see a way to do it, or you can carry on trying for a little longer with no guarantee you can get it done. Finally, if you feel like you have no mentor to learn from and no support at all for a long period of time, the best thing to do might be to leave and find somewhere that does offer those things.  

## Key takeaways 

I hope this article has given you ways to keep your mental and physical health in shape as someone who writes code professionally. There are so many positives to programming, it’s like no other activity, a mixture of art, creativity and science that can bring real joy to those that practice it. Nevertheless, you need watch out for the negatives listed in this article and work to balance your pursuit with your health and life. Code runs the world, and I think the demand for enthusiastic dedicated programmers is only going to keep going up. Not only will they need to learn the technical topics, but also topics such as these. It will hopefully make programmers realise their worth and to prioritise their mental and physical wellbeing.  If there are any ways you use to keep a healthy mindset or overcome certain negatives, let me know in the comments below. 

Here is a recap of all the suggested mindset remedies mentioned in this article: 

* Always have a start and end time to your working day 
* Respect your free time 
* Do something other than programming in your free time sometimes 
* Don’t neglect exercise - your health is the most important thing 
* Remember why you got into programming in the first place - for me to have fun, get paid for it, learn new things and build amazing stuff 
* Be honest and professional when giving estimates - but admit if you can’t say how long something will take 
* Stand up for yourself if you’re being made to accept a timeframe that is unrealistic 
* Remember no one can know everything  
* Your motivation to find solutions to problems is what gives you worth 
* Stay inquisitive and learn new things whenever you can 
* Keep your skills portfolio healthy and growing
* Remember you can’t do it all by yourself all of the time 
* If you don’t have a mentor or any support, consider moving to a place that provides those things ]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to query a database with Python Flask and download data to CSV or XLSX in Vue]]></title>
            <link>https://shedloadofcode.com/blog/query-sql-and-download-csv-and-xlsx-in-flask/</link>
            <guid>https://shedloadofcode.com/blog/query-sql-and-download-csv-and-xlsx-in-flask/</guid>
            <pubDate>Thu, 27 May 2021 17:18:00 GMT</pubDate>
            <description><![CDATA[Build a Vue Flask app that will query a SQL database and return data to the browser. We'll then add the ability to view and download the data in CSV or XLSX format using Axios.]]></description>
            <content:encoded><![CDATA[
## Background

I was recently working on a project building a web app to automate viewing and downloading data. The result was a Vue - Flask app which accepted some user input, and based upon that input, sent the relevant SQL query to a data warehouse. The data could then be viewed or downloaded straight from the browser. There were a multitude of benefits from this. The queries no longer needed to be ran manually, saving time. They were indexed and easily updated. Finally, the data was more accessible via a web app to users without knowledge of SQL.

This article will cover building a simplified version of this app where we’ll go over the following:

* Optional: Setting up an Azure SQL database for testing
* Getting a Vue - Flask app set up from a template
* Creating a SQL query lookup
* Configuring Flask RESTX API endpoints
* Sending an Axios call to get data
* Building a simple form to accept user input
* Presenting the data in the browser
* Adding links to download the data
* Bonus: Displaying the SQL code nicely formatted

You can use this as a starting point to further develop a more complex and tailored solution. You’ll need either your own database set up to follow along, or you can set one up in the optional first step. I’ll be setting up and connecting to an Azure SQL database however it should be adaptable to other databases. You'll also need [Python 3.6.x](https://www.python.org/downloads/) along with [Node](https://nodejs.org/en/) and [Yarn](https://yarnpkg.com/getting-started/install) installed.

## Optional: Setting up an Azure SQL database for testing

This first step is optional as you might already have your own database you want to connect to. To facilitate an end to end tutorial, I’m setting up an Azure SQL database for testing. You can register for an [Azure account](https://azure.microsoft.com/en-gb/free/) which has some services free for 12 months. The video below starts from the [Azure portal](https://portal.azure.com/). It will guide you through the process of setting up an Azure SQL database with a sample AdventureWorks dataset, and find the connection string.

<article-video 
  id="ZKqyRdgouu0" 
  title="Creating an Azure SQL database with sample dataset">
</article-video>

Now make a note of the connection string, we’ll need that later on. It should look something like this. 

```
Driver={ODBC Driver 13 for SQL Server};Server=tcp:test-sql-server-0123.database.windows.net,1433;Database=test-sql-database-01;Uid=AdminUser;Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;
```

Replace `{your_password_here}` with the password you created during setting up the SQL database.

## Setting up a Vue - Flask project

First things first, head over to this [public repository](https://github.com/gtalarico/flask-vuejs-template) and download the project template. This is a great project template from gtalarico and this will be our starting point. Use whichever editor or IDE you're comfortable with, I'm using Visual Studio Code. The general project structure should look something like this:

``` [project-structure.txt]
flask-vuejs-template-master
│   README.md
|   .flaskenv
|   .gitignore
|   app.json
|   package.json
│   Pipfile
|   Pipfile.lock
|   run.py
|   vue.config.js
|   yarn.lock
|   ...
|     
│
└───app
│   │   __init__.py
│   │   client.py
|   |   config.py
│   │
│   └───api
│       │   __init__.py
|       |   resources.py
│       │   security.py
│   
└───src
│   │   App.vue
│   │   backend.js
|   |   filters.js
|   |   main.js
|   |   router.js
|   |   store.js
│   │
│   └───assets
│   |   │   ...
│   └───components
│   |   │   HelloWorld.vue
│   └───views
│       │   Api.vue
│       │   Home.vue
```

The `app` directory contains the Flask app and the `src` directory contains the frontend Vue app. We'll now install pipenv, create a virtual environment, install the project packages to it, and activate it. The Pipfile requires Python 3.6, but you should be able to manually change this if you have a different Python version. We'll be installing `flask-restx`, a community driven fork of Flask-RESTPlus. We'll also be installing `pyodbc` to connect to the SQL database, `xlsxwriter` for downloading an excel file and `pandas` for general dataframe processing.

```
cd flask-vuejs-template-master
python -m pip install pipenv
python -m pipenv install --dev
python -m pipenv install flask-restx pyodbc xlsxwriter pandas
python -m pipenv shell
```

Now that the Python packages are installed, let's install and upgrade the Vue dependencies with Yarn, and build the Vue dist directory.

```
yarn install --dev
yarn upgrade
yarn build
```

If everything went smoothly, you should be able to run both the backend and frontend dev servers. Run `python run.py` and from another terminal window in the same directory run `yarn serve`. You should see the app running at `http://localhost:8080/#/`.

<article-image 
  src="https://res.cloudinary.com/dayqxxsip/image/upload/v1622740460/App%20Images/Blog%20Images/Article%20Images/flask-vue-template-1_cq7c2d.png" 
  alt="Vue Flask starter app" 
  loading="lazy" 
  styling=""
  caption="Vue Flask starter app running locally" 
  captionsrc="" 
  :showsource="false">
</article-image>

## Creating a SQL query lookup table

Creating a simple lookup table for the SQL queries that the database expects will be useful for later on. Of course, the queries here are specific to the AdventureWorksLT database I’m working with, so feel free to adapt them to yours. Create another folder inside the `app` folder called `data`. Then create a `lookup.csv` file and copy the data below into it. The other columns will map to the user’s input to find their chosen query.

``` csv [lookup.csv]
Query,SQL
All customers who live in Canada,"SELECT C.[FirstName],C.[LastName],A.[AddressLine1],A.[CountryRegion]FROM [SalesLT].[Customer] C JOIN [SalesLT].[CustomerAddress] CA ON CA.[CustomerId] = C.[CustomerId] JOIN [SalesLT].[Address] A ON CA.[AddressId] = A.[AddressId] WHERE  A.[CountryRegion] LIKE 'Canada'"
All products ordered by price,"SELECT TOP (1000) [ProductID],[Name],[ProductNumber],[Color],[StandardCost],[ListPrice],[Size],[Weight] FROM [SalesLT].[Product] ORDER BY [ListPrice] DESC"
Total revenue for each product,"SELECT P.Name, SUM(LineTotal) AS TotalRevenue FROM [SalesLT].[SalesOrderDetail] AS SOD JOIN [SalesLT].[Product] AS P ON SOD.[ProductID] = P.[ProductID] GROUP BY P.Name ORDER BY TotalRevenue DESC"
```

The key thing to note here are the double brackets which escape commas inside the SQL statements.


## Configuring Flask API endpoints

Since we’ll be using a Vue single page application, there will need to be endpoints for it to send requests to later on. Let’s get started building these out. Within `app/api` add a file `query.py`. This will be our main API route for handling queries. Once the file is created open `api/__init__.py` and add the `.query` import just underneath the `.resources` import, to ensure our new route is registered.

``` python [app/api/__init__.py]
...

# Import resources to ensure view is registered
from .resources import * # NOQA
from .query import *

```

Now in `query.py` add two routes, one for getting the data, and one which will download the data.

``` python [app/api/query.py]
import os
import io
from flask import request, send_file, make_response
from flask_restx import Resource
from . import api_rest
import pyodbc
import pandas as pd

connection_string = os.getenv("DB_URI")

@api_rest.route('/query/get')
class GetData(Resource):

    def post(self):
        """ Retrieves data from the database """
        # TODO

@api_rest.route('/query/download')
class DownloadData(Resource):

    def post(self):
        """ Returns data as a downloadable file """
        # TODO
```

**Adding an environment variable for DB_URI**

As you can see we're ready to hook up the connection string for our database using `os.getenv("DB_URI")`. The best and most secure way to do that is via an environment variable. This template has the `python-dotenv` package installed, so we can use a `.env` file. At the folder top level create a file called `.env` and add in your own connection string:

``` [/.env]
DB_URI="DRIVER={ODBC Driver 17 for SQL Server};SERVER=test-sql-server-0123.database.windows.net;DATABASE=test-sql-database-01;UID=AdminUser;PWD={your_password_here}"
```

Now the environment variable is added, you will have to close your current terminal, start a new one and reactivate the shell with `python -m pipenv shell`. This should show a message during start saying `Loading .env environment variables...` so we know they're registered!

**Route for getting data**

With the connection string ready, let's complete the route for retrieving data from the database. We'll be grabbing the query from the POST request, and then we'll use the lookup file we made earlier to find the correct SQL statement.

``` python [app/api/query.py]
@api_rest.route('/query/get')
class GetData(Resource):

    def post(self):
        """ Retrieves data from the database """
        query = request.get_json()['query']

        lookup = pd.read_csv(os.path.join(
            os.getcwd(), "app", "data", "lookup.csv"))
        
        sql_statement = lookup.loc[lookup["Query"] == query, "SQL"].iloc[0]

        conn = pyodbc.connect(connection_string)
        dataframe = pd.read_sql(sql_statement, conn)
        conn.close()

        return {
          "sql_statement": sql_statement,
          "data": dataframe.to_json()
        }
```

**Route for downloading data**

Next we'll complete the route which will query the database and return a downloadable file in either CSV or XLSX format.

``` python [app/api/query.py]
@api_rest.route('/query/download')
class DownloadData(Resource):

    def post(self):
        """ Returns data as a downloadable file """
        file_type = request.get_json()['fileType']
        query = request.get_json()['query']

        lookup = pd.read_csv(os.path.join(
            os.getcwd(), "app", "data", "lookup.csv"))

        sql_statement = lookup.loc[lookup["Query"] == query, "SQL"].iloc[0]

        conn = pyodbc.connect(connection_string)
        dataframe = pd.read_sql(sql_statement, conn)
        conn.close()

        if file_type == "csv":
          response = make_response(dataframe.to_csv(index=False))
          response.headers["Content-Disposition"] = "attachment; filename=data.csv"
          response.headers["Content-Type"] = "text/csv"

          return response
        elif file_type == "xlsx":
          bytes_stream = io.BytesIO()

          writer = pd.ExcelWriter(bytes_stream, mode="w", engine="xlsxwriter")
          dataframe.to_excel(writer, startrow=0, merge_cells=False,
                            sheet_name="Sheet_1", index_label=None, index=False)
          writer.save()

          bytes_stream.seek(0)

          return send_file(bytes_stream,
                          attachment_filename="data.xlsx",
                          mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
                          as_attachment=True)
```

Now with the endpoints built we can call them from the Vue app using Axios.

## Sending an Axios request to get data

Open up `src/components/HelloWorld.vue` and delete everything so we can start from a blank template.

``` html [src/components/HelloWorld.vue]
<template>
  <div>
    <h1>Select your query</h1>
    <div>
      <select v-model="selectedQuery">
        <option disabled value="">Please select a query</option>
        <option>All customers who live in Canada</option>
        <option>All products ordered by price</option>
        <option>Total revenue for each product</option>
      </select>
      <button v-on:click="getData()">Get data!</button>
    </div>
    <div v-if="dataframe">
      {{ dataframe }}
      <button v-on:click="downloadData('csv')">Download data to CSV</button>
      <button v-on:click="downloadData('xlsx')">Download data to XLSX</button>
    </div>
  </div>
</template>

<script>
import axios from 'axios';

export default {
  data() {
    return {
      selectedQuery: null,
      dataframe: null,
      sqlStatement: null
    }
  },
  methods: {
    getData() {
      // TO DO
    },
    downloadData() {
      // TO DO
    }
  }
}
</script>
```

**Get data method**

``` javascript [src/components/HelloWorld.vue]
    getData() {
      axios.post(`api/query/get`, { query: String(this.selectedQuery) })
        .then(response => {
          this.sqlStatement = response.data.sql_statement;
          this.dataframe = JSON.parse(response.data.data);
        })
    },
```

**Download data method**


``` javascript [src/components/HelloWorld.vue]
    downloadData(fileType) {
      axios.post(`api/query/download`, {
        fileType: fileType
      }, {
        responseType: fileType === "csv" ? "text" : "arraybuffer" 
      })
        .then(response => {
          let filename = response.headers["content-disposition"].split("filename=")[1];
          
          if (fileType === "csv") {
            const csv = response.data;
            const link = document.createElement("a");
            link.target = "_blank";
            link.href = "data:text/csv;charset=utf-8," + encodeURIComponent(csv);
            link.download = filename;
            link.click();
          } 
          
          if (fileType === "xlsx") {
            const blob = new Blob([response.data], { type: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' });
            const url = window.URL.createObjectURL(blob);
            const link = document.createElement("a");
            link.target = "_blank";
            link.href = url;
            link.download = filename;
            link.click();
            window.URL.revokeObjectURL(url);
          }
        })
        .catch(error => console.log(error));
    }
```

You should now see data on the page after selecting an option and clicking the get data button. It might not look too good just yet, but it will soon. Now we have the Axios calls working and ready to go, we can improve the UI and render the data to a table.

## Viewing the data in a table and downloading

To improve the UI and display the returned data in a table, let's install [Buefy](https://buefy.org/), which has lightweight UI components for Vue.js based on [Bulma](https://bulma.io/).

```
yarn add buefy
```

With Buefy installed, initialise it within `src/main.js`:

``` javascript [src/main.js]
import Vue from 'vue'
import App from './App.vue'
import router from './router'
import store from './store'
import Buefy from 'buefy'
import 'buefy/dist/buefy.css'
import './filters'

Vue.use(Buefy)

Vue.config.productionTip = false

new Vue({
  router,
  store,
  render: h => h(App)
}).$mount('#app')
```

Here is the revised `HelloWorld.vue` component to improve the UI and display our data in a table!

``` html [src/components/HelloWorld.vue]
<template>
  <div class="columns is-mobile is-centered">
    <div class="column is-half">
      <div>
        <b-field label="Select your query">
          <b-select v-model="selectedQuery" placeholder="Choose a query">
            <option disabled value="">Please select a query</option>
            <option>All customers who live in Canada</option>
            <option>All products ordered by price</option>
            <option>Total revenue for each product</option>
          </b-select>
        </b-field>
        <div class="buttons is-centered">
          <b-button v-on:click="getData()" type="is-primary">
            Get data
          </b-button>
        </div>
      </div>
      <div v-if="rows">
        <b-table
          :data="rows"
          :columns="columns"
          :sticky-header="true"
          height="600px"
        ></b-table>
        <div class="buttons is-centered mt-5">
          <b-button v-on:click="downloadData('csv')" type="is-primary">
            Download data to CSV
          </b-button>
          <b-button v-on:click="downloadData('xlsx')" type="is-primary">
            Download data to XLSX
          </b-button>
        </div>
      </div>
    </div>
  </div>
</template>

<script>
import axios from "axios";

export default {
  data() {
    return {
      selectedQuery: null,
      dataframe: null,
      sqlStatement: null,
    };
  },
  methods: {
    getData() {
      // Already implemented
    },
    downloadData(fileType) {
      // Already implemented
    },
    /**
     * Retrieves column names from the dataset.
     *
     * @param {Object} data
     * @return {Object} Object containing dataset rows.
     */
    getColumns(data) {
      let columns = Object.keys(data);
      return columns.map((name) => ({
        field: name,
        label: name,
      }));
    },
    /**
     * Transforms dataset into an object of row-data.
     *
     * @param {Object} data
     * @return {Object} Object containing dataset rows.
     */
    getRows(data) {
      let rows = [];
      let numberOfRows = this.getRowCount(data);
      let index = this.getStartIndex(data);

      for (let i = 0; i < numberOfRows; i++, index++) {
        let row = {};
        for (let col in data) {
          row[col] = data[col][index];
        }
        rows.push(row);
      }
      return rows;
    },
    /**
     * Counts the rows in each column and returns max count
     *
     * @param {Object} data
     * @return {Number} Count of rows
     */
    getRowCount(data) {
      let rowsInColumns = [];
      for (let col in data) {
        let rows = Object.keys(data[col]).length;
        rowsInColumns.push(rows);
      }
      return Math.max.apply(null, rowsInColumns);
    },
    /**
     * Gets the start index for the dataset
     *
     * @param {Object} data
     * @return {Number} Start index
     */
    getStartIndex(data) {
      for (let prop in data) {
        let column = data[prop];
        for (let row in column) {
          return row;
        }
        break;
      }
    },
  },
  computed: {
    rows() {
      if (this.dataframe) {
        return this.getRows(this.dataframe);
      }
    },
    columns() {
      if (this.dataframe) {
        return this.getColumns(this.dataframe);
      }
    }
  }
};
</script>
```

I've added helper methods to wrangle the returned data so it can be used with the [table component](https://buefy.org/documentation/table/). The table component expects a `column` prop as an array of column objects, and a `data` prop as an array of row objects.

So effectively we transform something like this:

``` json
{
   "Name": {
      "0":"Touring-1000 Blue, 60",
      "1":"Mountain-200 Black, 42",
      "2":"Road-350-W Yellow, 48",
      "3":"Mountain-200 Black, 38",
      "4":"Touring-1000 Yellow, 60",
      "5":"Touring-1000 Blue, 50",
   },
   "TotalRevenue": {
      "0":37191.492,
      "1":37178.838,
      "2":36486.2355,
      "3":35801.844,
      "4":23413.474656,
      "5":22887.072
   }
}
```

... into something like this:

``` json
[
  {
    "Name":"Touring-1000 Blue, 60",
    "TotalRevenue":37191.492
  },
  {
    "Name":"Mountain-200 Black, 42",
    "TotalRevenue":37178.838
  },
  {
    "Name":"Road-350-W Yellow, 48",
    "TotalRevenue":36486.2355
  },
  {
    "Name":"Mountain-200 Black, 38",
    "TotalRevenue":35801.844
  },
  {
    "Name":"Touring-1000 Yellow, 60",
    "TotalRevenue":23413.474656
  }
]
```

You should now see the data rendered in the table and two buttons at the bottom to download it in either CSV or XLSX format. Both use cases are now fulfilled! Great job if you made it this far!

## Bonus: Displaying the SQL query nicely formatted

What if a more advanced user is interested in what underlying SQL query was executed based upon their selections? That was the reason I added the query from the lookup to the JSON response, so it would be available for this last nice to have! I came across a package recently [sql-formatter](https://yarnpkg.com/package/sql-formatter) that formats SQL for easier reading. Using this package with [prism](https://yarnpkg.com/package/prismjs), not only will the SQL be formatted but also have syntax highlighting. First to install and configure both.

```
yarn add sql-formatter prismjs
```

Now these two packages are ready to go, add the SQL query underneath the download buttons, import both packages and they should handle the rest.

``` html [src/components/HelloWorld.vue]
          ...
          <b-button v-on:click="downloadData('xlsx')" type="is-primary">
            Download data to XLSX
          </b-button>
        </div>
      </div>
      <div v-show="sqlStatement">
        <h2 class="is-size-5 mt-5">View the SQL statement this query</h2>
        <pre class="language-sql" style="font-size: 16px">
          <code v-html="'\n' + sqlStatement">
          </code>
        </pre>
      </div>
    </div>
  </div>
</template>

<script>
import axios from "axios";
import { format } from 'sql-formatter';
import Prism from 'prismjs';
import 'prismjs/themes/prism.css';
import 'prismjs/components/prism-sql';

export default {
  data() {
    return {
      selectedQuery: null,
      dataframe: null,
      sqlStatement: null,
    };
  },
  updated() {
    Prism.highlightAll();
  },
  methods: {
    getData() {
      axios
        .post(`api/query/get`, { query: String(this.selectedQuery) })
        .then((response) => {
          this.sqlStatement = format(response.data.sql_statement, {
            language: "tsql",
            uppercase: true
          });
          this.dataframe = JSON.parse(response.data.data);
        });
    },
    ...
  }
}
</script>
```

As you can see I've wrapped the returned `response.data.sql_statement` with the `format` function and added `Prism.highlightAll()` to the `updated` lifecycle hook - so everytime the DOM updates it will highlight the new query!

## Demonstration and next steps

Here is a video of the completed project in action. It’s a simplified version of the app I worked on, however you should see the potential to make this your own and introduce additional functionalities. 

<article-video 
  id="r3DLSWC-vi4" 
  title="Querying a SQL database with Vue Flask app demo">
</article-video>

I hope you enjoyed this end to end project, let me know in the comments if you have any questions or if you've adapted this to your own needs. I think this is a very popular use case that can automate manual queries and put data in the hands of people who don’t know much SQL - they will certainly thank you for opening that door up for them! In terms of next steps and ideas for further development I suggest:

* Make the code more modular and introduce a service layer
* Generate the Form options dynamically from the SQL lookup sheet
* Pull the SQL lookup file from cloud storage like S3, Google Cloud Storage or Azure blob storage (allows admin to upload a new version with new queries easily)
* Deploying this app to a cloud hosting platform like AWS or Azure
* Adding user authentication if required
* Designing and improving the UI (this tutorial was more focused on functionality than UI design)
* Expanding the range of queries available
* Building dynamic queries into the app including where and group by clauses (always be aware of what SQL you’re allowing the user to execute to avoid SQL injection attacks)
* Connecting to multiple databases
* Adding a 'copy code' to clipboard button

If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in:

* [How to upload PDF files to Azure Blob Storage with Vue and Python Flask](/blog/how-to-upload-pdf-files-to-azure-blob-storage-with-vue-and-python-flask/)
* [Automated deployment of a Vue Flask app using Azure Pipelines](/blog/automated-deployment-of-a-vue-flask-app-using-azure-pipelines/)
* [How to import a CSV from Dropbox or GitHub into Google Sheets](/blog/how-to-import-a-csv-from-dropbox-or-github-into-google-sheets/)]]></content:encoded>
        </item>
    </channel>
</rss>