Shedload Of Code

Azure Function to vectorise PDFs and store in Qdrant container app with OpenAI and Python

Sun, 30 Jun 2024 18:00:00 GMT

## Introduction In this one we'll be going through part of an [LLM](https://en.wikipedia.org/wiki/Large_language_model) / [Azure OpenAI](https://azure.microsoft.com/en-gb/products/ai-services/openai-service) project I worked on recently. This involved: 1. Triggering an Azure Function app when a new PDF is dropped into an Azure Blob Storage container. 2. Vectorising the document in the Azure Function app. 3. Dropping those vectors into a Qdrant database running in an Azure Container app. 4. A Python FastAPI app which [retrieved and queried](https://en.wikipedia.org/wiki/Document_retrieval) those document vectors to answer user questions. I will be outlining the process and code on how to vectorise documents in an Azure Function, but cannot give too much detail due to the organisation's data protection policy. I have redacted some details in the images, but you should get a good feel for how to put together a solution like this if it's something you're interested in. It was quite a lightweight solution but still had many moving parts. There are tools that make this process easier and do the heavier lifting which I'm learning about currently including [Azure Prompt Flow](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/prompt-flow) and [Azure AI Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) for documents. However this process is more customisable and provides greater control. ## Drop a PDF into blob storage The first step was to set up a resource group in Azure, and create these components: * Blob storage container * Function app * Qdrant container app Once a new PDF is dropped into the storage account, the function app is automatically triggered and begins to run. ## Function app automatically triggers The function app is triggered now that a new PDF has been uploaded to the blob storage container. Here is the trigger configured in the function app. Here is the Python code that drives the vectorisation process. The docstring at the top of the file outlines the steps the function app takes. It triggers, reads the PDF, vectorises, and stores the vectors. ```python [function_app.py] """ An Azure function app which: - is triggered when a new PDF file is added to the blob container 'docupload' - reads the PDF file and turns to vectors - stores the vectors in Azure Qdrant Components: - Function app - Blob store - Qdrant container Prerequisites: - Install VS Code Azure extension - Read getting started documentation at https://shorturl.at/59jYg """ import azure.functions as func import logging import fitz import openai import qdrant_client.models as models import tiktoken from langchain.text_splitter import RecursiveCharacterTextSplitter from qdrant_client import QdrantClient from qdrant_client.http.models import * from qdrant_client.fastembed_common import * app = func.FunctionApp() @app.blob_trigger(arg_name="myblob", path="docupload", connection="saaicdupsertvectorspoc01_STORAGE") def aicdfaupsertvectorspoc(myblob: func.InputStream): # 1. Read document blob_name: str = myblob.name logging.info(f"Python blob trigger function processed blob" f"Name: {myblob.name}" f"Blob Size: {myblob.length} bytes") '' try: document = fitz.open(stream=myblob.read(), filetype="pdf") logging.info(f'PDF read successfully: {document}') except: print("The PDF could not be read.") # 2. Vectorise document and upload to Qdrant def tiktoken_len(text: str) -> int: tokenizer = tiktoken.get_encoding("p50k_base") tokens = tokenizer.encode(text, disallowed_special=()) return len(tokens) def data_upload(qdrant_index_name: str, document) -> None: settings = { "url": "https://ca-qdrant-poc.azurecontainerapps.io", # The URL to your container app "host": "ca-qdrant-poc.azurecontainerapps.io", "port": "6333", "openai_api_key": "", # Enter your OpenAI API key "openai_embedding_model": "text-embedding-ada-002" } whole_text = [] for page in document: text = page.get_text() text = text.replace("\n", " ") text = text.replace("\\xc2\\xa3", "£") text = text.replace("\\xe2\\x80\\x93", "-") whole_text.append(text) text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=100, length_function=tiktoken_len, separators=["\n\n", "\n", " ", ""], ) chunks = [] for record in whole_text: text_temp = text_splitter.split_text(record) chunks.extend([{"text": text_temp[i]} for i in range(len(text_temp))]) try: client = QdrantClient(url=settings["url"], port=None) collection_names = [] collections = client.get_collections() for i in range(len(collections.collections)): collection_names.append(collections.collections[i].name) if qdrant_index_name in collection_names: client.get_collection(collection_name=qdrant_index_name) else: client.create_collection( collection_name=qdrant_index_name, vectors_config=models.VectorParams( distance=models.Distance.COSINE, size=1536 ), ) except Exception as e: logging.error("Unable to connect to QdrantClient") logging.error(f"Error message: {str(e)}") for id, observation in enumerate(chunks): text = observation["text"] try: openai.api_key = settings["openai_api_key"] res = openai.Embedding.create( input=text, engine=settings["openai_embedding_model"] ) except openai.AuthenticationError: logging.error("Invalid API key") except openai.APIConnectionError: logging.error( "Issue connecting to open ai service. Check network and configuration settings" ) except openai.RateLimitError: logging.error("You have exceeded your predefined rate limits") client.upsert( collection_name=qdrant_index_name, points=[ models.PointStruct( id=id, payload={"text": text}, vector=res.data[0].embedding, ) ], ) logging.info("Text uploaded") logging.info("Embeddings upserted") file_index = blob_name \ .strip() \ .lower() \ .replace(" ", "_") \ .replace("docupload/", "") \ .replace(".pdfblob", "") \ .replace(".pdf", "") logging.info(f"File index: {file_index}") data_upload(qdrant_index_name=file_index, document=document) ```

So above we can see the Azure function in the Azure portal and in VS Code, and the run logs - yes I found 65 ways to fail here but eventually found a way to succeed! The full logs end with "Embeddings upserted" so we know it completed successfully. Now to check the Qdrant container app to confirm for sure that the vector embeddings are present there. ## Vectors stored in Qdrant container app The Qdrant container app was set up in Azure to hold the vector embeddings. If we head to that URL given for the container app and add **/dashboard/collections** we will see all of the document vector collections present in Qdrant. Selected a collection shows the vector embeddings that are stored in it. By vectorising and chunking the PDF content and storing it in a Qdrant vector database this can now work with the OpenAI LLM to answer questions based on the PDF documents. ---

--- ## Bonus: Saving snapshots for backups I found during this process a useful feature for backup and disaster recovery planning. Once a PDF has been vectorised and upserted, you can save a snapshot of the vectors in the Qdrant dashboard. If everything is wiped, you can just upload the snapshot to a collection and you're back up and running. This gives a secondary option to re-running the function app for all the documents. ## How can I learn more about LLMs, Qdrant and OpenAI in Python? First off, if you know nothing the freeCodeCamp course and video [A Non-Technical Introduction to Generative AI](https://www.freecodecamp.org/news/a-non-technical-introduction-to-generative-ai/) is great. Secondly, this is a useful article from DataCamp on the [25 Top MLOps Tools You Need to Know in 2024](https://datacamp.pxf.io/Wq1KkO) which includes Qdrant and LangChain. Lastly, to learn more about using OpenAI with Python there is a DataCamp course [Working with the OpenAI API](https://datacamp.pxf.io/Orj5xK) or you can check out the [openai-python GitHub repo](https://github.com/openai/openai-python). ## Wrap up That's everything for this one! There are lots of things to explore when it comes to LLMs and the new tools that are emerging. This was a pretty simple use case but required some discovery and learning to figure out how to do this. I hope you enjoyed this article and it helps you out if you're planning on embarking on the same wild journey of vectorising documents in Azure from scratch! Thanks for reading. If you know even easier way to query and answer questions based on documents please share them in the comments section at the bottom of this page. I will likely write another article on the entire solution once it's fully completed. Keep an eye out for that. Since you read this article all the way to the end you might also be interested in: * [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) * [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/)

Super quick setup guide to playing retro games using RetroPie, Dolphin and Redream

Sat, 13 Apr 2024 15:00:00 GMT

## Introduction In this short article we'll quickly learn how to setup RetroPie with a Raspberry Pi 3b+ to play NES, SNES, GBA... and also setup Dolphin for GameCube and Redream for Dreamcast games on PC. I recently had a ton of fun setting these up and doing a little retro gaming and thought I'd share the experience I went through and what I learned 😄 This guide is designed to be no-nonsense (hopefully) - I won't be going into how to get game ROMs, the methods and ethics of acquiring ROMs are for someone else to discuss. I will also signpost to the very useful resources and videos I used and collated to get started. This should speed up the setup for you and reduce the amount of research you need to do! With that out of the way, let's get started. ## What you’ll need I used the following to get a good retro gaming setup: * A [Raspberry Pi 3b+](https://www.amazon.co.uk/Raspberry-Pi-3-Model-B/dp/B07BDR5PDW) * A [64GB SanDisk MicroSD](https://www.amazon.co.uk/gp/product/B09X7C7LL1/) card * A laptop i5 processor, 8GB RAM and it's default nothing special graphics card - Intel HD Graphics 4400 * The latest version of RetroPie - go to the [RetroPie Download page](https://retropie.org.uk/download/) and download the latest version for your Raspberry Pi, for me this was the "Raspberry Pi 2/3/Zero 2 W" button. This worked well for NES, SNES, GB, GBC, GBA, N64 (some games are slow though), Dreamcast (some games are slow though). * The PS1 BIOS files - search for [PS1 BIOS files](https://www.google.com/search?q=ps1+bios+files&oq=ps1+bios+files) ... you're looking for .bin files named scph5500, scph5501 and scph5502 * The latest version of Dolphin - go to the [Dolphin Download page](https://dolphin-emu.org/download/) and download the latest version for your OS such as Windows x64 v5.0-21264 * The latest version of Redream - go to the [Redream Download page](https://redream.io/download) and download the latest version for your OS such as Windows v1.5.0 * Some game [ROMs](https://en.wikipedia.org/wiki/ROM_image) I found the 3b+ couldn't quite handle Dreamcast and since it's 32 bit, it couldn't install Redream. It also struggled with some N64 games and definitely wouldn't handle GameCube. Everything else was perfect including PS1. So I think Dreamcast and GameCube are best left for a half-decent laptop. ## RetroPie - Setup and adding ROMs 1. Head to the [RetroPie first installation](https://retropie.org.uk/docs/First-Installation/) page watch the video, follow the steps there to add RetroPie image to your MicroSD card 2. Insert the MicroSD card into your Raspberry Pi 3. Attached the power, HDMI and controller to your Raspberry Pi 4. EmulationStation launches on bootup, then configure your controller buttons 5. Find the device's IP by selecting the Show IP option in the RetroPie menu after booting up your Raspberry Pi. 6. Add ROMs by copying them into the relevan folders at the IP address like \\192.168.1.113\roms for example 7. Select a game to launch it - you can then adjust the settings, change the emulator etc just before it loads You can also [transfer ROMs by using a USB stick too](https://retropie.org.uk/docs/Transferring-Roms/) if you prefer to do it that way instead of transferring over your network. ## RetroPie - Saving your game After launching a game, *select + right bumper* saves the state, and *select + left bumper* loads the state. This saves to slot #0. To save to another slot, press *select + dpad left or right* to change save slot, then same as before *select + right bumper* saves the state, and *select + left bumper* loads the state. This [video tutorial neatly covers up this process](https://www.youtube.com/watch?v=cIYwcJDShU0). ## RetroPie - Configuring PS1, NDS and DC * You will need to [install an additional emulator for Nintendo DS](https://www.youtube.com/watch?v=IfY2FjaSaAk) called Drastic. Then add ROMs to the new "nds" folder in the RetroPie roms folder over the network. * You will need to [install an additional emulator for DreamCast](https://www.youtube.com/watch?v=yb3kYuLnkD8) called reicast or lr-flycast. However, a much better emulator is Redream mentioned later in the article. Since Raspberry Pi 3b+ is 32 bit Redream won't work on it, so a laptop/PC seems the better option for Dreamcast against 3b+. * You will need to add additional BIOS files to play PS1 games - you can find these with a [quick web search](https://www.google.com/search?q=ps1+bios+files&oq=ps1+bios+files). Also, ensure you add both the .bin files and a .cue file for the ROMs to the /roms/psx/ folder and ensure they are unzipped. You can take .bin files and create a .cue from them using a [cue maker](https://www.duckstation.org/cue-maker/). ## Dolphin - Setup and adding ROMs So as mentioned earlier, I found the 3b+ definitely wouldn't handle GameCube. Everything else was perfect including PS1. So I think GameCube are best left for a half-decent laptop. The setup is pretty straightforward. * [Download the installer from the Dolplhin site](https://dolphin-emu.org/). * Run the download to launch Dolphin * Follow this [handy video tutorial](https://www.youtube.com/watch?v=LzOIS7KqvdM&list=PL5TuPBnwdd6h172vufklL3VU4vv8lY4c8&index=10) to get setup with your controller and ROMs * Launch a game * To save/load a game, go to the taskbar at the top, select Emulation > Save/Load State > Save State to Slot/Load State from Slot ## Redream - Setup and adding ROMs So as mentioned earlier, I found the 3b+ couldn't quite handle Dreamcast and since it's 32 bit, it couldn't install Redream. Everything else was perfect including PS1. So I think Dreamcast are best left for a half-decent laptop. The setup is pretty straightforward. * [Download the installer from the Redream site](https://redream.io/). * Run the download to launch Redream * Go to the library tab and add the folder containing your ROMs seen below * Launch a game * To save/load a game, while in the game hit the 'Esc' key which brings up the menu seen below, then save to a slot. You get 1 slot for free at the time of writing and to unlock more requires a payment. ## Bonus - How to format an SD card to default code snippet Just in case you have a MicroSD card which is already in use and you want to effectively factory reset it to it's default settings, wipe it and then add the RetroPie image: * Search for diskpart.exe * Open as admin in a command prompt * Type `list disk`. ... * Type `select disk X` where X is the SD card drive number. ... * **WARNING!** Make sure you have selected the correct disk before proceeding as it will wipe the selected disk completely * Type `clean` to clean the drive and wipe it ## Bonus - Overclocking your Raspberry Pi 3b+ Insert the MicroSD into your PC then in the boot folder find the "config" file. Open this in Notepad++ or Visual Studio, then edit and increase the arm_freq variable to overclock. This is the section you're looking for... ```[config.txt] ... #uncomment to overclock the arm. 700 MHz is the default. #arm_freq=800 arm_freq=1400 ... ``` This [video tutorial neatly covers the process](https://www.youtube.com/watch?v=xXOi3xPLi6E&list=PL5TuPBnwdd6h172vufklL3VU4vv8lY4c8&index=12). ## Happy retro gaming That's everything for this one! There are lots of things to explore when it comes to retro gaming. I really enjoyed trying to set this up and learnt a lot in the process. I hope you enjoyed this article and found it useful, thanks for reading. Since you read this article all the way to the end you might also be interested in: * [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) * [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/)

Reflections on digital streaming and reducing smartphone usage

Wed, 13 Mar 2024 17:50:00 GMT

This articles covers some of the thoughts and steps I've taken to redefine my relationship with my smartphone, streaming and modern tech devices, using some nostalgic tactics. ## What is the problem with smartphones? Smartphones and modern tech devices seemingly offer everything we could possibly need in a single device. They appear to be the perfect multi-tool. Is it a surprise we all seem to be addicted to them? Is that good for us? I wasn't really an avid user to begin with, I don't use social media and yet still I found myself using my smartphone far too much. Checking the bank account, watching a how-to video, making a note, researching directions, looking up some information, streaming and listening to music. The heating system can be controlled via an app, the CCTV cameras can be viewed on an app... Almost everything and everyone is connected. This is all super convinient but lately I've felt I / we spend far too much time on these devices, and with technology in general. It's verging on a minor addiction, like the compulsive checking of your phone even though you know there won't be any notifications - or none that you're interested in anyway! Could it be that hyper-convience is actually bad for us? I am starting to think so. Dopamine (the feel good chemical) is released not for the reward itself, but [in ancipation of the reward](https://medium.com/delasign/anticipation-is-worth-more-than-the-reward-3ed5e4883258#:~:text=It's%20not%20the%20reward%2C%20it's%20the%20anticipation.&text=A%20finding%20that%20led%20the,the%20craving%20for%20that%20reward.%E2%80%9D). This means even being in the presence of a smartphone is like having constant access to a metaphorical slot machine, it might have us on edge constantly thinking things like: * Can I find any new information? * Should I make a note of that? * What's my bank balance? * I should look up directions to that place. * ... and many more All of this seems to be affecting our concentration and attention spans. If [early modern humans](https://en.wikipedia.org/wiki/Early_modern_human) have survived for ~300,000 years without smartphones, why since [2007 and the first iPhone](https://youtu.be/x7qPAY9JqE4?si=oEj_RK7e3dPIWf3c&t=156) do we need them so much? ## What are the solutions to this problem? I watched a few videos and lectures on solutions to this attention addiction problem. I am not against smartphones they are great devices that help us, but there are many dangers too which can lead to bigger issues like anxiety, depression, insomnia and more. Here are the options I gathered: * [Live without a smartphone](https://www.youtube.com/watch?v=uNQujCwCu88) - radical and difficult in a world built for smartphones with things like QR codes, medical apps, online government services and so on. * [Confront and redefine your relationship with them](https://www.youtube.com/watch?v=2ldLwkj4dRc) - acknowledge it is an issue and work towards improving it using intentional app time limits, using only a few apps etc. * [Learn to look up again](https://www.youtube.com/watch?v=m1_QlV6XCNs) - use tactics to help you manage your relationship so put it away during social situations, ask others to put theirs away, don't sleep with or near your phone, and turn off notifications. Now for my own reflections and tactics on how to reduce smartphone and redefine your relationship with them, tech usage and streaming. The results being a healthier, happier relationship with technology where you are more in control. ## Reset expectations I think technological progress is great, but it can be harmful. I think around 2005 was a sweet spot for technology use in that: * Landlines frequently used * Texting frequently used, picture messaging less so - also much harder to text on ['dumb' phones](https://en.wikipedia.org/wiki/Nokia_3310) * Computers and laptops were bulky, slower but did the job * Internet was available albeit much slower, more viruses, less sophisticated, but felt more free and open * Films were on DVD, options were buying or renting from [Blockbuster](https://en.wikipedia.org/wiki/Blockbuster_(retailer)) (not sure when Blockbuster collapsed) or for some they were downloaded via piracy - resulting in this [classic ad](https://www.youtube.com/watch?v=HmZm8vNHBSU). * Music albums were released to CD, played on portable CD players or ripped to PC and then stored on iPods / MP3 players * To find new music and artists you checked out [Last.fm](https://www.last.fm/) * [YouTube](https://en.wikipedia.org/wiki/History_of_YouTube) launched February 2005 * [Facebook](https://en.wikipedia.org/wiki/History_of_Facebook) launched 2004, before that is was [Myspace](https://en.wikipedia.org/wiki/Myspace), [Bebo](https://en.wikipedia.org/wiki/Bebo) which weren't really as widely adopted * Endless scrolling didn't really exist in the same way it does now So this world didn't include smartphones, and yes things were more inconvenient as a result nevertheless, **it still worked**. It still had everything we have now more or less. Now I'm not advocating we go back to these times, but we can certainly learn from them, reflect on what we've gained and what we've lost. Use some of those reflections to improve lives in this hyper-connected attention-seeking world. Here we go... ## Use it like a tool (or a landline) Smartphones are great multi-purpose tools, and that is part of the problem! One thing I've done to use it more like a tool is to use an app called [minimalist phone](https://www.minimalistphone.com/) for Android. I think there is an equivalent for iPhone too. It's great for keeping the phone semi-dumb and highlighting only the apps you really need while keeping some tucked away out of view just in case you need them occasionally. The best part for me is that at the time of writing there is a one-time purchase available instead of monthly / yearly! I hope this never changes. You can see in the image below I keep a few select apps on the front page. You can rename apps to keep things really simple, I renamed... * YouTube Music to **Music** * Kindle to **Books** * YouTube to **Videos** If you designate an app it will prompt you how long you want to spend on that app. Swiping right gives you a search bar to find your tucked away apps which I have added to folders. You can also 'hide' apps totally stashing them away and out of view completely. Once your time is up you get a prompt to 'Take me out of here' or continue with 'More time'. It's a super helpful interface. Some other helpful things were: * Keep your smartphone in the same location - on a window sill so you have to physically go to it the same as a landline. When you're done, put it back. This keeps you distanced physically and mentally. * When you have a question you want to ask Google, **ask yourself first**, try to figure it out, use that gift of a brain! Pretend Google doesn't exist, how would you work it out? How would you find that information? * Removing all social media apps - stick to text and WhatsApp to message people or call them. * Pretend it's a single use device - if you're listening to music, only do that, no app switching, this requires lots of willpower! * Try using colour contrast mode to turn the display black and white - much less distracting * If you're not using a minimalist phone app clean up those apps, get rid of the unused and hide the infrequent ones, reduce to the **essential** tools ## Enjoy single use devices to prevent multi-tasking My single device hacks to reduce reliance on streaming and to prevent multi-tasking are: * A [2TB Toshiba external hard drive](https://www.amazon.co.uk/dp/B07994QL95) to play movies directly on TV or Xbox * A [5th Gen iPod 60GB](https://www.ebay.co.uk/sch/i.html?_from=R40&_trksid=p4432023.m570.l1313&_nkw=ipod+5th+generation+A1136&_sacat=0) which I modded with a 256GB SD card and bigger battery for offline listening plus a [Bluetooth adapter](https://www.amazon.co.uk/dp/B09ZTBZHCN). I got this from eBay for £33 in great condition, even with songs loaded from the seller! Best purchase in ages. Only a small section of dead pixels on the screen. * An old iPhone SE no SIM card to use just for music - YouTube Music + YouTube background play * [JBL Charge 5](https://www.amazon.co.uk/JBL-Charge-Bluetooth-waterproof-built-Black/dp/B08VDNCZT9/ref=sr_1_1) - great portable Bluetooth speaker with good battery life if you can find one on offer These make my smartphone optional and it can be left alone sitting on the window sill, it makes me use the smartphone more like a desktop PC - I go to it, do what I need to do then leave it alone. These also make streaming somewhat optional, it means I could unsubscribe to most media streaming services and still be entertained and would have only what I love and treasure. It took effort to hunt down those movies and albums - again the anticipation of the reward is greater than the reward itself! Some effort and inconvience ensures the reward is appreciated even more. It wasn’t mindless scrolling to hunt for them either, it was active searching, thinking, reflecting. It slowed down consumption. One you've acquired them they are yours, no one can take them away from you. If you were lost in the jungle and had to eat anything to survive, your favourite food on return to civilisation would be the finest food you've ever tasted, and you would appreciate every bit of it. Struggle isn't always nice, but **some** struggle is a good thing - it makes us appreciate what we have instead of worrying other options might be better. There are lots of good ideas on how to introduce some struggle into your life [like using an iPod to listen to music](https://www.youtube.com/watch?v=3mfC4WNVMec) alongside or instead of streaming. The thing about streaming music, is that there isn't really a way to do it *without* a smartphone. They tend to go hand-in-hand. I think having a device dedicated to music is a special thing, even if like me, that's just an old iPhone SE used solely for music. It's the perfect size for this purpose and after finding a [new battery](https://www.amazon.co.uk/dp/B088TBSVSR) for it [and fitting it](https://youtu.be/x9JRqocmm24?si=G3si-0qqr8Mq8xtp) it goes for days. I keep the display black and white and only use YouTube Music, Headspace and YouTube for background listening with this device. If I want to go totally offline I've been building a good music library to load onto an iPod 5th Generation (A1136) modded with an [iFlash Quad](https://www.iflash.xyz/store/iflash-quad/) and 256GB SD card, along with 3000mAH battery giving days upon days of usage. You can find [great guides](https://youtu.be/6bhOyLF4Co4?si=r90rGFRZA4x6QB0f) to do this on YouTube from [DankPods](https://www.youtube.com/@DankPods) and others. The 5th Gen iPod seems the easiest to open up, whereas the 6th and 7th Gen have fully metal cases so much harder. Plus the 6th Gen has a limit of 128GB when flash modded, whereas the 5th and 7th Gen have no limit up to 1TB - not that I've ever tested that, 256GB is more than enough. ## List your top tens for entertainment If you could only watch / listen to 10 movies, documentaries and artists ever again, what would they be? Collate your own library of top 10’s whether that be MP3s, CDs, DVDs, or files. I lived during the times of piracy where streaming wasn’t an option and individual items were expensive. The price of a CD album can now get your a monthly subscription to most of the songs ever created! I’m not an expert on the economics of streaming, but however slim, there is a chance of returning to a world one day of ‘if you like it, then buy it’. I mean if they don’t pay artists enough it’s not unfeasible. I’m not sure, that’s another topic though for someone else to debate. Maybe use the money you spend on streaming to acquire your favourite music and films digitally or on CD / DVD, then get an external hard drive and back them up and for easier viewing. Plug the hard drive into a games console or TV and you've got your own private music / movie collection. Barring the hard drive failing you'll always have access to them. My varied lists included: **Music:** * Linkin Park * Atreyu * Avenged Sevenfold * Five Finger Death Punch * Queen * ... **Films:** * American Psycho * The Big Lebowski * There Will Be Blood * Starship Troopers * ... **Documentaries / Series:** * Blue Planet I, II, III * Planet Earth I, II, III * Anything David Attenborough * World War II in Colour by Robert Powell * The Simpsons * Futurama * ... ## Make streaming and convenience optional Are there benefits to streaming? Definitely, but there are dangers too. * Too much stuff available creating decision fatigue * Too instantly available * Mindless scrolling vs. active thinking and searching * Not what you treasure * You don't own it so it can disappear * Price increases could become unpalatable * Free version you are bombarded with ads - I think these have a big effect on your mental health, I avoid ads like the plague. * Actually changing the market and how we consume music - losing any physical connection Can you apply the principles of the old physical media world to the new streaming world? Yes, I think you can though the power of pretending. * Pretend your music streaming app is an iPod - you can’t change apps, search for anything, receive notifications * Pretend your Netflix app is Blockbuster or IMDB - what do you feel like watching before you load it up? * Pretend new episodes are released daily or weekly. So only 1 episode or film per day / week to avoid binging. * Pretend it's the 80s or 90s and your phone and internet doesn't exist for a day - find alternatives As discussed in the previous section, have some go-to entertainment to avoid endless streaming. My go-to before unsubscribing from Netflix was watching a film / episode then follow it with a David Attenborough documentary box set of Planet Earth, Blue Planet etc. Perfect for relaxing and winding down. I think having a go-to is becoming old school, a favourite film or documentary you could watch over and over. By using pretending in combination with some go-tos we can make streaming more optional, a nice to have, but not a necessity. ## Find a middle ground The only streaming service I used to pay for was Netflix. I recently unsubscribed from that to avoid endless scrolling and not finding anything I like. In 2023, I subscribed to Spotify Premium for the first time, I became tired of ads and the constant bombardment from them. We are certainly in an attention economy, where so much money is spent getting your attention and convincing you to spend money on things! I recently unsubscribed from that because I only listen to certain playlists and artists. Before 2023, I kind of got by with just MP3s and occasional Spotify, back when it was ad-free on desktop and tablet. That leaves me with only two subscriptions I have currently: * Amazon Prime which comes with [many benefits](https://www.amazon.co.uk/b?ie=UTF8&node=14917073031) at £7.91 per month - paying yearly was £95 so at approx £5 per delivery my **household** needs to have 19 orders per year since [you can share Prime benefits with your household members](https://www.amazon.com/gp/help/customer/display.html?nodeId=GWZ7QXD2X8WL8YE8) * YouTube Premium which comes with YouTube Music too at £12.99 per month * Total at £20.90 per month Do I enjoy giving money to two market dominating tech giants? Not really, I'm against monopolies but can't argue the services they have are good quality and mostly reliable. I can live with this choice, it's my middle ground. Limiting myself to only two subscriptions feels good, both mentally and for the wallet - I get tons of use from each so very cost-effective. I only recently subscribed to [YouTube Premium](https://www.youtube.com/premium) which comes with YouTube Music too. To confirm, I have no affliation with Google or YouTube Premium. I really enjoy watching ad-free videos and use it for everything how-tos, documentaries, lectures, guides, coding tutorials and lots more. I dislike having to pay to remove ads, nevertheless it's a huge platform with estimated over 800 million videos and 100 million songs so I understand that needs funding to keep it all running! 😂 Plus it keeps valuable content creators paid which is a good thing too. As a bonus too, YouTube Music is thrown into the bundle. Here is a comparison of Youtube Music against Spotify: **YouTube Music pros:** - Sounds louder and crisper than Spotify to me - Seamlessly switch between music and video version - Fine-tune Up Next playing suggestions with Familiar, Discover, Popular, Genres - Better Recommendation and Quick pick features in my view - Similar size catalogue of 100 million songs but with more niche uploads from Community Playlists - Can find more obscure songs maybe not on Spotify like very recent covers - Clean layout with Up Next, Lyrics, Related - Smart downloads - when connected to Wi-Fi the app will automatically download your specified amount of favourite + recent songs in an 'Offline Mixtape' which is awesome. This has also made using an old iPhone SE with no SIM as a dedicated music streamer even easier on the go. **YouTube Music cons:** - No app on Xbox for background play or easy navigation - No desktop app - although you can [download it as a progressive web app](https://support.google.com/chrome/answer/9658361?hl=en-GB&co=GENIE.Platform%3DDesktop#:~:text=On%20your%20computer%2C%20open%20Chrome,instructions%20to%20install%20the%20PWA.) from Chrome using the 'Install' button in the address bar, which adds it to the desktop - Adding artists is a pain, must subscribe or add albums - Creating playlists is a pain and are added to main YT - The solution to that I've found is to [create a new 'channel'](https://tinyurl.com/3wzsy3sz) to keep music seperated - No reliable 'Spotify Connect' function like using another device as a remote - No searching within playlists - first world problem, I know! - Not sure how good podcasts are, don't use them - Playlists are not as good probably due to a smaller community, they are more than ok though **Tips I used to transition music services or to iPod:** 1. Monitor what you use your old music service for 2. Add the same artists, albums and playlists to iPod (optional) 3. For Bluetooth use an adapter with the iPod 4. For Xbox use USB with Background Music Player or AirServer with phone 5. Use [Soundiiz](https://soundiiz.com/) to transfer any playlists from old to new service (free tier is 1 playlist at a time with 200 songs per playlist at the time of writing) 6. Unsubscribe from your old music service, use new service for discovery, repeat ## Conclusion I hope you enjoyed this article and it gave you the chance to reflect on your own relationship with technology and smartphones. We've covered many related topics including: * Smartphones * Streaming * Minimalism * Consumerism and the attention / subscription economy * How music and video consumption has changed * How ads and distraction affects our concentration I think we saw some common themes emerging: * Be aware and intentional with tech * Ensure you're controlling it and it's not controlling you * Set your structures, boundaries and limits * Try single use devices or a smartphone with minimal apps * Find a way that works for you I think moving forward those of us who create systems, apps, websites and any other digital solutions have to be aware of this stuff, and that success metrics don't focus on engagement but ethical use. It's definitely not being anti-technology, just a reflection on practices for positive human-computer interactions in an ever changing landscape. One philosophy is that digital tools in any form should give you time back, not take it away from you, it should make life better, and easier, not harder or harmful to users. You should own it, it shouldn't own you. What smoking was to the physical health, smartphone usage is to mental health. It is the issue of our time.

Solving real-world optimisation problems - a crash course with PuLP

Sat, 10 Feb 2024 15:58:00 GMT

I’ve read a few tutorials recently to refresh my knowledge on optimal resource allocation, and either the examples were too complex or delved too far into the maths. I also enrolled on a useful course from DataCamp called [Supply Chain Analytics in Python](https://datacamp.pxf.io/KjA61e). This article focuses more on the practical steps required for you to get started quickly with some good examples. By the end of this article, you should be able to solve simple and intermediate optimisation problems using Python and PuLP. This is a really useful skill for [statisticians](https://www.prospects.ac.uk/job-profiles/statistician), [data scientists](https://www.prospects.ac.uk/job-profiles/data-scientist), [operational reseachers](https://www.prospects.ac.uk/job-profiles/operational-researcher) and business to make the best decisions, maximise profits, production, minimise time, costs and more. We will start with a small example, then build up to more complex examples as we proceed. I ran all of the code contained in this article using [Spyder IDE with Anaconda](https://docs.anaconda.com/free/working-with-conda/ide-tutorials/spyder/). ## What is optimisation and linear programming? * Optimisation helps to find the best decision given some inputs, so aims to maximise or minimise an objective function, given a number of constraints * Linear programming (LP), also called linear optimisation, is a method to achieve the best outcome (such as maximum profit or minimal cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is a special case of mathematical programming also known as [mathematical optimisation](https://en.wikipedia.org/wiki/Mathematical_optimization). * [PuLP](https://coin-or.github.io/pulp/) is a library in Python to help with optimisation and linear programming tasks. PuLP stands for “Python. Linear Programming” ## What are the steps to solving an optimisation problem? An optimisation problem that uses linear programming (LP) and PuLP typically has the following steps / components: * **Model** - an initialised PuLP model * **Decision variables** - what you can control * **Objective function** - the goal to maximise or minimise like profit, cost, resources * **Constraints** - limitations to our solution like demand, capacity, time * **Solve model** - then view the most optimal outcome Let's see these in action in our first example. ## Exercise routine Use LP to decide on an exercise routine to burn as many calories as possible. | | Pushup | Running | |-----------|-----------------|--------------| | Minutes | 0.2 per pushup | 10 per mile | | Calories | 3 per pushup | 130 per mile | Constraint = only 10 minutes to exercise ```python [exercise.py] from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value # 1. Initialise model model = LpProblem("Maximize Calories Burnt", LpMaximize) # 2. Define Decision Variables: pushups and running pushup = LpVariable('Pushup', lowBound=0, upBound=None, cat="Continuous") running = LpVariable('Running', lowBound=0, upBound=None, cat="Continuous") # 3. Define objective function: calories per pushup or per mile model += 3 * pushup + 130 * running # 4. Define constraints: our model's limitations model += 0.2 * pushup + 10 * running <= 10 # Time constraint is 10 minutes to exercise model += pushup >= 0 + running >= 0 # Our results must be more than 0 pushups or miles ran (so not negative) # 5. Solve model model.solve() print("Run = {} miles".format(running.varValue)) print("Pushups = {}".format(pushup.varValue)) print(f"Calories burnt: {(running.varValue * 130) + (pushup.varValue * 3)}") ``` Our workflow in this code consisted of: 1. Initialising the model with the help of PuLP using LpProblem and set our goal as LpMaximize - since we want to maximise calories burnt 2. Defining our two decision variables as either pushups or running and set the category as Continuous 3. Setting the objective function in mathematical form, which were calories per pushup (3) and calories per mile of running (130) 4. Setting the constraints which were 10 minutes to exercise, and not a negative result 5. Solve the model and output the results The results printed were: > Run = 0.0 miles > > Pushups = 50.0 > > Calories burnt: 150.0 This has computed all possible combinations and returned the most optimal decision in miliseconds! We can see the most optimal outcome is to perform 50 pushups which burns 150 calories and is under the 10 minute constraint (0.2 * 10 = 10) I hope you can see the power here of quickly solving optimisation problems that would be very difficult to solve by hand accounting for all possible combinations. ## Glass manufacturing We are tasked with planning the optimal production at a glass manufacturer to maximise profit. This manufacturer only produces wine and beer glasses: * there is a maximum production capacity of 60 hours * each batch of wine and beer glasses takes 6 and 5 hours respectively * the warehouse has a maximum capacity of 150 rack spaces * each batch of the wine and beer glasses takes 10 and 20 spaces respectively * the production equipment can only make full batches, no partial batches * Also, we only have orders for 6 batches of wine glasses. Therefore, we do not want to produce more than this. Each batch of the wine glasses earns a profit of $5 and the beer $4.5 ```python [resources.py] from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value # 1. Initialise model model = LpProblem("Maximize Glass Co. Profits", LpMaximize) # 2. Define Decision Variables: wine and beer glasses wine = LpVariable('Wine', lowBound=0, upBound=None, cat="Integer") beer = LpVariable('Beer', lowBound=0, upBound=None, cat="Integer") # 3. Define objective function: profit for both wine glasses and beer glass decision variables model += 5 * wine + 4.5 * beer # 4. Define constraints: our model's limitations model += 10 * wine + 20 * beer <= 150 # Rack space cannot exceed 150 model += 6 * wine + 5 * beer <= 60 # Maximum production capacity is 60 hours model += wine <= 6 # Wine glasses cannot exceed 6 batches # 5. Solve model model.solve() print("Produce {} batches of wine glasses".format(wine.varValue)) print("Produce {} batches of beer glasses".format(beer.varValue)) ``` We followed the same pattern in this example, but defined more constraints. We also defined the category for our decision variables as Integer because we can only make full batches, no partial batches. Given these constraints, we calculate the optimal production outcome to maximise profit is to produce 6 batches of wine and 4 batches or beer! > Produce 6.0 batches of wine glasses > > Produce 4.0 batches of beer glasses ## Warehouse stock allocation Decide which warehouse to ship from to fulfil customer unit demand at the lowest cost. This example is more complex so uses Python list comprehension to define many decision variables, objective functions and constraints quickly. ```python [logistics.py] from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value warehouses = ['New York', 'Atlanta'] customers = ['A', 'B', 'C'] costs = { ('New York', 'A'): 232, ('New York', 'B'): 255, ('New York', 'C'): 264, ('Atlanta', 'A'): 255, ('Atlanta', 'B'): 233, ('Atlanta', 'C'): 250 } demand = { 'A': 1500, 'B': 900, 'C': 800 } # 1. Initialise model model = LpProblem("Minimise_Transportation_Costs", LpMinimize) # 2. Define 6 Decision Variables in a few lines of code using LpVarible.dicts # That's (2 warehouses * 3 customers) key = [(w, c) for w in warehouses for c in customers] shipments = LpVariable.dicts('Shipments', key, lowBound=0, cat='Integer') # 3. Define objective function: shipping costs model += lpSum([costs[(w, c)] * shipments[(w, c)] for w in warehouses for c in customers]) # 4. Define constraints: our model's limitations which is demand must be met for each customer for c in customers: model += lpSum([shipments[(w, c)] for w in warehouses]) == demand[c] # 5. Solve model model.solve() print("Status", LpStatus[model.status], "\n") # 6. Print values for each decision variable - demand print("Optimal units for each warehouse:") for decision_variable in model.variables(): print(decision_variable.name, "=", decision_variable.varValue) # 7. Print value for the objective function - costs print("\nObjective =", value(model.objective)) ``` In this example we've created some dictionaries to hold our data for warehouses, customers, costs (warehouse to customer), and demand (units). We then follow the same pattern but use list comprehension to define every combination of decision variables for warehouses and customers. We do the same thing to define all of our shipping costs. Finally, we can define the constraints in that the shipments for each warehouse must meet demand and solve the model. The output from PuLP gives us: > Status Optimal > > Optimal units for each warehouse: > > Shipments_('Atlanta',_'A') = 0.0 > > Shipments_('Atlanta',_'B') = 900.0 > > Shipments_('Atlanta',_'C') = 800.0 > > Shipments_('New_York',_'A') = 1500.0 > > Shipments_('New_York',_'B') = 0.0 > > Shipments_('New_York',_'C') = 0.0 > > > Objective = 757700.0 We can see that to meet demand for: * customer B we need 900 units in Atlanta * customer C we need 800 units in Atlanta * customer A we need 1500 units in New York This results in optimal shipping costs of 757,000 and we've solve a much bigger problem with many more variables. ## C02 monitor allocation Let's say we were tasked with allocating C02 monitors to schools in order to manage and monitor air quality similar to [this real scenario](https://www.gov.uk/guidance/using-co-monitors-and-air-cleaning-units-in-education-and-care-settings). We need to allocate them proportionally to have the greatest impact, with some left over for additional demand later. This is the longest example given there are a number of constraints to define. ```python [monitors.py] """ Allocates the optimal number of C02 monitors to schools given the constraints. """ from pulp import LpProblem, LpVariable, LpMaximize, LpMinimize, LpStatus, lpSum, value # Objective function: number of monitors available_monitors = 200 # Decision variables: a list of schools to allocate monitors schools = ["School A", "School B", "School C", "School D"] # Constraints: dictionaries of size, rooms and pupil counts for each school school_sizes = {"School A": 5000, "School B": 6000, "School C": 4000, "School D": 5500} # in square feet school_rooms = {"School A": 30, "School B": 40, "School C": 25, "School D": 35} pupil_counts = {"School A": 2000, "School B": 3000, "School C": 1500, "School D": 2500} def allocate_co2_monitors(schools, available_monitors, school_sizes, school_rooms, pupil_counts): # 1. Initialise model model = LpProblem("CO2_Monitor_Allocation", LpMinimize) # 2. Define the decision variables - the things we can control # .dicts creates a dictionary of LpVariables https://coin-or.github.io/pulp/technical/pulp.html#pulp.LpVariable.dicts monitors = LpVariable.dicts("Monitors", schools, lowBound=0, cat="Integer") # 3. Define the objective function: the thing we want to minimise or maximise so total number of monitors used per school # Passing a list to lpSum can add many decision variables at once model += lpSum(monitors) # 4. Define the constraints: # At least one monitor to each school for school in schools: model += monitors[school] >= 1 # The total number of allocated monitors should not exceed the available monitors model += lpSum(monitors) <= available_monitors - 20 # There must be 1 monitor per 500 square feet for school in schools: model += monitors[school] >= school_sizes[school] / 500 # There must be 1 monitor per 2 rooms for school in schools: model += monitors[school] >= school_rooms[school] / 2 # There must be 1 monitor per 50 pupils for school in schools: model += monitors[school] >= pupil_counts[school] / 50 # 5. Solve the LP problem model.solve() # 6. Check the status of the solution if LpStatus[model.status] != "Optimal": print("Unable to find an optimal solution.") return None # 7. Get the model results allocation = {} for school in schools: allocation[school] = value(monitors[school]) return allocation allocation = allocate_co2_monitors(schools, available_monitors, school_sizes, school_rooms, pupil_counts) if allocation: print("CO2 Monitor Allocation:") total_monitors_allocated = 0 for school, monitors in allocation.items(): total_monitors_allocated += int(monitors) print(f"{school}: {monitors} monitors") print(f"\nTotal monitors allocated: {total_monitors_allocated}") print(f"Total monitors leftover: {str(available_monitors - total_monitors_allocated)}") ``` Here we define the available monitors, and set our data for schools, school size, rooms, and pupil counts. Following the same pattern, we initialise the model, and generate our decision variables from the **schools** list - our decision variable is what we can change so here it's the schools and how many monitors to assign to each. Finally we add each of the constraints and solve: * At least one monitor to each school * The total number of allocated monitors should not exceed the available monitors * There must be 1 monitor per 500 square feet * There must be 1 monitor per 2 rooms * There must be 1 monitor per 50 pupils These are reasonable assumptions for the constraints but we could change them if they are too strict. In this case, we have a solved model which gives: > CO2 Monitor Allocation: > > School A: 40.0 monitors > > School B: 60.0 monitors > > School C: 30.0 monitors > > School D: 50.0 monitors > > Total monitors allocated: 180 > > Total monitors leftover: 20 Great! So from the 200 available monitors we have allocated 180 given the constraints with 20 leftover. I find this the most impressive example as solving this scenario without LP and PuLP would take so much more work! ## Conclusion Well done if you made it through all the examples. You should be now be able to solve simple and intermediate optimisation problems using Python and PuLP using this workflow. You just have to frame your problem as an LP problem and then modify your decision variables and constraints. It is worth reminding ourselves that sometimes there won't be a solution to problem. In that case, we must revisit our inputs to loosen them a little if possible. Maybe the constraints are too strict and need to be made more forgiving. This is where operations meets analysis. Always question the outputs and sense check them through solid quality assurance - find out more in the article [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/). I hope you enjoyed this article, you've hopefully added a seriously useful tool to your toolkit. You may also be interested in these articles on the site: * [How to build and visualise a Monte Carlo simulation with Python and Plotly](/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/) * [Understanding Explainable AI (XAI) for classification, regression and clustering with Python](/blog/understanding-explainable-ai-for-classification-regression-and-clustering-with-python/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)

Improving Wi-Fi 2.4GHz and 5GHz speeds after Full Fibre (FTTP) upgrade

Tue, 09 Jan 2024 17:50:00 GMT

Given I work with computers every day, and have a good understanding of computer science and networking, I recently needed a refresher to improve my Wi-Fi connection speeds. In this article, we'll go through the steps I took to increase Wi-Fi speeds from 25Mbps to all 150Mbps after a recent internet connection upgrade. Maybe these steps can help you to maximise your own connection speeds too. ## Setting the scene - the issue My [fibre connection](https://www.openreach.com/fibre-broadband) was previously FTTC (fibre to the cabinet) but was recently upgraded to 'Full Fibre' or FTTP (fibre to the property). Great! This meant I was able to go from 25-30Mbps ([megabits per second](https://en.wikipedia.org/wiki/Data-rate_units)) to a maximum speed of 150Mbps. After the installation by [CityFibre](https://cityfibre.com/homes), I was impressed with the setup and everything was working ok with the new router but Wi-Fi speeds weren't always better, suffering some drop-out and similar speeds to the prior setup. I needed to investigate this and figure out how to get the full speed throughout the property. The following sections go step-by-step through **what I did**, what you **need to know** and the **tactics you can try** to improve your Wi-Fi connection speeds. I'm not saying these things are guaranteed to work for you, but they've been a huge improvement for me and I wanted to share them. ## To begin, check your wired connection speed A good starting point is to first check that you are receiving the increased speeds by connecting your device to the router with an Ethernet cable. I have an Xbox Series X console connected this way, which has a [Network connection speed test](https://support.xbox.com/en-GB/help/hardware-network/connect-network/xbox-one-connection-speed) in the settings menu. You can compare your wired / wireless speed by using the same or another device and searching Google for "[internet speed test](https://www.google.com/search?q=internet+speed+test)". Both of these methods give the download and upload speeds. The Xbox Series X was receiving the full 150Mbps so the upgrade was definitely working correctly through a wired connection. ## Understand the difference between 2Ghz and 5Ghz bands Before progressing, it's important to understand what the 2.4Ghz and 5Ghz Wi-Fi bands are and their pros vs cons. Here is a crash course: **2.4Ghz =** slower but larger coverage area - can also get interference from radios, bluetooth, other networks etc. **5Ghz =** faster but smaller coverage area Most routers auto-assign a device to a band based on how far away the device is and if the device is capable of using the 5Ghz band. More devices are assigned to the 2.4Ghz band and that can lead to crowding. To check this out, login to your router admin panel at the local IP address http://192.168.1.1/ ... the **admin** username and password is typically found on the back of the router. In Wi-Fi Settings / Device Settings, you can then see which devices are connected to which band. If a device is connected to the 2.4Ghz band, that could be the reason for lower speeds! You can try re-connecting the device if you're close to the router to attempt to switch to the 5Ghz band. ## Move and elevate your router Okay, starting with the basics, if your router is crammed into a cupboard or behind a huge TV, it's likely going to block the signal substantially. You can try to move it to a higher location where it isn't blocked in. You may need to run the main cable connected to the router to a suitable spot then re-test the speeds. ## Check your device Wi-Fi network adapter Sometimes for older devices, the built in Wi-Fi chip / receiver cannot actually connect to the faster 5Ghz channel. To check this I found a great article from Louisiana State University [Wireless: Determine if Computer Has 5GHz Network Band Capability (Windows)](https://grok.lsu.edu/article.aspx?articleid=17341). This can be summarised as: * Search "**cmd**" in the Start Menu. * Type "**netsh wlan show drivers**" in the Command Prompt & Press Enter. * Look for the "**Radio types supported**" section. * If the network adapter supports network mode **802.11ac**: * The computer supports both 2.4GHz and 5GHz - your network capability IS Dual-Band Compatible. * This is true if your computer supports both 802.11ac and 802.11n together as well. * If the network adapter supports only network mode 802.11n: * The computer MAY OR MAY NOT have 2.4 GHz and 5GHz network capability and be Dual-Band Compatible.* * If the network adapter does not support either of these network modes, it IS NOT Dual-Band Compatible. Where a device can only connect to the 2.4Ghz band it may still get okay speeds and have further range, but just won't be able to benefit from the much faster 5Ghz band. ## Upgrade your device with an external Wi-Fi network adapter If it happens that your device Wi-Fi network adapter is older and incapable of connecting to the 5Ghz band then it might be a good time to upgrade with an external network adapter. Newer adapters are capable of pulling in greater speed, are dual-band so can connect to both the 2.4Ghz and 5Ghz bands and can hold the connection better for less drop-out. The WAVLINK AC1900 USB WiFi Dongle has delivered the best improvement in speeds to my upstairs desktop PC, and seems future proof in that it's capable of pulling even greater speeds than my current maximum of 150Mbps. It should pull in up to 600Mbps on 2.4Ghz and up to 1300Mbps on 5Ghz bands respectively. So if you upgrade your plan with your ISP, you're covered - although I'm sure that would be more than you'll ever need. These links are on Amazon, I don't receive any commissions for these links and have used both products, you should be able to find these products elsewhere if you wish though. Both worked very well and pulled in consistent speeds close to the connection's max 150Mbps but can go even higher if your [ISP](https://en.wikipedia.org/wiki/Internet_service_provider) plan allows. * [WAVLINK AC1900 USB WiFi Dongle for PC, Dual Band 1900Mbps WiFi Adapter for Desktop, Laptop PC with Magnetic Base, 4X 3dBi External Antennas, support Win 11/10/8/7/XP, Mac OS 10.7-10.15](https://www.amazon.co.uk/dp/B09KRK7TQT?ref=ppx_yo2ov_dt_b_product_details&th=1) * [TP-Link AC600 High Gain USB Wi-Fi Dongle, Dual Band Wi-Fi Adapter with 5dBi Antenna for PC/Desktop/Laptop, Supports Windows11/10/8.1/8/7/XP, Mac OS X 10.9-10.14 (Archer T2U Plus)](https://www.amazon.co.uk/dp/B07PJV66CN?ref=ppx_yo2ov_dt_b_product_details&th=1) My desktop PC hadn't moved, it was in the same location in my home office, as when I did the first test (left of the image below) however with the new WAVLINK network adapter the internet speed test had gone from 44 down 26 up to 147 down 144 up! You can see in the image below I'm connected to **Wi-Fi 2** which was the new WAVLINK external network adapter. It was a similar result with the lower profile [TP-Link AC600](https://www.amazon.co.uk/dp/B07PJV66CN?ref=ppx_yo2ov_dt_b_product_details&th=1) adapter on my laptop, but the WAVLINK seemed more robust and stable - with it's four prongs likely the reason! So is trying an external network adapter with your desktop PC and laptops worth a try? These results say absolutely! A final point, usually with an external network adapter you must install the relevant driver for the new adapter. With WAVLINK you head to their site, download for Windows or Mac, then install. Pretty simple process and the instructions are on the box. ## Add a Wi-Fi mesh extender to avoid dead zones Now we've covered using an external network adapter to improve **receiving** Wi-Fi network signal, what about improving general **outgoing** coverage to address dead-spots in the property? This proved to be a little tougher. The only thing I have tried so far is using a Wi-Fi mesh 'extender'. This effectively acts as a second router, which has the option to split the 2.4Ghz and 5Ghz bands on the extender. You can therefore end up with multiple access points or [SSID](https://en.wikipedia.org/wiki/Service_set_(802.11_network))s. I split up the bands and then named them something easy to understand like: * TALKTALK-843 * TALKTALK-843_EXT_24 * TALKTALK-843_EXT_5 This gives the option to connect to the main router downstairs, or the extension upstairs either on the 2.4 or 5Ghz bands. I used the TP-Link range extender as seen below. * [TP-Link AC750 Universal Dual Band Range Extender, Broadband/Wi-Fi Extender, Booster/Hotspot with Ethernet Port, Plug and Play, Smart Signal Indicator, UK Plug (RE220) ,White](https://www.amazon.co.uk/dp/B07ZWBBPQN?ref=ppx_yo2ov_dt_b_product_details&th=1) This worked quite well, but still struggled in one room - must be a particular thick wall slightly blocking the signal. Still a very good improvement though with no drop. The pros vs cons of using a Wi-Fi extender are: Pros: * You can choose which devices connect to which access point - spreading the network load * You can choose which band you want to connect a device to * It should improve coverage and reduce dead-spots Cons: * It is in effect still one connection just mirroring and relaying from the host router * Can introduce interference as now there are two access points broadcasting * It can only improve the coverage if it is still in range of the host router - ideally half way between the router and the dead-spot It was an inexpensive option to try and it did boost coverage in certain rooms. It provides another option to try in combination with the others. ## Consider adding a wired connection for critical devices I haven't taken this step yet, but I am considering it. A wired Ethernet connection may pull in similar speeds to a solid external network adapter, but the difference is reliability. Even the best Wi-Fi adapter may suffer drop-out at a critical moment like during a conference call or video interview. The chances of that happening with a wired connection is significantly less. If you're not a fan of ripping open your walls to install network cable, then a DIY job of running (and hiding) flat Ethernet under the carpets or floorboards, up the stairs and along skirting boards and into your PC is an option. Is it an ideal solution? Nope. But as long as it's run where no one will disturb it this temporary fix might become a permanent one and also very reliable. The one I'm looking to try from BUSOHE below claims to be flexible, durable and support over 30kg. Sounds tough to me. It's also flat so should be easier to lay under carpets neatly and away from footsteps. * [BUSOHE Cat 8 Ethernet Cable 20m, High Speed Flat Gigabit RJ45 Lan Network Cable, 40Gbps 2000Mhz Internet Patch Cord for Switch, Router, Modem, Patch Panel, PC (White)](https://www.amazon.co.uk/gp/product/B07QV7S2HT/ref=ox_sc_saved_title_2?smid=AJQDNWC8R613R&th=1) ## Happy networking This wasn't a typical analytical or programming article, however to write code and learn effectively, a strong stable internet connection is pretty vital! Worth giving this stuff some thought and taking the time to ensure you have the best connection possible so you can keep coding, learning and building great solutions without any worries. Not only that, it's good to have a solid and stable setup for video calls, video streaming and screen sharing. All great tools in any digital role. I hope this article gave you ideas and helped you to improve your network Wi-Fi speeds 😄 If you enjoyed this article be sure to check out [other articles](/) on the site 👍

Searching Markdown files for internal links and visualising with a Pyvis network graph

Fri, 08 Dec 2023 16:31:00 GMT

Lately I've been trying to improve the internal links on the site to improve the user experience. I wanted to check whether each article links to at least one other relevant article. I also wanted to understand what my content clusters looked like - the aim is to cover topics with a unique take or that are under-represented so they can help as many people as possible and avoid covering topics that are saturated. This also helps to keep efficient use of my time. There is a 'related articles' section at the bottom but this works on the category and isn't in the body of the article. The articles are stored in Markdown files in GitHub to keep them backed up and version controlled, so the plan was to: * Search the Markdown files and extract all internal links using [RegEx](https://docs.python.org/3/library/re.html) * Produce and display a network visualisation to understand content clusters and relationships using [Pyvis](https://pyvis.readthedocs.io/en/latest/) Pyvis is a wrapper for the popular [visJS](https://visjs.org/) JavaScript library, and it allows for easy generation of network graph visuals in Python. If you want to follow along, a reproducible example can be [found in the GitHub repo](https://github.com/shedloadofcode/pyvis-network-graph-md) ready to clone or download. The main Python file is in the /utils folder, and the Markdown files containing internal links are in the /content/blog/ folder. ## Install packages We'll only need to install two libraries, pyvis and pandas, so let's install those. ``` python -m pip install pyvis pandas ``` ## Import libraries In a new Python file **internal_links_graph.py**, we'll first import all libraries. ```python [internal_links_graph.py] import os import re import pandas as pd from pyvis.network import Network ``` ## Searching the Markdown files Next we need to create the edge data to feed into the network graph, by searching the Markdown files for internal links. To do that, we need to: * Define source (page linked from), target (page linked to), and weight (line weight) lists * Set a regular expression to parse Markdown links * Loop through and open each file in the given directory path, and for each: * Grab all links starting with **/blog/** * Append these to source, target and weight lists * Zip the lists together and return ```python [internal_links_graph.py] def get_edge_data() -> pd.DataFrame: source = [] target = [] weight = [] pages_with_no_internal_links = set() count = 0 path = "../content/blog" links_regex = re.compile(r'\[([^\]]+)\]$([^)]+)$') for filename in os.listdir(path): file_path = os.path.join(path, filename) name, extension = os.path.splitext(filename) count += 1 try: with open(file_path, encoding="utf8") as f: md = f.read() links = list(links_regex.findall(md)) links_added = 0 for link in links: if link[1].startswith("/blog/"): source.append("/blog/" + name + "/") target.append(link[1]) weight.append(0.4) links_added += 1 if links_added == 0: pages_with_no_internal_links.add(name) except Exception as error: print("An exception occurred:", error) print(f"{count} files searched.") print(f"{len(source)} sources and {len(target)} targets.", end="\n\n") print(f"{len(pages_with_no_internal_links)} pages with no internal links:") for link in pages_with_no_internal_links: print(link) return zip(source, target, weight) ``` ## Producing the network graph Now we have the **edge_data** of all source and target pages, we can build a network graph to visualise the nodes by: * Defining a new **Network** with the given properties * Add each item in **edge_data** to as a network node * Add hover information to each node * Output the network graph to an HTML file **links.html** ```python [internal_links_graph.py] def display_graph(edge_data) -> None: net = Network(height="900px", width="100%", directed=True, bgcolor="#222222", font_color="#b1b4b6", select_menu=True, filter_menu=True, cdn_resources="remote") net.show_buttons(filter_=["nodes", "physics"]) for e in edge_data: src = e[0] dst = e[1] w = e[2] net.add_node(src, src, title=src) net.add_node(dst, dst, title=dst) net.add_edge(src, dst, value=w) neighbor_map = net.get_adj_list() # add neighbor data to node hover data for node in net.nodes: node["title"] += " links to:\n" + "\n".join(neighbor_map[node["id"]]) node["value"] = len(neighbor_map[node["id"]]) net.show("links.html", notebook=False) ``` ## Run the file Finally, we can add the two function calls to the script to get the edge data and display the graph. ```python [internal_links_graph.py] if __name__ == "__main__": edge_data = get_edge_data() display_graph(edge_data) ``` To run the program in a new terminal or command line we can use: ``` python internal_links_graph.py ``` ## Full code ```python [internal_links_graph.py] """Searches the Markdown files for internal links in blog articles. Reads in the all files in the /content/blog directory and then searches for any link which contains /blog/. Outputs the results of this to a graph visual 'links.html' Install packages using `pip install pandas pyvis` """ import os import re import pandas as pd from pyvis.network import Network def get_edge_data() -> pd.DataFrame: source = [] target = [] weight = [] pages_with_no_internal_links = set() count = 0 path = "../content/blog" links_regex = re.compile(r'\[([^\]]+)\]$([^)]+)$') for filename in os.listdir(path): file_path = os.path.join(path, filename) name, extension = os.path.splitext(filename) count += 1 try: with open(file_path, encoding="utf8") as f: md = f.read() links = list(links_regex.findall(md)) links_added = 0 for link in links: if link[1].startswith("/blog/"): source.append("/blog/" + name + "/") target.append(link[1]) weight.append(0.4) links_added += 1 if links_added == 0: pages_with_no_internal_links.add(name) except Exception as error: print("An exception occurred:", error) print(f"{count} files searched.") print(f"{len(source)} sources and {len(target)} targets.", end="\n\n") print(f"{len(pages_with_no_internal_links)} pages with no internal links:") for link in pages_with_no_internal_links: print(link) return zip(source, target, weight) def display_graph(edge_data) -> None: net = Network(height="900px", width="100%", directed=True, bgcolor="#222222", font_color="#b1b4b6", select_menu=True, filter_menu=True, cdn_resources="remote") net.show_buttons(filter_=["nodes", "physics"]) for e in edge_data: src = e[0] dst = e[1] w = e[2] net.add_node(src, src, title=src) net.add_node(dst, dst, title=dst) net.add_edge(src, dst, value=w) neighbor_map = net.get_adj_list() # add neighbor data to node hover data for node in net.nodes: node["title"] += " links to:\n" + "\n".join(neighbor_map[node["id"]]) node["value"] = len(neighbor_map[node["id"]]) net.show("links.html", notebook=False) if __name__ == "__main__": edge_data = get_edge_data() display_graph(edge_data) ``` ## What I learnt about the content clusters The main takeaway from plotting all of the content in a network graph, was that there wasn't enough internal linking throughout the site. I spent some time to embed relevant content links in other articles and the outcome was a collection of strong content clusters. The clusters included web scraping, automation, data science and analysis, and web app development. The first image below shows what the network looked like before these improvements, and the second what it looks like now. You can see from the HTML file output the network graph can be searched and filtered using the top dropdowns. This is because earlier we passed **True** to both **select_menu** and **filter_menu** when creating the **Network** object. The image below shows filtering the example from the GitHub repo by a given path. Very useful for quickly identifying and highlighting nodes in a larger network. ## Happy networking I hope you were able to apply this methodology to your own use case. Although you might not store your content in Markdown, I am sure this could be adapted to search other formats with a similar setup. Visualising relationships like this through nodes in a network graph is very powerful. It certainly helped to deliver more relevant internal links to articles and visualise the content clusters. Pyvis can also be [integrated with NetworkX](https://pyvis.readthedocs.io/en/latest/tutorial.html#networkx-integration). [NetworkX](https://networkx.org/documentation/stable/index.html) is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in: * [Searching for text in PDFs at increasing scale](/blog/searching-for-text-in-pdfs-at-increasing-scale/) * [How to match and count keywords in text using JavaScript](/blog/how-to-match-and-count-keywords-in-text-using-javascript/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) for improving your Python skills

Record mouse and keyboard for automation scripts with Python

Sat, 02 Dec 2023 16:05:00 GMT

In this article, we'll take a look at how to record mouse clicks and keyboard input with [pynput](https://pynput.readthedocs.io/en/latest/) then convert that to a [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/index.html) automation script for playback. ## Why build a mouse and keyboard recorder? The short answer is to automate boring, time-consuming and repetitive tasks and let Python do them instead while you go enjoy a coffee ☕ PyAutoGUI is excellent for click and type automation tasks, but one of the weaknesses I found with it, is that it's difficult to 'record' a task and get the xy coordinates for the mouse clicks. There is an option to [take screenshots and locate images within the screen](https://pyautogui.readthedocs.io/en/latest/screenshot.html) but I could never get this to work accurately - mouse xy coordinates are much more reliable. The [documentation](https://pyautogui.readthedocs.io/en/latest/mouse.html) features a useful program that will constantly print out the position of the mouse cursor: ```python #! python3 import pyautogui, sys print('Press Ctrl-C to quit.') try: while True: x, y = pyautogui.position() positionStr = 'X: ' + str(x).rjust(4) + ' Y: ' + str(y).rjust(4) print(positionStr, end='') print('\b' * len(positionStr), end='', flush=True) except KeyboardInterrupt: print('\n') ``` But then you'd have to find all the coordinates and script that up seperately, a tedious task! Previously I explored [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/) which was great for keeping the screen active. Taking this a step further, actually recording the coordinates and also keyboard input then outputting that as a script, would be far better for automation tasks. I checked out a few existing tools like [record-and-play-pynput](https://github.com/george-jensen/record-and-play-pynput) and [pyautogui-mouse-record](https://github.com/DepictYourself/pyautogui-mouse-record) but none really satisfied what I was looking for, but they did give me a good start and inspiration. ## How to run the recorder You'll need to install a few Python packages first. ``` python -m pip install pynput pyautogui ``` Now let's go through step-by-step how to use this mouse and keyboard recorder. * Run `python record.py` to start the recording * To end the recording: - Hold right click for 2 seconds then release to end the recording for mouse. - Press 'ESC' to end the recording for keyboard. - Both are needed to finish recording. - The recorded mouse and keyboard actions will be saved as 'recording.json' * Run `python convert.py` to convert 'recording.json' into a PyAutoGUI script - The conversion will be saved as 'play.py' * Run `python play.py` to play back the actions 😄 All of the code can be found below or in [the GitHub repo](https://github.com/shedloadofcode/mouse-and-keyboard-recorder). Also, at the end there is a video demo of the recorder in action. ## Record mouse and keyboard The first step is to record the mouse and keyboard input. To do this, we are using pynput to listen for on press and on click, then storing those events as a dictionary in the **recording** list. Once both listeners are terminated, we store this in a file **recording.json** ```python [record.py] """ Records mouse and keyboard and outputs the actions to a JSON file recording.json To begin recording: - Run `python record.py` To end recording: - Hold right click for 2 seconds then release to end the recording for mouse. - Press 'ESC' to end the recording for keyboard. - Both are needed to finish recording. """ import time import json from pynput import mouse, keyboard print("Hold right click for 2 seconds then release to end the recording for mouse") print("Click 'ESC' to end the recording for keyboard") print("Both are needed to finish recording") recording = [] count = 0 def on_press(key): try: json_object = { 'action':'pressed_key', 'key':key.char, '_time': time.time() } except AttributeError: if key == keyboard.Key.esc: print("Keyboard recording ended.") return False json_object = { 'action':'pressed_key', 'key':str(key), '_time': time.time() } recording.append(json_object) def on_release(key): try: json_object = { 'action':'released_key', 'key':key.char, '_time': time.time() } except AttributeError: json_object = { 'action':'released_key', 'key':str(key), '_time': time.time() } recording.append(json_object) def on_move(x, y): if len(recording) >= 1: if (recording[-1]['action'] == "pressed" and \ recording[-1]['button'] == 'Button.left') or \ (recording[-1]['action'] == "moved" and \ time.time() - recording[-1]['_time'] > 0.02): json_object = { 'action':'moved', 'x':x, 'y':y, '_time':time.time() } recording.append(json_object) def on_click(x, y, button, pressed): json_object = { 'action':'clicked' if pressed else 'unclicked', 'button':str(button), 'x':x, 'y':y, '_time':time.time() } recording.append(json_object) if len(recording) > 1: if recording[-1]['action'] == 'unclicked' and \ recording[-1]['button'] == 'Button.right' and \ recording[-1]['_time'] - recording[-2]['_time'] > 2: with open('recording.json', 'w') as f: json.dump(recording, f) print("Mouse recording ended.") return False def on_scroll(x, y, dx, dy): json_object = { 'action': 'scroll', 'vertical_direction': int(dy), 'horizontal_direction': int(dx), 'x':x, 'y':y, '_time': time.time() } recording.append(json_object) def start_recording(): keyboard_listener = keyboard.Listener( on_press=on_press, on_release=on_release) mouse_listener = mouse.Listener( on_click=on_click, on_scroll=on_scroll, on_move=on_move) keyboard_listener.start() mouse_listener.start() keyboard_listener.join() mouse_listener.join() if __name__ == "__main__": start_recording() ``` ## Convert JSON output to PyAutoGUI script Now we have the **recording.json** file, we can use that to convert it into a Python script. We are excluding mouse release and scroll events as these don't really help for the purposes of conversion. ```python [convert.py] """ Converts the recording.json file to a Python script 'play.py' to use with PyAutoGUI. The 'play.py' script may require editing and adapting before use. Always review 'play.py' before running with PyAutoGUI! """ import json key_mappings = { "cmd": "win", "alt_l": "alt", "alt_r": "alt", "ctrl_l": "ctrl", "ctrl_r": "ctrl" } def read_json_file(): """ Takes the JSON output 'recording.json' Excludes released and scrolling events to keep things simple. """ with open('recording.json') as f: recording = json.load(f) def excluded_actions(object): return "released" not in object["action"] and \ "scroll" not in object["action"] recording = list(filter(excluded_actions, recording)) return recording def convert_to_pyautogui_script(recording): """ Converts to a Python template script 'play.py' to use with PyAutoGUI. Converts the: - Mouse clicks - Keyboard input - Time between actions calculated """ if not recording: return output = open("play.py", "w") output.write("import time\n") output.write("import pyautogui\n\n") for i, step in enumerate(recording): print(step) not_first_element = (i - 1) > 0 if not_first_element: ## compare time to previous time for the 'sleep' with a 10% buffer pause_in_seconds = (step["_time"] - recording[i - 1]["_time"]) * 1.1 output.write(f"time.sleep({pause_in_seconds})\n\n") else: output.write("time.sleep(1)\n\n") if step["action"] == "pressed_key": key = step["key"].replace("Key.", "") if "Key." in step["key"] else step["key"] if key in key_mappings.keys(): key = key_mappings[key] output.write(f"pyautogui.press('{key}')\n") if step["action"] == "clicked": output.write(f"pyautogui.moveTo({step['x']}, {step['y']})\n") if step["button"] == "Button.right": output.write("pyautogui.mouseDown(button='right')\n") else: output.write("pyautogui.mouseDown()\n") if step["action"] == "unclicked": output.write(f"pyautogui.moveTo({step['x']}, {step['y']})\n") if step["button"] == "Button.right": output.write("pyautogui.mouseUp(button='right')\n") else: output.write("pyautogui.mouseUp()\n") print("Recording converted. Saved to 'play.py'") if __name__ == "__main__": recording = read_json_file() convert_to_pyautogui_script(recording) ``` As some of the keys from pynput don't correspond directly to PyAutoGUI, the **key_mappings** dictionary helps out with this. If you come across any more, you can add to this dictionary taking the pynput key and mapping it to the relevant PyAutoGUI [keyboard keys](https://pyautogui.readthedocs.io/en/latest/keyboard.html#keyboard-keys). ## Play the automation script Once the conversion ends, **play.py** will contain a PyAutoGUI script that will look something like: ```python [play.py] import time import pyautogui time.sleep(1) pyautogui.press('win') time.sleep(1) pyautogui.press('f') time.sleep(0.22220540046691897) pyautogui.press('i') time.sleep(0.10727632045745851) pyautogui.press('r') time.sleep(0.08800437450408936) pyautogui.press('e') time.sleep(0.5824827909469605) pyautogui.press('f') time.sleep(0.11989445686340333) pyautogui.press('o') time.sleep(0.22220461368560793) pyautogui.press('x') time.sleep(2.674463224411011) pyautogui.moveTo(206, 219) pyautogui.mouseDown() time.sleep(0.07921419143676758) pyautogui.moveTo(206, 219) pyautogui.mouseUp() time.sleep(5.592976307868958) pyautogui.moveTo(522, 68) pyautogui.mouseDown() time.sleep(0.11439170837402345) ``` Here is a quick end-to-end video demo recording, converting then playing back an automation process - an example of opening Firefox, navigating to W3Schools, searching for Python, copying some code, then pasting it into Visual Studio Code. This uses left click, right click and keyboard input so applicable to a real-world scenario. ## Final cut Okay this was another fun Python automation article, now you know how to create a mouse and keyboard recorder with Python, and have a solid start to building more advanced robotic process automation (RPA) solutions with PyAutoGUI. You can refer to the [documentation](https://pyautogui.readthedocs.io/en/latest/) for more guidance on using PyAutoGUI and think about what else you might like to build 😄 Although there is functionality for [controlling the mouse with pynput](https://pynput.readthedocs.io/en/latest/mouse.html) I still prefer to have a PyAutoGUI output script. This program can be modified and adapted further to your needs. You could read in some data with [pandas](https://pandas.pydata.org/) and then introduce a for loop to repeat an automation process for multiple inputs during playback. If you enjoyed this article be sure to check out other articles on the site including: * [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/) for another Python and PyAutoGUI use case * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) for improving your Python skills Finally, if you have any questions or if you decide to use or extend this program, please leave a comment below. I'd love to know what you use it for and how it's helped you out 👍

Developing your data science and analytical coding skills - a review of DataCamp

Mon, 13 Nov 2023 11:47:00 GMT

In this article, we will explore quite an in-depth overview of [DataCamp](https://datacamp.pxf.io/EKAK42), what it is, who it's for, how to get started and get the most out of it, alongside my experiences of using DataCamp to develop data science and career skills. I hope this review can give you a solid starting point to decide whether DataCamp is right for you. Let's begin! ## What is DataCamp? DataCamp is an online learning platform and a powerful resource for learning how to code for data science. > Develop in-demand data science and AI skills at your own pace with 460+ courses. Learn SQL, Python, R, Tableau, PowerBI, ChatGPT and more with interactive exercises. Follow short videos led by expert instructors and then practice what you've learned with hands-on exercises in your browser. ## Who is DataCamp good for? * Beginners who want to learn how to code for data analysis, data science and / or data engineering * Intermediate analysts who want to explore more complex data science topics * Professional analysts who want to quickly refresh skills for a project or carry out continuous professional development ## My experience with DataCamp My first use of DataCamp way back in 2018 was through the [Microsoft Professional Certificate in Data Science](https://devblogs.microsoft.com/premier-developer/microsoft-professional-program-for-data-science-sharpen-your-data-science-skills/) where it was used for the practical coding sections. I was both very impressed and hooked on data science, so subscribed for a yearly subscription to really commit to the change of career specialism. I completed that alongside [Harvard's Professional Certificate in Computer Science for Artificial Intelligence](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/). Both of these courses were essential for me to break in to the field of data science and software development. Statistics accounted for 50% of my undergraduate degree but had no where near the hands on coding experience DataCamp provided. Back then, I studied all of the introductary courses for [Python](https://datacamp.pxf.io/217ayQ), [R](https://datacamp.pxf.io/NkE9R2) and [SQL](https://datacamp.pxf.io/vNmPQj). This gave me an excellent foundation for understanding how to use code to interrogate data and solve business problems. Since then, I joined a large employer who provides a business subscription to DataCamp. This really helps me to balance a full-time job with learning. Ongoing professional development is vital, and this also helps when a project comes up I need a refresher on or a technique I’ve not used before or in a while. We have recently started using Azure Databricks with PySpark for a prediction project, so the courses I am doing right now include: * [Introduction to Azure](https://datacamp.pxf.io/k0AOJd) * [Introduction to PySpark](https://datacamp.pxf.io/rQWaJ3) * [Supervised Learning with scikit-learn](https://datacamp.pxf.io/q4ozOY) ## Pricing and free tier When it comes to [pricing](https://datacamp.pxf.io/c/4971160/1112312/13294), it is very clear and easy to select your currency from the dropdown at the top right. At the time of writing, there is a discount for a yearly subscription opposed to a monthly subscription which is great if you're ready to dedicate yourself to learning data science. Much like a gym membership, I think once you commit for the long term, you stick with it and make progress. In terms of advancing your career, gaining access to an immense library of content and the ability to practice coding plus gain certification, I feel this price is very reasonable. When comparing the pricing to typical [undergraduate tuition fees](https://www.ucas.com/finance/undergraduate-tuition-fees-and-student-loans#how-much-are-tuition-fees), I think the yearly pricing represents exceptional value for hands-on learning. In the unlikely event that you try it and really don't gel with it then [you can cancel easily](https://support.datacamp.com/hc/en-us/articles/360001546054-How-do-I-cancel-my-subscription-#h_01HEAHEBB21SH0MHY7VP71Y8V4). Also, take advantage of the limited access free tier - you get every first chapter free. ## Offers and promotions From time to time there are promotions and discounts so be sure to take advantage of these if you decide DataCamp is right for you. Here is a list I will keep updated with current and upcoming promotions and discounts: * [Student Discount - 50% Off for Students by subscribing to our Premium Student Plan!](https://datacamp.pxf.io/c/4971160/1611874/13294) * [Black Friday Sale - 50% OFF](https://datacamp.pxf.io/c/4971160/1859711/13294) November 13, 2023 11:59:00 (EST) to November 26, 2023 23:59:00 (EST) * [Cyber Monday Sale - 50% Off](https://datacamp.pxf.io/c/4971160/1859718/13294) November 27, 2023 00:01:00 (EST) to December 7, 2023 23:59:00 (EST) ## Getting started with DataCamp After logging in to [DataCamp](https://datacamp.pxf.io/EKAK42), the Learn hub is the main place to access learning materials. Although the video below is geared for business users, it's very helpful to everyone getting started with the basics of DataCamp including: * Tracks - career or skill tracks currate courses into a guided track. * Courses - interactive courses combining short videos with hands-on exercises. * Practice - quick daily challenges to keep skills sharp. * Assessments - test your skills to find your weak areas. * Tutorials - lots of articles and how-to guides. * Projects and Case Studies - solve real world problems guided or unguided. ## Making the most of DataCamp * Certifications - DataCamp Certification is an official recognition and a great way to prove your skills are job-ready. * [Workspace](https://datacamp.pxf.io/rQWaL3) - personal in-browser tool to write code, and share your data analysis. Think of this as a cloud based Jupyter notebook-like tool. * Competitions - apply skills to a real world task and compare to other DataCamp learners. * Code Alongs - webinars and events. * Popular topics - learn about new and trending tech like ChatGPT. ## Does DataCamp have any weaknesses? One of the downsides I've heard is that sometimes DataCamp can feel too much like a 'fill in the gaps' puzzle. I get this to an extent, but it's really important to not blindly go through the exercise, but to try and understand the exercise instead. DataCamp is excellent at providing a taste of what an aspiring data scientist needs to start with. If aspiring analysts/data scientists become very interested in what they are exposed to, they'll then complement this with other learning methods and research wider (YouTube videos, textbooks, articles and so on). For me, DataCamp is like a flight simulator; it teaches you what you need to know in a controlled environment, where you can make mistakes but don’t forget you also need to prepare for the real thing in a business setting which includes: * Setting up an IDE such as RStudio, Spyder, Visual Studio Code, PyCharm on your own machine * Installing Python (Base or Anaconda) or R on your own machine * Using cloud tools like Azure, AWS, Google Cloud Platform, Databricks * Setting up and configuring cloud databases with SSMS, Postgres etc * Gathering requirements from real business stakeholders * Selecting an explainable model for classification / regression given a business problem * Managing a project from start to finish; delivering a working solution * Presenting analysis to real stakeholders Your first day as a Data Scientist probably won't include firing up DataCamp! However, gaining the skills required to land and carry out that role, it may well provide you. You can check out the article [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/) if you're preparing to apply. ## What do others think about DataCamp? To get a feel for what others think and their experiences, check out the [stories](https://datacamp.pxf.io/1rz7ga) page which has lots of learner outcomes. I also read a really interesting article on [How One Learner Saved 1,500+ Hours of Work By Taking 200+ Courses and Amassing 1,000,000+ XP](https://www.datacamp.com/blog/how-one-learner-saved-1500-hours-of-work-by-taking-200-courses-and-amassing-1000000-xp). ## Alternatives to DataCamp It wouldn't be fair to finish the review without acknowledging alternatives to DataCamp. Although DataCamp is excellent for data science, if you have a slightly different goal in mind, another service may be better suited to you. These might include: * [Pluralsight](https://www.pluralsight.com/) - interactive and video courses on all areas of tech * [edX](https://www.edx.org/) - courses from big-name universities and colleges with optional paid certificates * [Coursera](https://www.coursera.org/) - video courses on many topics including coding * [Udemy](https://udemy.com/) - video courses on many topics including coding * [freeCodeCamp](https://www.freecodecamp.org/) - free interactive coding courses I've used all of these in the past, my favourites were freeCodeCamp, edX and Pluralsight. My opinion is that freeCodeCamp is great for starting out, edX offers accreditation from universities and colleges like [Harvard's CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/), and Pluralsight is another enterprise favourite for tech with Microsoft usually offering a 3 month trial with their Visual Studio Enterprise / Professional subscriptions. ## Final verdict The overall conclusion to this review is that DataCamp is a fantastic resource for learning data science. It may not be perfect, nothing is, but it is one of the best tools out there to improve or maintain data science skills. It is no surprise that [80% of the Fortune 1000 use it](https://datacamp.pxf.io/AW5Poa). A final thought is that every role seems to be demanding more skills in analysis, statistics and using data to make better decisions. This means that not just data scientists or data engineers need data skills, everyone does. If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions feel free to leave a comment 👍

How to scrape AutoTrader with Python and Selenium to search for multiple makes and models

Sun, 05 Nov 2023 17:31:00 GMT

Searching for used cars can be time consuming and sometimes there isn't a good way to easily compare potential cars. AutoTrader is a great place to perform this search and comparison but as far as I can see, it does not allow to search for multiple makes and models in one search. Who wants to keep going back and forth between previously saved searches, right? Wouldn't it be so much easier if you could compare all of them in one list or spreadsheet? We'll explore the Python code that does just that using both [Selenium](https://selenium-python.readthedocs.io/) and [regular expressions](https://docs.python.org/3/library/re.html) (RegEx), along with a video demo of how to use it. ## Installing required Python packages Of course, you'll need the latest stable version of [Python](https://www.python.org/downloads/) installed on your operating system and added to path before progressing. I'm also using [Visual Studio Code](https://code.visualstudio.com/) as the code editor, this isn't essential but it's a great free lightweight IDE worth checking out. Following that, the autotrader scraper will rely on a few Python packages so using pip, install the following: ``` python -m pip install numpy pandas bs4 selenium xlsxwriter ``` The main libraries we are using here are: * [Selenium](https://selenium-python.readthedocs.io/) to control ChromeDriver, navigate to URLs etc. * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse and search the HTML elements * [Pandas](https://pandas.pydata.org/) for data manipulation and calculations * [XlsxWriter](https://xlsxwriter.readthedocs.io/) to create the Excel output including conditional formatting All other libraries such as [os](https://docs.python.org/3/library/os.html) [re](https://docs.python.org/3/library/re.html), [time](https://docs.python.org/3/library/time.html) and [datetime](https://docs.python.org/3/library/datetime.html) come as standard with the [Python standard library](https://docs.python.org/3/library/index.html). ## Downloading ChromeDriver Selenium effectively 'controls' or 'drives' a web browser in an automated way. In order to do that, we need ChromeDriver, and we need the version that matches your current version of [Chrome](https://www.google.com/intl/en_uk/chrome/). My version of Chrome was **'Version 119.0.6045.106 (Official Build) (64-bit)'**. You can find your current version of Chrome by hitting the three dots in the top right of the browser > Help > About Google Chrome. You will see your current version and an option to update if it isn't the latest version. So based on that, I required the [latest stable version of ChromeDriver](https://googlechromelabs.github.io/chrome-for-testing/) for my machine which was ['119.0.6045.105 win64'](https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip). If you already have Version 119.0.6045.106 you can just [head to this repository](https://github.com/shedloadofcode/autotrader-selenium-scraper) where I have stored the code alongside the version of 'chromedriver.exe' I used ready for cloning / download. ## Explaining the AutoTrader scraper To simplify the code block below and to understand the process, here is a 3 step summary of what's going on. 1. We set our `criteria` and `cars` search parameters. 2. Then we `scrape_autotrader`: * For each car find how many pages of results there are in `number_of_pages` * For each page scrape all the `articles` * For each article use [RegEx](https://www.w3schools.com/python/python_regex.asp) to find all the car `details` * Store all car details in a list `data` and return this 3. We take that, and `output_data_to_excel` * Ensuring the data is parsed to numeric format * Calculating mileage per annum * Sorting on distance * Conditional format the numeric columns red, amber, green for easier analysis So once you've set your criteria and cars, ensure you're in the correct directory, then you can run the scraper using: ``` python autotrader-scraper.py ``` The code below then executes and begins the automated scraping in ChromeDriver. ```python [autotrader-scraper.py] # type: ignore """ Enables the automation of searching for multiple makes/models on Autotrader UK using Selenium and Regex. Set your criteria and cars makes/models. Data is then output to an Excel file in the same directory. Running Chrome Version 119.0.6045.106 and using Stable Win64 ChromeDriver from: https://googlechromelabs.github.io/chrome-for-testing/ https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip """ import os import re import time import datetime import pandas as pd from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options criteria = { "postcode": "LS1 2AD", "radius": "20", "year_from": "2010", "year_to": "2014", "price_from": "3000", "price_to": "6500", } cars = [ { "make": "Toyota", "model": "Yaris" }, { "make": "Honda", "model": "Jazz" }, { "make": "Suzuki", "model": "Swift" }, { "make": "Mazda", "model": "Mazda2" } ] def scrape_autotrader(cars, criteria): chrome_options = Options() chrome_options.add_argument("_tt_enable_cookie=1") driver = webdriver.Chrome() data = [] for car in cars: # Example URL: # https://www.autotrader.co.uk/car-search?advertising-location=at_cars&include-delivery-option # =on&make=Honda&model=Jazz&postcode=LS12AD&radius=10&sort=relevance&year-from=2011&year-to=2015 url = "https://www.autotrader.co.uk/car-search?" + \ "advertising-location=at_cars&" + \ "include-delivery-option=on&" + \ f"make={car['make']}&" + \ f"model={car['model']}&" + \ f"postcode={criteria['postcode']}&" + \ f"radius={criteria['radius']}&" + \ "sort=relevance&" + \ f"year-from={criteria['year_from']}&" + \ f"year-to={criteria['year_to']}&" + \ f"price-from={criteria['price_from']}&" + \ f"price-to={criteria['price_to']}" driver.get(url) print(f"Searching for {car['make']} {car['model']}...") time.sleep(5) source = driver.page_source content = BeautifulSoup(source, "html.parser") try: pagination_next_element = content.find("a", attrs={"data-testid": "pagination-next"}) number_of_pages = pagination_next_element.get("aria-label")[-1] except: print("No results found.") continue print(f"There are {number_of_pages} pages in total.") for i in range(int(number_of_pages)): driver.get(url + f"&page={str(i + 1)}") time.sleep(5) page_source = driver.page_source content = BeautifulSoup(page_source, "html.parser") articles = content.findAll("section", attrs={"data-testid": "trader-seller-listing"}) print(f"Scraping page {str(i + 1)}...") for article in articles: details = { "name": car['make'] + " " + car['model'], "price": re.search("[£]\d+(\,\d{3})?", article.text).group(0), "year": None, "mileage": None, "transmission": None, "fuel": None, "engine": None, "owners": None, "location": None, "distance": None, "link": article.find("a", {"href": re.compile(r'/car-details/')}).get("href") } try: seller_info = article.find("p", attrs={"data-testid": "search-listing-seller"}).text location = seller_info.split("Dealer location")[1] details["location"] = location.split("(")[0] details["distance"] = location.split("(")[1].replace(" mile)", "").replace(" miles)", "") except: print("Seller information not found.") specs_list = article.find("ul", attrs={"data-testid": "search-listing-specs"}) for spec in specs_list: if "reg" in spec.text: details["year"] = spec.text if "miles" in spec.text: details["mileage"] = spec.text if spec.text in ["Manual", "Automatic"]: details["transmission"] = spec.text if "." in spec.text and "L" in spec.text: details["engine"] = spec.text if spec.text in ["Petrol", "Diesel"]: details["fuel"] = spec.text if "owner" in spec.text: details["owners"] = spec.text[0] data.append(details) print(f"Page {str(i + 1)} scraped. ({len(articles)} articles)") time.sleep(5) print("\n\n") print(f"{len(data)} cars total found.") return data def output_data_to_excel(data, criteria): df = pd.DataFrame(data) df["price"] = df["price"].str.replace("£", "").str.replace(",", "") df["price"] = pd.to_numeric(df["price"], errors="coerce").astype("Int64") df["year"] = df["year"].str.replace(r"\s($\d\d reg$)", "", regex=True) df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64") df["mileage"] = df["mileage"].str.replace(",", "").str.replace(" miles", "") df["mileage"] = pd.to_numeric(df["mileage"], errors="coerce").astype("Int64") now = datetime.datetime.now() df["miles_pa"] = df["mileage"] / (now.year - df["year"]) df["miles_pa"].fillna(0, inplace=True) df["miles_pa"] = df["miles_pa"].astype(int) df["owners"] = df["owners"].fillna("-1") df["owners"] = df["owners"].astype(int) df["distance"] = df["distance"].fillna("-1") df["distance"] = df["distance"].astype(int) df["link"] = "https://www.autotrader.co.uk" + df["link"] df = df[[ "name", "link", "price", "year", "mileage", "miles_pa", "owners", "distance", "location", "engine", "transmission", "fuel", ]] df = df[df["price"] < int(criteria["price_to"])] df = df.sort_values(by="distance", ascending=True) writer = pd.ExcelWriter("cars.xlsx", engine="xlsxwriter") df.to_excel(writer, sheet_name="Cars", index=False) workbook = writer.book worksheet = writer.sheets["Cars"] worksheet.conditional_format("C2:C1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) worksheet.conditional_format("D2:D1000", { 'type': '3_color_scale', 'min_color': '#f96a6c', 'mid_color': '#ffdc81', 'max_color': '#63be7b' }) worksheet.conditional_format("E2:E1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) worksheet.conditional_format("F2:F1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) writer.save() print("Output saved to current directory as 'cars.xlsx'.") if __name__ == "__main__": data = scrape_autotrader(cars, criteria) output_data_to_excel(data, criteria) os.system("start EXCEL.EXE cars.xlsx") ``` If you don't want an Excel file with all the conditional formatting, after the transformations in `output_data_to_excel` remove everything at and below `writer` then just output to a CSV instead using: ```python df.to_csv("cars.csv") ``` I hope you find this code highly modifiable so you can adapt and extend it however you like. I was keen to calculate the mileage per annum to assess wear and tear, but you might want to include other calculations to explore other aspects and take it even further! ## Taking the scraper for a test drive Let's see the scraper in action, in this end-to-end demo. By performing this process weekly we can get the most up to date listing for a given area. In this demo, I have chosen a random postcode in Leeds. The formatting after scraping makes it really easy to see the trade offs in terms of price, year, mileage, miles per annum and previous owners. It also nicely allows for further filtering to narrow down your parameters. I closed the accept cookies pop up manually just so the steps taken in ChromeDriver were easily visible, but this isn't essential, you can just let it run. ## Why did the previous scraper stop working? For those of you who tried the old scraper from a [previous article](/blog/building-an-autotrader-scraper-with-python-to-search-for-multiple-makes-and-models/) you'll know it stopped working after the AutoTrader UK website changed sometime after September 2023. All of the classes used for scraping changed and were obfuscated. However, as we've seen in the current scraper, some attributes still allow element identification such as the `data-testid` attribute. The current scraper is simpler, should be more robust and less reliant on third party code other than stable libraries. However I have no doubt at some point it will stop working after another site change. Nevertheless, this scraper is easier to change, relying only on attribute identification followed by using regular expressions to find the required information. So by changing: 1. How we are identifying elements with [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and 2. How we are parsing the information out of those elements with [RegEx](https://docs.python.org/3/library/re.html) We can successfully update the code to adapt to changing needs. Selenium is a big help with this also, as it ensures that all scraping occurs after the page has loaded within Chrome. This means that anything that is dynamically added to the page using JavaScript after the page load should be captured. ## Happy car hunting again! The only thing left for you to do is set your criteria, add the makes and models you want, and off you go! Happy car hunting. I hope the scraper helps you compare cars easier and find the one you're looking for as much as it helped me 👍 If you have any thoughts on this article, please leave a comment below or reach out by email at the bottom of this page. Certainly want to hear how this is being used, if it's helping others and how you've adapted it to your needs 😄 If you enjoyed this article be sure to check out: * [How to scrape and analyse your Amazon spending data](/blog/how-to-scrape-and-analyse-your-amazon-spending-data/) * [How to scrape and analyse your Chess.com data](/blog/how-to-scrape-and-analyse-your-chess-com-data/)

How to import a CSV from Dropbox or GitHub into Google Sheets

Thu, 02 Nov 2023 13:05:00 GMT

## Introduction Recently I really wanted to export some of my spending data from the [Spending Tracker app](https://play.google.com/store/apps/details?hl=en&id=com.mhriley.spendingtracker) I use in CSV format to analyse it. This app exports data to Dropbox once it's linked up. So I needed a way to bring that data into Google Sheets to analyse trends etc. The process is quite simple once you know the steps involved, so I have documented them here! I have also documented how to do the same thing using GitHub. As an example, we will use the Titanic dataset stored in both Dropbox and GitHub and then import that into Google Sheets from both sources 😄 ## Get the CSV link from Dropbox First things first, we need to head across to Dropbox and copy the link to the CSV file. This gives us the link [https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=0](https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=0) For this Dropbox link we will need to change the ending from `dl=0` to `dl=1` so that the file is downloaded rather than viewed when we try to import it later. **This is an important step**. So the correct link is [https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=1](https://www.dropbox.com/scl/fi/dm1q4w0idefrwcv1arsxf/titanic.csv?rlkey=652khaywjcazj9h0itw47b574&dl=1) ## Get the CSV link from GitHub Doing the same process for GitHub I stored the CSV within a repository named [data-files](https://github.com/shedloadofcode/data-files/blob/main/titanic.csv). To ensure the CSV imports correctly we must first hit the 'Raw' button and copy that link instead. This gives us the raw CSV link [https://raw.githubusercontent.com/shedloadofcode/data-files/main/titanic.csv](https://raw.githubusercontent.com/shedloadofcode/data-files/main/titanic.csv) ## Import CSV data from Dropbox Now we have both links, to import that CSV data into Google Sheets, we will use the [IMPORTDATA](https://support.google.com/docs/answer/3093335) function and pass in the URL for each CSV file. Again, for the Dropbox link we will need to change the ending from `dl=0` to `dl=1` so that the file is downloaded. We can enter the formula and pass the link as the first argument. This imports the data and adds it to the current sheet. ## Import CSV data from GitHub Following the same pattern but on a new sheet, we enter the link from GitHub and hit enter. This imports the data and adds it to the current sheet. ## Analyse the data On either sheet, if we click any cell in the table and hit `Ctrl + A` we can select all the data, and then go to **Insert > Pivot Table** and select **'New sheet'** We can then drag in fields to analyse the data. Here were are finding the count and survival rate of males vs females. You can apply this methodology to any dataset, and any questions you have for that dataset! The best part is when the Google Sheet reloads then new data will be automatically pulled in creating a data pipeline. ## Import complete! Thanks very much for reading, this was a short article covering how to import a CSV from Dropbox or GitHub into Google Sheets. By using this method, it creates an automated refresh when the Sheet is reloaded, ensuring analysis is always carried out on the latest data. If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions please leave a comment 👍 Hope this helps you out and enjoy your day!

How to build a random recipe selector with Python

Thu, 26 Oct 2023 15:30:00 GMT

## Introduction For a while I've wanted to try and mimic the setup of [HelloFresh](https://www.hellofresh.co.uk/), whereby you: * have a list of meals you like to cook and eat * want to choose a random a number of recipes for the following week * then want a shopping list for the ingredients for those recipes Although I have never tried HelloFresh I've heard from others it's great for simplicity - you only get the ingredients you need for the recipe and it's mostly healthy stuff. Nevertheless, the principles inspired my own solution. I wanted to sharpen up my cooking skills, learn new recipes, and automate the stressful decision and procurement part of the process. This is where ingredirandom steps in to help, it: * defines a list of `recipes` dictionaries from various cook books I use * defines a list of `costs` as a tuple with product codes from online shopping at ASDA * a script `ingredirandom` which randomly selects a given number of those recipes, and outputs the shopping list to a text file The following code blocks contain each of these steps, so please enjoy having a read through 😄 ## Create a list of recipes ```python [recipes.py] recipes = [ { "name": "Beefy Mince and Pasta Bake", "book": "Enter cookbook name here", "page": 38, "serves": "2-3", "ingredients": [ "Tin of Campbell's condensed Tomato soup", "500g beef mince", "Beef or vegetable stock cubes", "Grated cheese", "Garlic cloves", "Onion", "Butter", "Pasta", "Freeze dried basil", "Salt", "Pepper" ] }, { "name": "Hoisin Chicken Noodles", "book": "Enter cookbook name here", "page": 59, "serves": "2", "ingredients": [ "Spring onions", "Fresh ginger", "Garlic cloves", "Chicken breasts", "Mushrooms", "Chicken stock", "Soy sauce", "Hoisin sauce", "Can of sweetcorn", "Fresh egg noodles", "Olive oil" ] }, { "name": "Pan Roast Chicken Breast with mustard sauce", "book": "Enter cookbook name here", "page": 64, "serves": "Enter cookbook name here", "ingredients": [ "Cherry tomatoes", "Lettuce", "Cucumber", "Chicken breasts", "Potato wedges or new potatoes", "Olive oil", "Balsamic vinegar", "Sugar", "Salt", "Pepper", "Mustard sauce", "Prosecco" ] }, { "name": "Chicken and Mushroom Pasta with 'Philly' cheese and fresh basil", "book": "Enter cookbook name here", "page": 68, "serves": "2", "ingredients": [ "500g tagliatelle", "Onion", "Garlic cloves", "Chicken breasts", "Mushrooms", "200g Philadelphia soft cheese", "Fresh basil", "Salt", "Pepper", "Parmesan", "Olive oil" ] }, { "name": "Thai Salmon with coconut rice and green chilli dressing", "book": "Enter cookbook name here", "page": 200, "serves": "2-3", "ingredients": [ "Olive oil", "Thai red curry paste", "Spring onions", "400g can coconut milk", "Fresh coriander leaves", "Lemon", "Rice", "Salmon steaks", "Hoisin sauce", "Sugar", "Green chilli" ] }, { "name": "Pan Roasted Chicken with spicy fried rice", "book": "Enter cookbook name here", "page": 154, "serves": "2-3", "ingredients": [ "Basmati rice", "Chicken breasts", "Eggs", "Onion", "Garlic cloves", "Red pepper", "Fresh ginger", "Red chilli", "Oyster sauce", "Soy sauce", "Spring onions" ] }, { "name": "Tuna Noodles with honey and ginger dressing", "book": "Enter cookbook name here", "page": 153, "serves": "2-3", "ingredients": [ "Honey", "Soy sauce", "White wine vinegar", "Red chilli", "Fresh ginger", "Salt", "Pepper", "Spring onions", "Cucumber", "Red pepper", "Can of tuna", "Fresh egg noodles" ] }, { "name": "Zesty Tuna Steaks with chilli tagliatelle", "book": "Enter cookbook name here", "page": 87, "serves": "2-3", "ingredients": [ "500g tagliatelle", "Olive oil", "Spring onions", "Red chilli", "Tuna steaks", "Black olives", "Fresh thyme", "Olive Oil", "Sugar", "Lime", "Salt", "Pepper" ] }, { "name": "Spaghetti Carbonara with Parmesan", "book": "Enter cookbook name here", "page": 80, "serves": "2-3", "ingredients": [ "500g pack spaghetti", "Onion", "Garlic cloves", "200g pack pancetta lardons or strips of streaky bacon", "Olive oil", "Eggs", "Parmesan", "Fresh basil", "Red wine" ] }, { "name": "Chorizo Spaghetti with balsamic and basil sauce", "book": "Enter cookbook name here", "page": 79, "serves": "2-3", "ingredients": [ "500g pack spaghetti", "Onion", "Garlic cloves", "Red pepper", "Chorizo sausages", "Tomatoes", "Olives", "Fresh basil", "Olive oil", "Balsamic vinegar", "Red wine vinegar", "Sugar" ] }, { "name": "Italian Meatballs with spaghetti", "book": "Enter cookbook name here", "page": 175, "serves": "2-3", "ingredients": [ "Olive oil", "Onion", "Garlic cloves", "400g tin tomatoes", "Tomato puree", "Brown sugar", "Red wine vinegar", "Fresh basil", "Salt", "Pepper", "500g beef mince", "Onion", "Red wine" ] }, { "name": "Beef Chow Mein with oyster sauce", "book": "Enter cookbook name here", "page": 72, "serves": "2-3", "ingredients": [ "Fresh ginger", "Garlic cloves", "Tomato puree", "Oyster sauce", "Soy sauce", "Onion", "Red pepper", "Rump steak", "Bean sprouts", "Fresh egg noodles", "Olive oil" ] }, { "name": "Crispy Fried Duck Breast with ginger dressing and fried rice", "book": "Enter cookbook name here", "page": 180, "serves": "2", "ingredients": [ "Basmati rice", "Carrots", "Courgette", "Duck breasts", "Eggs", "Spring onions", "Olive oil", "Fresh ginger", "Lime", "Soy sauce", "Honey" ] }, { "name": "Irish Lamb Stew with colcannon", "book": "Enter cookbook name here", "page": 184, "serves": "4", "ingredients": [ "Olive oil", "Onion", "Garlic cloves", "Stewing or diced lamb", "Carrots", "Flour", "Vegetable stock cube", "Apricot jam", "Red wine", "Rosemary", "Mushrooms", "Potatoes", "Butter", "250g pack of spring greens or savoy cabbage (optional)", "300ml pot of soured cream (optional)" ] }, { "name": "Beef Steak with balsamic onion and peppercorn sauce", "book": "Enter cookbook name here", "page": 171, "serves": "2", "ingredients": [ "Peppercorn sauce", "Onion", "Olive oil", "Onion", "Balsamic vinegar", "Brown sugar", "Potatoes", "Butter", "Salt", "Pepper", "Rump steak" ] }, { "name": "Sweet Honey Chicken with risotto rice", "book": "Enter cookbook name here", "page": 187, "serves": "2", "ingredients": [ "Soy sauce", "Fresh ginger", "Honey", "Dried chives", "Chicken breasts", "Butter", "Garlic cloves", "Yellow pepper", "Basmati rice", "Chicken stock", "Mushrooms", "Spring onions", "Courgette" ] }, { "name": "Sweet and Sour Chicken Noodles", "book": "Enter cookbook name here", "page": 196, "serves": "2", "ingredients": [ "Soy sauce", "Spring onions", "Red pepper", "Garlic cloves", "Chicken breasts", "Fresh egg noodles", "Olive oil", "Tomato puree", "Honey", "White wine vinegar", "Fresh ginger" ] }, { "name": "Corned Beef Hash with fried eggs", "book": "Enter cookbook name here", "page": 141, "serves": "2", "ingredients": [ "Potatoes", "Onion", "Corned beef", "Eggs", "Olive oil", "Salt", "Pepper" ] }, { "name": "Shiitake Mushroom Risotto with Parmesan", "book": "Enter cookbook name here", "page": 149, "serves": "2", "ingredients": [ "Butter", "Onion", "Garlic cloves", "Risotto rice", "Vegetable stock cube", "Shiitake mushrooms", "Salt", "Pepper", "Fresh basil", "Parmesan" ] }, { "name": "Traditional Pork Steaks with honey and mustard sauce", "book": "Enter cookbook name here", "page": 76, "serves": "2", "ingredients": [ "Potato wedges or new potatoes", "Carrots", "Pork steaks", "Green beans", "Onion", "Fresh ginger", "Cumin", "Cinnamon", "Flour", "Honey", "Wholegrain mustard", "Salt", "Pepper", "Olive oil" ] }, { "name": "Thai Prawn Curry with rice", "book": "Enter cookbook name here", "page": 83, "serves": "2", "ingredients": [ "Rice", "Pilau rice seasoning", "Prawns", "Olive oil", "Thai red curry paste", "400g can coconut milk", "Mangetout", "Baby sweetcorn", "Spring onions" ] }, { "name": "Crispy Parmesan Cod with fresh tomato sauce and mini roasts", "book": "Enter cookbook name here", "page": 84, "serves": "2", "ingredients": [ "Olive oil", "Tomato and basil sauce", "Pepper", "Fresh basil", "Breadcrumbs", "Parmesan", "Lemon", "Potatoes", "Cod", "Eggs" ] }, { "name": "Easy Cooked Breakfast", "book": "Enter cookbook name here", "page": 100, "serves": "2", "ingredients": [ "Hash browns", "Salt", "Pepper", "Sausages", "Streaky bacon", "Tomatoes", "Eggs", "Bread", "Olive oil", "Orange juice" ] }, { "name": "Chicken Biryani with Naan bread", "book": "Enter cookbook name here", "page": 199, "serves": "2", "ingredients": [ "Butter", "Onion", "Chicken thighs", "Korma curry paste", "Rice", "Chicken stock", "Yoghurt", "Raisins" "Fresh coriander leaves", "Flaked almonds", "Naan bread" ] }, { "name": "Pork with apple and pear chutney", "book": "Enter cookbook name here", "page": 75, "serves": "2", "ingredients": [ "Apple and pear chutney", "Pilau rice seasoning", "Rice", "Mangetout", "Pork steaks" ] }, { "name": "Zesty Cod with rice", "book": "Enter cookbook name here", "page": 96, "serves": "2", "ingredients": [ "Rice", "Pilau rice seasoning", "Cod fillets", "Eggs", "Onion", "Mushrooms", "Lemon", "Freeze dried basil" ] } ] ``` ## Record ingredients and costs ```python [costs.py] """ A record of costs per ingredient. Key is ingredient name, value is tuple (cost of item, ASDA product code) Last updated: September 2023 """ cost_lookup = { "Tin of Campbell's condensed Tomato soup": (1.30, "5498495"), "500g beef mince": (3.70, "1525219"), "Beef or vegetable stock cubes": (3.10, "4052433"), "Grated cheese": (2.55, "4639365"), "Garlic cloves": (2.00, "6599892"), "Onion": (1.00, "5737702"), "Butter": (3.25, "6858100"), "Pasta": (0.95, "6125466"), "Freeze dried basil": (0.80, "544353"), "Salt": (0.80, "4938721"), "Pepper": (0.90, "1352762"), "Spring onions": (0.75, "410212"), "Fresh ginger": (0.60, "6668284"), "Chicken breasts": (4.70, "7648521"), "Mushrooms": (1.29, "4110717"), "Chicken stock": (0.75, "2687967"), "Soy sauce": (1.90, "6124290"), "Hoisin sauce": (1.80, "6124274"), "Can of sweetcorn": (0.65, "5986511"), "Fresh egg noodles": (1.50, "5128622"), "500g tagliatelle": (2.00, "2207092"), "Olive oil": (5.90, "6722819"), "Red chilli": (0.55, "4928242"), "Tuna steaks": (5.00, "7740432"), "Black olives": (1.15, "951664"), "Fresh thyme": (0.55, "5139830"), "Sugar": (0.89, "217367"), "Lime": (1.00, "5596923"), "500g pack spaghetti": (0.75, "12943"), "200g pack pancetta lardons or strips of streaky bacon": (2.25, "6345750"), "Eggs": (2.95, "166781"), "Parmesan": (1.85, "3160573"), "Fresh basil": (0.55, "6753736"), "Red wine": (8.50, "1701819"), "Red pepper": (0.55, "1857059"), "Chorizo sausages": (2.70, "3567277"), "Tomatoes": (1.25, "5794643"), "Olives": (2.00, "6697522"), "Balsamic vinegar": (1.30, "1554788"), "Red wine vinegar": (4.50, "7681719"), "400g tin tomatoes": (1.25, "7675447"), "Tomato puree": (1.40, "7675461"), "Brown sugar": (1.35, "6345327"), "Oyster sauce": (1.75, "6124294"), "Rump steak": (6.20, "7357125"), "Bean sprouts": (0.50, "6536231"), "Basmati rice": (2.20, "18631"), "Carrots": (0.50, "150208"), "Courgette": (0.75, "6566770"), "Duck breasts": (6.00, "7443861"), "Honey": (1.47, "5506364"), "Stewing or diced lamb": (4.95, "6740100"), "Flour": (0.80, "11120"), "Vegetable stock cube": (0.75, "2687969"), "Apricot jam": (1.15, "6722853"), "Rosemary": (0.55, "5148466"), "Potatoes": (1.70, "1843017"), "250g pack of spring greens or savoy cabbage (optional)": (0.75, "150460"), "300ml pot of soured cream (optional)": (1.00, "5673649"), "Thai red curry paste": (2.30, "7563362"), "400g can coconut milk": (2.00, "7679943"), "Fresh coriander leaves": (2.00, "18695"), "Lemon": (0.55, "5797459"), "Rice": (2.70, "18802"), "Salmon steaks": (5.50, "6349272"), "Green chilli": (0.50, "1208242"), "Cucumber": (0.79, "152446"), "Can of tuna": (4.00, "6041045"), "White wine vinegar": (1.30, "2569207"), "200g Philadelphia soft cheese": (2.20, "7345715"), "Potato wedges or new potatoes": (1.50, "6311576/6141368"), "Peppercorn sauce": (1.20, "6923656"), "Dried chives": (0.80, "544339"), "Yellow pepper": (0.55, "1857071"), "Corned beef": (2.30, "2594051"), "Risotto rice": (2.40, "6125968"), "Shiitake mushrooms": (1.60, "4708261"), "Pork steaks": (3.60, "7452907"), "Green beans": (0.93, "7132612"), "Cumin": (0.80, "544313"), "Cinnamon": (0.80, "6684574"), "Wholegrain mustard": (2.65, "3667611"), "Pilau rice seasoning": (2.00, "59161"), "Prawns": (2.80, "6305703"), "Baby sweetcorn": (1.35, "6523635"), "Mangetout": (0.85, "5795246"), "Tomato and basil sauce": (1.15, "7458116"), "Breadcrumbs": (1.00, "5496030"), "Cod": (5.00, "6088572"), "Hash browns": (2.00, "3843261"), "Sausages": (3.50, "7600840"), "Streaky bacon": (2.25, "6345750"), "Bread": (1.30, "2160171"), "Orange juice": (1.15, "656042"), "Chicken thighs": (4.95, "6923652"), "Korma curry paste": (2.10, "5904835"), "Saffron": (2.45, "5615948"), "Yoghurt": (1.00, "3425334"), "Raisins": (1.50, "4960067"), "Flaked almonds": (1.50, "4960109"), "Naan bread": (0.75, "5215599"), "Apple and pear chutney": (1.95, "6210082"), "Cod fillets": (4.75, "6246480"), "Curry paste": (2.10, "5017664") } ``` ## Select random recipes ```python [ingredirandom.py] import random import datetime from recipes import recipes from costs import cost_lookup def get_random_selections(): k = int(input("Number of recipes to randomly choose?: ")) selections = random.sample(recipes, k=k) return selections def output_to_text_file(selections): now = datetime.datetime.now() file_path = "shopping-list-" + now.strftime("%d-%m-%Y") + ".txt" with open(file_path, "w") as file: file.write("Shopping list for ") file.write(now.strftime("%B %d, %Y\n\n")) file.write("* = Ingredient is in multiple recipes\n\n") total_week_cost = 0 all_ingredients = set() for i, recipe in enumerate(selections): if (i > 0): file.write("\n\n") total_recipe_cost = 0 file.write(f"Recipe {i + 1}\n") file.write("____________________________\n") file.write(f"{recipe['name']}\n") file.write(f"{recipe['book']} - Page {recipe['page']}\n\n") for ingredient in recipe['ingredients']: if ingredient in cost_lookup: ingredient_cost = float(cost_lookup[ingredient][0]) ingredient_product_id = cost_lookup[ingredient][1] file.write( '{:70s} {:20s} {:20s}'.format( ingredient + "*" if ingredient in all_ingredients else ingredient, "\u00a3" + str(ingredient_cost), str(ingredient_product_id)) ) file.write("\n") total_recipe_cost += ingredient_cost else: file.write( ingredient + "*" if ingredient in all_ingredients else ingredient) file.write("\n") all_ingredients.add(ingredient) file.write(f"\nEstimated recipe cost: \u00a3{round(total_recipe_cost, 2)}\n") total_week_cost += total_recipe_cost file.write(f"\n\nEstimated week cost: \u00a3{round(total_week_cost, 2)}") print("Selections saved to shopping-list.txt", end="\n") print("Happy cooking :)") if __name__ == "__main__": print("Welcome to IngrediRandom!", end="\n") print(f"There are {len(recipes)} recipes in total.") selections = get_random_selections() output_to_text_file(selections) ``` You run `python ingredirandom.py`, enter the number of random recipes you want selecting and the recipes, ingredients and costs are output to a text file `shopping-list.txt`. In the text file below we can see the output of the program, our random meals, the page number of the recipe book for the instructions along with a shopping list with the ingredients required and their costs. Remember the below list shows the **total** cost, that is if you were starting with nothing and had to buy everything. That's not to say you don't already have most of the ingredients or can find them cheaper elsewhere. These are just suggestions, groceries are becoming increasingly expensive so always shop around and adapt the program! That's just the way I decided to build this program, so I always know the cost of the core ingredients but remembering that the expensive one off items like olive oil or butter will bring that cost upwards. ```[shopping-list-16-09-2023.txt] Shopping list for September 16, 2023 * = Ingredient is in multiple recipes Recipe 1 ____________________________ Crispy Fried Duck Breast with ginger dressing and fried rice Your cookbook name - Page 180 Basmati rice £2.2 18631 Carrots £0.5 150208 Courgette £0.75 6566770 Duck breasts £6.0 7443861 Eggs £2.95 166781 Spring onions £0.75 410212 Olive oil £5.9 6722819 Fresh ginger £0.6 6668284 Lime £1.0 5596923 Soy sauce £1.9 6124290 Honey £1.47 5506364 Estimated recipe cost: £24.02 Recipe 2 ____________________________ Beef Steak with balsamic onion and peppercorn sauce Your cookbook name - Page 171 Peppercorn sauce £1.2 6923656 Onion £1.0 5737702 Olive oil* £5.9 6722819 Onion* £1.0 5737702 Balsamic vinegar £1.3 1554788 Brown sugar £1.35 6345327 Potatoes £1.7 1843017 Butter £3.25 6858100 Salt £0.8 4938721 Pepper £0.9 1352762 Rump steak £6.2 7357125 Estimated recipe cost: £24.6 Recipe 3 ____________________________ Hoisin Chicken Noodles Your cookbook name - Page 59 Spring onions* £0.75 410212 Fresh ginger* £0.6 6668284 Garlic cloves £2.0 6599892 Chicken breasts £4.7 7648521 Mushrooms £1.29 4110717 Chicken stock £0.75 2687967 Soy sauce* £1.9 6124290 Hoisin sauce £1.8 6124274 Can of sweetcorn £0.65 5986511 Fresh egg noodles £1.5 5128622 Olive oil* £5.9 6722819 Estimated recipe cost: £21.84 Estimated week cost: £70.46 ``` ## Adapting to your needs You can add, remove or modify recipe entries from `recipes.py`. You can also update the costs in `costs.py` for each ingredient. If you order online this should be easy to do as you do it, or find the cost from your receipt. I haven't added actual recipes to avoid copyright issues, however I will say my main cook books are [HelloFresh Recipes That Work](https://www.amazon.co.uk/HelloFresh-Recipes-that-step-step/dp/1784724653/), [Nosh for Students](https://www.amazon.co.uk/NOSH-Students-Student-Cookbook-Recipe/dp/0993260985) and [Nosh for Graduates](https://www.amazon.co.uk/GRADUATES-cookbook-those-graduated-student/dp/0954317955/) - yes these are simple but effective books I am no expert so simple is good for me. I entered my favourite recipes from these books into the `recipes.py` lookup and entered the costs from online grocery shopping at ASDA into `costs.py` lookup. Voila! I could also almost fully automate this process by turning the cost and recipe lookups into JSON files, storing them in GitHub, then having an AWS Lambda function read them and run ingredirandom.py, then send an email to me with the recipes for the week. I might explore this in a future article. ## Bon appetit This program works really well at mixing things up and enjoying learning new recipes. You can also adapt this code to fulfil any other random selection use case you may have. The major benefit is you can add more recipes you enjoy and remove the ones that you don't want to try again. It keeps your cooking skills sharp and I hope you find like me, eases the stressful procurement part of cooking. This leaves you to gather everything you need upfront and then just enjoy the process of preparing, cooking and eating good clean fresh food at least a few times per week! 😆

How to create an interactive correlation heatmap using Danfo.js and Plotly

Sun, 10 Sep 2023 17:47:00 GMT

In this short article, we'll look at how create a Pearson correlation heatmap visual using [Danfo.js](https://danfo.jsdata.org/) and [Plotly.js](https://plotly.com/javascript/) and then display it in an HTML page using JavaScript. I recently came across this issue whilst building the [Data Explorer Workbench tool](https://shedloadofcode.github.io/) in which I needed to calculate and display correlation between variables in the dataset using only JavaScript. Data Explorer Workbench is a web based tool for automated exploratory data analysis (EDA) where you can upload a CSV dataset and explore descriptive statistics, relationships and correlation. I was using Vue.js as the framework here, although you can amend the steps to other frameworks or a static HTML file. When it comes to data visualisation, heatmaps are a powerful tool for exploring relationships and patterns in your dataset. Heatmaps allow you to visualise the correlation between different variables, making it easier to identify trends and dependencies. ## What is a Correlation Heatmap? A correlation heatmap is a graphical representation of the correlation matrix, which shows the correlation coefficients between multiple variables in a dataset. Each cell in the heatmap represents the correlation between two variables, with colors indicating the strength and direction of the correlation. Heatmaps are commonly used in data analysis to identify relationships between variables, especially in fields like finance, healthcare, and social sciences. ## Getting Started Before we dive into creating a correlation heatmap, you'll need to have [Node](https://nodejs.org/en) installed on your system. Additionally, you'll need to install the [danfo](https://www.npmjs.com/package/danfojs-node) and [plotly](https://www.npmjs.com/package/plotly.js) libraries. You can do this using Node and [npm](https://www.npmjs.com/): ``` npm i danfojs-node npm i plotly.js ``` Once you have the required libraries installed, let's move on to the step-by-step process of creating a correlation heatmap. ## Step 0: Create the HTML chart placeholder ```html

``` This gives us a div container where the correlation heatmap will be placed. ## Step 1: Importing the libraries The first step is to import the necessary libraries: ```js import * as dfd from "danfojs"; import Plotly from 'plotly.js-dist-min'; ``` We use danfo for data manipulation and plotly for creating interactive visualisations. ## Step 2: The corr function You will see in the next step we require a `corr` function to calculate the Pearson correlation value for each variable. ```js /* * Calculates Pearson correlation between * two arrays x and y. */ corr(x, y) { let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0; const minLength = x.length = y.length = Math.min(x.length, y.length), reduce = (xi, idx) => { const yi = y[idx]; sumX += xi; sumY += yi; sumXY += xi * yi; sumX2 += xi * xi; sumY2 += yi * yi; } x.forEach(reduce); return (minLength * sumXY - sumX * sumY) / Math.sqrt((minLength * sumX2 - sumX * sumX) * (minLength * sumY2 - sumY * sumY)); } ``` ## Step 3: Loading the data and display the heatmap Now, you need to load your dataset into a DataFrame using danfo. For the purpose of this tutorial, let's assume you have a CSV file named your_data.csv containing your dataset. You can load an example Titanic dataset from a GitHub repo as follows: ```js dfd.readCSV("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv") .then(df => { df.head().print() /** * Generate heatmap * This needs to be in the format of * zValues = [ * [0.00, 0.00, 0.75, 0.75, 1.00], * [0.00, 0.00, 0.75, 1.00, 0.00], * [0.75, 0.75, 1.00, 0.75, 0.75], * [0.00, 1.00, 0.00, 0.75, 0.00], * [1.00, 0.00, 0.00, 0.75, 0.00] * ]; */ let zValues = []; let dfCopy = df.copy(); let columnsLength = dfCopy.shape[1]; let columnsToDrop = []; let numericColumns = dfCopy.selectDtypes([ 'int32', 'float32', ]); // Drop columns with high cardinality (many unique values) for (let i = 0; i < columnsLength; i++) { let column = dfCopy.columns[i]; // Skip if a numeric column as it will have lots of unique values // but this doesn't matter :) if (numericColumns.$columns.includes(column)) { continue; } let uniqueValuesCount = dfCopy.column(column).unique().$data.length; if (uniqueValuesCount > 5) { columnsToDrop.push(column); } } dfCopy.drop({ columns: columnsToDrop, inplace: true }); // Create dummy columns for categoric variables let dummies = dfCopy.getDummies(dfCopy); // Uncomment to debug: console.log("DUMMIES", dummies); columnsLength = dummies.$columns.length; for (let i = 0; i < columnsLength; i++) { let column = dummies.$columns[i]; // Uncomment to debug: console.log("COMPARING", column); let correlations = []; for (let j = 0; j < columnsLength; j++) { let comparisonColumn = dummies.$columns[j]; // Uncomment to debug: console.log("TO", comparisonColumn); let pearsonCorrelation = corr( dummies[column].$data, dummies[comparisonColumn].$data ).toFixed(2) correlations.push( pearsonCorrelation ); } zValues.push(correlations); } var xValues = dummies.$columns; var yValues = dummies.$columns; var colorscaleValue = [ [0, '#3D9970'], [1, '#001f3f'] ]; var data = [{ x: xValues, y: yValues, z: zValues, type: 'heatmap', colorscale: colorscaleValue, showscale: false }]; var layout = { autosize: false, width: window.innerWidth - 650, height: 700, annotations: [], xaxis: { ticks: '', side: 'top' }, yaxis: { ticks: '', ticksuffix: ' ', autosize: false } }; for ( var i = 0; i < yValues.length; i++ ) { for ( var j = 0; j < xValues.length; j++ ) { var currentValue = zValues[i][j]; if (currentValue != 0.0) { var textColor = 'white'; }else{ var textColor = 'black'; } var result = { xref: 'x1', yref: 'y1', x: xValues[j], y: yValues[i], text: zValues[i][j], font: { family: 'Arial', size: 12, color: 'rgb(50, 171, 96)' }, showarrow: false, font: { color: textColor } }; layout.annotations.push(result); } } Plotly.newPlot('correlation-heatmap', data, layout); }).catch(err=>{ console.log(err); }) ``` The length of this code can be made more concise by introducing functions. However, here we are performing a number of preprocessing steps before calculating the correlation coefficient with `corr`: * Reading the dataset with Danfo * Copying the dataset to work on it * Identifying the numeric type columns in the dataset * Dropping columns with high cardinality (many unique values) * Creating dummy columns for categoric variables ## Bonus: Using just plain HTML and JavaScript That's the whole process done with the heatmap created! If you prefer not to use Node and NPM with a framework, you can give this [minimal working example](https://github.com/shedloadofcode/danfo-plotly-correlation-heatmap/blob/main/test.html) using just plain HTML and JavaScript a go. In this example we are just importing both Danfo and Plotly from a [CDN](https://en.wikipedia.org/wiki/Content_delivery_network). ```html HTML 5 Boilerplate

``` This produces the below HTML page. ## Conclusion Creating a correlation heatmap is a valuable step in data analysis and visualisation. It helps you quickly identify relationships and patterns within your dataset, which can lead to valuable insights. In this article, we've demonstrated how to create a correlation heatmap using the Danfo and Plotly libraries in JavaScript. By following these steps, you can easily generate interactive heatmaps for your own datasets, enabling you to explore and understand your data more effectively. Remember that data visualisation is not only about creating pretty charts but also about gaining insights and making data-driven decisions. Heatmaps are just one of the many tools at your disposal for this purpose, and they can be a powerful addition to your data analysis toolkit. I am really excited by Danfo which brings Pandas style data manipulation and data analysis to JavaScript. I hope more articles utilising this library will be coming soon. If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions feel free to leave a comment 👍

Eight ways to perform feature selection with scikit-learn

Sat, 05 Aug 2023 12:25:00 GMT

Feature selection is a crucial step in machine learning that involves selecting the most relevant features from a dataset. By eliminating irrelevant or redundant features, feature selection techniques can improve model performance and efficiency. In this guide, we'll explore some common feature selection techniques and provide code examples using the Boston Housing dataset. The Boston Housing dataset contains information about housing prices in Boston. It consists of various features such as average number of rooms per dwelling, crime rate, and pupil-teacher ratio. Our goal is to select a subset of features that have the most impact on predicting house prices. ## Univariate Feature Selection Univariate feature selection evaluates each feature individually based on statistical tests to measure the correlation between each feature and the target variable. Let's visualise the feature scores using a bar plot. ```python import matplotlib.pyplot as plt import numpy as np from sklearn.feature_selection import SelectKBest, f_regression from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform univariate feature selection selector = SelectKBest(score_func=f_regression, k=5) X_new = selector.fit_transform(X, y) # Get the selected feature indices selected_indices = selector.get_support(indices=True) selected_features = data.feature_names[selected_indices] # Get the feature scores scores = selector.scores_ # Plot the feature scores plt.figure(figsize=(10, 6)) plt.bar(range(len(data.feature_names)), scores, tick_label=data.feature_names) plt.xticks(rotation=90) plt.xlabel('Features') plt.ylabel('Scores') plt.title('Univariate Feature Selection: Feature Scores') plt.show() print("Selected Features:") print(selected_features) ``` Selected Features: ['INDUS' 'RM' 'TAX' 'PTRATIO' 'LSTAT'] In this example, we select the top 5 features using the f_regression score function and visualise the feature scores using a bar plot. The selected features are the ones that have the highest correlation with the target variable. If we had a categorical target instead of a continuous target we might use chi2 instead of using f_regression ## Recursive Feature Elimination (RFE) Recursive Feature Elimination (RFE) is an iterative method that starts with all features and recursively eliminates the least important features based on the model's performance. Let's visualise the feature rankings using a line plot. ```python import matplotlib.pyplot as plt from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform Recursive Feature Elimination estimator = LinearRegression() selector = RFE(estimator, n_features_to_select=5) X_new = selector.fit_transform(X, y) # Get the selected feature indices selected_indices = selector.get_support(indices=True) selected_features = data.feature_names[selected_indices] # Get the feature rankings rankings = selector.ranking_ # Plot the feature rankings plt.figure(figsize=(10, 6)) plt.plot(range(1, len(rankings) + 1), rankings, marker='o') plt.xticks(range(1, len(rankings) + 1), data.feature_names, rotation=90) plt.xlabel('Features') plt.ylabel('Rankings') plt.title('Recursive Feature Elimination: Feature Rankings') plt.show() print("Selected Features:") print(selected_features) ``` Selected Features: ['CHAS' 'NOX' 'RM' 'DIS' 'PTRATIO'] Here, we use LinearRegression as the estimator and select the top 5 features. We visualise the feature rankings using a line plot. Lower ranks indicate more important features. ## L1 Regularisation (Lasso) L1 regularisation, also known as Lasso regularisation, applies a penalty term to the linear regression model, encouraging sparse feature weights. This results in some feature weights being driven to zero, effectively selecting only the most relevant features. Let's visualise the feature coefficients using a horizontal bar plot. ```python import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform L1 regularisation (Lasso) lasso = Lasso(alpha=0.1) lasso.fit(X, y) # Get the non-zero feature coefficients nonzero_coefs = lasso.coef_ selected_indices = nonzero_coefs != 0 selected_features = data.feature_names[selected_indices] nonzero_coefs = nonzero_coefs[selected_indices] # Plot the feature coefficients plt.figure(figsize=(10, 6)) plt.barh(range(len(nonzero_coefs)), nonzero_coefs, tick_label=selected_features) plt.xlabel('Coefficient Values') plt.ylabel('Features') plt.title('L1 Regularisation (Lasso): Feature Coefficients') plt.show() print("Selected Features:") print(selected_features) ``` Selected Features: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] In this example, we apply L1 regularisation with a regularisation strength (alpha) of 0.1. We visualise the non-zero feature coefficients using a horizontal bar plot. The selected features are the ones with non-zero coefficients in the Lasso model. ## Tree-Based Methods Tree-based methods, such as Random Forest and Gradient Boosting, inherently perform feature selection by evaluating the importance of each feature in the tree construction process. Let's visualise the feature importances using a bar plot. ```python import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform feature selection using Random Forest forest = RandomForestRegressor(n_estimators=100) forest.fit(X, y) # Get feature importances importances = forest.feature_importances_ # Sort feature importances in descending order sorted_indices = importances.argsort()[::-1] # Select the top k features k = 5 selected_features = data.feature_names[sorted_indices[:k]] top_importances = importances[sorted_indices[:k]] # Plot the feature importances plt.figure(figsize=(10, 6)) plt.bar(range(len(top_importances)), top_importances, tick_label=selected_features) plt.xticks(rotation=90) plt.xlabel('Features') plt.ylabel('Importance') plt.title('Tree-Based Methods: Feature Importances') plt.show() print("Selected Features:") print(selected_features) ``` Selected Features: ['RM' 'LSTAT' 'DIS' 'CRIM' 'NOX'] In this example, we use a Random Forest model with 100 estimators to calculate feature importances. We select the top 5 features based on their importance scores and visualise them using a bar plot. ## Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. Let's visualise the explained variance ratio using a bar plot. ```python import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform PCA pca = PCA(n_components=5) X_new = pca.fit_transform(X) # Get the explained variance ratio explained_variance = pca.explained_variance_ratio_ # Plot the explained variance ratio plt.figure(figsize=(10, 6)) plt.bar(range(1, len(explained_variance) + 1), explained_variance) plt.xlabel('Principal Components') plt.ylabel('Explained Variance Ratio') plt.title('Principal Component Analysis (PCA): Explained Variance Ratio') plt.show() # Get the loadings (principal component vectors) loadings = pca.components_ # Create a loading plot plt.figure(figsize=(10, 6)) for i, (loading, feature_name) in enumerate(zip(loadings, data.feature_names)): plt.arrow(0, 0, loading[0], loading[1], head_width=0.05, head_length=0.1, fc='blue', ec='blue') plt.text(loading[0], loading[1], feature_name, fontsize=12, ha='center', va='center', color='black') plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.8) plt.axvline(x=0, color='gray', linestyle='--', linewidth=0.8) plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.title('Loading Plot: Feature Contributions to Principal Components') plt.grid(True) plt.show() ``` Selected Features: ['CHAS' 'INDUS' 'CRIM', 'ZN', 'NOX'] In this example, we select the top 5 principal components that capture the most variance in the data. We visualise the explained variance ratio of these components using a bar plot for the 5 principal components, and a summary in two dimension space with 2 principal components to view the loadings / feature importances. In the PCA example with the bar chart, the importance of variables is not directly represented by the bar heights as in feature importance plots. Instead, PCA focuses on transforming the original features into a new set of uncorrelated variables called principal components. These principal components are linear combinations of the original features and are sorted in descending order of the amount of variance they capture. The explained variance ratio plot in the PCA example shows the proportion of the total variance in the dataset that each principal component explains. While this plot doesn't directly indicate which original features are the most important, it does help us understand the overall contribution of each principal component to the variability in the data. In general, when you perform PCA, the first few principal components tend to capture most of the variance in the dataset. Therefore, the original features that contribute the most to these early principal components can be considered more important in terms of explaining the dataset's variability. However, identifying which specific original features contribute most to a particular principal component can be challenging due to the linear combination nature of principal components. If you need to understand the relationship between the original features and specific principal components, you might need to perform further analysis, such as looking at the loadings of the principal components, which represent the contribution of each original feature to the construction of the principal component. In summary, in a PCA analysis, the focus is more on understanding the variability and relationships between variables rather than directly identifying the "most important" variables as you would in other feature selection methods. ## Correlation-based Feature Selection Correlation-based feature selection measures the correlation between each feature and the target variable, as well as the correlation between different features. Let's visualise the feature correlations using a heatmap. ```python import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Calculate feature correlations with target variable correlations = np.abs(np.corrcoef(X.T, y)[:X.shape[1], -1]) sorted_indices = correlations.argsort()[::-1] # Select the top k features k = 5 selected_features = data.feature_names[sorted_indices[:k]] top_correlations = correlations[sorted_indices[:k]] print("Selected Features with correlation:") print(selected_features) print(top_correlations) ``` Selected Features with correlation: ['LSTAT' 'RM' 'PTRATIO' 'INDUS' 'TAX'] [0.73766273 0.69535995 0.50778669 0.48372516 0.46853593] In this example, we calculate the absolute correlations between each feature and the target variable. We select the top 5 features with the highest correlations to the target variable y. ## Mutual Information Mutual information measures the statistical dependency between two variables. In the context of feature selection, it quantifies the amount of information that one feature provides about the target variable. Let's visualise the feature scores using a bar plot. ```python import matplotlib.pyplot as plt from sklearn.feature_selection import SelectKBest, mutual_info_regression from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform mutual information feature selection selector = SelectKBest(score_func=mutual_info_regression, k=5) X_new = selector.fit_transform(X, y) # Get the selected feature indices selected_indices = selector.get_support(indices=True) selected_features = data.feature_names[selected_indices] # Get the feature scores scores = selector.scores_ # Plot the feature scores plt.figure(figsize=(10, 6)) plt.bar(range(len(data.feature_names)), scores, tick_label=data.feature_names) plt.xticks(rotation=90) plt.xlabel('Features') plt.ylabel('Scores') plt.title('Mutual Information: Feature Scores') plt.show() print("Selected Features:") print(selected_features) ``` Selected Features: ['INDUS' 'NOX' 'RM' 'PTRATIO' 'LSTAT'] In this example, we select the top 5 features based on mutual information scores using the mutual_info_regression score function. We visualise the feature scores using a bar plot. This method is also good for datasets with a categorical target but instead of using 'mutual_info_regression' as the `score_func` we would import and use 'mutual_info_classif' instead. ## Sequential Feature Selection Sequential Feature Selection is a method that combines multiple feature subsets and evaluates their performance using a machine learning model. Let's visualise the feature performance using a line plot. ```python import matplotlib.pyplot as plt import numpy as np from sklearn.feature_selection import SequentialFeatureSelector from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston # Load the Boston Housing dataset data = load_boston() X = data.data y = data.target # Perform sequential feature selection estimator = LinearRegression() selector = SequentialFeatureSelector(estimator, n_features_to_select=5, direction='forward') selector.fit(X, y) # Get the selected feature indices selected_indices = np.where(selector.support_)[0] selected_features = data.feature_names[selected_indices] # Get the feature performance (manually store performance scores) performance = [] for step in range(1, len(selected_indices) + 1): subset_indices = selected_indices[:step] X_subset = X[:, subset_indices] score = -np.mean(np.abs(np.mean(LinearRegression().fit(X_subset, y).predict(X_subset) - y))) performance.append(score) # Plot the feature performance plt.figure(figsize=(10, 6)) plt.plot(range(1, len(performance) + 1), performance, marker='o') plt.xticks(range(1, len(performance) + 1), selected_features, rotation=90) plt.xlabel('Features') plt.ylabel('Performance') plt.title('Sequential Feature Selection: Feature Performance') plt.show() print("Selected Features:") print(selected_features) ``` Selected Features: ['CRIM' 'CHAS' 'RM' 'PTRATIO' 'LSTAT'] In this example, we use LinearRegression as the estimator and select the top 5 features using the forward selection approach. We visualise the feature performance using a line plot. ## Conclusion In conclusion, feature selection techniques are essential for improving machine learning models by selecting the most relevant features and reducing dimensionality. In this guide, we explored various techniques and applied them to the Boston Housing dataset. By incorporating these feature selection techniques into your machine learning workflow, you can enhance model performance, reduce overfitting, and gain better insights into the underlying data patterns. Consider experimenting with different techniques and evaluating their impact on your specific dataset and task to identify the most effective feature subset. Ultimately, feature selection empowers you to build more robust, interpretable models that deliver accurate predictions and valuable insights.

Understanding Explainable AI (XAI) for classification, regression and clustering with Python

Sat, 08 Jul 2023 17:47:00 GMT

## Introduction Artificial Intelligence (AI) has become an integral part of our lives, with its applications spanning across various domains. However, one major concern associated with AI is its lack of transparency and explainability. In recent years, there has been a growing demand for Explainable AI (XAI) techniques that aim to shed light on the decision-making processes of AI models. In this blog post, we will explore the concepts of XAI in the context of classification, regression, and clustering, and understand how these techniques can enhance the interpretability and trustworthiness of AI systems. The primary goal of AI and machine learning is to build models that can analyse and interpret complex data, recognise patterns, make predictions or decisions, detect anomalies, optimise processes and adapt their behavior based on new information without too much human expertise or explicit programming. Use the contents menu above to jump to classification, regression or clustering examples based on your interests. I carried out these analyses in the Spyder IDE. ## Classification and Explainable AI Classification is a fundamental task in AI that involves assigning input data points to predefined categories or classes. Explainable AI techniques in classification aim to provide insights into how a model arrived at a particular classification decision. Let's take a closer look at some XAI methods commonly used in classification: * Feature Importance: Feature importance techniques help identify which input features contribute the most to the classification decision. These methods assign scores or weights to each feature, allowing us to understand the relative importance of different inputs. * Rule Extraction: Rule extraction methods attempt to extract a set of human-interpretable rules from a trained classification model. These rules provide a transparent representation of how the model makes decisions, enabling easier comprehension. * Local Explanations: Local explanation methods focus on explaining individual predictions by highlighting the relevant features and their impact on the decision. Techniques like LIME (Local Interpretable Model-agnostic Explanations) generate locally faithful explanations that explain model behavior at specific instances. Explainable models aim to address the "black box" nature of traditional classification models by providing insights into the underlying factors and reasoning behind each classification prediction. Here are some popular explainable classification models: * Decision Trees: Decision trees are intuitive and transparent models that make decisions based on a sequence of rules. Each internal node represents a decision based on a specific feature, and each leaf node represents a class label. Decision trees provide a clear path of decision-making, making them inherently explainable. * Rule-Based Models: Rule-based models generate a set of if-then rules that define the decision boundaries of the classification model. These rules are typically human-readable and provide a transparent representation of the decision-making process. * Logistic Regression with L1 Regularisation: Logistic regression models with L1 regularisation can result in sparse solutions where only a subset of the input features is used for classification. This sparsity property allows for feature selection, indicating which features are most important for the classification decision. ## Decision Tree Classifier example To demonstrate the process, we will use a scikit-learn Decision Tree Classifier with the Titanic dataset to train a model to predict whether a passenger survived the disaster. It's a common and well known dataset, so perfect for learning the XAI process. We first import packages, read and prepare the dataset for the model, and split the data into training and testing sets. The training set (80% of the data) will be used to train the model, and the test set (20% of the data) acts as 'unseen data' to see how well the model works. Finally, we create a Decision Tree Classifier and train the model on the training set. ```python import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from sklearn.metrics import confusion_matrix, accuracy_score, classification_report import matplotlib.pyplot as plt import seaborn as sns # Load the dataset url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv' data = pd.read_csv(url) # Handle missing values data.fillna(value={'Age': data['Age'].median()}, inplace=True) data.fillna(value={'Embarked': data['Embarked'].mode()[0]}, inplace=True) # Remove unnecessary columns data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True) # Encode categorical variables data = pd.get_dummies(data, columns=['Sex', 'Embarked']) # Split the data into features and target variable X = data.drop('Survived', axis=1) y = data['Survived'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize the decision tree classifier model = DecisionTreeClassifier(max_depth=3) # Train the model model.fit(X_train, y_train) ``` The prepared data looks like this. We now really want to explain how well this model has performed, what features are important to the model's decison making and how well we expect it to perform on new data. We can first check training set accuracy as a benchmark and feature importances. Later we will check the testing set (unseen data) accuracy. ```python # Assess accuracy train_accuracy = round(model.score(X_train, y_train) * 100, 2) # Plot the feature importances importances = model.feature_importances_ indices = np.argsort(importances)[::-1] feature_names = X_train.columns sorted_feature_names = feature_names[indices] plt.figure() plt.title("Feature importance") plt.bar(range(X_train.shape[1]), importances[indices], align="center") plt.xticks(range(X_train.shape[1]), sorted_feature_names, rotation='vertical') plt.xlabel("Feature") plt.ylabel("Importance") plt.show() ``` The training accuracy returns 83.43% and the feature importances show that Sex_female, Pclass, Age has the largest importance on the model's decisions. So this model can correctly classify 83.43% of this dataset. Not a bad start. We can further break this down by visualising the decision tree. ```python # Plot the decision tree fig = plt.figure(figsize=(35, 15)) plot = tree.plot_tree(model, feature_names=X.columns, class_names=['Not Survived', 'Survived'], filled=True, fontsize=18) plt.suptitle(f"Model accuracy score = {train_accuracy}%\nTraining sample = {len(X_train)} rows", fontsize=18) plt.savefig("tree.png") ``` Let's interpret how the decision tree would classify a 30 year old male named Mike who was in passenger class 3. > 1st condition > > Sex_female (Mike=1) <= 0.5 ~ True > > Mike fulfils the condition; we move to the left side of the tree. > 2nd condition > > Age (Mike=30.0) <= 6.5 ~ False > > Mike doesn't fulfil the condition; we move to the right side of the tree. > 3rd condition > > Pclass (Mike=3) <= 1.5 ~ False > > Mike doesn't fulfil the condition; we move to the right side of the tree. > Last node > > The ultimate node, the leaf, tells us that the training dataset contained 354 males with a passenger class more than 1.5 of which > 42 survived (1) but 312 (0) didn't survive. Therefore, the chances of Mike surviving according to this model are 42 divided by 354: 42 / 354 = 0.1186440677966102 We get the answer that Mike had a 11.86% chance of surviving the Titanic accident and can understand how the model arrived at such a decision. We can confirm this later when passing in brand new data for the model to predict on. Things to remember when interpreting decision tree diagrams: * Nodes: Each node in the tree represents a decision point based on a specific feature and threshold. The topmost node is the root node, and subsequent nodes are internal nodes. The leaf nodes represent the final predictions. * Splits: The edges or branches between nodes indicate the splits based on the feature and threshold values. For example, if a sample's feature value is greater than the threshold, it follows the right branch; otherwise, it follows the left branch. * Gini Impurity or Information Gain: The plot_tree visual may also include measures such as Gini impurity or information gain. These metrics reflect the impurity or the amount of information gained by the split at each node. Lower values indicate more homogeneous child nodes, indicating better splits. In general, the Gini impurity ranges from 0 to 1, where 0 represents a perfectly pure node (all elements belong to the same class) and 1 represents a maximally impure node (elements are evenly distributed across all classes). * Colors: By setting `filled=True` in the `plot_tree` function, the plot is filled with colors to represent the majority class in each node. The color intensity reflects the class distribution or the probability of each class. * Samples: The plot may display the number of samples or observations that reach each node. It provides insights into the data distribution and the number of instances at different decision points. * Value: Refers to the target or output variable that the decision tree is trying to predict or classify at each node. At each internal node of the tree, a decision is made based on a feature and its threshold, leading to a different branch depending on whether the condition is satisfied or not. Eventually, the tree reaches the leaf nodes, which correspond to the final predicted classes. * Class: Refers to the distribution or count of samples belonging to each class at a specific node or leaf of the decision tree. This provides a breakdown of the samples in that node or leaf based on their class labels. It indicates the number of instances or the distribution of classes within that particular subset of the data. For example, the top node shows class=[444, 268] which means 444 did not survive and 268 survived. * Feature Importance: The decision tree visual allows you to infer feature importance based on the position and depth of the features within the tree. Features closer to the root node are more influential in the decision-making process. We can now also get a sense for how the model performed overall on the testing set by using a confusion matrix. We can see that the prediction success drops to 79.89% when applied to the test set (unseen data). We can see of the total test set of 179 records the model predicted 143 (92 + 51) correctly and 36 (13 + 23) incorrectly. This is an accuracy score of 143 / 179 = 0.79888 which confirms our score. ```python # Plot a confusion matrix to assess prediction success y_pred = model.predict(X_test) test_accuracy = round(accuracy_score(y_test, y_pred) * 100, 2) cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt="d", cmap="Blues") plt.title(f"Accuracy score = {test_accuracy}%\nTest sample = {len(X_test)} rows") plt.xlabel("Predicted Labels") plt.ylabel("True Labels") plt.show() ``` The same information can be found in the classification report. The classification report in scikit-learn provides a clear and concise summary of the model's performance for each class, as well as overall performance metrics. ```python # Produce a classification report report = classification_report( y_true=y_test, y_pred=y_pred, output_dict=True ) report = pd.DataFrame(report) ``` | | 0 | 1 | accuracy | macro avg | weighted avg | | --------- | -------- | -------- | -------- | --------- | ------------ | | precision | 0.8 | 0.796875 | 0.798883 | 0.798438 | 0.798708 | | | recall | 0.87619 | 0.689189 | 0.798883 | 0.78269 | 0.798883 | | | f1-score | 0.836364 | 0.73913 | 0.798883 | 0.787747 | 0.796167 | | | support | 105 | 74 | 0.798883 | 179 | 179 | | This classification report shows: * Precision: The precision for each class is the ratio of true positives (correctly predicted instances) to the sum of true positives and false positives (instances incorrectly predicted as positive). It measures the accuracy of positive predictions. Precision is reported for each class. * Recall: The recall, also known as sensitivity or true positive rate, for each class is the ratio of true positives to the sum of true positives and false negatives (instances incorrectly predicted as negative). It measures the model's ability to correctly identify positive instances. Recall is reported for each class. * F1-score: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. The F1-score is reported for each class. The closer it is to 1, the better the model. * Support: The support indicates the number of occurrences of each class in the true labels. It represents the number of samples belonging to each class. * Accuracy: The accuracy is the proportion of correctly classified instances (both true positives and true negatives) to the total number of instances. It provides an overall measure of the model's performance. * Macro average: The macro average is the average of precision, recall, and F1-score across all classes. It treats all classes equally, regardless of class imbalance. * Weighted average: The weighted average is the average of precision, recall, and F1-score across all classes, weighted by the support (number of samples) of each class. It considers the class imbalance and provides a more representative evaluation metric. We can apply this model to brand new unseen data. In this example we have 4 new passengers. 2 males and 2 females. ```python # Pass in new unseen data to the model and get a prediction columns = ["Pclass", "Age","SibSp","Parch","Fare","Sex_female", "Sex_male","Embarked_C","Embarked_Q", "Embarked_S"] unseen_data = { "Pclass": [3, 1, 2, 1], "Age": [30, 15, 50, 28], "SibSp": [1, 2, 0, 0], "Parch": [0, 0, 0, 0], "Fare": [20.0, 20.0, 20.0, 35.5], "Sex_female": [0, 1, 1, 0], "Sex_male": [1, 0, 0, 1], "Embarked_C": [0, 1, 0, 0], "Embarked_Q": [0, 0, 1, 0], "Embarked_S": [1, 0, 0, 1] } unseen_df = pd.DataFrame(unseen_data, columns=columns) predictions = model.predict(unseen_df) probability = pd.DataFrame(model.predict_proba(unseen_df), columns=["Did Not Survive %", "Survived %"]) unseen_df["Survived Prediction"] = predictions unseen_df["Survived Probability"] = probability["Survived %"] ``` Here are the results showing both female passengers are predicted to survive with a 96.87% probability, whereas both male passengers are not predicted to survive, with 11.86% (this profile matches Mike from earlier!) and 32.96% probability. Decision trees can be prone to overfitting as there is only one 'tree'. A Random Forest model can overcome this by assessing many trees using subsets of the data to avoid overfitting. You can still ouput feature importances with a Random Forest model, and they are generally more accurate, but are harder to explain to others! I will cover Logistic Regression and Random Forest models for classification in another article. Both are good alternative options. The maximum depth of a decision tree determines the number of levels in the tree and directly impacts the complexity of the decision boundary. By setting a higher maximum depth, the decision tree can capture more complex relationships in the data, potentially resulting in a more intricate decision boundary. Conversely, reducing the maximum depth can lead to a simpler decision boundary. ## Rules-based Classifier example An alternative approach to the Titanic classification problem is to use a rules based approach. Rule-based models typically provide deterministic predictions (0 or 1) based on the conditions of the rules. They do not inherently provide probabilistic outputs or confidence levels associated with predictions, which can be valuable for certain applications. Despite these limitations, rule-based models can still be valuable in certain scenarios, especially when interpretability and explainability are essential requirements. They are often used in domains where human-understandable decision rules are preferred, such as expert systems, regulatory compliance, or auditing. Here's an example of implementing a rules based model in Python using the Titanic dataset: ```python data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv') # Define rules rules = [ {'condition': (data['Sex'] == 'female') & (data['Pclass'] <= 2) & (data['Age'] <= 50), 'prediction': 1}, {'condition': (data['Sex'] == 'female') & (data['Pclass'] <= 2) & (data['Age'] > 50), 'prediction': 0}, {'condition': (data['Sex'] == 'female') & (data['Pclass'] > 2), 'prediction': 1}, {'condition': (data['Sex'] == 'male') & (data['Age'] <= 10), 'prediction': 1}, {'condition': (data['Sex'] == 'male') & (data['Age'] > 10) & (data['Fare'] > 20), 'prediction': 1} ] # Apply rules to make predictions predictions = [] for rule in rules: condition = rule['condition'] prediction = rule['prediction'] predictions.append(condition & (data['Survived'] == prediction)) # Combine predictions final_prediction = pd.concat(predictions, axis=1).any(axis=1) data["Predicted"] = final_prediction.replace({True: 1, False: 0}) # Calculate accuracy rules_based_model_accuracy = sum(final_prediction == data['Survived']) / len(data) ``` We define a list of rules, where each rule consists of a condition and a prediction. The condition is a boolean expression based on the features in the dataset, and the prediction represents the outcome if the condition is satisfied. We then iterate over the rules and apply them to the dataset to make predictions. Each rule is evaluated as a boolean condition, and the predictions are stored in a list. Finally, we combine the predictions using the logical OR operation, and compare the final prediction with the actual target variable ('Survived') to calculate the accuracy of the rule-based model. The accuracy of this model is 91.58% which suggests the rules are quite overfit, but that's okay if we want rigid well defined rules that are easily explainable, it's a trade off. ## Regression and Explainable AI Regression is a type of supervised learning task that predicts continuous numerical values based on input variables. Explainable AI techniques in regression help us understand how the model estimates the relationship between the input features and the target variable. Here are some common XAI methods used in regression: Partial Dependence Plots: Partial dependence plots visualize the relationship between a target variable and one or more input features while keeping other features fixed. These plots provide insights into how changes in the input variables impact the predicted outcome. Feature Contribution: Feature contribution methods quantify the impact of each input feature on the regression model's predictions. They help identify the most influential features and their corresponding effects, aiding interpretability. Model Simplification: Model simplification techniques aim to create simpler, more interpretable models that approximate the behavior of complex regression models. This simplification enhances transparency and enables easier comprehension of the underlying relationships. ## Linear Regression example We will use a scikit-learn Linear Regression model with the Boston Housing dataset to train a model to predict house prices. It's another well known dataset. There are 14 attributes in each case of the dataset. They are: * CRIM - per capita crime rate by town * ZN - proportion of residential land zoned for lots over 25,000 sq.ft. * INDUS - proportion of non-retail business acres per town. * CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise) * NOX - nitric oxides concentration (parts per 10 million) * RM - average number of rooms per dwelling * AGE - proportion of owner-occupied units built prior to 1940 * DIS - weighted distances to five Boston employment centres * RAD - index of accessibility to radial highways * TAX - full-value property-tax rate per $10,000 * PTRATIO - pupil-teacher ratio by town * B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town * LSTAT - % lower status of the population * MEDV / target - Median value of owner-occupied homes in $1000's We follow the same pattern as our first example. ```python import math import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import ( mean_squared_error, mean_absolute_error, median_absolute_error, r2_score ) import matplotlib.pyplot as plt import seaborn as sns # Load the Boston Housing dataset from sklearn.datasets import load_boston boston = load_boston() # Create a DataFrame from the dataset data = pd.DataFrame(boston.data, columns=boston.feature_names) data['target'] = boston.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split( data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42 ) # Create and train the linear regression model model = LinearRegression() model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) ``` The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation. We can now evaluate the accuracy of the model. ```python # Calculate the residuals residuals = y_test - y_pred results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Residuals': residuals, 'Absolute Residuals': abs(residuals)}) # Identify incorrect predictions results['Prediction Status'] = results['Absolute Residuals'] <= 5 close_predictions_count = len(results[results['Absolute Residuals'] <= 5]) results['Prediction Status'] = results['Prediction Status'].replace({ True: 'Prediction +/- $5000', False: 'Prediction > $5000' }) # Evaluate the model print('Mean Square Error = ' + str(mean_squared_error(y_test, y_pred))) print('Root Mean Square Error = ' + str(math.sqrt(mean_squared_error(y_test, y_pred)))) print('Mean Absolute Error = ' + str(mean_absolute_error(y_test, y_pred))) print('Median Absolute Error = ' + str(median_absolute_error(y_test, y_pred))) print('R2 = ' + str(r2_score(y_test, y_pred))) print('') print('% within +/- $5000 = ' + str(close_predictions_count / len(results))) ``` | Evaluation metric | Value | | ---------------------- | -------- | | Mean Square Error | 24.29112 | | Root Mean Square Error | 4.928602 | | Mean Absolute Error | 3.189092 | | Median Absolute Error | 2.324332 | | R2 | 0.668759 | | % within +/- $5000 | 0.862745 | In general, an R2 value of 0.66 means that approximately 66% of the variation in the target variable is explained by the regression model. This implies that the model captures a substantial portion of the underlying patterns in the data and performs better than simply using the mean value of the target variable for prediction. However, it also indicates that there is still some unexplained variation in the target variable that the model does not account for. * The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points. The Root Mean Squared Error (RMSE) is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axisd. * Root Mean Squared Error (RMSE): RMSE is the square root of the mean squared error and provides an interpretable metric in the same unit as the target variable. It penalizes larger errors more heavily compared to MSE. * Mean Absolute Percentage Error (MAPE): MAPE measures the average percentage difference between the predicted and actual values. It is particularly useful when the scale of the target variable varies significantly. * Coefficient of Determination (Adjusted R-squared): R-squared measures the proportion of the variance in the target variable explained by the regression model. Adjusted R-squared adjusts for the number of features in the model, penalizing the addition of irrelevant features. We can now use the scatter plot below to compare the actual target values with the predicted values. This visualisation helps assess how closely the model's predictions align with the true values. I have highlighted those predictions the were within +/- $5000 as these can be assumed to be accurate. ```python # Visualize actual vs predicted plot plt.figure(figsize=(15, 6)) sns.scatterplot(x='Actual', y='Predicted', hue='Prediction Status', data=results) sns.lineplot(x=results['Actual'], y=results['Actual'], color='black', label='Perfect Prediction') plt.title(f'Testing set = {len(y_test)} rows\nActual vs. Predicted') plt.xlabel('Actual Values ($1000)') plt.ylabel('Predicted Values ($1000)') plt.show() ``` Residuals, in the context of regression analysis, refer to the differences between the observed (actual) values and the predicted values obtained from a regression model. By examining the residuals, we can assess how well the regression model captures the patterns and trends in the data. A desirable regression model should have residuals that exhibit certain properties, such as being normally distributed around zero, showing no systematic patterns or trends, and having consistent variability across the range of the predicted values. We can create a residuals plot like the one below. The `residuals` were calculated by subtracting the predicted values `y_pred` from the actual values `y_test`. The `residplot()` function from seaborn is used to create the residuals vs. predicted values plot. It automatically fits and plots a linear regression line to the data points. The plot displays the relationship between the predicted values and the residuals. The horizontal line at y=0 serves as a reference line to indicate where the residuals should ideally be centered. Residuals above the line indicate overestimation, while residuals below the line indicate underestimation. ```python # Create the residuals vs. predicted values plot using seaborn plt.figure(figsize=(15, 6)) sns.residplot(x=y_pred, y=residuals) plt.axhline(y=0, color='red', linestyle='--') plt.title('Residuals vs. Predicted Values') plt.xlabel('Predicted Values ($1000)') plt.ylabel('Residuals ($1000)') plt.show() ``` Visualising the distribution of residuals can help too. A histogram or a kernel density plot of the residuals can help assess if they are normally distributed. Deviations from normality may indicate model misspecification or the presence of outliers. We can see in this distribution that most residuals are within $5000 either way which we also found in our earlier actual vs predicted chart. ```python # Create a histogram of residuals using seaborn plt.figure(figsize=(15, 6)) sns.histplot(residuals, kde=True) plt.title('Distribution of Residuals') plt.xlabel('Residuals ($1000)') plt.ylabel('Frequency') plt.show() ``` Finally, to figure out which features are most important to this model's predictions we can examine their coefficients to provide insights into the relationships between the features and the target variable. I have used `abs()` to rank absolute coefficients, regardless of whether they were positive or negative relationships. ```python # Interpret the model coefficients = pd.DataFrame({ 'Feature': list(X_train.columns.values), 'Coefficient': model.coef_, 'Absolute Coefficient': abs(model.coef_) }) feature_importance = coefficients.sort_values('Absolute Coefficient', ascending=False).reset_index(drop=True) ``` | Feature | Coefficient | Absolute Coefficient | | ------- | ----------- | -------------------- | | NOX | \-17.2026 | 17.20263 | | RM | 4.438835 | 4.438835 | | CHAS | 2.784438 | 2.784438 | | DIS | \-1.44787 | 1.447865 | | PTRATIO | \-0.91546 | 0.915456 | | LSTAT | \-0.50857 | 0.508571 | | RAD | 0.26243 | 0.26243 | | CRIM | \-0.11306 | 0.113056 | | INDUS | 0.040381 | 0.040381 | | ZN | 0.03011 | 0.03011 | | B | 0.012351 | 0.012351 | | TAX | \-0.01065 | 0.010647 | | AGE | \-0.0063 | 0.006296 | We can confirm these relationships using a pairplot with high coefficient features plotted against the target (median house price). ```python # Confirm feature importance with correlation pairplot plt.figure(figsize=(30, 20)) sns.pairplot(data, y_vars = ['target'], x_vars = ['PTRATIO', 'NOX', 'RM', 'LSTAT', 'AGE']) plt.show() ``` We can further enhance our understanding using a correlation heatmap for all features. I have opted to set a threshold of more than 0.4 or less than -0.4 here to only display important correlations which makes this visual much easier to read. You can just pass in `correlation` instead of `masked_corr_matrix` if you want to view them all. ```python # Check this against a Pearson correlation heatmap # Only keep important correlations (more than 0.4 or less than -0.4) correlation = data.corr() masked_corr_matrix = correlation[(correlation > 0.4) | (correlation < -0.4)] plt.figure(figsize=(20, 10)) sns.heatmap(masked_corr_matrix, cmap="coolwarm", annot=True, fmt='.2f', linewidths=.05).set_title("Correlation Heatmap") plt.show() ``` Another important point in selecting features for a linear regression model is to check for multicolinearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. This can affect the model. Same goes for the features DIS and AGE which have a correlation of -0.75. We kept all the features in this example for simplicity. ## Clustering and Explainable AI Clustering is an unsupervised learning task that involves grouping similar data points together based on their inherent patterns or characteristics. Although clustering lacks explicit labels, XAI techniques can still play a crucial role in understanding and validating the clustering results. Here are a few XAI methods in clustering: Cluster Visualization: Visualizing the clustering results helps us understand how the data points are grouped together. Techniques like scatter plots, heatmaps, or dendrograms provide a visual representation of the clusters, aiding in interpretation. Cluster Profiling: Cluster profiling techniques analyze the characteristics of each cluster, such as mean values, distribution, or other statistical measures. These profiles provide insights into the defining features of each cluster, enhancing interpretability. Dimensionality Reduction: Dimensionality reduction methods, such as Principal Component Analysis (PCA) or t-SNE (t-Distributed Stochastic Neighbor Embedding), can help reduce the high-dimensional input space to a lower-dimensional representation that is more easily understandable and interpretable. ## K-means example For this example we will use the Palmer Penguins dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. This dataset contains the data of 344 penguins. Just like in the Iris dataset, there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. These three classes are Adelie, Chinstrap, and Gentoo. So we could use this dataset for classification supervised learning (labelled data). But unlike the other examples we've seen, since clustering and dimensionality reduction are unsupervised methods, we will pretend we don't know what the classes are. We are only interested in grouping similar data points together based on their characteristics, helping us discover patterns and structure in data without pre-defined categories. This has real world uses including customer segmentation, market research and social network analysis. We will use K-means clustering which is an unsupervised algorithm that groups data points into K distinct clusters based on their proximity to the cluster centroids. It iteratively assigns data points to the nearest centroid and updates the centroids until convergence, aiming to minimize the within-cluster sum of squares. It's important to note that the clustering model is a tool to assist us in organizing and understanding data, but it doesn't provide definitive answers or predictions. ```python import pandas as pd from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt import seaborn as sns # Load the Palmer Penguin dataset url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv' data = pd.read_csv(url) # Drop missing data = data.dropna() # Keep species as known labels known_labels = data['species'].values # Select relevant features for clustering features = data[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']] # Scale the data scaler = StandardScaler() scaled_features = pd.DataFrame(scaler.fit_transform(features), columns=features.columns) # Perform clustering using K-means kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(scaled_features) labels = kmeans.predict(scaled_features) centroids = kmeans.cluster_centers_ ``` This will give us `labels` as our clusters. Note that in this example, we have used three clusters (n_clusters=3) but you can adjust the number of clusters as per your requirements in other datasets and experiment with different cluster sizes. ```python # Add the cluster labels to the dataset data['cluster'] = labels # Profile each cluster using feature analysis features_profile = data.groupby('cluster')\ .agg(['mean', 'median', 'std']) mean_features = data.groupby('cluster').mean() # Compute the silhouette score to evaluate cluster quality silhouette_avg = silhouette_score(scaled_features, labels) ``` After running this code, we add the predicted cluster `labels` back to the original data, and obtain the `features_profile` and `mean_features` values for each cluster, which will provide insights into the characteristics of the clusters by mean, median, and standard deviation. The mean feature values can help identify the statistical differences between clusters. **Mean feature values by cluster** | cluster | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | | ------- | -------------- | ------------- | ----------------- | ----------- | | 0 | 38.27674 | 18.12171 | 188.6279 | 3593.798 | | 1 | 47.56807 | 14.99664 | 217.2353 | 5092.437 | | 2 | 47.66235 | 18.74824 | 196.9176 | 3898.235 | The silhouette score returned 0.58 before scaling and 1.00 after, which is a great silhouette score, they range between -1 and 1, with values closer to 1 indicating well-separated clusters and values closer to -1 indicating overlapping or poorly separated clusters. We can produce a scatter plot visualisation using PCA to display the clusters in a two-dimensional space. By reducing the dimensionality of the data using PCA, we can project the data onto these principal components, effectively creating a lower-dimensional representation of the original data. This lower-dimensional representation allows us to visualise the data in a more manageable and interpretable way. ```python # Visualization using PCA pca = PCA(n_components=2) reduced_data = pca.fit_transform(scaled_features) # Visualize the clusters clustered_data = pd.DataFrame({'PCA Component 1': reduced_data[:, 0], 'PCA Component 2': reduced_data[:, 1], 'Cluster': labels}) plt.figure(figsize=(15, 10)) sns.scatterplot(data=clustered_data, x='PCA Component 1', y='PCA Component 2', hue='Cluster', palette='viridis') plt.xlabel('PCA Component 1') plt.ylabel('PCA Component 2') plt.title('PCA Results') plt.show() ``` Although the axes in a PCA plot do not directly correspond to individual features, the contributions of the original features to each principal component can be quantified. The loadings of the features on the principal components indicate their relative importance in explaining the variability in the data. This information can be used to assess which features have the most influence on the overall patterns observed in the PCA plot. ```python loadings = pca.components_ # Calculate the squared loadings (squared weights) for each feature feature_importance = np.square(loadings) # Sum the squared loadings across principal components to get the total importance for each feature total_importance = np.sum(feature_importance, axis=0) feature_importance_df = pd.DataFrame({'Feature': scaled_features.columns, 'Importance': total_importance}) feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False) feature_importance_df = feature_importance_df.reset_index(drop=True) ``` | Feature | Importance | | ----------------- | ---------- | | bill_depth_mm | 0.793125 | | bill_length_mm | 0.566126 | | flipper_length_mm | 0.332761 | | body_mass_g | 0.307989 | By calculating the squared loadings, we obtain the importance of each feature for each principal component. Summing the squared loadings across principal components provides the total importance for each feature. Finally, we sort the features based on their total importance to determine their ranking. As a general guideline, a common approach is to consider a total_importance value that captures a substantial amount of the variance in the data. For instance, a threshold of 0.80 or higher is often used, suggesting that the selected principal components account for at least 80% of the variance in the data. At a higher level, we cannot view all of the features in two-dimensional space, but we can select two features to explore. ```python # Visualize the clusters with two variables plt.figure(figsize=(15, 10)) sns.scatterplot(data=data, x='bill_length_mm', y='flipper_length_mm', hue='cluster', palette='viridis') plt.xlabel('Bill Length (mm)') plt.ylabel('Flipper Length (mm)') plt.title('Clustering Results') sns.scatterplot(x=centroids[:, 0], y=centroids[:, 2], hue=range(3), marker='X', s=200, palette=['black', 'black', 'black'], legend=False) plt.show() ``` The following two images below show the clusters identified by the model after scaling the data, and the actual penguin groupings. We can see that the images almost perfectly align which shows this model is performing very well at identifying distinct cluster groupings. It was signficantly less accurate before scaling the data. ```python # Visualise the clusters in a pairplot plt.figure(figsize=(30, 20)) sns.pairplot(data, hue="cluster", palette='viridis', vars = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']) plt.suptitle('Clusters after scaling') plt.show() # Visualise the actual penguin relationships and groupings plt.figure(figsize=(30, 20)) sns.pairplot(data, hue="species", vars = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']) plt.suptitle('Actual Penguin groupings') plt.show() ``` As a final piece of quality assurance, we can use a crosstab to examine the cluster labels vs known labels (species) to see how they align. The known labels won't usually be available in clustering problems because we're not trying to make a prediction, so this is a nice sense check whilst learning about using clustering. ```python # Quality assure (QA) our clusters against known penguin species qa = pd.DataFrame({'labels': labels, 'species': known_labels}) qa = pd.crosstab(qa['labels'], qa['species']) ``` | Cluster | Adelie | Chinstrap | Gentoo | | ------- | ------ | --------- | ------ | | 0 | 124 | 5 | 0 | | 1 | 0 | 0 | 119 | | 2 | 22 | 63 | 0 | We can see that Cluster 1 contains all Gentroo! Cluster 0 is mostly Adelie. Cluster 2 is the weakest with mostly Chinstrap but some Adelie. On the whole though, this suggests a very well performing clustering model. Once again, we're not trying to predict anything with clustering, only to identify clear groupings in the data, and ensure those groupings are explainable. ## Benefits and importance of Explainable AI The integration of XAI techniques into classification, regression, and clustering models offers several benefits: * Transparency: XAI methods provide transparency by revealing the inner workings of AI models, making them more understandable to users and stakeholders. * Trust: Enhanced explainability builds trust by enabling users to comprehend and verify the decisions made by AI systems. * Bias Detection: XAI techniques can help identify and mitigate biases present in AI models, ensuring fair and unbiased decision-making. * Compliance: In regulated industries, explainability is crucial for compliance with legal and ethical standards. An article I found really interesting on all of these topics was [6 Lessons from a Data Scientist in the Banking Industry](https://towardsdatascience.com/6-lessons-from-a-data-scientist-in-the-banking-industry-11dc4a8a7234). A quote that really hit me during that article was: > I exclusively build models using logistic regression. I am not alone. From banking to insurance, much of the financial world runs on regression. Why? > > > Because these models work. > > ... > > > With regression, I ended up with models that had 8 to 10 features. Each of these features had to be thoroughly explained. A non-technical colleague had to agree they captured a relationship that existed in reality. > >... > > This was a source of disappointment. Leaving uni, I had learned so much about random forests, XGBoost and neural networks. I was excited to apply these techniques. In the first week, I remember one of my senior colleagues saying: > > “Forget about all those fancy models” This echoes that a simple model that is easy to explain to a non-technical audience, is better than a more accurate but more complex model that is much harder to explain. ## Conclusion Explainable AI is a rapidly evolving field that aims to make AI models more transparent and interpretable. By incorporating XAI techniques into classification, regression, and clustering, we can gain insights into the decision-making processes of these models. Enhanced transparency not only facilitates user understanding but also promotes trust, fairness, and accountability in AI systems. As AI continues to shape our world, it becomes imperative to prioritise explainability. As always, if you enjoyed this article, be sure to check out [other articles on the site](/). You may be interested in [Concepts of Artificial Intelligence with Python - a review of CS50 AI](/blog/concepts-of-artificial-intelligence-with-python-a-review-of-cs50-ai/).

How to match and count keywords in text using JavaScript

Tue, 04 Jul 2023 20:30:00 GMT

## Introduction Keywords play a crucial role in analysing and extracting information from text data. Whether you're building a search functionality or conducting text analysis, being able to match and count keywords in JavaScript can be a valuable skill. In this article, we will explore a step-by-step approach to achieving this using JavaScript. I used this approach whilst creating an interactive JavaScript tool [Job Application Keyword Checker](/tools/job-application-keyword-calculator/). Be sure to check it out! ## Define your keywords and text The first step is to define the keywords you want to search for in the text. Create an array and populate it with the keywords you wish to match. Next, you need to obtain the text in which you want to search for the keywords. This can be any string of text you have or even user input. For demonstration purposes, let's assume we have the following: ```js const keywords = ["this", "where", "keywords", "none"]; const text = "This is the input text where we will search for keywords."; ``` Be sure to customise this array with your own set of keywords and you own text input. ## Match and count keywords using regex Now that we have our keywords and text ready, let's proceed with the matching and counting process. We will iterate over each keyword and utilise regular expressions to find matches in the text. We'll also count the occurrences of each keyword. ```js let keywordCount = 0; const keywordCounts = {}; keywords.forEach(keyword => { const regex = new RegExp(keyword, "gi"); const matches = text.match(regex); if (matches) { keywordCount[keyword] = matches.length; keywordCount++; } else { keywordCount[keyword] = 0; } }); console.log(keywordCount); ``` In the code above, we iterate over each keyword and create a regular expression using the keyword and the "gi" flags. The "g" flag enables a global search to find all occurrences of the keyword, while the "i" flag ensures case-insensitive matching. Using the match method on the text with the regular expression, we find all the matches. If matches are found, we store the count in the keywordCount object; otherwise, we set the count to 0. Finally, we log the keywordCount object to the console, which displays the count of each keyword in the text. ## Match and count keywords using array.includes() An alternative and more iterative approach is to transform the input text to an array and then match words from each array. We first would need a function to transform a string into an array. ```js /** * Parses an input string and transforms it into * an array of words */ function getWords(str) { let words = str.toLowerCase().split(" "); let uniqueWords = [...new Set(words)]; for (let i = 0; i < uniqueWords.length; i++) { uniqueWords[i] = uniqueWords[i].replace(/-/g, " "); } return uniqueWords; } ``` We can then use this to match the keywords. We use `toLowerCase` to avoid case sensitive mismatches. ```js let textArray = getWords(text); let matchedWords = []; // Go over each word in the text array and find matches for (let i = 0; i < textArray.length; i++) { let word = textArray[i].toLowerCase(); if (!matchedWords.includes(word)) { if (keywords.includes(word)) { matchedWords.push(word); } } } // Then go over all keywords to cross check for (let i = 0; i < keywords.length; i++) { let term = keywords[i]; if (!matchedWords.includes(term)) { if (text.toLowerCase().includes(term)) { matchedWords.push(term); } } } console.log(matchedWords.length); ``` ## Conclusion Matching and counting keywords in text is a useful technique when working with JavaScript and textual data. By following the steps outlined in this article, you can easily implement this functionality into your own projects. Remember to customize the keywords and text variables to match your specific use case. Feel free to experiment and enhance this code further by considering variations of keywords, such as plural forms or different tenses. Advanced techniques like stemming or lemmatisation can be employed to achieve more comprehensive keyword matching. Stemming is the process of reducing words to their base or root form, disregarding variations like tense or plural forms, to improve keyword matching and analysis in text data whereas lemmatisation is the process of reducing words to their base or dictionary form. So a good example would be the word "running" becomes "run". Harnessing the power of JavaScript and keyword matching opens up possibilities for creating powerful search engines, text analysis tools, and much more. Start exploring and leveraging this technique to unlock the potential within your own solutions! As always, if you enjoyed this article, be sure to check out [other articles on the site](/). If you are interested in finding out how to search for keywords using Python, then check out [Using PyPDF2 to score keywords in a job application](/blog/using-PyPDF2-to-score-keywords-in-a-job-application/).

Using PyPDF2 to score keywords in a job application

Wed, 28 Jun 2023 20:30:00 GMT

## Introduction AI and automated models will be used alongside human expertise more and more in the future. This article will explore a simple but useful example of this by counting and assessing keywords in job applications using Python. A model can bring a better quantitative assessment, whereas a human reviewer can bring a better qualitative assessment. Both are valuable. ## What are the benefits? I sit on interview panels to select and onboard apprentices, degree apprenticeships alongside junior and intermediate staff at a large organisation, for both the software / web development and the data science sides of the business. Managing this in combination with the day job, using AI and automation is super helpful. Sifting 100+ applicants can take many hours from many people. It helps take a more objective approach and to be more critical. Did the candidate just load up on buzzwords without any real substance? Did the candidate use only a few target words but have solid examples that demonstrated the skills required? Did the candidate give solid examples which also included the target words? Which candidate would you invite to interview? ## Understanding the PDF input I can't share the individual job applications of course due to data protection, but I can share the model code and show what the outputs look like. You can then use this approach and tailor it to your specific needs. The way that the organisation processes job applications means that only a single PDF is given to the panel with all of them combined. This enables anonymity and fairness in that you only see a candidate number and the application itself. No identifiable information given. It also meant that the model would first need to separate this large PDF file into the constituate applications. You will see in the code, I achieved this by splitting the text of the file on 'Application ID:'. This is what the large PDF file looked like. I have censored all text for privacy. If your situation requires many files instead of just one that requires splitting up, you can adapt this code using the approach found in [Searching for text in PDFs at increasing scale](/blog/searching-for-text-in-pdfs-at-increasing-scale/). ## Creating the model Before taking a look at the model, here is a brief summary of what's going on: * We define essential and desirable `criteria` keywords to look for. * We then use PyPDF2 to `read_applications` from the PDF using the filepath to it. * After splitting the text into separate applications we then `score_applications` using regex to count keyword matches. * Finally, we use `scores.describe()` to provide summary statistics. ```python [siftbot.py] # -*- coding: utf-8 -*- """ A scoring model to help with job application sifting. Enter keywords for the role essential and desirable criteria, then run the program. The outputs will be saved in the 'applications', 'scores' and 'statistics' variables. Documentation for PyPDF2: https://pypdf2.readthedocs.io/en/3.0.0/ Migration guide for PyPDF2: https://pypdf2.readthedocs.io/en/3.0.0/user/migration-1-to-2.html """ import re import time import PyPDF2 import pandas as pd def criteria(): return { "essential": [ "maths", "a level", "numeric", "analytical", "technologi", "language", "data", "business challenge", "problem solving", "communicat" ], "desirable": [ "programming skills", "analysis", "data manipulation", "analytical software", "software packages", "RStudio", "SQL", "Power BI", "Excel", "mathematical models", "infrastructure", "security", "web design", "agile", "agile project methodology", "customer facing", "technical and non-technical", "data architecture", "innovative" ] } def read_applications(filepath: str) -> list: pdf_reader = PyPDF2.PdfReader(filepath) # Formerly PyPDF2.PdfFileReader(filepath) number_of_pages = pdf_reader.getNumPages() all_text = "" for i in range(0, number_of_pages): pages = pdf_reader.pages[i] # Formerly reader.getPage(pageNumber) text = pages.extractText() all_text += text applications = all_text.split("Application ID:") return applications def score_applications(applications: list, criteria: dict): scores = [] for application in applications: score = { "application_id": application[1:8], "word_count": 0, "essential": 0, "desirable": 0, "matched_terms": "" } for term in criteria["essential"]: if re.search(term, application): print(f"Matched '{term}' in application {score['application_id']}") score["essential"] += 1 score["matched_terms"] += (term + " ") for term in criteria["desirable"]: if re.search(term, application): print(f"Matched '{term}' in application {score['application_id']}") score["desirable"] += 1 score["matched_terms"] += (term + " ") score["word_count"] = len(application.split()) scores.append([ score["application_id"], score["word_count"], score["essential"], score["desirable"], score["essential"] + score["desirable"], score["matched_terms"] ]) columns = ["Application ID", "Word Count", "Essential", "Desirable", "Combined", "Matched Terms"] return pd.DataFrame(scores, columns=columns) if __name__ == "__main__": start = time.time() applications = read_applications("C:\\Users\\shedloadofcode\\OneDrive\\Documents\\Recruitment\\Recruitment Jan 2023\\Applications\\applications for sift (109).pdf") scores = score_applications(applications, criteria()) statistics = scores.describe() print(f"Bot finished in {round(time.time() - start, 2)} seconds") ``` I created the model using Spyder IDE and the key variables are then stored as outputs in the variable explorer - the top right window. ## Viewing the outputs Using the variable explorer the outputs can be analysed. We can first sense check that all of the applications were split up correctly on 'Application ID:' and that there are 109 records as expected in `applications`. The results of the scoring is shown in `scores` which is super helpful by providing an application word count, a count of essential and desirable keywords matched, a combined count, and the matched terms themselves. This means you can sort by essential, desirable or total keywords matched. It also opens up further insights, such as 'did a candidate have a high word count, but didn't use many keywords?'. To aid with these kinds of questions, we can view the `statistics` output to find out min, max, mean and median (50%) word counts, essential, desirable and combined counts. This helps us to assess how a particular candidate compares to the average in terms of word count vs matched terms. The `scores` DataFrame could also be saved to a CSV file to share with other panel members. ## Final thoughts and interactive tool Thanks for reading 😄 I hope you found this article interesting. I used the logic from the code in this article to create an interactive JavaScript tool [Job Application Keyword Checker](/tools/job-application-keyword-calculator/). Be sure to check it out and give it a go, you can get started by hitting 'Show me an example' and take it from there! My final thought is that we should never blindly trust a model, especially in cases such as these where we are assessing suitability for a job position. A quantitative model can only get us so far. We should always carry out a human review and ask critical questions such as: * Is the candidate strong even though they didn't directly match many keywords? * Did the candidate just drop all the keywords into their application without really understanding them? This ensures fairness and avoids simple keyword matching bias, whilst also allowing the model to aid in decision making and speed up reviews. As always, if you enjoyed this article, be sure to check out [other articles on the site](/).

How to create animated charts with Python and Plotly

Fri, 14 Apr 2023 20:30:00 GMT

## Introduction In this short article we'll learn how to create animated charts using Python and Plotly. This follows on from the theme of the previous article [How to build and visualise a Monte Carlo simulation with Python and Plotly](/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/) Using these techniques can better help to tell the story when it comes to communicating data insights and changes over time periods or stages. The best way to get started quickly with animated charts, is to learn from examples and then start to apply your own datasets to them. All the examples in this article will follow the pattern 'show me the code that generates the chart' then 'show me what that chart looks like'. You'll see that [Plotly makes generating animated charts](https://plotly.com/python/animations/) in Python relatively straightforward, and attaches a play and stop button to the chart as standard. Auto play is enabled by default, but for the charts embedded on this page, I set this to false, so you need to hit the play button on the chart to start the animation. There are of course options to customise Plotly charts further. All of the code and outputs for the charts can be [found on GitHub](https://github.com/shedloadofcode/animated-plotly-charts). ## Animated bar chart ```python import plotly.express as px df = px.data.gapminder() fig = px.bar(df, x="continent", y="pop", animation_frame="year", animation_group="country", hover_name="country", range_y=[0,4000000000], color="continent", color_discrete_map={ 'Asia': '#1d70b8', 'Europe': '#f47738', 'Africa': '#28a197', 'Americas': '#6f72af', 'Oceania': '#d53880' }) fig.update_layout( title="Global population growth over time.", xaxis_title="Continent", yaxis_title="Population", legend_title="Legend Title", showlegend=False, font=dict( family="Arial", size=14 ), paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)') fig.write_html("outputs/animated_bar.html", auto_play=False) ``` ## Animated line chart ```python import plotly.graph_objects as go import pandas as pd dates = ["2022-12-03", "2022-12-04", "2022-12-05", "2022-12-06", "2022-12-07", "2022-12-08", "2022-12-09"] school_a = [86.77, 80.74, 79.48, 76.47, 75.44, 74.49, 70.41] school_b = [92.77, 91.64, 90.68, 92.37, 92.84, 90.29, 92.71] df = pd.DataFrame(list(zip(dates, school_a, school_b)), columns=['date', 'school_a', 'school_b']) fig = go.Figure( layout=go.Layout( updatemenus=[dict(type="buttons", direction="right", x=0.9, y=1.16), ], xaxis=dict(range=["2022-12-02", "2022-12-10"], autorange=False, tickwidth=2, title_text="Time"), yaxis=dict(range=[0, 100], autorange=False, title_text="Price") )) # Add traces i = 1 fig.add_trace( go.Scatter(x=df.date[:i], y=df.school_a[:i], name="School A", visible=True, line=dict(color="#f47738", dash="dash"))) fig.add_trace( go.Scatter(x=df.date[:i], y=df.school_b[:i], name="School B", visible=True, line=dict(color="#1d70b8", dash="dash"))) # Animation fig.update(frames=[ go.Frame( data=[ go.Scatter(x=df.date[:k], y=df.school_a[:k]), go.Scatter(x=df.date[:k], y=df.school_b[:k])] ) for k in range(i, len(df) + 1)]) fig.update_xaxes(ticks="outside", tickwidth=2, tickcolor='white', ticklen=10) fig.update_yaxes(ticks="outside", tickwidth=2, tickcolor='white', ticklen=1) fig.update_layout(yaxis_tickformat=',') fig.update_layout(legend=dict(x=0, y=1.1), legend_orientation="h") # Buttons fig.update_layout(title="Attendance % of two schools over time.", xaxis_title="Date", yaxis_title="Attendance %", legend_title="Legend Title", showlegend=False, font=dict( family="Arial", size=14 ), paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', hovermode="x", updatemenus=[ dict( buttons=list([ dict(label="Play", method="animate", args=[None, {"frame": {"duration": 500}}]), dict(label="School A", method="update", args=[{"visible": [False, True]}, {"showlegend": True}]), dict(label="School B", method="update", args=[{"visible": [True, False]}, {"showlegend": True}]), dict(label="All", method="update", args=[{"visible": [True, True, True]}, {"showlegend": True}]), ])) ] ) fig.write_html("outputs/animated_line.html", auto_play=False) ``` ## Animated scatter chart ```python import plotly.express as px df = px.data.gapminder() fig = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country", size="pop", color="continent", hover_name="country", log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90]) fig.add_hline(y=72, line_width=2, line_dash='dash', line_color='lightgray', annotation_text='', annotation_font=dict( family="Arial", size=15, color="lightgray" ), annotation_font_size=15, annotation_position='bottom left', fillcolor='lightgray') fig.add_shape(type="line", x0=12000, y0=0, x1=12000, y1=100, line_width=2, line_color='lightgray', line_dash='dash') fig.add_annotation(x=0.15, xref='paper', yref='paper', xanchor='left', y=0.15, yanchor='top', text="Below average", font=dict( color="black", size=20, family="Arial" ), showarrow=False) fig.add_annotation(x=0.85, xref='paper', yref='paper', xanchor='left', y=0.95, yanchor='top', text="Above average", font=dict( color="black", size=20, family="Arial" ), showarrow=False) fig.add_annotation(x=.99, xref='paper', xanchor='right', y=27, yanchor='bottom', text="Data last updated 2008", font=dict( color="gray", size=14 ), showarrow=False) fig.update_layout( title="Global life expectancy and GDP per capita over time.", xaxis_title="GDP per capita", yaxis_title="Life expectancy", legend_title="Legend Title", showlegend=False, font=dict( family="Arial", size=14 ), paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)') fig.write_html("outputs/animated_scatter.html", auto_play=False) ``` ## Bonus: Animated choropleth map For this chart I had to use a [Jupyter Notebook launched with Anaconda](https://medium.com/analytics-vidhya/fastest-way-to-install-geopandas-in-jupyter-notebook-on-windows-8f734e11fa2b) with an environment dedicated to GeoPandas. I had a few issues installing GeoPandas but this method worked ok. You can view the [entire notebook on GitHub](https://github.com/shedloadofcode/animated-plotly-charts/blob/main/animated%20map%20choropleth.ipynb). You'll see in the Jupyter Notebook, the final cell creates the choropleth map using [mapbox](https://www.mapbox.com/) - you will need to sign up to get a free API key to use this. ```python fig = px.choropleth_mapbox(df_crime_final, geojson=geojson, featureidkey='properties.name', locations='NEIGHBOURHOOD', color='Count', hover_name='NEIGHBOURHOOD', hover_data=['Count'], color_continuous_scale='Reds', animation_frame='Date', mapbox_style='carto-positron', title='Cumulative Numbers of Crimes in Vancouver Neighborhoods', center={'lat':49.25, 'lon':-123.13}, zoom=11, opacity=0.75, labels={'Count':'Count'}, width=1200, height=800) fig.write_html("outputs/animated_choropleth.html", auto_play=False) ``` ## Deployment options Now you've seen some examples of animated charts you can start putting together your own, but what's the best way to share these charts with others? Well you could export to an HTML file the same as in this article, and then even embed that into a web page. My [previous article](/blog/how-to-build-and-visualise-a-monte-carlo-simulation-with-python-and-plotly/) discussed this approach, here's the code snippet which uses [htmlpreview.github.io](https://htmlpreview.github.io). ```html ``` ## Final note I usually post an article every month (at least) but I missed February and March as I was busy preparing to bring my new German Shepherd puppy home. His name is Kaiser and he's settled in to the home very well 😄 I've been doing lots of training with him, teaching commands like sit, stay, down, come, leave it, out and heel. Maybe I'll write a fun article soon on that since I guess it's related to coding and logic - 'Programming my German Shepherd' 😆 I should now be back on track with my (at least) monthly new article releases. Anyway, I hope you enjoyed the article and as always be sure to check out other articles on the site. You may be interested in: * [Creating statistical neighbours comparator benchmarking models with Python](/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/) * [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code/) * [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/)

How to build and visualise a Monte Carlo simulation with Python and Plotly

Fri, 06 Jan 2023 18:30:00 GMT

*This article does not constitute financial advice and is for educational purposes only.* ## What are Monte Carlo simulations? [Monte Carlo simulations](https://en.wikipedia.org/wiki/Monte_Carlo_method) are used to model the probabilities of different outcomes where those outcomes are hard to predict due to random variables. The [Law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) states that as a sample size grows, its mean gets closer to the average of the whole population. This is due to the sample being more representative of the population as the sample become larger. In other words, with a Monte Carlo simulation the goal is to simulate the collection of all or many possible paths (using random sampling) in order to find the possibilities and the most likely or theoretical solution. In summary: * A Monte Carlo simulation is a model used to predict the probability of a variety of outcomes when the potential for random variables is present. * Monte Carlo simulations help to explain the impact of risk and uncertainty in prediction and forecasting models. * A Monte Carlo simulation requires assigning multiple values to an uncertain variable to achieve multiple results and then averaging the results to obtain an estimate. * A Monte Carlo model is a [stochastic model](https://www.investopedia.com/terms/s/stochastic-modeling.asp#:~:text=our%20editorial%20policies-,What%20Is%20Stochastic%20Modeling%3F,different%20conditions%2C%20using%20random%20variables.), meaning that due to randomness the results may differ each time, as opposed to a deterministic model where given the same inputs you'll get the same result every time. ## A quick example to illustrate Monte Carlo simulations are named after the [Monte Carlo casino](https://en.wikipedia.org/wiki/Monte_Carlo_Casino) in Monaco, so let's ask a casino based question. "If we always pick red at roulette, how often would we win?" The roulette wheel has 18 red slots, 18 black slots, and 1 green slot for a total of 37 slots. ```python [roulette.py] import random def play_roulette(): total_slots = 37 red_probability = (18 / total_slots) * 100 black_probability = (18 / total_slots) * 100 green_probability = (1 / total_slots) * 100 possible_outcomes = ["red", "black", "green"] probabilities = [red_probability, black_probability, green_probability] outcome = random.choices( possible_outcomes, weights=probabilities, k=1 )[0] return outcome def perform_simulation(n_times=1000, choice="red"): results = { "red": 0, "black": 0, "green": 0 } for i in range(n_times): outcome = play_roulette() results[outcome] += 1 win_percentage = results[choice] / n_times return results, win_percentage if __name__ == "__main__": results, win_percentage = perform_simulation(n_times=1000000, choice="red") print(results) print(win_percentage) ``` So after 1 million simulations, we can say we win just less than half of the time with a 48.71% probability. We've proven that the extra green pocket gives the house an edge over the long run. ## Building the Monte Carlo model with Python Now we have an idea of what a Monte Carlo simulation is and have seen a short example, we can build a more complex model. The challenge I have set here is to recreate an awesome [Monte Carlo retirement simulation](https://engaging-data.com/fire-calculator/?age=32&initsav=25000&spend=45000&initinc=60000&wr=4&ir=1&retspend=40000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=8.1&MCbondrtn=2.4&tax=7&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70) from [engaging-data.com](https://engaging-data.com) using Python and Plotly. After playing around with this calculator I wondered how this could be re-created in Python with a few individual touches. I got quite close and there's lots to learn from the code. The question this time is "If I invest a set amount for a number of years, how much might I have?" All of the code for this model can be found on [GitHub](https://github.com/shedloadofcode/monte-carlo-simulation). ```python [model.py] """ Monte Carlo model to simulate the growth of an investment portfolio over time. """ import numpy as np from helpers import ( get_random_returns, get_confidence_levels, get_yearly_percentiles) from plots import ( plot_histogram, plot_yearly_percentiles) def perform_simulation(inputs: dict): """ Performs a simulation to find out how much the pot is worth in £ after years of growth. Returns: pot (float) - the final amount at the end history (list) - the yearly history of results [10000, 11000, 12000, ...] """ years = inputs['end_age'] - inputs['start_age'] pot = inputs['starting_pot'] returns = get_random_returns(years=years) mean_return = (np.mean(returns) - 1) * 100 history = [] for i in range(years): annual_return = returns[i] pot *= annual_return pot += inputs['annual_contributions'] history.append(int(pot)) return pot, history, mean_return def perform_monte_carlo(inputs: dict, n: int = 1000): pot_sizes = [] results = [] mean_returns = [] for i in range(n): final_amount, history, mean_return = perform_simulation(inputs) pot_sizes.append(final_amount) results.append(history) mean_returns.append(mean_return) lower_confidence, upper_confidence = get_confidence_levels(pot_sizes) print('Monte carlo model done :)', end='\n') print('Plots saved to /outputs folder') print('Mean return across all simulations: ', end='') print(f'{round(np.mean(mean_returns), 1)}%') return { 'pot_sizes': pot_sizes, 'results': results, 'yearly_percentiles': get_yearly_percentiles(results, inputs), 'lower_confidence': lower_confidence, 'upper_confidence': upper_confidence, 'mean_returns': mean_returns } if __name__ == "__main__": inputs = { 'start_age': 20, 'end_age': 65, 'starting_pot': 5000, 'annual_contributions': 500 * 12, 'target_amount': 300000, 'n_simulations': 10000 } mc = perform_monte_carlo(inputs, n=inputs['n_simulations']) plot_histogram(mc['pot_sizes'], mc['upper_confidence'], mc['lower_confidence']) plot_yearly_percentiles(inputs=inputs, df=mc['yearly_percentiles']) ``` This model takes a dictionary 'inputs' which you can change to adapt the simulation. The 'perform_monte_carlo' function carries out a given number of simulations and returns the final 'pot_sizes' with other useful information like the history and mean returns of each simulation, the yearly percentiles, alongside upper and lower confidence intervals. For this example our starting age is 20 and end age is 65. We start with £5,000 and our annual contributions are £6,000 (or £500 per month) and we're aiming for a £300,000 pot! We will run this simulation 10,000 times. You might be thinking, how do we simulate the randomness of what our returns might be each year? A quick Google search tells us that the historic [annual average return](https://www.google.com/search?q=s%26p+500+average+return) of the S&P500 is 10% per year. Sorry but I'm much more pessimistic and expect lower. I have modelled a range of returns and assigned them probability weights in the file helpers.py below. This means I've assumed low returns are more likely, but there is also a chance of higher returns, or negative returns. Nobody knows what the markets will do, and that's why randomness will help us with this uncertainty and view the outcomes of many simulations. ```python [helpers.py] import random import numpy as np import pandas as pd def get_random_returns(years: int): """ Generates a list of random return percentages for the length of years required. """ random_returns = [] for i in range(years): high_negative_returns = (random.randint(-20, -8) / 1000) + 1 low_negative_returns = (random.randint(-7, -1) / 1000) + 1 low_returns = (random.randint(0, 4) / 100) + 1 medium_returns = (random.randint(5, 9) / 100) + 1 high_returns = (random.randint(10, 20) / 100) + 1 possible_returns = [ # Weights high_negative_returns, # 5 % chance low_negative_returns, # 25 % chance low_returns, # 40 % chance medium_returns, # 25 % chance high_returns # 5 % chance ] random_return = random.choices( possible_returns, weights=(5, 25, 40, 25, 5), k=1 )[0] random_returns.append( random_return ) return random_returns def get_confidence_levels(pot_sizes): upper_confidence = round(np.quantile(pot_sizes, 0.975), 2) lower_confidence = round(np.quantile(pot_sizes, 0.025), 2) return lower_confidence, upper_confidence def get_yearly_percentiles(results, inputs) -> pd.DataFrame: """ Finds the percentiles for each year. """ results_rotated = list(zip(*results[::-1])) year = [] age = [] ninetieth_percentile = [] seventy_fifth_percentile = [] median = [] twenty_fifth_percentile = [] tenth_percentile = [] for i, year_results in enumerate(results_rotated): new_age = (inputs['start_age'] + 1) + i ninetieth_percentile_value = np.percentile(year_results, 90) seventy_fifth_percentile_value = np.percentile(year_results, 75) median_value = np.median(year_results) twenty_fifth_percentile_value = np.percentile(year_results, 25) tenth_percentile_value = np.percentile(year_results, 10) year.append(i + 1) age.append(new_age) ninetieth_percentile.append(ninetieth_percentile_value) seventy_fifth_percentile.append(seventy_fifth_percentile_value) median.append(median_value) twenty_fifth_percentile.append(twenty_fifth_percentile_value) tenth_percentile.append(tenth_percentile_value) return pd.DataFrame( list( zip(year, age, ninetieth_percentile, seventy_fifth_percentile, median, twenty_fifth_percentile, tenth_percentile) ), columns=[ 'year', 'age', '90th_percentile', '75th_percentile', 'median', '25th_percentile', '10th_percentile'] ) ``` The randomness we've introduced here is for every year in each of the 10,000 or more simulations a: * 5% chance of negative returns between -20% and -8% * 25% chance of negative returns between -7% and -1% * 40% chance of low returns between 0% and 4% * 25% chance of medium returns between 5% and 9% * 5% chance of high returns between 10% and 20% If you think these are too pessimistic or optimistic please go ahead change the values or weights 👍 The 'get_yearly_percentiles' function takes the 2D list 'results' (all of the histories for all simulations year by year), [rotates it](https://stackoverflow.com/questions/8421337/rotating-a-two-dimensional-array-in-python) to line up year 1, year 2, year 3 and so on, and then finds the percentiles (10th, 25th, median, 75th, 90th) for each year. This effectively shows the range of results from all simulations for each year in a DataFrame: | year | age | 90th_percentile | 75th_percentile | median | 25th_percentile | 10th_percentile | | ---- | --- | --------------- | --------------- | -------- | --------------- | --------------- | | 1 | 21 | 11450 | 11300 | 11100 | 10990 | 10970 | | 2 | 22 | 18153 | 17804 | 17399 | 17110 | 16957 | | 3 | 23 | 25296.1 | 24570.25 | 23919 | 23395.75 | 23051 | | 4 | 24 | 32631 | 31605 | 30632 | 29832 | 29279.8 | | 5 | 25 | 40342.1 | 38841.25 | 37513.5 | 36407.75 | 35624.9 | | ... | ... | ... | ... | ... | ... | ... | | 41 | 61 | 618714.6 | 553043.8 | 492832 | 442674.5 | 403963.4 | | 42 | 62 | 645462.7 | 578133 | 514355 | 461004.3 | 420718 | | 43 | 63 | 673703 | 602295 | 535547 | 478970 | 437741.3 | | 44 | 64 | 703538.1 | 629788.8 | 557324.5 | 498292.3 | 453481.4 | | 45 | 65 | 739303.7 | 656680.5 | 580414.5 | 517842 | 470615.8 | This can then be plotted using Plotly along with the final pot sizes. ## Plotting the Monte Carlo results with Plotly I was using [Spyder](https://www.spyder-ide.org/) to carry out this analysis, and saved the plots as html files in the /output directory. You'll need to install Plotly with `python -m pip install plotly` ```python [plots.py] import numpy as np import plotly.express as px import plotly.graph_objects as go import plotly.io as pio pio.renderers.default='svg' def plot_histogram(pot_sizes: list, upper_confidence:float, lower_confidence: float): """ Plots the frequencies of the final pot sizes. """ fig = px.histogram(pot_sizes, title=f"The final pot size after {len(pot_sizes)} simulations.") fig.add_vline(x=lower_confidence, line_width=3, line_dash="dash", line_color="green") fig.add_vline(x=upper_confidence, line_width=3, line_dash="dash", line_color="green") fig.add_vline(x=np.median(pot_sizes), line_width=3, line_dash="dash", line_color="black", annotation_text="median", annotation_font_size=15) fig.add_vrect(x0=lower_confidence, x1=upper_confidence, line_width=0, fillcolor="green", opacity=0.2, annotation_text="95% confidence interval", annotation_font_size=15) fig.update_layout( xaxis_title="Amount (£)", yaxis_title="Count", showlegend=False, font=dict( family="Arial", size=14 ), paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', ) fig.write_html('outputs/mc-histogram.html', auto_open=False) def plot_yearly_percentiles(inputs, df): """ Plots the year by year percentile graph. """ exact_np = df[df['90th_percentile'] > inputs['target_amount']].iloc[0] exact_sfp = df[df['75th_percentile'] > inputs['target_amount']].iloc[0] exact_median = df[df['median'] > inputs['target_amount']].iloc[0] exact_tfp = df[df['25th_percentile'] > inputs['target_amount']].iloc[0] exact_tp = df[df['10th_percentile'] > inputs['target_amount']].iloc[0] fig = go.Figure() fig.add_traces(go.Scatter(x=df['age'], y=df['10th_percentile'], line = dict(color='#FFA502'), mode='lines', name='10th %tile', fill='none', fillcolor = '#F7CA77')) fig.add_traces(go.Scatter(x=df['age'], y=df['25th_percentile'], line = dict(color='#7BE56E'), mode='lines', name='25th %tile', fill='tonexty', fillcolor = '#F7CA77')) fig.add_traces(go.Scatter(x=df['age'], y=df['median'], line=dict(color='black'), line_width=3, mode='lines', name="median", fill='tonexty', fillcolor='#00FF66')) fig.add_traces(go.Scatter(x=df['age'], y=df['75th_percentile'], line = dict(color='#7BE56E'), mode='lines', name="75th %tile", fill='tonexty', fillcolor = '#00FF66')) fig.add_traces(go.Scatter(x=df['age'], y=df['90th_percentile'], line = dict(color='#FFA502'), mode='lines', name="90th %tile", fill='tonexty', fillcolor = '#F7CA77')) fig.update_layout(hovermode="x") fig.update_xaxes(tickangle=0, dtick=1, showticklabels=True, gridcolor='lightgray', type='category') fig.update_yaxes(gridcolor='lightgray', rangemode="tozero") fig.add_hline(y=inputs['target_amount'], line_width=2, line_dash='dash', line_color='red', annotation_text='Target amount', annotation_font=dict( family="Arial", size=15, color="red" ), annotation_font_size=15, annotation_position='bottom left', fillcolor='red') fig.add_shape(type="line", x0=int(exact_median['year'] - 1), y0=0, x1=int(exact_median['year'] - 1), y1=inputs['target_amount'], line_width=2, line_color='gray', line_dash='dash') fig.add_shape(type="line", x0=int(exact_tp['year'] - 1), y0=0, x1=int(exact_tp['year'] - 1), y1=inputs['target_amount'], line_width=2, line_color='orange', line_dash='dash') fig.add_shape(type="line", x0=int(exact_np['year'] - 1), y0=0, x1=int(exact_np['year'] - 1), y1=inputs['target_amount'], line_width=2, line_color='orange', line_dash='dash') fig.add_shape(type="line", x0=int(exact_tfp['year'] - 1), y0=0, x1=int(exact_tfp['year'] - 1), y1=inputs['target_amount'], line_width=2, line_color='green', line_dash='dash') fig.add_shape(type="line", x0=int(exact_sfp['year'] - 1), y0=0, x1=int(exact_sfp['year'] - 1), y1=inputs['target_amount'], line_width=2, line_color='green', line_dash='dash') fig.add_annotation(x=int(exact_median['year'] - 1), y=inputs['target_amount'] * 1.45, text=f"{int(exact_median['year'])} years", font=dict( color="black", size=21 ), showarrow=False, yshift=10) fig.add_annotation(x=int(exact_median['year'] - 1), y=inputs['target_amount'] * 1.3, text=f"(Age {int(exact_median['age'])})", font=dict( color="black", size=21 ), showarrow=False, yshift=10) fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, y=df['10th_percentile'].max() - 8000, text="10%", font=dict( color="black", size=12 ), showarrow=False, yshift=10) fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, y=df['25th_percentile'].max() - 8000, text="25%", font=dict( color="black", size=12 ), showarrow=False, yshift=10) fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.25, y=df['median'].max() - 5000, text="median", font=dict( color="black", size=12 ), showarrow=False, yshift=10) fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, y=df['75th_percentile'].max() - 4000, text="75%", font=dict( color="black", size=12 ), showarrow=False, yshift=10) fig.add_annotation(x=inputs['end_age'] - inputs['start_age'] - 1.2, y=df['90th_percentile'].max() - 5000, text="90%", font=dict( color="black", size=12 ), showarrow=False, yshift=10) fig.add_annotation(x=.99, xref='paper', xanchor='right', y=0, yanchor='bottom', text="shedloadofcode.com", font=dict( color="gray", size=14 ), showarrow=False) fig.add_annotation(x=0.01, xref='paper', yref='paper', xanchor='left', y=0.99, yanchor='top', text=f"In {inputs['n_simulations']} simulations " + f"{int(exact_median['age'])} " + f"is the median age ({int(exact_median['year'])} years)
", font=dict( color="black", size=15 ), showarrow=False) fig.add_annotation(x=0.01, xref='paper', yref='paper', xanchor='left', y=0.96, yanchor='top', text="10th to 90th %ile: " + f"{int(exact_np['year'])} to {int(exact_tp['year'])} " + "years to target", font=dict( color="black", size=15 ), showarrow=False) fig.add_annotation(x=0.01, xref='paper', yref='paper', xanchor='left', y=0.93, yanchor='top', text="25th to 75th %ile: " + f"{int(exact_sfp['year'])} to {int(exact_tfp['year'])} " + "years to target", font=dict( color="black", size=15 ), showarrow=False) fig.update_layout( title=f"Percentiles by year after {inputs['n_simulations']} simulations.", xaxis_title="Age", yaxis_title="Amount (£)", legend_title="Legend Title", showlegend=False, font=dict( family="Arial", size=14 ), paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', ) fig.write_html('outputs/mc-percentiles.html', auto_open=False) ``` This outputs the year by year percentiles to 'outputs/mc-percentiles.html'. The good part about the Plotly HTML output is that after [uploading to GitHub](https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html) it can be viewed via [htmlpreview.github.io](https://htmlpreview.github.io) Go ahead and take a look at the [Monte Carlo percentile graph](https://htmlpreview.github.io/?https://raw.githubusercontent.com/shedloadofcode/monte-carlo-simulation/main/outputs/mc-percentiles.html). You can also use this to embed the interactive plot in a web page using an [iframe](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe) like the one below! ```html ``` We can see that the median amount crosses the target at age 51 given our inputs. However, given a better or worse outcome it could cross the target amount between ages 47 and 54. Monte Carlo simulations are a great way to deal with uncertainty when we simply don't know what the expected values (in this case investment returns) will be. We can also take a look at the frequencies of the final pot sizes at the end age 65 in a histogram. We can see the 95% confidence interval is between £420k - £850k with the median pot size being £581k. This demonstrates the power of compounding and starting investing from an early age. ## Comparing the results I have used numerous scenarios as inputs to test this model against the calculator from [engaging-data.com](https://engaging-data.com) (ED) to see how the results align, which has been pretty fun. I set the average return on the ED calculator to **4%** as my model is a bit more pessimistic. As mentioned earlier, you can change the probability weights for a given set of returns in the 'get_random_returns' function if you feel more optimistic. Here are my findings in three scenarios: --- **Scenario 1 inputs** | Input | Value | | ------------------- | ----- | | Start age | 20 | | End age | 65 | | Starting pot | 5,000 | | Annual contributions | 6,000 | | Target amount | 300,000 | | Simulations | 10,000 | **Scenario 1 results** [View calculator results](https://engaging-data.com/fire-calculator/?age=20&initsav=5000&spend=6000&initinc=12000&wr=4&ir=0&retspend=12000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=4&MCbondrtn=2&tax=0&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70) | Model | Years | Age | | ---------- | ----- | ------ | | This model | 30 | 50 | | ED calculator | 30.3 | 50 | --- **Scenario 2 inputs** | Input | Value | | ------------------- | ----- | | Start age | 30 | | End age | 65 | | Starting pot | 10,000 | | Annual contributions | 10,000 | | Target amount | 400,000 | | Simulations | 10,000 | **Scenario 2 results** [View calculator results](https://engaging-data.com/fire-calculator/?age=30&initsav=10000&spend=10000&initinc=20000&wr=4&ir=0&retspend=16000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=4&MCbondrtn=2&tax=0&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70) | Model | Years | Age | | ---------- | ----- | ------ | | This model | 26 | 56 | | ED calculator | 25.2 | 55 | --- **Scenario 3 inputs** | Input | Value | | ------------------- | ----- | | Start age | 25 | | End age | 65 | | Starting pot | 20,000 | | Annual contributions | 20,000 | | Target amount | 1,000,000 | | Simulations | 10,000 | **Scenario 3 results** [View calculator results](https://engaging-data.com/fire-calculator/?age=25&initsav=20000&spend=20000&initinc=40000&wr=4&ir=0&retspend=40000&stockpct=80&fixpct=18&cashpct=2&graph=mc&secgraph=0&stockrtn=8.1&bondrtn=2.4&MCstockrtn=4&MCbondrtn=2&tax=0&income=0&incstart=50&incend=70&expense=0&expstart=50&expend=70) | Model | Years | Age | | ---------- | ----- | ------ | | This model | 30 | 55 | | ED calculator | 30.2 | 55 | --- As you can see the results are very closely aligned, so I'm very pleased with how well this model is performing. Of course, as [George Box said](/blog/programming-quotes-that-offer-wisdom-and-motivation/#using-statistics) *All models are wrong, but some are useful*. We should not forget that models and simulations can only give us an indication of possible outcomes, we should never blindly trust them but use them as tools. I think it's also important to keep them realistic and not introduce too much bias or ego into our assumptions. It would be great to get 10% returns every year, but is that realistically going to happen? Lowering our model's assumptions ensures we are closer to a 'worst case scenario' and any over-performance is a bonus! ## Conclusion We've learned lots on both Monte Carlo methods and creating / embedding Plotly visualisations with Python. Some of the techniques used in this article with Plotly can also be used for variations of [fan charts](https://analystsuncertaintytoolkit.github.io/UncertaintyWeb/chapter_6.html#fan-charts) typically used for forecasting and acknowledging uncertainty. I didn't quite get around to incrementing the results by 0.1 and plotting the circles which you can [achieve with Plotly shapes](https://plotly.com/python/shapes/#circles-positioned-relative-to-the-axis-data). Maybe this is something you can try to replicate if you want to. I actually preferred seeing the vertical lines show which age the amount goes above the target rather than the exact intersection - this also is the foundation of statistical process control charts to make variation in the results explicit. Thanks to [engaging-data.com](https://engaging-data.com) for giving me the inspiration to try and re-create this awesome model and visualisation with Python and Plotly. Finally, it's worth mentioning that [DataCamp](https://datacamp.pxf.io/EKAK42) has an interactive course [Monte Carlo Simulations in Python](https://datacamp.pxf.io/rQWmd5) and many other great courses on data science and machine learning. Read the full review [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp). If you enjoyed this article you may also be interested in: * [Creating statistical neighbours comparator benchmarking models with Python](/blog/creating-statistical-neighbours-comparator-benchmarking-models-with-python/) * [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview) * [Six tips for producing and assuring high quality analytical code](/blog/six-tips-for-producing-and-assuring-high-quality-analytical-code) * [MIT OpenCourseWare Monte Carlo Simulation](https://www.youtube.com/watch?v=OgO1gpXSUzU) * [Uncertainty Toolkit for Analysts](https://analystsuncertaintytoolkit.github.io/UncertaintyWeb/introduction.html)

Six tips for producing and assuring high quality analytical code

Thu, 15 Dec 2022 10:40:00 GMT

In this article we'll look at six tips on producing solid analytical code and ensuring it is of high quality. As with all software engineering the goal is to solve the problem alongside reducing complexity, creating useful abstractions, and keeping it simple! These tips are inspired by two excellent resources [Quality assurance of code for analysis and research](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html) and [The Turing Way](https://the-turing-way.netlify.app/welcome). ## Begin with the end in mind Analysis can get complicated without a good roadmap of where you want to get to. What is the purpose of the analysis? What does the end result look like? It's worth asking questions like this first. You want to be able to describe it to someone who's never heard of your project in one sentence. * A model to identify our most valuable customers. * A model to allocate the correct amount of stock to each store. * A model to forecast product sales. This helps people understand 'what it does'. To explain to those more curious 'how it does it' we might require a simple and clear solution diagram. It is the A to B summary - I find this helps newcomers understand the technical big picture. It doesn't even have to be a diagram it can be as simple something like this in the README file: ``` Read sales data | ---> Apply forecasting model | ------> Output daily predicted sales for each product | ---------> Email output to store manager ``` Without looking at any code I know what this model should do. By writing this before writing the code it allows you plan at a high level what the solution should actually do and avoids coding parts that aren't actually needed. If you want to improve your system design skills more generally, check out the article [Five ways to improve your system design and software architecture skills](/blog/five-ways-to-improve-your-system-design-and-software-architecture-skills/). ## Structure your project neatly This enables you and others to find the files they need quickly, and to make sense of the overall solution. [cookiecutter](https://drivendata.github.io/cookiecutter-data-science/) and [govcookiecutter](https://github.com/best-practice-and-impact/govcookiecutter) provide useful Data Science project structures. ``` ├── LICENSE ├── Makefile <- Makefile with commands like `make data` or `make train` ├── README.md <- The top-level README for developers using this project. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. │ ├── docs <- A default Sphinx project; see sphinx-doc.org for details │ ├── models <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), │ the creator's initials, and a short `-` delimited description, e.g. │ `1.0-jqp-initial-data-exploration`. │ ├── references <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated graphics and figures to be used in reporting │ ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. │ generated with `pip freeze > requirements.txt` │ ├── setup.py <- Make this project pip installable with `pip install -e` ├── src <- Source code for use in this project. │ ├── __init__.py <- Makes src a Python module │ │ │ ├── data <- Scripts to download or generate data │ │ └── make_dataset.py │ │ │ ├── features <- Scripts to turn raw data into features for modeling │ │ └── build_features.py │ │ │ ├── models <- Scripts to train models and then use trained models to make │ │ │ predictions │ │ ├── predict_model.py │ │ └── train_model.py │ │ │ └── visualization <- Scripts to create exploratory and results oriented visualizations │ └── visualize.py │ └── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io ``` This project structure might be too complex for simpler projects, but it gives you a start and you can reduce or repurpose from there. Just a 'data', 'models', 'notebooks', 'output' and 'tests' folder might be enough with a 'src' directory for helper modules/functions and a good README. This structure can also be replicated for projects [where R is used](https://www.r-bloggers.com/2018/08/structuring-r-projects/) instead of Python. ## Use version control always You may think it's only a small project and that using version control is too complex for it. Always use version control! Your future self will thank you 😄 It enables the ability to back up your work, collaborate with others using branches, revert to previous versions, and more. Plus, there's usually no good reason not to use it! First create a repository with a repository hosting provider such as [GitHub](https://www.github.com). Then in your working directory initialise the directory as a repo and push your initial commit. ``` git init git commit -m "Initial commit" git branch -M main git remote add origin https://github.com/your-username/repo-name.git git push -u origin main ``` Then every time you make a change, commit again. Keep commits short and often, rather than committing lots of changes all in one go. Then push to the remote repository every once in a while so your changes are backed up. ``` git add . git commit -m "Add new percentage calculations to model" git push ``` There are [many commands with Git](https://git-scm.com/docs) you should explore, the most useful are to `revert` to a previous commit, and to create a new `branch` to work on something separately before you `merge` it back to the main branch. You can see the whole history of the project with every commit using `git log --graph`. ## Keep it reproducible with a virtual environment and README A virtual environment is a collection of packages / dependencies that gives you everything you need to run a project. It solves 'but it works on my machine' problems. You want your analysis to be reproducible, which means someone should be able to clone your repo, install the package dependencies and run your code successfully. For Python there is the [venv](https://docs.python.org/3/library/venv.html) and pipenv packages and for R there is the [renv](https://rstudio.github.io/renv/articles/renv.html) and [packrat](https://rstudio.github.io/packrat/) packages. I prefer [venv](https://docs.python.org/3/library/venv.html) and [renv](https://rstudio.github.io/renv/articles/renv.html). When someone first clones your repo, there may be other steps they have to go through to run your code too. There may be environment variables that need adding to a `.env` file or sensitive data files adding that could not be stored in version control. A [good README.md file](https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/) helps with the setup steps. Here I have used some setup steps from an analytical web app I worked on recently which used the [Django](https://www.djangoproject.com/) Python web framework. ```text [README.md] # My data visualisation app This app presents data visualisation in a web interface. ## Features * Security and user login * HTTPS Let's Encrypt * Object-relational mapping * Integration to Google Sheets API ## Running locally * Create and activate a virtual environment python -m venv venv .\venv\Scripts\activate python -m pip install python -m pip install -r requirements.txt * To deactivate use: deactivate * To install new packages use: python -m pip install * To register newly installed packages use: python -m pip freeze > requirements.txt * Create the database 'db.sqlite3' and migrate the latest schema using: python manage.py migrate * Create a superuser account to login using: python manage.py createsuperuser Username: admin Email address: Password: admin Bypass password validation and create user anyway? [y/N]: y * Pre-populate the database with some testing data (optional): python manage.py loaddata responses.json * Add environment variable file '.env' in /home directory with: ENVIRONMENT='Development' SECRET_KEY='' EMAIL_HOST='' EMAIL_HOST_USER='' EMAIL_HOST_PASSWORD='' DEFAULT_FROM_EMAIL='' * Run the application using: python manage.py runserver ``` ## Keep code modular, adaptable, documented and simple Some problems do sometimes call for quite complex solutions, but by abstracting away some of that complexity into easy to understand classes, methods, functions and variables we can make it simpler. The main characteristics of high quality code are: * Clean and consistent style * Functional * Easy to understand for others * Efficient * Testable * Easy to maintain * Easy to change and adapt * Well documented We can achieve most of these things by creating well defined classes, methods and functions that do what they say they will, are well documented and are testable. We can also refactor early and often to ensure the code is the most readable it can be - we write code for humans more so than computers! Following a style guide such as the [Google Python style guide](https://google.github.io/styleguide/pyguide.html) or the [Tidyverse R style guide](https://style.tidyverse.org/index.html) can also keep the code standardised. Files should start with a docstring describing the contents and usage of the module: ```python """A one line summary of the module or program, terminated by a period. Leave one blank line. The rest of this docstring should contain an overall description of the module or program. Optionally, it may also contain a brief description of exported classes and functions and/or usage examples. Typical usage example: foo = ClassFoo() bar = foo.FunctionBar() """ ``` R function docstring: ```r #' Short title for function #' #' @description #' Longer description of the function #' #' @param first An object of class "?". Description of parameter #' @param second An object of class "?". Description of parameter #' @return Returns an object of class "?". Description of what the function returns #' @examples #' # Add some code illustrating how to use the function my_new_function <- function(first, second) { return("hello world") } ``` JavaScript function docstring: ```js /** * Summary. (use period) * * Description. (use period) * * @see Function/class relied on * @link URL * * @param {type} var Description. * @param {type} [var] Description of optional variable. * @param {type} [var=default] Description of optional variable with default variable. * @param {Object} objectVar Description. * @param {type} objectVar.key Description of a key in the objectVar parameter. * * @yield {type} Yielded value description. * * @return {type} Return value description. */ function myNewFunction () { return "hello world"; } ``` Python function docstring: ```python def my_new_function(first: str, second: int) -> str: """Short title for function. Longer description of the function. Args: first (str): A description of the first argument. second (int): A description of the second argument. Returns: result (str): A description of the return value. Raises: IOError: A description of the error raised. """ result = first + str(second) return result ``` Not only do docstrings make your code easier for yourself and others to understand, the best part is that you can auto-generate documentation using [Sphinx for Python](https://www.sphinx-doc.org/en/master/) and using [Roxygen for R](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html)! These require another article to go through but are really useful for keeping documentation up to date. We can also make any code more adaptable by not hardcoding configuration values and instead putting them in a YAML or JSON config file. This makes input parameters easier to quickly change and see the result of that change on the outputs. ```yaml [config.yaml] input_path: "C:/a/very/specific/path/to/input_data.csv" output_path: "outputs/predictions.csv" test_split_proportion: 0.3 random_seed: 42 prediction_parameters: constant_a: 7 max_v: 1000 ``` ```python [model.py] import yaml with open("./config.yaml") as file: config = yaml.load(file) data = read_csv(config["input_path"]) ... ``` ```r [model.R] config <- yaml::yaml.load_file("config.yaml") data <- read.csv(config$input_path) ... ``` ## Use automated unit tests and peer review Using a unit testing framework like [pytest](https://docs.pytest.org/en/7.2.x/), [unittest](https://docs.python.org/3/library/unittest.html), [testthat](https://testthat.r-lib.org/) or [Runit](https://www.rdocumentation.org/packages/RUnit/versions/0.4.32) will help you to check whether those nicely documented functions you wrote actually do what they say they should. Test driven development to me, simply means you are the first user of your own code. If all your functions, classes and methods do what they are expected to do, we can be very sure the overall program will behave as expected. These same frameworks can be used to write higher level acceptance tests too like 'does the whole program produce somewhat expected results?'. This tests the overall behaviour of the code as opposed to the implementation. Don't aim for 100% test coverage, I think testing the critical functions and most realistic use cases of your code are the most important. Create your first tests and build your library of tests from there. A unit test should be small, it should run fast and it should test one unit of code. Below is an example of a unit test with pytest. This one fails as the function does not return the number multiplied by 3 but by 2! All test files must begin 'test_' before running the `pytest` command in the same directory. It also helps readability to use the [arrange, act, assert pattern](https://automationpanda.com/2020/07/07/arrange-act-assert-a-pattern-for-writing-good-tests/). ```python [test_calculations.py] def times_number_by_three(number: float): return number * 2 def test_times_number_by_three(): # Arrange value = 3 # Act result = times_number_by_three(value) # Assert expected = 9 assert result == expected ``` Next is the same example but using R and [testthat](https://testthat.r-lib.org/). RStudio will automatically recognise the `test_that` function and give a 'Run Tests' option in the top right. Alternatively you can use the command `testthat::test_file("test_calculations.R")` to test a single file. ```r [test_calculations.R] library(testthat) time_number_by_three <- function(number) { return(number * 2) } test_that("number_is_multiplied_by_three", { # Arrange value <- 3 # Act result <- time_number_by_three(value) # Assert expected <- 9 expect_equal(result, expected) }) ``` Other things to be aware of when testing are: * The function you want to test doesn't have to be in the test file like in these examples, you can import it from elsewhere in your project making testing super simple. * You can also split your tests up into separate files to keep the project structure clean. * You can create tests to validate any outputs and check the behaviour of the code as QA and acceptance tests. * You can run all test files in a directory with both [pytest](https://docs.pytest.org/en/7.1.x/getting-started.html#run-multiple-tests) and [testthat](https://testthat.r-lib.org/reference/test_dir.html) fully automating your test suite. Finally, although automation is great and having a suite of tests you can run every time you introduce a new change gives you confidence, having peer review is equally important. This is where someone else reviews your code and checks that it is readable, understandable and actually works. When reviewing code you should ask yourself these questions: * Can I easily understand what the code does? * Is the code sufficiently documented for me to understand it? Is there duplication in the code that could be simplified by refactoring into functions and classes? Are functions and class methods simple, using few parameters? * Does the code fulfil its requirements? * Is the required functionality tested sufficiently? * How easy will it be to alter this code when requirements change? They always do. * Are high level parameters kept in dedicated configuration files? Or would somebody need to work their way through the code with lots of manual edits to reconfigure for a new run? * Can I generate the same outputs that the analysis claims to produce? * Have dependencies been sufficiently documented? * Is the code version, input data version and configuration recorded? In the useful site I shared at the beginning of this article, you can find [code quality assurance checklists](https://best-practice-and-impact.github.io/qa-of-code-guidance/checklists.html) for analytical projects which seem a really good starting point too. ## Conclusion These six tips should make any analytical project you start a pleasure to work on. Spending the time to really think about the end goal, keep things simple and get your project structure set is worth it. I think it was Abraham Lincoln who said "give me six hours to chop down a tree and I will spend the first four sharpening the axe". Solid advice we should all take. Thanks for reading 👍 If you enjoyed this article you might also like the article [Preparing for a statistical data science interview](/blog/preparing-for-a-statistical-data-science-interview/). Here are some recommended resources for further learning: * [The Pragmatic Programmer, The: Your journey to mastery, 20th Anniversary Edition](https://www.amazon.co.uk/Pragmatic-Programmer-journey-mastery-Anniversary/dp/0135957052/) * [The Effective Engineer: How to Leverage Your Efforts In Software Engineering to Make a Disproportionate and Meaningful Impact](https://www.amazon.co.uk/Effective-Engineer-Engineering-Disproportionate-Meaningful/dp/0996128107/) * [Tips for urgent quality assurance of ad-hoc statistical analysis](https://gss.civilservice.gov.uk/policy-store/top-tips-for-quality-assuring-urgent-pieces-of-ad-hoc-statistical-analysis/) * [Tips for urgent quality assurance of data](https://gss.civilservice.gov.uk/policy-store/tips-for-urgent-quality-assurance-of-data/)

Creating statistical neighbours comparator benchmarking models with Python

Wed, 23 Nov 2022 12:25:00 GMT

This article will explore how to get started creating a statistical neighbours model to benchmark, compare and find similar observations within a dataset. This might be comparing the sales of a store, to only other stores that are statistically similar in terms of size, budget and staffing or comparing school attendance performance for a given area to only other areas of similar size, pupil numbers and other characteristics. The main problem of comparator models is how to define what is considered statistically 'similar'. We will explore two approaches to solving this problem. **All of the data used in this article is not real data. It has been adapted and modified based upon real data sources for learning purposes.** ## Filtering approach In this dummy dataset [school_data.xlsx](https://github.com/shedloadofcode/data-files/blob/main/school_data.xlsx?raw=true) I adapted from two good open data sources [Explore education statistics](https://explore-education-statistics.service.gov.uk/find-statistics/pupil-attendance-in-schools) and [Get Information about Schools](https://www.get-information-schools.service.gov.uk/) there are around 1,800 schools but we only want to compare a school's attendance levels to it's top ten most statistically similar in terms of pupil size, alongside FSM and SEN characteristics. | School | Attendance% | Pupils | FSM | SEN | Phase | LocationID | | ----------- | ----------- | ------ | --- | --- | ------- | ---------- | | SCHOOL-0001 | 98.2 | 63 | 5 | 6 | PHASE-1 | 855 | | SCHOOL-0002 | 81 | 1229 | 257 | 72 | PHASE-2 | 873 | | SCHOOL-0003 | 94.8 | 250 | 10 | 16 | PHASE-1 | 891 | | SCHOOL-0004 | 94.5 | 653 | 78 | 89 | PHASE-1 | 856 | | SCHOOL-0005 | 93.9 | 463 | 83 | 45 | PHASE-1 | 866 | | SCHOOL-0006 | 94.2 | 918 | 156 | 131 | PHASE-2 | 865 | | SCHOOL-0007 | 0 | 81 | 25 | 18 | PHASE-2 | 888 | | SCHOOL-0008 | 91.4 | 195 | 83 | 29 | PHASE-1 | 888 | | SCHOOL-0009 | 96.5 | 223 | 89 | 63 | PHASE-1 | 888 | | SCHOOL-0010 | 92.5 | 719 | 253 | 130 | PHASE-2 | 209 | | ... | ... | ... | ... | ... | ... | ... | For each school, we will apply a series of filters to find it's top ten comparators in terms of both pupil size and characteristics like FSM and SEN. ```python ["attendance_comparators.py"] """ A model to identify school comparator's based on their size and characteristics in order to compare attendance performance. Assumptions: - Schools will only be compared to schools of the same phase type. - Results will be the top ten statistically closest schools. - The comparators will be based on attendance %. Functionality: - Ability to compare against schools of a similar size. - Ability to compare against schools with similar characteristics """ import os import time import pandas as pd def get_data() -> pd.DataFrame(): """ Reads the Excel dataset into a Pandas DataFrame and adds new features such as %FSM and %SEN. """ df = pd.read_excel("school_data.xlsx") df["Attendance%"] = df["Attendance%"] * 100 df["%FSM"] = (df["FSM"] / df["Pupils"]) * 100 df["%SEN"] = (df["SEN"] / df["Pupils"]) * 100 return df def generate_all_comparators(output_all_to_csv: bool = False) -> None: """ Generates the top 10 comparators for every school in the dataset, for each of the 2 comparator groups (size, characteristics). Optionally saves the result to CSV files where the folder name is the name of the school where output_all_to_csv is set to True. """ df = get_data() df_length = len(df) comparator_mappings = [] for index, row in df.iterrows(): school_name = row["SchoolName"] similar_sized_comparators = find_similar_sized_comparators( school_name=school_name, df=df ) similar_characteristics_comparators = find_similar_characteristics_comparators( school_name=school_name, df=df ) add_comparators_to_mappings( comparators=similar_sized_comparators, mappings=comparator_mappings, school_name=school_name, grouping="Size" ) add_comparators_to_mappings( comparators=similar_characteristics_comparators, mappings=comparator_mappings, school_name=school_name, grouping="Characteristics" ) if output_all_to_csv: if not os.path.exists("output"): os.mkdir("output") school_name = school_name.replace("/", "") directory = f"output/{school_name}" if not os.path.exists(directory): os.mkdir(directory) similar_sized_comparators.drop( columns=["Unnamed: 0"], inplace=True) similar_characteristics_comparators.drop( columns=["Unnamed: 0"], inplace=True) similar_sized_comparators.to_csv( directory + "/similar_sized_comparators.csv", index=False ) similar_characteristics_comparators.to_csv( directory + "/similar_characteristics_comparators.csv", index=False ) print(f"{index + 1} of {df_length} done.") return pd.DataFrame.from_records(comparator_mappings) def add_comparators_to_mappings(comparators, mappings, school_name, grouping) -> None: """ Builds the final output by adding all of the comparators from the size and characteristics DataFrames to the mapping list in JSON / dictionary format: [ { "School": "A", "Comparator": "B", "Grouping": "Size" }, { "School": "B", "Comparator": "C", "Grouping": "Characteristics" }, ] Which results in the final output: School Comparator Grouping A B Size A D Size B D Characteristics Avoids adding the target school_name as it's own comparator. """ for index, row in comparators.iterrows(): comparator_school_name = row["SchoolName"] if comparator_school_name != school_name: mappings.append({ "School": school_name, "Comparator": comparator_school_name, "Grouping": grouping }) def find_similar_sized_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame: """ Finds schools of a similar size and returns as comparators. This comparator is calculated by the total number of pupils in each school, per organisation type. The groupings for each organisation type will be calculated based on the highest and lowest pupil count for schools in that category i.e. within a given % threshold """ school = df[df["SchoolName"] == school_name] school_size = school["Pupils"].values[0] school_type = school["Phase"].values[0] schools_with_same_type = df[df["Phase"] == school_type] upper_size_threshold = school_size * 1.25 lower_size_threshold = school_size * 0.75 schools_of_similar_size = schools_with_same_type[ (schools_with_same_type["Pupils"] >= lower_size_threshold) & (schools_with_same_type["Pupils"] <= upper_size_threshold) ].copy(deep=True) schools_of_similar_size["Size difference"] = (abs( schools_of_similar_size["Pupils"] - school_size )) schools_of_similar_size = schools_of_similar_size.nsmallest( 11, "Size difference") schools_of_similar_size["Rank"] = ( schools_of_similar_size["Attendance%"].rank( ascending=False ) ) return schools_of_similar_size def find_similar_characteristics_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame: """ Finds schools with similar %FSM and %SEN characteristics and returns as comparators. """ school = df[df["SchoolName"] == school_name] school_type = school["Phase"].values[0] school_fsm_percentage = school["%FSM"].values[0] school_sen_percentage = school["%SEN"].values[0] schools_with_same_type = df[df["Phase"] == school_type] upper_fsm_threshold = school_fsm_percentage * 1.1 lower_fsm_threshold = school_fsm_percentage * 0.9 upper_sen_threshold = school_sen_percentage * 1.1 lower_fsm_threshold = school_sen_percentage * 0.9 schools_with_similar_characteristics = schools_with_same_type[ (schools_with_same_type["%FSM"] >= lower_fsm_threshold) & (schools_with_same_type["%FSM"] <= upper_fsm_threshold) & (schools_with_same_type["%SEN"] >= lower_fsm_threshold) & (schools_with_same_type["%SEN"] <= upper_sen_threshold) ].copy(deep=True) schools_with_similar_characteristics["Characteristics difference"] = ( abs(schools_with_similar_characteristics["%FSM"] - school_fsm_percentage) + abs(schools_with_similar_characteristics["%SEN"] - school_sen_percentage) ) schools_with_similar_characteristics = schools_with_similar_characteristics.nsmallest( 11, "Characteristics difference" ) schools_with_similar_characteristics["Rank"] = ( schools_with_similar_characteristics["Attendance%"].rank( ascending=False ) ) return schools_with_similar_characteristics if __name__ == "__main__": start = time.time() output = generate_all_comparators( output_all_to_csv=True ) output.to_csv("output/comparator-mappings.csv", index=False) end = time.time() print(f"Model finished in {round(end - start, 2)} seconds.") ``` If the `output_all_to_csv` flag is set to True then for each school a folder will be created in the `output` directory for it, containing all of it's comparators for both size and pupil characteristics. An example of one of these outputs for 'SCHOOL-005' can be seen in the image below. We can see within `similar_characteristics_comparators.csv` the %FSM and %SEN are within the upper and lower thresholds and within `similar_size_comparators.csv` Pupils are within the upper and lower thresholds. This shows the model is accurately filtering and ranking only those observations that fit inside these parameters. Within the `output` directory, there is also the full list of comparators in the `comparator-mappings.csv` file. If we also had columns for 'Easting' and 'Northing' for these schools, we could also add another filter to find the top ten geospatially closest schools. ```python ["attendance_comparators.py"] from scipy.spatial import distance def find_similar_location_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame: """ Finds schools which are geospatially closest and returns as comparators. """ school = df[df["School"] == school_name] school_location_id = school["LocationID"].values[0] school_type = school["Phase"].values[0] schools_with_same_type = df[df["Phase"] == school_type] school_easting = school["Easting"].values[0] school_northing = school["Northing"].values[0] location_data_available = ( (school_easting != 0) & (school_northing != 0) ) if location_data_available: geo_comparators = schools_with_same_type \ .copy(deep=True) \ .reset_index() distances = [] for _, row in geo_comparators.iterrows(): a = (school_easting, school_northing) b = (row["Easting"], row["Northing"]) distances.append( distance.euclidean(a, b) ) geo_comparators["distance"] = pd.Series(distances) geo_comparators = geo_comparators[ geo_comparators["Phase"] == school_type ] geo_comparators = geo_comparators.sort_values( by="distance", ascending=True ) geo_comparators = geo_comparators.head(11) return geo_comparators schools_in_same_area = schools_with_same_type[ (schools_with_same_type["LocationID"] == school_location_id) ].copy(deep=True) if len(schools_in_same_area) <= 11: return schools_in_same_area sample = schools_in_same_area.sample(n=10) sample = sample.append(school) return sample ``` ## Scoring approach In the next example, our dummy dataset [la_data.csv](https://raw.githubusercontent.com/shedloadofcode/data-files/main/la_data.csv) (adapted from a dataset taken from the [ONS](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/incomeandwealth/articles/mappingincomedeprivationatalocalauthoritylevel/2021-05-24)) is at Local Authority (area) level. | Local Authority District code (2019) | Local Authority District name (2019) | Profile | Rural-urban classification | Deprivation gap (percentage points) | Deprivation gap % | Deprivation gap ranking | Moran's I | Moran's I ranking | Income deprivation rate | Income deprivation rate ranking | Income deprivation rate quintile | % of households with 3 or more children | School pupils | School attendance % | Schools total spending £ | School spend per pupil £ | School Free School Meal % | | ------------------------------------ | ------------------------------------ | -------------------- | --------------------------------------------------------------- | ----------------------------------- | ----------------- | ----------------------- | --------- | ----------------- | ----------------------- | ------------------------------- | -------------------------------- | --------------------------------------- | ------------- | ------------------- | ------------------------ | ------------------------ | ------------------------- | | E07000223 | Adur | n-shape | Urban with City and Town | 21.70% | 21.70 | 233 | 0.17 | 234 | 10.80% | 158 | 3 | 10 | 37437 | 76 | 307104 | 8.20 | 28.70 | | E07000026 | Allerdale | Flat | Mainly Rural (rural including hub towns >=80%) | 36.60% | 36.60 | 95 | 0.29 | 157 | 12.10% | 130 | 3 | 16 | 40461 | 69 | 869572 | 21.49 | 43.60 | | E07000032 | Amber Valley | n-shape | Urban with Minor Conurbation | 32.90% | 32.90 | 121 | 0.29 | 157 | 10.90% | 153 | 3 | 6 | 22981 | 44 | 652505 | 28.39 | 39.90 | | E07000224 | Arun | n-shape | Urban with City and Town | 28.70% | 28.70 | 164 | 0.31 | 139 | 10.40% | 171 | 3 | 25 | 34449 | 64 | 437529 | 12.70 | 35.70 | | E07000170 | Ashfield | More income deprived | Urban with City and Town | 36.00% | 36.00 | 98 | 0.15 | 246 | 15.20% | 72 | 2 | 11 | 9366 | 50 | 770050 | 82.22 | 43.00 | | E07000105 | Ashford | n-shape | Urban with Significant Rural (rural including hub towns 26-49%) | 29.10% | 29.10 | 160 | 0.34 | 116 | 11.00% | 150 | 3 | 26 | 38834 | 71 | 613225 | 15.79 | 36.10 | | E07000004 | Aylesbury Vale | Less income deprived | Largely Rural (rural including hub towns 50-79%) | 19.60% | 19.60 | 264 | 0.47 | 55 | 6.70% | 272 | 5 | 22 | 38433 | 56 | 609848 | 15.87 | 26.60 | | E07000200 | Babergh | Less income deprived | Mainly Rural (rural including hub towns >=80%) | 16.90% | 16.90 | 280 | 0.17 | 234 | 8.00% | 232 | 4 | 21 | 48694 | 53 | 146570 | 3.01 | 23.90 | | E09000002 | Barking and Dagenham | More income deprived | Urban with Major Conurbation | 25.40% | 25.40 | 195 | 0.27 | 175 | 19.40% | 20 | 1 | 21 | 36548 | 89 | 326135 | 8.92 | 32.40 | | E09000003 | Barnet | n-shape | Urban with Major Conurbation | 31.90% | 31.90 | 132 | 0.36 | 105 | 11.10% | 148 | 3 | 15 | 48851 | 33 | 448473 | 9.18 | 38.90 | We want to compare a Local Authority area to only other statistically similar areas, but not just on one factor, but many (or all) numeric factors available and score them in terms of 'closeness'. This will find the top ten closest neighbours for comparisons and benchmarking. ```python [statistical_neighbours.py] import pandas as pd def find_statistical_neighbours_for(local_authority_district_code: str) -> pd.DataFrame: df = pd.read_csv( filepath_or_buffer="la_data.csv", encoding="cp1252" ) df["Comparator score"] = 0 df["Comparator variables"] = "" target_la = df.loc[ (df["Local Authority District code (2019)"] == local_authority_district_code) ] comparison_variables = { "Deprivation gap %": 1, "Deprivation gap ranking": 1, "Moran's I ranking": 1, "Income deprivation %": 1, "Income deprivation rate ranking": 1, "% of households with 3 or more children ": 1, "School pupils": 2, "School Free School Meal %": 2 } # compare the comparator variables for each LA against the target LA and score them for index, row in df.iterrows(): is_target_la = ( row["Local Authority District code (2019)"] == local_authority_district_code ) if is_target_la: continue for variable in comparison_variables: if variables_are_statistically_similar(target_la[variable].values[0], row[variable]): df.loc[index, 'Comparator score'] = ( df.loc[index, 'Comparator score'] + comparison_variables[variable] ) df.loc[index, 'Comparator variables'] = ( df.loc[index, 'Comparator variables'] + variable + ", " ) return(df.nlargest(10, "Comparator score").append(target_la)) def variables_are_statistically_similar(target: float, comparator: float) -> bool: upper_bound = target * 1.10 lower_bound = target * 0.90 comparator_is_within_range = ( comparator > lower_bound and comparator < upper_bound ) return comparator_is_within_range def print_attendance_comparisons(df: pd.DataFrame) -> None: target_la = df.iloc[-1] df = df[: -1] la_name = target_la["Local Authority District name (2019)"] la_school_attendance_percentage = target_la["School attendance %"] average_comparator_attendance_percentage = df["School attendance %"].mean() print("The average school attendance percentage of your comparator LAs was ", end="") print(f"{average_comparator_attendance_percentage}%", end="\n") print(f"School attendance in {la_name} was {la_school_attendance_percentage}%", end="\n") attendance_percentage_difference = ( la_school_attendance_percentage - average_comparator_attendance_percentage ) attendance_percentage_difference = round(abs(attendance_percentage_difference), 2) if la_school_attendance_percentage < average_comparator_attendance_percentage: print( f"This is {attendance_percentage_difference} " f"percentage points lower than your comparator LAs" ) else: print( f"This is {attendance_percentage_difference} " f"percentage points higher than your comparator LAs" ) def print_spending_comparisons(df: pd.DataFrame) -> None: target_la = df.iloc[-1] df = df[: -1] la_name = target_la["Local Authority District name (2019)"] la_school_spending = target_la["Schools total spending £"] average_comparator_spending = df["Schools total spending £"].mean() print("", end="\n\n") print("The average school spending of your comparator LAs was ", end="") print(f"£{average_comparator_spending}", end="\n") print(f"School spending in {la_name} was £{la_school_spending}", end="\n") spending_difference = ( la_school_spending - average_comparator_spending ) spending_difference = round(abs(spending_difference), 2) if la_school_spending < average_comparator_spending: print(f"This is £{spending_difference} lower than your comparator LAs") else: print(f"This is £{spending_difference} higher than your comparator LAs") if __name__ == "__main__": comparators = find_statistical_neighbours_for("E07000150") print_attendance_comparisons(comparators) print_spending_comparisons(comparators) html_file = open("index.html", "w") html_file.write(comparators.to_html()) html_file.close() ``` The scoring model works by first assigning weights in the dictionary `comparison_variables`. Then later will check each of these to see if the `variables_are_statistically_similar()` against the target Local Authority, and if so, increment the score by the weight for each. The scoring model then first prints some summary information to the console such as comparisons between the target Local Authority's average attendance and average spending against their comparator Local Authorities. It then outputs the comparators for the target Local Authority to a HTML file 'output.html' to see which has the highest score. The output could be made to look a little nicer with some styling via CSS, but it clearly shows that across all of the comparison variables which are the 'closest' and even has a column 'Comparator variables' to show which variables were the ones driving those scores. The target Local Authority (in this example Corby) is at the bottom of the table to refer back to. Go ahead and try plugging in different Local Authority District Codes to the `find_statistical_neighbours_for(local_authority_district_code: str)` function to see how it performs! ## What we learned We have covered using both filtering and scoring approaches to solving statistical neighbour problems. You can now apply these models to other problems in different domains. It is a very useful ability to only compare to other observations that are statistically similar - it makes the comparison analysis more tailored and as a result the conclusions and decisions are more relevant. Much better to compare and benchmark observations against those with similar characteristics, else you may end up making decisions that don't really apply to the school, local authority, store, or anything else the observation may be! I did use mostly an iterative approach whilst putting these solutions together, like looping over DataFrame rows for example. If you can think of more efficient ways to solve these statistical neighbour problems for larger datasets or have any other comparator techniques you would like to share, please post a comment in the comment section below! As always, if you liked this article please check out [other articles](/) on the site.

Building an AutoTrader scraper with Python to search for multiple makes and models

Fri, 21 Oct 2022 11:35:00 GMT

**Update November 2023:** Please check out the new Autotrader scraper in the article [How to scrape AutoTrader with Python and Selenium to search for multiple makes and models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/) which uses Python, Selenium and RegEx. --- **Update September 2023:** The Autotrader UK website has since changed their layout breaking this scraper, it last worked around June 2023 since I utilised it to find a used Honda Jazz! It seems HTML classes such as 'product-card-details__title' have been changed, making scraping more difficult. Thanks to everyone for your great feedback on this scraper, I will continue to try and find an alternative way or workaround and update the article if I find one 👍 Still lots to learn from this article! --- Searching for used cars can be difficult and time consuming. AutoTrader is a great place to perform this search but as far as I can see, it does not allow to search for multiple makes and models in one search. Who wants to keep going back and forth between previously saved searchs, right? Wouldn't it be so much easier if you could compare all of them in one list? ## Designing the solution A solution would need to perform the following steps for each make and model given as inputs: 1. Go to [AutoTrader](https://www.autotrader.co.uk). 2. Search for the given make and model with filters applied (price, year, mileage etc). 3. Scrape information from each of the listings. 4. Add the information to a list. We can then output the information to CSV for further data analysis. Finally, the CSV output can be formatted to make it easier to read and find the most optimal car even faster. ## Installing required packages This will rely on a few Python packages so if you are following along or are wanting to use this program yourself, install the following: ``` python -m pip install numpy pandas requests cloudscraper bs4 xlsxwriter openpyxl ``` ## Building the AutoTrader scraper A good starting point in any project is asking 'has this been done before?' and 'is there existing open source code that can be used for this?'. Sometimes you want to code everything from scratch, and sometimes you want to get things done quickly. Building on the work of others is the foundation of computing and a testament to how far technology has come along in my view. I found a very useful package called [autotrader-scraper](https://pypi.org/project/autotrader-scraper/) which used [cloudscraper](https://pypi.org/project/cloudscraper/) and [beautifulsoup](https://pypi.org/project/beautifulsoup4/) to scrape data from AutoTrader given some filter arguments. I extended this code to scrape the seller details and fixed an issue where the scraper retrieved the seller page link instead of the actual vehicle link from the HTML source. ```python [autotrader-scraper.py] import requests import json import csv from bs4 import BeautifulSoup import traceback import cloudscraper def get_cars( make="BMW", model="5 SERIES", postcode="SW1A 0AA", radius=1500, min_year=1995, max_year=1995, include_writeoff="include", max_attempts_per_page=5, verbose=False): # To bypass Cloudflare protection scraper = cloudscraper.create_scraper() # Basic variables results = [] n_this_year_results = 0 url = "https://www.autotrader.co.uk/results-car-search" keywords = {} keywords["mileage"] = ["miles"] keywords["BHP"] = ["BHP"] keywords["transmission"] = ["Automatic", "Manual"] keywords["fuel"] = [ "Petrol", "Diesel", "Electric", "Hybrid – Diesel/Electric Plug-in", "Hybrid – Petrol/Electric", "Hybrid – Petrol/Electric Plug-in" ] keywords["owners"] = ["owners"] keywords["body"] = [ "Coupe", "Convertible", "Estate", "Hatchback", "MPV", "Pickup", "SUV", "Saloon" ] keywords["ULEZ"] = ["ULEZ"] keywords["year"] = [" reg)"] keywords["engine"] = ["engine"] # Set up parameters for query to autotrader.co.uk params = { "sort": "relevance", "postcode": postcode, "radius": radius, "make": make, "model": model, "search-results-price-type": "total-price", "search-results-year": "select-year", } if (include_writeoff == "include"): params["writeoff-categories"] = "on" elif (include_writeoff == "exclude"): params["exclude-writeoff-categories"] = "on" elif (include_writeoff == "writeoff-only"): params["only-writeoff-categories"] = "on" year = min_year page = 1 attempt = 1 try: while year <= max_year: params["year-from"] = year params["year-to"] = year params["page"] = page r = scraper.get(url, params=params) if verbose: print("Year: ", year) print("Page: ", page) print("Response: ", r) try: if r.status_code != 200: # If not successful (e.g. due to bot protection) attempt = attempt + 1 # Log as an attempt if attempt <= max_attempts_per_page: if verbose: print("Exception. Starting attempt #", attempt, "and keeping at page #", page) else: page = page + 1 attempt = 1 if verbose: print("Exception. All attempts exhausted for this page. Skipping to next page #", page) else: j = r.json() s = BeautifulSoup(j["html"], features="html.parser") articles = s.find_all("article", attrs={"data-standout-type":""}) # If no results or reached end of results... if len(articles) == 0 or r.url[r.url.find("page=")+5:] != str(page): if verbose: print("Found total", n_this_year_results, "results for year", year, "across", page-1, "pages") if year+1 <= max_year: print("Moving on to year", year + 1) print("---------------------------------") # Increment year and reset relevant variables year = year + 1 page = 1 attempt = 1 n_this_year_results = 0 else: for article in articles: car = {} car["name"] = article.find("h3", {"class": "product-card-details__title"}).text.strip() car["link"] = "https://www.autotrader.co.uk" + \ article.find("a", {"class": "listing-fpa-link"})["href"][: article.find("a", {"class": "listing-fpa-link"})["href"] \ .find("?")] car["price"] = article.find("div", {"class": "product-card-pricing__price"}).text.strip() seller_info = article.find("ul", {"class": "product-card-seller-info__specs"}).text.strip() car["seller"] = " ".join(seller_info.split()) key_specs_bs_list = article.find("ul", {"class": "listing-key-specs"}).find_all("li") for key_spec_bs_li in key_specs_bs_list: key_spec_bs = key_spec_bs_li.text if any(keyword in key_spec_bs for keyword in keywords["mileage"]): car["mileage"] = int(key_spec_bs[:key_spec_bs.find(" miles")].replace(",","")) elif any(keyword in key_spec_bs for keyword in keywords["BHP"]): car["BHP"] = int(key_spec_bs[:key_spec_bs.find("BHP")]) elif any(keyword in key_spec_bs for keyword in keywords["transmission"]): car["transmission"] = key_spec_bs elif any(keyword in key_spec_bs for keyword in keywords["fuel"]): car["fuel"] = key_spec_bs elif any(keyword in key_spec_bs for keyword in keywords["owners"]): car["owners"] = int(key_spec_bs[:key_spec_bs.find(" owners")]) elif any(keyword in key_spec_bs for keyword in keywords["body"]): car["body"] = key_spec_bs elif any(keyword in key_spec_bs for keyword in keywords["ULEZ"]): car["ULEZ"] = key_spec_bs elif any(keyword in key_spec_bs for keyword in keywords["year"]): car["year"] = key_spec_bs elif key_spec_bs[1] == "." and key_spec_bs[3] == "L": car["engine"] = key_spec_bs results.append(car) n_this_year_results = n_this_year_results + 1 page = page + 1 attempt = 1 if verbose: print("Car count: ", len(results)) print("---------------------------------") except KeyboardInterrupt: break except: traceback.print_exc() attempt = attempt + 1 if attempt <= max_attempts_per_page: if verbose: print("Exception. Starting attempt #", attempt, "and keeping at page #", page) else: page = page + 1 attempt = 1 if verbose: print("Exception. All attempts exhausted for this page. Skipping to next page #", page) except KeyboardInterrupt: pass return results ``` This returns results from the `get_car()` function as a list. You can leave or edit the `keywords` inputs if you would like to pull back less or more results before filtering further. ## Searching AutoTrader for multiple makes and models Now we have a file named 'autotrader_scraper.py' we will create another file for the searcher which we'll name 'autotrader_searcher.py'. This will use the `get_car()` function we created in the last step to retrieve information from AutoTrader for each make and model and then combine them into one list. This list can then be used to create a Pandas DataFrame for further filtering. In the `criteria` dictionary, be sure to replace the postcode with your postcode. ```python [autotrader-searcher.py] """ Enables the automation of multiple autotrader searches. Based on the autotrader-scraper package: https://github.com/suhailidrees/autotrader_scraper """ from autotrader_scraper import get_cars import pandas as pd criteria = { "postcode": "SW1A 0AA", "min_year": 2008, "max_year": 2014, "radius": 40, "min_price": 2000, "max_price": 6000, "fuel": "Petrol", "transmission": "Manual", "max_mileage": 100000, "max_attempts_per_page": 3, "verbose": False } civic = get_cars( make = "Honda", model = "Civic", postcode = criteria["postcode"], radius = criteria["radius"], min_year = criteria["min_year"], max_year = criteria["max_year"], include_writeoff = "exclude", max_attempts_per_page = criteria["max_attempts_per_page"], verbose = criteria["verbose"] ) print("Civic search done.") jazz = get_cars( make = "Honda", model = "Jazz", postcode=criteria["postcode"], radius = criteria["radius"], min_year = criteria["min_year"], max_year = criteria["max_year"], include_writeoff = "exclude", max_attempts_per_page = criteria["max_attempts_per_page"], verbose = criteria["verbose"] ) print("Jazz search done.") auris = get_cars( make = "Toyota", model = "Auris", postcode=criteria["postcode"], radius = criteria["radius"], min_year = criteria["min_year"], max_year = criteria["max_year"], include_writeoff = "exclude", max_attempts_per_page = criteria["max_attempts_per_page"], verbose = criteria["verbose"] ) print("Auris search done.") corolla = get_cars( make = "Toyota", model = "Corolla", postcode=criteria["postcode"], radius = criteria["radius"], min_year = 2000, max_year = criteria["max_year"], include_writeoff = "exclude", max_attempts_per_page = criteria["max_attempts_per_page"], verbose = criteria["verbose"] ) print("Corolla search done.") yaris = get_cars( make = "Toyota", model = "Yaris", postcode=criteria["postcode"], radius = criteria["radius"], min_year = criteria["min_year"], max_year = criteria["max_year"], include_writeoff = "exclude", max_attempts_per_page = criteria["max_attempts_per_page"], verbose = criteria["verbose"] ) print("Yaris search done.") mazda3 = get_cars( make="Mazda", model="Mazda3", postcode=criteria["postcode"], radius=criteria["radius"], min_year=criteria["min_year"], max_year=criteria["max_year"], include_writeoff="exclude", max_attempts_per_page=criteria["max_attempts_per_page"], verbose=criteria["verbose"] ) print("Mazda3 search done.") swift = get_cars( make="Suzuki", model="Swift", postcode=criteria["postcode"], radius=criteria["radius"], min_year=criteria["min_year"], max_year=criteria["max_year"], include_writeoff="exclude", max_attempts_per_page=criteria["max_attempts_per_page"], verbose=criteria["verbose"] ) print("Swift search done.") results = ( civic + jazz + auris + corolla + yaris + mazda3 + swift ) print(f"Found {len(results)} total results.") df = pd.DataFrame.from_records(results) df["price"] = df["price"] \ .str.replace("£", "") \ .str.replace(",", "") \ .astype(int) df["distance"] = df["seller"].str.extract(r'(\d+ mile)', expand=False) df["distance"] = df["distance"].str.replace(" mile", "") df["distance"] = pd.to_numeric(df["distance"], errors="coerce").astype("Int64") df["year"] = df["year"].str.replace(r"\s($\d\d reg$)", "", regex=True) df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64") shortlist = df[ (df["price"] >= criteria["min_price"]) & (df["price"] <= criteria["max_price"]) & (df["fuel"] == criteria["fuel"]) & (df["mileage"] <= criteria["max_mileage"]) & (df["transmission"] == criteria["transmission"]) & (df["engine"] != "1.0L") & (df["engine"] != "1.2L") ] print(f"{len(shortlist)} cars met the criteria. Saving to 'autotrader-shortlist.csv'") shortlist = shortlist.sort_values(by="distance") shortlist.to_csv("autotrader-shortlist.csv") ``` As you can see from this code, when the time comes to replace my car I am determined to find a good condition, relatively low mileage, reliable Japanese car for less than £5000 that can get me from A to B without too many headaches! You might want to remove some of these cars and add others that are on your wish list. ## Formatting the shortlist Now we have results returned from AutoTrader in CSV format, it would be nicer to apply some conditional formatting to this to quickly pick out the most viable vehicles - the hidden gems. Create another file named 'shortlist_formatter.py'. ```python [shortlist-formatter.py] import openpyxl import numpy as np import pandas as pd import os import shutil import datetime def format_autotrader_shortlist() -> None: df = pd.read_csv("autotrader-shortlist.csv") now = datetime.datetime.now() df["miles_pa"] = df["mileage"] / (now.year - df["year"]) df["miles_pa"].fillna(0, inplace=True) df["miles_pa"] = df["miles_pa"].astype(int) most_viable_cars_mask = ( (df["mileage"] < 85000) & (df["miles_pa"] < 9000) & (df["owners"] <= 3) ) df["viable"] = np.where( most_viable_cars_mask, "Y", "" ) df = add_previously_viewed_cars(df) df = df[[ "viable", "viewed", "name", "link", "price", "year", "mileage", "miles_pa", "owners", "engine", "seller", "distance", ]] writer = pd.ExcelWriter("autotrader-shortlist.xlsx", engine="xlsxwriter") df.to_excel(writer, sheet_name="Sheet1", index=False) workbook = writer.book worksheet = writer.sheets["Sheet1"] worksheet.conditional_format("E2:E1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) worksheet.conditional_format("F2:F1000", { 'type': '3_color_scale', 'min_color': '#f96a6c', 'mid_color': '#ffdc81', 'max_color': '#63be7b' }) worksheet.conditional_format("G2:G1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) worksheet.conditional_format("H2:H1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) worksheet.conditional_format("I2:I1000", { 'type': '3_color_scale', 'min_color': '#63be7b', 'mid_color': '#ffdc81', 'max_color': '#f96a6c' }) writer.save() print("Shortlist formatting done.") def add_previously_viewed_cars(df) -> pd.DataFrame: df["viewed"] = "" if not os.path.exists("Previous searches/Last search/autotrader-shortlist.xlsx"): return df viewed_cars = pd.read_excel( "Previous searches/Last search/autotrader-shortlist.xlsx" ) for index, row in df.iterrows(): car_in_previous_search = ( (viewed_cars["name"] == row["name"]) & (viewed_cars["link"] == row["link"]) ).any() if car_in_previous_search: df.loc[index, "viewed"] = "Y" return df def update_previous_search_history(): """ Copies the autotrader shortlist Excel file to '/Previous searches/Last search' to find cars seen previously and to '/Previous searches' for documenting historic searches. """ if not os.path.exists("autotrader-shortlist.xlsx"): return now = datetime.datetime.now() date = f"{str(now.day)}-{now.strftime('%m')}-{str(now.year)}" shutil.copyfile( src="autotrader-shortlist.xlsx", dst=f"Previous searches/autotrader-shortlist-{date}.xlsx" ) shutil.copyfile( src="autotrader-shortlist.xlsx", dst=f"Previous searches/Last search/autotrader-shortlist.xlsx" ) def open_file_in_excel() -> None: os.system("start EXCEL.EXE autotrader-shortlist.xlsx") if __name__ == "__main__": format_autotrader_shortlist() update_previous_search_history() open_file_in_excel() ``` This calculates mileage per annum which is then used in a viability check. This means that the cars with the most potential are given a 'Y' in the viable column. Of course, even a car with relatively low mileage and a low number of previous owners can still be in a poor condition if it's not been looked after or has been sat idle for long periods of time, so this only highlights the *potential* gems. Using the `most_viable_cars_mask` identifies and marks cars as viable with a 'Y' which have less than 85000 miles, less than 9000 miles per annum and with 3 previous owners or less. ## Taking it for a spin Let's see the scraper, searcher, and formatter all in action one after another, in this end-to-end demo. I perform this process weekly to get the most up to date listing for my area. The formatter makes it really easy to see the trade offs in terms of price, year, mileage and previous owners. ## Troubleshooting On the odd occasion, the program does hang as it retries after a failed connection. The best way to correct this is to end the program using Ctrl + C, wait a short while, and then re-run it in a new console. This will establish a new connection and successfully return the results from the multiple scraping calls started by the `get_cars()` function. ## Bonus: Identify cars seen in a previous search As you might have noticed in `shortlist-formatter.py` after the formatting is complete, the autotrader search Excel file is copied to both the '/Previous searches' and '/Previous searches/Last search' folders with the `update_previous_search_history` function. This is so that on our *next* search we can cross-reference it with this historic data to find out if we've seen a particular car before! I found this to be an extremely useful addition especially if you are running this every week. ## Finishing in first place Spreading the search net wider to multiple makes and models and automating the search has been an excellent strategy for finding suitable cars within a reasonable distance from my location fast. I will update this section when I do go ahead a buy one to let you know what is was 😄 I am hoping my current car will last into next year, but at least I have this handy program ready to go if not. The only thing left for you to do is set your criteria, add the makes and models you want, and off you go! Happy car hunting. If you enjoyed this article be sure to check out [other articles](/) on the site.

Creating a screen and mouse jiggler with Python

Fri, 02 Sep 2022 14:00:00 GMT

I recently came across the idea of a mouse jiggler (keeps your mouse moving) and after some investigation realised there are [products being sold](https://www.amazon.co.uk/s?k=mouse+jiggler&crid=W92Y3XH4RRF8&sprefix=mouse+jiggl%2Caps%2C580&ref=nb_sb_noss_2) to achieve this! Yes, even after doing up to 50% of my time working from home since long before the COVID-19 pandemic I had never heard of this 😄 The more I thought about it, I figured something like this would be really useful for me for a number of good reasons. ## Why build a mouse jiggler? Sometimes I use my personal PC or laptop to try out ideas or perform testing outside of the organisation's internal network. However, if I spend more then approximately 1 minute away from my work laptop the screen will go off, my status will appear as 'away' on instant messenger. This makes it seem like I'm not available for my team's questions when really I'm just doing work on my own machine. I'd prefer the screen to just stay on instead using the touch pad to keep the screen on. Unfortunately, the screen saver / screen off / IM settings are disabled. The solution could be just moving the mouse back and forth slowly on the screen to keep the active window showing, or with a function to switch windows from time to time so I can check different apps as I work. I guess some other reasons might be balancing work and life more generally - attending appointments, having a coffee break, making lunch, or letting the dog out. I have no doubt some people may use such a tool to avoid work and appear present at their machine, but then that's not really a mouse jiggler problem, that's a job satisfaction, productivity, motivation, wellbeing and management problem. More generally, automation skills with Python are very good to have, and can be used in other projects like if you wanted to [record your mouse and keyboard clicks to then automate repetitive tasks](/blog/record-mouse-and-keyboard-for-automation-scripts-with-python/). ## Explaining the mouse jiggler program The only package that this program relies on is [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/). To install with pip, run: ``` pip install pyautogui ``` Once installed create a Python file. ```python [mouse_jiggler.py] # -*- coding: utf-8 -*- import pyautogui import time import random import sys pyautogui.FAILSAFE = False def switch_screens() -> None: """ Switches the active screen using Alt + Tab a random number of times. """ max_switches = random.randint(1, 5) pyautogui.keyDown('alt') for _ in range(1, max_switches): pyautogui.press('tab') pyautogui.keyUp('alt') def wiggle_mouse() -> None: """ Wiggles the mouse between two coordinates. """ max_wiggles = random.randint(4, 9) for _ in range(1, max_wiggles): coords = get_random_coords() pyautogui.moveTo( x=coords[0], y=coords[1], duration=5 ) time.sleep(10) def get_random_coords() -> []: """ Returns a list of coordinates in the format [x=1980, y=1080] """ screen = pyautogui.size() width = screen[0] height = screen[1] return [ random.randint(100, width - 200), random.randint(100, height - 200) ] if __name__ == "__main__": print('Press Ctrl-C to quit.') try: while True: switch_screens() wiggle_mouse() sys.stdout.flush() except KeyboardInterrupt: print("\n") ``` To start the program use this command from the same directory: ``` python mouse_jiggler.py ``` To end the program use Ctrl + C. So the program relies on two functions: * `switch_screens` uses Alt + Tab to switch the active screen a set number of times. * `wiggle_mouse` moves the mouse to a random set of coordinates. These functions are using some of the [many methods that PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/quickstart.html) conveniently provides: * `.size()` returns current screen resolution width and height * `.moveTo(x, y, duration)` moves the mouse to XY coordinates over duration in seconds * `.keyDown(key)` presses the key down and keeps the button pressed * `.keyUp(key)` releases a key that was kept pressed by `keyDown()` * `.keyPress(key)` presses the given key and combines `keyDown()` followed by `keyUp()` This creates a simple solution to always keep the screen active, preventing it from turning off and keeping you appearing as online. This has worked great and really has taken a burden off my mind whilst I try to innovate and prove techniques on my own personal machine that might not work on my work machine. It really is a win-win. If you only want the mouse to move and to keep the active window showing and don't want to switch screens, remove the switch_screens function call underneath `while True`. I have heard stories of some employers using screen and keyboard tracking software for monitoring employees which I find really sad. I'm focused and I take pride in my work but I'm not always at 100% so I doubt any monitoring software would be a true reflection on how much productivity I give and how much value I bring to my workplace in terms of money and time. No one can be switched on all the time, and we all have to realise that mental health and wellbeing in general is so important. If you did find yourself in that situation and had to stick around a while and had the ability to install Python, I can see an modified version of this program being useful to spread the time between screen switches out. You might have noticed I set FAILSAFE to false to turn it off. This is NOT recommended [in the documentation](https://pyautogui.readthedocs.io/en/latest/index.html?highlight=failsafe#fail-safes) so consider yourself warned, however I found to reliably avoid the failsafe action when the mouse is in any of the four corners of the primary monitor, it was best to disable it. It just means you have to be extra careful with the code, and if in any doubt set it to true to re-enable it. ## Seeing the mouse jiggler in action Here is a quick video of how the program behaves moving the mouse and switching screens a number of times. ## What will you use yours for? Okay this was a fun article, now you know how to create a screen and mouse jiggler with Python, and have a solid start to building more advanced robotic process automation (RPA) solutions with PyAutoGUI. You can refer to the [documentation](https://pyautogui.readthedocs.io/en/latest/) for more guidance on using PyAutoGUI and think about what else you might like to build 😄 If you enjoyed this article be sure to check out other articles on the site, some which also explore automation with Python and PyAutoGUI including: * [Record mouse and keyboard for automation scripts with Python](/blog/record-mouse-and-keyboard-for-automation-scripts-with-python/) * [Reduce Material Design Icons Font to 7KB and automate with PyAutoGUI](/blog/reduce-material-design-icons-font-to-7kb-and-automate-with-pyautogui/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) for improving your Python skills Finally, if you have any questions or if you decide to use or extend this program, please leave a comment below. I'd love to know what you use it for and how it's helped you out 👍

Hide your own site visits from Cloudflare Analytics with JavaScript

Thu, 25 Aug 2022 14:32:00 GMT

In this short article, we'll look at how to keep your own site visits below the radar of Cloudflare Analytics so you don't skew your usage stats using JavaScript. ## Why is hiding your own visits important? When I set up this site I wanted to test out the functionality even after [privacy-first analytics](/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets/) had been set up. However, this would misrepresent how many users were actually visiting the site! It's really difficult to remember, "oh yeah, that was me when I tested that page a bunch of times" when viewing the usage figures. So that would be bad enough if you were doing the testing or browsing your own site as a single developer or author, but what if you were a team of 5 - 10 or more? That would mean for each member of the team that published articles or made improvements and then viewed the page on the live site, the usage figures would go way up and be completely skewed. To solve this problem, I created a simple but effective solution by creating a private route for internal users that would disable both my custom analytics and Cloudflare Analytics by never instantiating it in the first place 😄 Effectively a route that says "don't track my visits in the usage stats" which is perfect for testing and viewing the live site. ## Disable analytics with JavaScript In a [previous article](/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets#bonus-avoid-tracking-your-own-activity), I covered how to 'Avoid tracking your own activity' in the bonus section. This approach set a boolean flag value in local storage when an internal user or myself hit the `/do-not-track-me` route. After this was set an internal user could go ahead and browse any pages on the site knowing they would not be adding to the usage counts. We can use a similar solution but applied to how Cloudflare Analytics is initialised. When setting up Cloudflare Analytics you add a script to the page like: ```html ``` So as long as a user has visited the `/do-not-track-me` route first and the interim page loaded setting a value for `donottrack` as true: ```js localStorage.setItem("donottrack", true); window.location.href = "/"; ``` We can then use a custom function to fire on page reload which checks it and disables analytics by not initialising it: ```js initialiseCloudflareAnalytics() { let analyticsDeactivated = localStorage.getItem("donottrack") || false; if (analyticsDeactivated) { return; } let cloudflareScript = document.createElement("script"); cloudflareScript.setAttribute("src", "https://static.cloudflareinsights.com/beacon.min.js"); cloudflareScript.setAttribute("defer", true); cloudflareScript.setAttribute("data-cf-beacon", '{"token": "8bcfbc66e3f442149d3539d3cbfafc9b"}'); document.body.appendChild(cloudflareScript); this.cloudflareScriptInitialised = true; console.log("Cloudflare Analytics initialised."); }, ``` If a regular user visits the site without going first through the `/donottrack` route, this flag will never be set and therefore the Cloudflare Analytics script will be appended to the document body and will work as expected. I applied this to the mounted action in a Vue.js single page app, but you could just as easily apply this logic in any webpage using either JavaScript with `window.onload` or jQuery with `$(document).ready()`. You might also want to provide an internal user with an option in your site to activate analytics again, you could achieve this simply by displaying a button only for users with analytics deactivated by checking the flag in local storage then firing `localStorage.removeItem("notrack");` when they click it. This will allow them to become regular users again and have their page visits logged. ## Short but sweet I hope this article helped you to think about how you might selectively disable tracking to prevent inflating your usage figures with your own or your team's page visits and prevent headaches 😆 I think the same approach could be used with similar analytics tools such as Google Analytics too. It is a simple to implement solution, it does mean that you and your team need to remember to hit the newly added `/do-not-track` to set the cookie in local storage, but you only have to do it once. I think this tradeoff is worth it for the simplicity though, especially for small to medium sized sites and works across devices. If you enjoyed this article be sure to check out [other articles](/) on the site. If you have any questions feel free to leave a comment 👍

Concepts of Artificial Intelligence with Python - a review of CS50 AI

Tue, 12 Jul 2022 17:41:00 GMT

This article covers the concepts of Artificial Intelligence (AI) introduced in Harvard's [CS50 Introduction to Artificial Intelligence with Python](https://edx.sjv.io/q4oLWq) course, along with a review of the course itself, what I learned from it, and helpful advice if you're looking to start it yourself. Spoiler alert, when outlining the projects for each week I may include example code, you might want to skip over these parts if you're taking the course yourself. ## So what is CS50 AI all about? > CS50's Introduction to Artificial Intelligence (AI) with Python explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that give rise to technologies like game-playing engines, handwriting recognition, and machine translation. Through hands-on projects, students gain exposure to the theory behind graph search algorithms, classification, optimization, reinforcement learning, and other topics in artificial intelligence and machine learning as they incorporate them into their own Python programs. By course’s end, students emerge with experience in libraries for machine learning as well as knowledge of artificial intelligence principles that enable them to design intelligent systems of their own. The course contains seven lectures, twelve projects and seven quizzes. The lectures and projects cover key AI concepts such as search, knowledge, uncertainty, optimisation, machine learning, neural networks and natural language processing. The suggested completion time is seven weeks, at between ten to thirty hours per week. The only prerequisites for the course are either taking the [CS50 Introduction to Computer Science](https://edx.sjv.io/EKAg9W) course or prior programming experience in Python. The course is free and if you submit and receive a score of at least 70% on each of this course’s projects, you will be eligible for a [free certificate](https://cs50.harvard.edu/ai/2020/certificate/) like the one below. A nice recognition of the hard work put in to get it. 🤓 You can also choose to pay £145 (at the time of writing) to get a [verified certificate](https://edx.sjv.io/6ezQAQ) from [edX](https://edx.sjv.io/q4oLWq). This might be worthwhile if you are wanting to show to an employer or talk about in an interview. If you've already achieved a verified certificate for [CS50 Introduction to Computer Science](https://edx.sjv.io/EKAg9W) (I completed this in 2018 and loved the course) then after completing this course in AI you in turn complete the [Professional Certificate in Computer Science for Artificial Intelligence](https://edx.sjv.io/KjEvJ7). Both of these courses combined make for a solid introduction to Computer Science. In covering programming, web development, probability, machine learning and artificial intelligence you have the foundation to enter a number of career paths including Software Engineer and Data Scientist roles. CS50 in collaboration with edX offers a few different 'pathways' as outlined below. | Level | Course | Estimated Duration | Topics | Languages Covered | Certificate | Final Certificate (combined with CS50) | |------------|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|-----------------------------------------------------------------------------------------------------|--------------------------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Core | [CS50's Introduction to Computer Science](https://edx.sjv.io/EKAg9W) | 12 weeks 6-18 hours per week | Abstraction, Algorithms, Data Structures, Encapsulation, Software Engineering, and Web Development | C, Python, SQL, and JavaScript plus CSS and HTML | $90 edX (What I paid, might have changed) | - | | Specialist | [CS50's Web Programming with Python and JavaScript](https://edx.sjv.io/5gzD7b) | 12 weeks 6-9 hours per week | Git, Models, Migration, User Interfaces, Testing, CI/CD, Scalability, Security | HTML, CSS, Python, SQL, JavaScript | $199 edX (may have now changed) | [Professional Certificate in Computer Science for Web Programming](https://edx.sjv.io/q4oL2g) | | Specialist | [CS50's Mobile App Development with React Native](https://edx.sjv.io/jraGOZ) | 13 weeks 6-9 hours per week | Components, Props, State, Views, Navigation, User Input, Performance, Shipping, Testing | JavaScript | $199 edX (may have now changed) | [Professional Certificate in Computer Science for Mobile Apps](https://edx.sjv.io/eKj5AQ) | | Specialist | [CS50's Introduction to Game Development](https://edx.sjv.io/oqJgyW) | 12 weeks 6-9 hours per week | 2D and 3D Graphics, Animation, Sound, Collision Detection, Unity, LOVE 2D | Lua, C# | $199 edX (may have now changed) | [Professional Certificate in Computer Science for Game Development](https://edx.sjv.io/WqEvGe) | | Specialist | [CS50's Introduction to Artificial Intelligence with Python](https://edx.sjv.io/q4oLWq) | 7 weeks 10-30 hours per week | Graph Search Algorithms, Knowledge Representation, Logical Inference, Probability, Machine Learning | Python | $199 edX (may have now changed) | [Professional Certificate in Computer Science for Artificial Intelligence](https://edx.sjv.io/KjEvJ7) | | | | | | | | | AI is the ability of a machine to display human-like capabilities such as reasoning, learning, planning and creativity. AI has completely changed the world and has the potential to continually do so. I do think however, that it can be misunderstood. I see people using the term "artificial intelligence" without realising fully what it means - particular the difference between [strong and weak AI](https://www.ibm.com/cloud/learn/strong-ai#toc-strong-ai--YaLcx8oG). The uses of AI day to day are vast, including search engines, predictive search, image recognition, games, voice assistants, email spam detection, bank fraud detection, smart devices, movie and music recommendations, chatbots, finding map directions and more. Other applications that might soon be seen more often include autonomous drones, self-driving vehicles, robots and virtual workers. I think the great thing about this course, is that it lifts the lid on what otherwise can be seen as a black box, to explore the concepts and algorithms that are key to implementing AI systems. It gives you the core knowledge required to build your own intelligent programs which "mimic the problem-solving and decision-making capabilities of the human mind" ([IBM](https://www.ibm.com/uk-en/cloud/learn/what-is-artificial-intelligence#toc-what-is-ar-DhYPPT4m)). Although not essential, I would recommend the book [Artificial Intelligence: A Modern Approach](https://www.amazon.co.uk/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/) as a companion to the course. The following sections cover the core concepts covered in each lecture, and the projects completed with links to my submitted code in GitHub. If you are taking the course yourself, you should not view these solutions as it might be seen as breaking [Academic Honesty](https://cs50.harvard.edu/ai/2020/honesty/). Okay, let's dive into the concepts covered in the course! ## Lecture 0: Search **Concepts:** - **Agent**: entity that perceives its environment and acts upon that environment. - **State**: a configuration of the agent and its environment. - **Actions**: choices that can be made in a state. - **Transition model**: a description of what state results from performing any applicable action in any state. - **Path cost**: numerical cost associated with a given path. - **Evaluation function**: function that estimates the expected utility of the game from a given state. **Algorithms:** - [**DFS**](https://youtu.be/D5aJNFWsWew?t=1557) (depth first search): search algorithm that always expands the deepest node in the frontier. - [**BFS**](https://www.youtube.com/watch?v=D5aJNFWsWew) (breath first search): search algorithm that always expands the shallowest node in the frontier. - [**Greedy best-first search**](https://youtu.be/D5aJNFWsWew?t=3269): search algorithm that expands the node that is closest to the goal, as estimated by an heuristic function h(*n*). - [**A\* search**](https://youtu.be/D5aJNFWsWew?t=3916): search algorithm that expands node with lowest value of the "cost to reach node" *g(n)* plus the "estimated goal cost" h(*n*). In other words, g(*n*) is the number of steps you had to take to get to the node you're at and the *h(n)* is the ['Manhatten distance'](https://xlinux.nist.gov/dads/HTML/manhattanDistance.html) heuristic estimate of how far a node is away from the goal. This can be expressed as *f(n) = g(n) + h(n)*. - [**Minimax**](https://youtu.be/D5aJNFWsWew?t=4450): adversarial search algorithm. **Data Structures** - [**Frontier**](https://youtu.be/D5aJNFWsWew?t=993): represents all the possible nodes to search next that haven’t yet been explored - **Stack**: last-in first-out data type used for DFS - **Queue**: first-in first-out data type used for BFS - [**Node**](https://youtu.be/D5aJNFWsWew?t=909): keeps track of a state, a parent (node that generated this node), an action (action applied to parent to get to node) and a path cost (from initial state to node) **Projects** - [**Tic-Tac-Toe**](https://cs50.harvard.edu/ai/2020/projects/0/tictactoe/) - Using [Minimax](https://en.wikipedia.org/wiki/Minimax), implement an AI to play Tic-Tac-Toe optimally. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/0.%20Search/tictactoe) - [**Degrees**](https://cs50.harvard.edu/ai/2020/projects/0/degrees/) - Write a program that determines how many “degrees of separation” apart two actors are. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/0.%20Search/degrees) ```python [degrees.py] def shortest_path(source, target): """ Finds the shortest path between any two actors (source, target) by choosing a sequence of movies that connects them. Returns the shortest list of (movie_id, person_id) pairs that connect the source to the target. If no possible path, returns None. """ print( f"Finding shortest path between {people[source]['name']} ({source}) and {people[target]['name']} ({target})...") timer = time.time() # Start with frontier and initial node frontier = QueueFrontier() initial_node = Node(state=source, parent=None, action=None) frontier.add(initial_node) # Start with empty explored set explored = set() number_of_states_explored = 0 while True: # If frontier is empty no solution if frontier.empty(): return None # Remove a node from the frontier node = frontier.remove() number_of_states_explored += 1 # Add the node to the explored set explored.add(node.state) # Expand node, add resulting nodes to the frontier if the aren't already # in the frontier or the explored set for movie_id, person_id in neighbors_for_person(node.state): if not frontier.contains_state(person_id) and person_id not in explored: child = Node(state=person_id, parent=node, action=movie_id) # If child node (neighbor) contains goal state, no need to add it to the frontier # instead return the solution immediately. if child.state == target: path = [] node = child while node.parent is not None: path.append((node.action, node.state)) node = node.parent path.reverse() seconds_taken = time.time() - timer print(f"Explored { number_of_states_explored } states in { seconds_taken } seconds") return path frontier.add(child) ``` There are two approaches to the order of this solution, one of them dramatically [reduces time complexity](https://youtu.be/cEnVl_xopjo?t=245). ## Submitting the first project I started CS50 AI a while back, but other commitments got in the way. So I was really happy to dive back in. I'd already done the tictactoe project so I submitted that first (I know it was the second project, but it interested me more so I did it first 😆). The first obstacle you might hit on week 0 is "I've finished my first project... How do I submit my work?!" I had the same question. So let's take submitting tictactoe as an example. In the main CS50 AI site in the [tictactoe project](https://cs50.harvard.edu/ai/2020/projects/0/tictactoe/) page, there is a section "Getting started" to pull the project code from. Once the project is completed, we have a section 'How to Submit' which contains a series of steps: * Visit [this link](https://submit.cs50.io/invites/8f7fa48876984cda98a73ba53bcf01fd), log in with your GitHub account, and click **Authorize cs50**. Then, check the box indicating that you’d like to grant course staff access to your submissions, and click **Join course**. * [Install Git](https://git-scm.com/downloads) and, optionally, [install submit50](https://cs50.readthedocs.io/submit50/). * If you’ve installed submit50, execute `submit50 ai50/projects/2020/x/tictactoe` * Submit [this form](https://forms.cs50.io/4aeea18e-5aa0-4ae2-9086-5941d5556954). I had a folder structure broken down by lecture and project: ``` 0. Search | --- degrees | --- tictactoe | 1. Knowledge | --- knights | --- minesweeper ... ``` Seems straightforward but there were a few gotchas. So here is how I stumbled through it: * cd into project directory `Search/tictactoe` * I tried install submit50 on Windows using `pip3 install submit50`. This is a no-no [it does not work on Windows](https://github.com/cs50/submit50/issues/196). So I launched Ubuntu (which has Python preinstalled) on a virtual machine using VirtualBox * To install the Python packages for the project (for tictactoe it was pygame) alongside submit50 I needed to install pip using `sudo apt install python3-pip` * I could now install submit50 using `pip3 install submit50` * Once submit50 is installed I needed to reboot the Ubuntu virtual machine to ensure the terminal recognised it (I was getting `submit50: Command not found`) * In the project directory I could now install all the packages using `pip3 install -r requirements.txt` - you might want to create and install packages to [a virtual environment](https://docs.python.org/3/library/venv.html) per project folder if you wish * I was then able to run tictactoe for testing using `python3 runner.py` These steps got me very close to my first submission. There were two more obstacles... Since I was using VS Code within Ubuntu everytime I tried to submit GitHub would open in the browser, I'd sign in but [submission would fail](https://cs50.stackexchange.com/questions/37360/using-submit50-on-vscode) when I returned to VS Code. The solution is go to File > Preferences > Settings > Extensions > GitHub and untick Git Authentication 😄 So now when using `submit50 ai50/projects/2020/x/tictactoe` to submit, the prompt for my GitHub username and password would appear within VS Code itself, much better. The final hurdle was, if you have two factor authentication turned on with GitHub, you might get this message 😧 The link provided in the error message https://cs50.ly/github-2fa has all the steps for creating a personal access token. Once you have it, re-submit and use that token at the password prompt. Now using `submit50 ai50/projects/2020/x/tictactoe` again, the submission for tictactoe was successfully uploaded! Hopefully this should serve as a good example of how to submit tictactoe, and you can now use the same method for submitting each of the other projects. You might find a much easier way to do this, I'm sure you could use Windows Subsystem for Linux instead, but this worked nicely for me even if there were a few headaches to overcome. ## Lecture 1: Knowledge **Concepts** - [**Sentence**](https://youtu.be/LucW-p6zC5c?t=104): an assertion about the world in a knowledge representation language. - [**Knowledge base**](https://youtu.be/LucW-p6zC5c?t=975): a set of sentences known by a knowledge-based agent. - [**Entailment**](https://youtu.be/LucW-p6zC5c?t=1022): _a_ entails _b_ if in every model in which sentence _a_ is true, sentence _b_ is also true. - [**Inference**](https://youtu.be/LucW-p6zC5c?t=1308): the process of deriving new sentences from old ones. - [**Conjunctive normal form**](https://youtu.be/LucW-p6zC5c?t=4985): logical sentence that is a conjunction of clauses. - [**First order logic**](https://youtu.be/LucW-p6zC5c?t=5910): Propositional logic. - **Second order logic**: Proposition logic with universal and existential quantification. - **Truth table**: table showing the outputs for all possible combinations of inputs to a logic gate or circuit. **Algorithms** - **Model checking**: enumerate all possible models and see if a proposition is true in every one of them. - **Conversion to CNF** and **Inference by resolution** **Projects** - [**Knights**](https://cs50.harvard.edu/ai/2020/projects/1/knights/) - Write a program to solve logic puzzles [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/1.%20Knowledge/knights) - [**Minesweeper**](https://cs50.harvard.edu/ai/2020/projects/1/minesweeper/) - Write an AI to play Minesweeper [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/1.%20Knowledge/minesweeper) I think the [Minesweeper project](https://cs50.harvard.edu/ai/2020/projects/1/minesweeper/) was one of my favourite! The general logic of the AI was adding sentences to it's knowledge base where a sentence consisted of a set of board cells, and a count of the number of those cells which are mines, so something like `Sentence({(0, 1), (1, 0), (1, 1)}, 3)`. This says out of cells `{(0, 1), (1, 0), (1, 1)}` exactly 3 of them are mines. We can then infer they must all be mines as the number of cells is equal to the count! On every move the following process was executed: 1. Mark the cell as a move that has been made 2. Mark the cell as safe 3. Get the neighbours of the current cell 3. Add a new sentence to the AI's knowledge base based on the cell's neighbours and count (of adjacent mines) 4. Mark any additional cells as safe or as mines if it can be concluded based on the AI's knowledge base 5. Add any new sentences to the AI's knowledge base if they can be inferred from existing knowledge This meant that as sentences are added to the knowledge base the AI can make yet more inferences. Given this board, we can see there is one mine next to the top row cells and two mines next to the bottom middle cell. The top row's sentence would be `{A, B, C} = 1`. the bottom middle's sentence would be `{A, B, C, D, E} = 2`. Now we have two sentences where the first sentence's set of cells are a subset of the second sentence's set of cells. We can now construct a new sentence by doing set2 - set1 = count2 - count 1 which is `{D, E} = 1`. If two of A, B, C, D, and E are mines, and only one of A, B, and C are mines, then it stands to reason that exactly one of D and E must be the other mine. Here is a demo of the Minesweeper AI in action! ## Lecture 2: Uncertainty When the answer isn't certain, we can use probability based methods to assess the knowledge available, to then make decisions. **Concepts** - **Unconditional probability**: degree of belief in a proposition in the absence of any other evidence. - [**Conditional probability**](https://youtu.be/uQmYZTTqDC0?t=577): degree of belief in a proposition given some evidence that has already been revealed. - [**Possible worlds**](https://youtu.be/uQmYZTTqDC0?t=170): every possible outcome for a given series or combination of events - [**Random variable**](https://youtu.be/uQmYZTTqDC0?t=1040): a variable in probability theory with a domain of possible values it can take on. - [**Independence**](https://youtu.be/uQmYZTTqDC0?t=1316): the knowledge that one event occurs does not affect the probability of the other event. - [**Bayes' Rule**](https://youtu.be/uQmYZTTqDC0?t=1608): _P(a) P(b|a) = P(b) P(a|b)_ - [**Bayesian network**](https://youtu.be/uQmYZTTqDC0?t=2982): data structure that represents the dependencies among random variables. - [**Markov assumption**](https://youtu.be/uQmYZTTqDC0?t=5580): the assumption that the current state depends on only a finite fixed number of previous states. - **Markov chain**: a sequence of random variables where the distribution of each variable follows the Markov assumption. - [**Hidden Markov Model**](https://youtu.be/uQmYZTTqDC0?t=6257): a Markov model for a system with hidden states that generate some observed event. **Algorithms** - **Inference by enumeration** - **Sampling** - **Likelihood weighting** **Projects** - [**Heredity**](https://cs50.harvard.edu/ai/2020/projects/2/heredity/) - Write an AI to assess the likelihood that a person will have a particular genetic trait. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/2.%20Uncertainty/heredity) - [**PageRank**](https://cs50.harvard.edu/ai/2020/projects/2/pagerank/) - Write an AI to rank web pages by importance. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/2.%20Uncertainty/pagerank) Having worked as a Data Scientist and Statistician, I like the following questions for starting to think about probability. Firstly, if you have two fair dice, what is the probablity of rolling a 12 (6 and 6)? The answer is 1 in 36 or a 2.778% chance because we can see out of all the 36 possible words (the possible combinations of dice throws) only one satisfies the requirement of rolling a 12. I read in the book [The Art of Statistics: Learning from Data](https://www.amazon.co.uk/Art-Statistics-Learning-Pelican-Books/dp/0241398630) about how in 2012, 97 Members of Parliament were asked 'If you spin a coin twice, what is the probablity of getting two heads?' 60 out of 97 of them couldn't give the correct answer. The answer is 1 in 4 or a 25% chance because we can see out of the 4 possible outcomes only one satisfies the requirement of flipping two heads. Another favourite of mine that seemingly breaks the laws of probablity is the [Monty Hall Problem](https://www.youtube.com/watch?v=4Lb-6rxZxx0). There is a [follow up explanation](https://www.youtube.com/watch?v=7u6kFlWZOWg) for this and an excellent comment on this video from Rundvelt showing the importance of looking at 'possible worlds': > I think that if you drew out all the possibilities that would demonstrate the fact better. For example. > > Scenario 1: > Car / Goat / Goat > > Scenario 2: > Goat / Car / Goat > > Scenario 3: > Goat / Goat / Car > > Let's say you pick the door on the left and do not switch. > > Scenario 1: Win > > Scenario 2: Lose > > Scenario 3: Lose > > Let's say you pick the door on the left and switch doors. > > Scenario 1: Lose > > Scenario 2: Win > > Scenario 3: Win. > > Not Switching = 1 win out of 3. > > Switching = 2 wins out of 3. To learn more about statistics and probability, I recommend the book [Practical Statistics for Data Scientists](https://www.amazon.co.uk/Practical-Statistics-Data-Scientists-Essential-dp-149207294X/dp/149207294X/ref=dp_ob_title_bk) - I love using this as a reference book. ## Lecture 3: Optimisation Optimisation can be summarised as choosing the best option from a set of options. **Concepts** - [**Local search**](https://youtu.be/TA5ZJm1ZYS4?t=104): search algorithm that maintain a single node and searches by moving to a neighbouring node, but is not concered about finding the path, just the optimal solution. - [**State-space landscape**](https://youtu.be/TA5ZJm1ZYS4?t=252): the different configuations of possible worlds and their cost value. - **Objective function**: function to find the global maximum from the state space landscape. - **Cost function**: function to find the global minimum from the state space landscape. - **Neighbouring state**: a state that is close to the current state, but slightly different to compare objective or cost function value. **Algorithms** - [**Hill Climbing**](https://youtu.be/TA5ZJm1ZYS4?t=450): start at a given state, then consider the neighbours of that state and pick the highest or lowest. - [**steepest-ascent**](https://youtu.be/TA5ZJm1ZYS4?t=1271): choose the highest-valued neighbour. - **stochastic**: choose randomly from higher-valued neighbours. - **first-choice**: choose the first higher-valued neighbour. - [**random-restart**](https://youtu.be/TA5ZJm1ZYS4?t=1604): conduct hill climbing multiple times. - **local beam search**: chooses the _k_ highest-valued neighbours. - [**Simulated Annealing**](https://youtu.be/TA5ZJm1ZYS4?t=1750): early on, more likely to accept worse-valued neighbours than the current state. - [**Linear Programming**](https://youtu.be/TA5ZJm1ZYS4?t=2445): a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. - [**Simplex**](https://en.wikipedia.org/wiki/Simplex_algorithm) - [**Interior Point**](https://en.wikipedia.org/wiki/Interior-point_method) - [**Constraint Satisfaction problems**](https://youtu.be/TA5ZJm1ZYS4?t=3061): problems where the state has constraints or limiations. - [**Node Consistency**](https://youtu.be/TA5ZJm1ZYS4?t=3549): when all the values in a variable's domain satisfy the variable's unary constraints. - [**Arc Consistency**](https://youtu.be/TA5ZJm1ZYS4?t=3789): when all the values in a variable's domain satisfy the variable's binary constraints. [**Backtracking Search**](https://youtu.be/TA5ZJm1ZYS4?t=4619): a search algorithm to solve a constraint satisfcation problem that incrementally builds candidates as the solution, but abandons a candidate ('backtracks') as soon as it finds the candidate cannot possibly be a valid solution. **Projects** - [**Crossword**](https://cs50.harvard.edu/ai/2020/projects/3/crossword/) - Write an AI to generate crossword puzzles. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/3.%20Optimisation/crossword) ## Lecture 4: Learning Machine learning models focus on finding and learning from patterns in existing data, then use those patterns to predict new outcomes with a high degree of accuracy. Although accuracy is important it's also essential to build [explainable models / explainable AI (XAI)](/blog/understanding-explainable-ai-for-classification-regression-and-clustering-with-python/) so subjects, stakeholders and businesses can understand them and have more confidence in them. **Concepts** - [**Supervised learning**](https://youtu.be/E4M_IQG0d9g?t=75): given a data set of input-output pairs, learn a function to map inputs to outputs. - [**Classification**](https://youtu.be/E4M_IQG0d9g?t=493): supervised learning task of learning a function mapping an input point to a discrete category. - [**Regression**](https://youtu.be/E4M_IQG0d9g?t=2371): supervised learning task of learning a function mapping and input point to a continuous value. - [**Loss function**](https://youtu.be/E4M_IQG0d9g?t=2564): function that express how poorly our hypothesis performs (L1, L2). - [**Overfitting**](https://youtu.be/E4M_IQG0d9g?t=2974): when a model fits too closely to a particular data set and therefore may fail to generalize to future data. - [**Regularization**](https://youtu.be/E4M_IQG0d9g?t=3347): penalizing hypotheses that are more complex to favor simpler, more general hypotheses. - [**Holdout cross-validation**](https://youtu.be/E4M_IQG0d9g?t=3403): splitting data into a training set and a test set, such that learning happens on the training set and is evaluated on the test set. - [**k-fold cross-validation**](https://youtu.be/E4M_IQG0d9g?t=3497): splitting data into _k_ sets, and experimenting _k_ times, using each set as a test set once, and using remaining data as training set. - [**Reinforcement learning**](https://youtu.be/E4M_IQG0d9g?t=4198): given a set of rewards or punishments, learn what actions to take in the future. - [**Unsupervised learning**](https://youtu.be/E4M_IQG0d9g?t=5935): given input data without any additional feedback, learn patterns. - [**Clustering**](https://youtu.be/E4M_IQG0d9g?t=6019): organizing a set of objects into groups in such a way that similar objects tend to be in the same group. **Algorithms** - [**k-nearest-neighbor classification**](https://youtu.be/E4M_IQG0d9g?t=491): given an input, chooses the most common class out of the _k_ nearest data points to that input. - [**Support Vector Machines (SVM)**](https://youtu.be/E4M_IQG0d9g?t=2001): algorithm which creates a line or a hyperplane which separates the data into classes. - **Markov decision process**: model for decision-making, representing states, actions and their rewards. - **Q-learning**: method for learning a function _Q_(s, a), estimate of the value of performing action _a_ in state _s_. - **Greedy decision-making** - **epsilon-greedy** - **k-means clustering**: clustering data based on repeatedly assigning points to clusters and updating those clusters' centers. **Basic template for building a machine learning classifier model** ```python [ml-scaffold.py] import pandas as pd import numpy as np from sklearn.svm import SVC from sklearn.linear_model from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split model = KNeighborsClassifier() data = pd.read_csv("filepath goes here.csv") target = data['ColumnName'].values features = data['ColumnNameA', 'ColumnNameB', 'ColumnNameC'] X_train, X_test, y_train, y_test = train_test_split( features, target, test_size=0.3 ) model.fit(X_train, y_train) predictions = model.predict(X_test) correct = (y_test == predictions).sum() incorrect = (y_testing != predictions).sum() total = len(predictions) print(f"Results for model {type(model).__name__}") print(f"Correct: {correct}") print(f"Incorrect: {incorrect}") print(f"Accuracy: {100 * correct / total:.2f}%") ``` **Packages** - [pandas](https://pandas.pydata.org/): fast, powerful, flexible and easy to use open source data analysis and manipulation tool,built on top of the Python programming language. - [scikit-learn](https://scikit-learn.org/stable/): Machine learning and predictive analysis package built on NumPy, SciPy, and matplotlib. [[Lecture]](https://youtu.be/E4M_IQG0d9g?t=3582) **Resources** - [Google's Machine Learning Glossary](https://developers.google.com/machine-learning/glossary) **Projects** - [Shopping](https://cs50.harvard.edu/ai/2020/projects/4/shopping/) - Write an AI to predict whether online shopping customers will complete a purchase. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/4.%20Learning/shopping) - [Nim](https://cs50.harvard.edu/ai/2020/projects/4/nim/) - Write an AI that teaches itself to play Nim through reinforcement learning. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/4.%20Learning/nim) ## Lecture 5: Neural Networks An artificial nerual network is a mathematical model for learning inspired by biological neural networks. **Concepts** - [**Multilayer neural network**](https://youtu.be/mFZazxxCKbw?t=1800): artificial neural network with an input layer, an output layer, and at least one hidden layer. - [**Deep neural network**](https://youtu.be/mFZazxxCKbw?t=1833): neural network with multiple hidden layer. - [**Dropout**](https://youtu.be/mFZazxxCKbw?t=2238): temporarily removing units - selected at random - from a neural network to prevent over-reliance on certain units. - [**Computer vision**](https://youtu.be/mFZazxxCKbw?t=3185): computational methods for analysing and understanding digital images. - [**Image convolution**](https://youtu.be/mFZazxxCKbw?t=3490): applying a filter that adds each pixel value of an image to its neighbours, weighted according to a kernel matrix. - [**Pooling**](https://youtu.be/mFZazxxCKbw?t=3988): reducing the size of an input by sampling from regions in the input. - [**Convolutional neural network**](https://youtu.be/mFZazxxCKbw?t=4098): neural networks that use convolution, usually for analyzing images. - [**Recurrent neural network**](https://youtu.be/mFZazxxCKbw?t=5223): neural network that generates output that feeds back into its own inputs. **Algorithms** - **Gradient descent**: algorithm for minimizing loss when training neural network. - **Backpropagation**: algorithm for training neural networks with hidden layers. **Packages** - [tensorflow](https://www.tensorflow.org/learn): An open source software library for high performance numerical computation. It comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains. See also [The Sequential model with Tensorflow Keras](https://www.tensorflow.org/guide/keras/sequential_model). - [scikit-learn](https://scikit-learn.org/stable/): A machine learning and predictive analysis package built on NumPy, SciPy, and matplotlib. - [opencv-python](https://pypi.org/project/opencv-python/): A library of Python bindings designed to solve computer vision problems. See [docs](https://docs.opencv.org/4.5.2/d2/d96/tutorial_py_table_of_contents_imgproc.html). **Projects** - [Traffic](https://cs50.harvard.edu/ai/2020/projects/5/traffic/) - Write an AI to identify which traffic sign appears in a photograph. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/5.%20Neural%20Networks) After downloading the distribution code, install the Python packages from the requirements file, I ran `python3 traffic.py gtsrb` as a test and received an `Illegal instruction (core dumped)` error message. I was running this on an Ubuntu Linux VM using VirtualBox. The [fix for this](https://github.com/tensorflow/tensorflow/issues/17411) was to re-install an earlier version of the tensorflow package: ``` pip3 uninstall tensorflow pip3 install tensorflow==1.5 ``` After this I ran `python3 traffic.py gtsrb` again and arrived at the line 62 not implemented error in the `load_data` function, as expected, `File "traffic.py", line 62, in load_data raise NotImplementedError`. Hope this helps you out if you find yourself getting the same error message! Here is a demo of the Convolutional Neural Network model used for the Traffic project in action! ## Lecture 6: Language Natural Language Processing or NLP aims to understand human language, both written and spoken to extract information. **Concepts** - [**n-gram**](https://youtu.be/_hAVVULrZ0Q?t=1681): a contiguous sequence of _n_ items inside of a text. - [**Tokenization**](https://youtu.be/_hAVVULrZ0Q?t=1836): the task of splitting a sequence of characters into pieces (tokens). - **Text Categorization** - [**Bag-of-words model**](https://youtu.be/_hAVVULrZ0Q?t=2561): represent text as an unordered collection of words. - **Information retrieval**: the task of finding relevant documents in response to a user query. - [**Topic modeling**](https://youtu.be/_hAVVULrZ0Q?t=4199): models for discovering the topics for a set of documents. - [**Term frequency**](https://youtu.be/_hAVVULrZ0Q?t=4253): number of times a term appears in a document. - [**Function words**](https://youtu.be/_hAVVULrZ0Q?t=4456): words that have little meaning on their own, but are used to grammatically connect other words. - [**Content words**](https://youtu.be/_hAVVULrZ0Q?t=4492): words that carry meaning independently. - [**Inverse document frequency**](https://youtu.be/_hAVVULrZ0Q?t=4643): measure of how common or rare a word is across documents. Formula is *log(total_documents / number_of_documents_containing(word))* - [**Information extraction**](https://youtu.be/_hAVVULrZ0Q?t=4873): the task of extracting knowledge from documents. - [**WordNet**](https://youtu.be/_hAVVULrZ0Q?t=5413): a lexical database of semantic relations between words. - [**Word representation**](https://youtu.be/_hAVVULrZ0Q?t=5537): looking for a way to represent the meaning of a word for further processing. - [**one-hot**](https://youtu.be/_hAVVULrZ0Q?t=5636): representation of meaning as a vector with a single 1, and with other values as 0. - [**distribution**](https://youtu.be/_hAVVULrZ0Q?t=5768): representation of meaning distributed across multiple values. **Algorithms** - [**Markov model applied to language**](https://youtu.be/_hAVVULrZ0Q?t=2281): generating the next word based on the previous words and a probability. - [**Naive Bayes**](https://youtu.be/_hAVVULrZ0Q?t=2806): based on the Bayes' Rule to calculate probability of a text being in a certain category, given it contains specific words. Assuming every word is independent of each other. - [**Additive smoothing**](https://youtu.be/_hAVVULrZ0Q?t=3743): adding a value _a_ to each value in our distribution to smooth the data. - [**Laplace smoothing**](https://youtu.be/_hAVVULrZ0Q?t=3753): adding 1 to each value in our distribution (pretending we've seen each value one more time than we actually have). - [**tf-idf**](https://youtu.be/_hAVVULrZ0Q?t=4703): ranking of what words are important in a document by multiplying term frequency (TF) by inverse document frequency (IDF). - **Automated template generation**: giving AI some terms and let it look into a corpus for patterns where those terms show up together. Then it can use those templates to extract new knowledge from the corpus. - [**word2vec**](https://youtu.be/_hAVVULrZ0Q?t=5873): model for generating word vectors. - **skip-gram architecture**: neural network architecture for predicting context words given a target word. **Packages** - [**NLTK**](https://www.nltk.org/): Natural language toolkit or NLTK is a package for working iwth human language data. [Lecture](https://youtu.be/_hAVVULrZ0Q?t=1236]) **Projects** - [Parser](https://cs50.harvard.edu/ai/2020/projects/6/parser/) - Write an AI to parse sentences and extract noun phrases. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/6.%20Language/parser) - [Questions](https://cs50.harvard.edu/ai/2020/projects/6/questions/) - Write an AI to answer questions. [[Solution]](https://github.com/shedloadofcode/cs50-artificial-intelligence/tree/main/6.%20Language/questions) ## Reflections on the course Overall I found the course challenging yet extremely informative on the concepts and implementations of AI. It had the right balance between abstract concepts and concrete solutions in Python. I'm now much more aware of and always on the look out for applying these AI concepts to problems, or whether a problem can be framed as one of them. I think just knowing how to solve certain kinds of problems is half the battle, the other half is shaping the problem into a workable solution. To do that you need solid robust data, and a clear vision for the 'world' in which the AI agent will operate. We have already seen widespread use of AI and this can only increase in the coming decades. I think having an understanding of the fundamentals and building your own small AI solutions is essential, especially for Software Engineers and Data Scientists. The main aim in building intelligent systems for me, is to enable the autonomous agents that operate within their 'world' to carry out tasks and make decisions at or above the accuracy a human domain expert could, but faster and more reliably. To achieve that, it may involve a combination of machine learning, statistics, software engineering, system architecture and data engineering skills, plus business domain knowledge. As shown in the below image, there is generally an overlap between roles and skills, but in my opinion all of of these skills have a benefit to any digital or data role. The concepts and skills learnt in this course certainly help to get you started on your journey to engineering intelligent, autonomous systems and your own AI programs that can help make other people's lives better. The certification is optional, however I opted to purchase it and have talked about it and the skills gained from it within interviews. I think it demonstrates a commitment to continuing professional development, an attitude of continuous learning and an accolade you can be proud of upon finishing the course. As always, if you enjoyed this article be sure to check out [other articles](/) on the site including [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/) 😄

How to upload PDF files to Azure Blob Storage with Vue and Python Flask

Thu, 24 Mar 2022 11:15:00 GMT

In this article we’ll take a look at uploading PDF files to Azure Blob Storage using Vue and Python Flask. This is a common use case I’ve come across for document storage. Although we’ll be uploading PDFs in this article, the same approach can be used for files of any kind. ## Getting started We’re going to use the same Vue Flask template I used from another article [How to query a database with Python Flask and download data to CSV or XLSX in Vue](/blog/query-sql-and-download-csv-and-xlsx-in-flask/). The template is in this public [GitHub repository](https://github.com/gtalarico/flask-vuejs-template) from gtalarico. Once you’ve cloned or downloaded the repo, setup a virtual environment with pipenv and install the packages that will be needed below. ``` cd flask-vuejs-template-master python -m pip install pipenv python -m pipenv install --dev python -m pipenv install flask-restx azure-storage-blob python -m pipenv shell ``` Since this template uses [Flask-RESTPlus](https://flask-restplus.readthedocs.io/en/stable/) and we're using [Flask-RESTX](https://flask-restx.readthedocs.io/en/latest/) which is a community driven fork of it, go ahead and replace all references to 'flask_restplus' with 'flask-restx'. We'll be using the [azure-storage-blob](https://pypi.org/project/azure-storage-blob/) package which has a [quickstart guide](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#run-the-code) from Microsoft. There are similar packages available for other languages too, including Java, C# and .NET, JavaScript, C++, Go and more. Now that the Python packages are installed, let's install and upgrade the Vue dependencies with Yarn, and build the Vue dist directory. ``` yarn install --dev yarn upgrade yarn build ``` If everything went smoothly, you should be able to run both the backend and frontend dev servers. Run `python run.py` and from another terminal window in the same directory run `yarn serve`. You should see the app running at http://localhost:8080/ 👍 ## Set up Azure Blob Storage container Beginning with the end in mind, we'll need a place to store files in Azure. So the first job is to setup a Storage Account for that in Azure. This [article](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal) from Microsoft goes over the process of setting one up. You head to the [Azure portal](https://portal.azure.com/#create/hub) and search for "storage account" then hit **Create**. Make sure you delete this Storage Account resource after testing as you may incur costs if you don't. If in any doubt always [check the pricing calculator](https://azure.microsoft.com/en-gb/pricing/calculator/) or [Azure Blob Storage pricing page](https://azure.microsoft.com/en-gb/pricing/details/storage/blobs/) from Microsoft. Once finished deploying you'll need to create a container and grab the credentials as we'll need them later on! To create a container, go to the Storage Account resource and hit the add container button. We'll use this new container to store the PDF files. Then head to **Access keys** under **Security + networking** and hit **Show keys**. Copy the storage account name, the keys and connection strings. You should only need the **Connection string** under **key1** to connect with the Python SDK though. Never hurts to have backup credentials. ## Create form to upload file in Vue Inside src/views/Home.vue we first want a very basic outline of our file upload form. Substitute the template tags for this new template creating the form. ```html [src/views/Home.vue] ``` We then need to implement the `handleFileUpload` and `submitFile` methods. The first will allow us to **stage** a file, and the second will allow us to **submit** and send that file with Axios to the Python API endpoint at /api/upload we'll create later. ```js [src/views/Home.vue] ``` ## Handle file upload in Flask Now we have a basic form to upload the file to the server with Axios, let's create an API endpoint that will actually upload the file to Azure Blob Storage. Inside app/api/resources.py we'll add a route to handle this operation. ```python [app/api/resources.py] """ REST API Resource Routing http://flask-restplus.readthedocs.io """ import io from datetime import datetime from flask import request, jsonify, send_file from flask_restx import Resource from azure.storage.blob import BlobServiceClient, ContainerClient from .security import require_auth from . import api_rest AZURE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;" + \ "AccountName=vueflaskstorageaccount;" + \ "AccountKey=m6Vegjl44F28CnuujeYI27kZblp7pQBRftsuDXGLUN0PkfuRxAkY3MqJogwu2FShclWFWHfD3n4hJYeQEmk3GQ==;" + \ "EndpointSuffix=core.windows.net" @api_rest.route('/upload') class UploadFile(Resource): """ Uploads file to Azure Blob Storage """ def post(self): f = request.files["file"] try: service_client = BlobServiceClient.from_connection_string(AZURE_CONNECTION_STRING) container_client = service_client.get_container_client("pdf-container") blob_client = container_client.get_blob_client(f.filename) blob_client.upload_blob(f) return jsonify(success=True) except: return jsonify(success=False) ``` After assigning the connection string from earlier to `AZURE_CONNECTION_STRING` (in production you don't want to hardcode this sensitive connection string, instead use an environment variable) we initialise a service client which gets the container and uploads the file inside of it. The [sample](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/samples/blob_samples_hello_world.py) and [introduction](https://docs.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python#getting-started) from Microsoft are useful for learning more about working with the Azure Blob Storage SDK for Python. ## Listing all files in the container Whilst in app/api/resources.py we're gonna add in two more routes, one to get all files in the container, and another to download a given file by name. This will allow us to list all files in the application and select one to download. Add these under the `UploadFile` class. ```python [app/api/resources.py] @api_rest.route('/all-files') class GetAllFiles(Resource): """ Gets all filenames in the Azure Blob Storage container """ def get(self): container = ContainerClient.from_connection_string( conn_str=AZURE_CONNECTION_STRING, container_name="pdf-container" ) all_filenames = [] blob_list = container.list_blobs() for blob in blob_list: all_filenames.append(blob.name) return { "filenames": all_filenames } @api_rest.route('/download/') class DownloadFile(Resource): """ Downloads a file from Azure Blob Storage by filename """ def post(self, filename): service_client = BlobServiceClient.from_connection_string(AZURE_CONNECTION_STRING) container_client = service_client.get_container_client("pdf-container") blob_client = container_client.get_blob_client(filename) bytes_stream = io.BytesIO() blob_data = blob_client.download_blob() blob_data.readinto(bytes_stream) bytes_stream.seek(0) return send_file(bytes_stream, attachment_filename=blob_data.name, mimetype="application/pdf", as_attachment=True) ``` ## Downloading a file Now we have API endpoints to handle both returning the list of files in the container, and to actually download a file, let's make a very simple UI to do both. We'll repurpose `src/views/Api.vue` for this. ```html [src/views/Api.vue] ``` ## What we learned We now have a small but working application that can handle file uploads and downloads. We learned how to build an upload form in Vue.js and then configure and work with Axios and the Azure Blob Storage Python package. Let's take a look at a short video of the application in action! Let's upload three PDF files from my Downloads folder to Azure Blob Storage then download them via the app. I hope you enjoyed this article and can put what you learned here into practice in your own projects. If you're interested in deploying a Vue Flask app be sure to check out the article [Automated deployment of a Vue Flask app using Azure Pipelines](/blog/automated-deployment-of-a-vue-flask-app-using-azure-pipelines/). If you enjoyed this article be sure to check out [other articles](/) on the site.

Preparing for a statistical data science interview

Wed, 09 Feb 2022 17:45:00 GMT

In this article, I'll cover the steps I followed during my recent application for a Senior Data Scientist role. I hope this helps you organise your own preparation for data science and any other roles 😄 ## Start with the job requirements To write a good application and prepare for interview, you need to know your selling points against the job criteria. Below are criteria for three different Senior Data Scientist roles I found. They all have similarities but can vary in tooling and technology used. Select each of the panels to view the requirements.

In preparing an application, you want to be ready to stress where the requirements of the role and your strong areas collide. ## Write and submit an application When it comes to applications, you've either worked as a Data Scientist or you haven't. If you have, showcase your experience and projects. If you haven't, apply Data Science techniques and solve data problems in your current role or in your own time and showcase those projects. Either way, whilst sifting applications they're gonna be looking for **relevant** experience. Give them what they want. When I've been on the other side of recruitment sifting applications, the number one thing that marks candidates down is not providing relevant evidence of using the skills required for the job they're applying for. All of the examples in this article are made up and light on detail but the structure and format of them are the same as what I use for real. You'll need to expand on them but they give a solid framework. For applications, the style should be direct, punchy, quick to scan, easy for a hiring manager to sift and shouldn't include anything that makes you look bad. If you don't have any direct data science experience, show data science techniques you use in your current role. If you don't have any data science examples in your current role, start 'doing data science' in your current role! No one just starts out as a data scientist, but the good news is, you don't have to be a data scientist to apply analytical techniques. The CV and personal statement below cover both angles. ## Prepare competency answers using STAR Also called behaviour questions, these require prepared answers that tell a story about your behaviour. Always try to start sentences with '**I**' rather than '**we**'. The assessor is interested in your contribution, not what other people did. Approach these questions like you are trying to tell the assessor how great you are, and what an asset you'll be by proving you've handled tough situations before, and delivered strong results. Look at the competencies you'll be assessed on, think about what you've done in the past, and start writing up an answer in the STAR format. * Situation = briefly outline the context * Task = briefly outline what you needed to do and why * Action = go into detail about what you did and your thinking process (why you did those things) * Result = drive home that as a result of your actions, X was the outcome, quantify results if possible (saved X%) Here is a short example that follows the STAR format. For the real thing I would expand on the action paragraphs, adding in: * Why you chose that analytical method * What alternatives and options you explored * How you handled messy data * How you evaluated the model * How you updated the model to include new data * What obstacles and issues you overcame * How you got buy-in from others * How you validated your methods * How you handled conflict and disagreements * Did you need to delegate any tasks * Was there time pressure * How did you prioritise conflicting tasks * How did you avoid burnout juggling extra responsibilities * How you ensured standards were high * How you ensured you were meeting the customer's needs * How you measured success * How did you disseminate the analysis to non-technical people * How you deployed and maintained the model I would spend *some* time thinking through possible hypothetical questions that might test your understanding of the company, business area or sector. These might include questions like: * Imagine we gave you data on X, what kind of analysis would you perform on it? * How would you use data science techniques to improve products and services in our sector? * We do X analysis here, why do you think that's important for us? ## Prepare technical presentation Main thing for any presentation is to keep it clear and concise. Address any points you're asked to, otherwise stick to the [rule of three](https://medium.com/swlh/the-rule-of-three-how-to-use-it-9c67219364f6). Keep it mostly high level, but be prepared to drill into details. I was asked to present a recent analytical project. I don't like PowerPoint but created some slides as talking points: * Introduction - quick about me then into the problem statement * Research - how I researched the issue * Development - how I built the solution * User journey - to understand the product * Challenges and solutions - how I overcame obstacles * Analytical techniques - drill into key data science techniques used * Launch - releasing a working product or model into production * Outcomes - the value that was added and success metrics ## Review statistical concepts Statistics underpin almost all of data science. I think data science can even be referred to as statistical learning. So going back to basics can never hurt. I wouldn't get too bogged down during this part, but certainly don't neglect it. * Descriptive statistics * Inferential statistics * Distributions (normal, binomial, poisson, exponential) * Sampling (random, stratified, cluster) * Hypothesis testing * Statistical significance * Regression * Confidence Intervals * Correlation * P-values * Probability (Bayes theorem) * Bias * Testing (z-score, t-test, Chi-square, ANOVA) A great book for brushing up on statistical concepts is [Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python](https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential/dp/149207294X/ref=sxin_13_mbs_w_global_sims). Another extremely useful book I mention later that covers lots of topics including statistics is [Ace the Data Science Interview: 201 Real Interview Questions Asked By FAANG, Tech Startups, & Wall Street](https://www.amazon.com/Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_1). ## Review machine learning concepts As with statistical concepts it's always good to refresh your knowledge of machine learning algorithms and when to use each before heading into a data science interview. These can be broadly categorised as: * Supervised learning - classification, regression algorithms * Unsupervised learning - clustering algorithms * Reinforcement learning A book I refer back to again and again on machine learning is [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ref=sr_1_1) ## Complete a practice take-home test Although I don't really agree with take-home tests or coding challenges, it is out there, it exists, and if I find a role I really want but there's a test attached I'll consider it based upon time investment. A test might mean spending significant time investment brushing up on concepts you've not used in a while. Nevertheless, putting aside the LeetCode style coding challenges covered in my [coding interview topics in Python article](/blog/exploring-coding-interview-topics-in-python/), I figure for data science there will be only one of two possibilities for a test. It will either be an analytical (tell us something interesting about this data) or modelling (predict X outcome with this data, model this data to calculate X outcome) project. For analytical I'd use Jupyter Notebook or Jupyter Lab and for modelling I'd use Visual Studio Code with the [cookiecutter package](https://drivendata.github.io/cookiecutter-data-science/) for good code organisation. This makes creating a new machine learning project as easy as: ``` pip install cookiecutter ``` ``` cookiecutter https://github.com/drivendata/cookiecutter-data-science ``` I found a [Reddit post](https://www.reddit.com/r/datascience/comments/a0xi77/practice_takehome_case_study_datasetscode_included/) linking to a [practice case study with code](https://www.interviewqs.com/blog/case-study-example-1) applicable to any analytical role - and with some modifications to a data science role. This is the kind of thing any take-home might look like: > Build a simple model based on insights you've found and describe how its predictions add value to the company. Present > the model you fitted, why you chose it, explain the model as if speaking to a non-technical audience and how the > predictions could have an impact on the business processes going forward. I have also come across statistical and numerical aptitude tests but they are usually only at entry-level or for large recruitment campaigns. For the last numerical aptitude test I had I used [Assessment Day](https://www.assessmentday.co.uk/aptitudetests_numerical.htm) to prepare. They are usually fast pace like one minute per question so the winning formula is, read the question, look for the specific data the question is talking about, perform calculations, sense check, select the answer, then move on (you don't have time to double check answer). For statistical tests, here is a [good practice test](https://files.civilservicejobs.service.gov.uk/admin/fairs/apptrack/download.cgi?SID=b3duZXI9NTA3MDAwMCZvd25lcnR5cGU9ZmFpciZkb2NfdHlwZT12YWMmZG9jX2lkPTk4OTUyMCZ2ZXJpZnk9NDliZDZiMTQ1ZDQ2NjZlMDkyYWRmZDBlMGM3MDZhYmYmcmVxc2lnPTE2NDI2MTE5NTgtODJhMDg1Y2Q2MmU2ODJjN2NjZjUwNGMzOGM3ZjE3NzVlNTU1NGIzNw==) with answers at the end. A book that guided me through the case study and coding aspects of data science interviews [Ace the Data Science Interview: 201 Real Interview Questions Asked By FAANG, Tech Startups, & Wall Street](https://www.amazon.com/Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_1). This is an invaluable resource considering sheer breadth of it's contents. I took a risk on this one and decided to try and get it delivered ASAP before my interview. It was just what I needed and covers almost all topics you'd need to be aware of. This includes chapters on probability, statistics, machine learning, coding / data structures and algorithms, SQL and database design, product sense and case studies. ## On the day The main things you should do on the day of the interview include: * Stay relaxed! * Be yourself * Enjoy the process as much as possible * Let your passion for data science show * Tell them how amazing you are in your answers (not arrogant but confident) * Like in the application say '**I**' not '**we**' (they are interested in your actions) * Remember the STAR format (it will keep your answers on track) * Remember the data science lifecycle (it will keep your answers on track) * Ask questions (you're interviewing them too!) * Don't be afraid to say if you don't know something (How would you learn about it?) ## After the interview Well done! You did it! The interview is over and you can breathe a sigh of relief. My last major tip might not be what you're expecting... Now that the interview is over, write down the questions you were asked (to practice in future) and then forget about the interview completely! Don't dwell on things that you could have said, mistakes you think you made, or even what went well. Just resign it to the history books. Yes, celebrate that it's over and done with and that you gave it a solid effort, but expect the answer to be 'sorry we went with another candidate, but we thought you were great'. Expect the worst, hope for the best. By doing this, you'll force yourself to view applications and interviews as opportunities and won't over-invest yourself emotionally in them. Everyone fails interviews for many different reasons. If you did great, it's their loss. If you stumbled, see it as practice and improve for next time. The key to getting what you want in anything is to never stop trying, failing, then improving, then trying again. I hope this article helped you prepare for your own data science interview and wish you the best of luck with it! If you enjoyed this article, be sure to check out [other articles](/) on the site.

Reduce Material Design Icons Font to 7KB and automate with PyAutoGUI

Wed, 05 Jan 2022 16:10:00 GMT

This article will cover how I reduced the total size of loading Material Design Icons Font from 361KB to 7KB, and then automated that process using PyAutoGUI. We’ll go through a full end-to-end tutorial of the process. If you want to follow along you can [download the distribution code](https://github.com/shedloadofcode/reduce-mdi-icons-font) before continuing. ## Why optimise Material Design Icons Font? Website performance is critical for delivering a solid user experience. It is especially important for serving web pages to mobile devices and/or locations with poor network connectivity. Not only that, lower page sizes saves bandwidth usage and in turn saves money. Every byte and millisecond counts. I try to identify any opportunity to improve performance and page speed I can. It’s a continuous process of improvement. I recently optimised this site and ticked off the following improvements: * Compressing images with tools like TinyPNG * Using smaller image formats like WebP * Lazy loading images and videos below the fold * Lazy hydration for SPA’s * Minifying JavaScript and CSS * Reducing payload sizes for data requests * Reducing webpack bundle size * Eliminating any redirects * Caching or precomputing results for expensive operations There was something still bothering me though, I had a Lighthouse error indicating “ensure text remains visible during webfont load”. This was because I was using a CDN to pull in the Material Design icons stylesheet from [cdn.jsdelivr.net](https://cdn.jsdelivr.net/npm/@mdi/font@5.8.55/css/materialdesignicons.min.css) which then downloaded the [woff2 font](https://cdn.jsdelivr.net/npm/@mdi/font@5.8.55/fonts/materialdesignicons-webfont.woff2?v=5.8.55). From the CDN, the font file weighed in at 320KB and the stylesheet was 43.5KB. The solution recommended was to add [font-display swap](https://web.dev/font-display/?utm_source=lighthouse&utm_medium=devtools) to the font stylesheet selector. I know that seems silly for an icon font as there is no 'fallback' for icons really, a better suggestion for icon fonts might be to use [font-display block instead](https://stackoverflow.com/questions/49461308/correct-font-display-value-for-icon-fonts). This was impossible to achieve using a CDN although I didn’t mind the idea of self-hosting icon web fonts. I knew there were trade offs between using a CDN opposed to self hosting, but in the interests of site reliability (who wants to use a site without icons if the CDN stops working, right?) I decided to self-host the icon font. This is where my optimisation experiment began! ## Download the Material Design Icon pack With the distribution code I’m starting with a simple HTML page, alongside empty style and font folders. We can see the page has 19 Material Design icons and they are coming from the CDN source to begin with. The Material Icons are being loaded via the stylesheet link tag in the head section of the document, which then loads the 320KB woff2 font file which you will see by hitting F12 and inspecting the Network tab in Chrome (or a different browser's) DevTools. To make viewing this information easier, you can filter the Network tab to just 'CSS' and 'Font' like in the image above. ```html [index.html] Icons Site ``` We’re going to swap this out for a locally hosted icon font and stylesheet. Download the [Material Design icon font](https://github.com/Templarian/MaterialDesign-Webfont) and extract the contents. Move all of the font files in the 'fonts' folder to our project's 'fonts' folder. Then move 'materialdesignicons.css' in the 'css' folder to our project's 'css' folder. At the end, we'll only be using the .woff2 file as it provides [improved compression and is supported by major browsers](https://stackoverflow.com/questions/11002820/why-should-we-include-ttf-eot-woff-svg-in-a-font-face). With the stylesheet and the font files in the correct folders, let’s hook up the stylesheet and remove the CDN by updating the head section. ```html [index.html] ... Icons Site ... ``` To eliminate the pesky 'ensure text remains visible' Lighthouse error I also added 'font-display block' to 'materialdesignicons.css'. ```css [materialdesignicons.css] /* MaterialDesignIcons.com */ @font-face { font-family: "Material Design Icons"; src: url("../fonts/materialdesignicons-webfont.eot?v=6.5.95"); src: url("../fonts/materialdesignicons-webfont.eot?#iefix&v=6.5.95") format("embedded-opentype"), url("../fonts/materialdesignicons-webfont.woff2?v=6.5.95") format("woff2"), url("../fonts/materialdesignicons-webfont.woff?v=6.5.95") format("woff"), url("../fonts/materialdesignicons-webfont.ttf?v=6.5.95") format("truetype"); font-weight: normal; font-style: normal; font-display: block; } ``` After hard refreshing the page (Ctrl + F5) you should see the icon font is still working as expected but with the CDN removed. Checking the Network tab again we can see the icons are now being loaded locally via 'materialdesignicons.css'. 👍 The major problem from the image above is that the 'materialdesignicons.css' file is over 26,000 lines of code for over 5,000 icons, and is 369KB, and the .woff2 file is 361KB, and yet we’re only using 19 icons! The page load time will be bloated, our bandwidth is being consumed and the visitor experience badly affected as a result. The average web page is around 2-3MB, but the recommended size is 1MB. This is 73% of that recommended 1MB in the icon stylesheet and font alone! We could minify 'materialdesignicons.css' to look similar to 'materialdesignicons.min.css' from the original download which is 298KB but that's still too large. Let’s embark on the next step in our efficiency quest to reduce both the stylesheet and font file sizes. ## Identify which icons are actually being used throughout the site I first searched the site to figure out which icons were actually being used throughout it. The one page site we’re using has 19 icons, this site had around 84, mostly in the tools (especially the [System Capacity Calculator](/tools/system-capacity-calculator/)). I made a note of these by inspecting with DevTools, finding the CSS selector in the full stylesheet 'materialdesignicons.css', then copying them into a separate Notepad++ file. This can take a little time, but well worth it! ## Remove unused selectors from the stylesheet The result of my investigation to identify the icons actually used gave me a list of 19 CSS selectors. I made a backup of the full stylesheet in case I wanted to add any more icons in the future, but after replacing the body with the condensed list this is what it looked like: ```css [materialdesignicons.css] /* MaterialDesignIcons.com */ @font-face { font-family: "Material Design Icons"; src: url("../fonts/materialdesignicons-webfont.eot?v=6.5.95"); src: url("../fonts/materialdesignicons-webfont.eot?#iefix&v=6.5.95") format("embedded-opentype"), url("../fonts/materialdesignicons-webfont.woff2?v=6.5.95") format("woff2"), url("../fonts/materialdesignicons-webfont.woff?v=6.5.95") format("woff"), url("../fonts/materialdesignicons-webfont.ttf?v=6.5.95") format("truetype"); font-weight: normal; font-style: normal; font-display: block; } .mdi:before, .mdi-set { display: inline-block; font: normal normal normal 24px/1 "Material Design Icons"; font-size: inherit; text-rendering: auto; line-height: inherit; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } .mdi-language-c::before { content: "\F0671"; } .mdi-language-cpp::before { content: "\F0672"; } .mdi-language-csharp::before { content: "\F031B"; } .mdi-language-css3::before { content: "\F031C"; } .mdi-language-fortran::before { content: "\F121A"; } .mdi-language-go::before { content: "\F07D3"; } .mdi-language-html5::before { content: "\F031D"; } .mdi-language-java::before { content: "\F0B37"; } .mdi-language-javascript::before { content: "\F031E"; } .mdi-language-kotlin::before { content: "\F1219"; } .mdi-language-lua::before { content: "\F08B1"; } .mdi-language-markdown::before { content: "\F0354"; } .mdi-language-php::before { content: "\F031F"; } .mdi-language-python::before { content: "\F0320"; } .mdi-language-r::before { content: "\F07D4"; } .mdi-language-ruby::before { content: "\F0D2D"; } .mdi-cpu-64-bit::before { content: "\F0EE0"; } .mdi-server::before { content: "\F048B"; } .mdi-access-point-network::before { content: "\F0002"; } .mdi-18px.mdi-set, .mdi-18px.mdi:before { font-size: 18px; } .mdi-24px.mdi-set, .mdi-24px.mdi:before { font-size: 24px; } .mdi-36px.mdi-set, .mdi-36px.mdi:before { font-size: 36px; } .mdi-48px.mdi-set, .mdi-48px.mdi:before { font-size: 48px; } ``` You can use this to replace the entire contents of the stylesheet. 112 lines is much better than 26,000 lines. This file now weighs in a at 2.1KB and we’re feeling lighter already. If we hard refresh again, we can see the site is still working as expected 😆 Now onto the harder part, optimising the font file. ## Remove unused selectors from the font file So we’ve reduced the size of stylesheet to only the icons we’re using, how do we do the same for the .woff2 font file? To do this I used a free tool called [FontForge](https://fontforge.org/en-US/). The process sounds difficult at first but this is what worked for me: 1. Download FontForge 2. Open the .ttf font file 3. Select the icons you want to keep by searching for an icon with `Ctrl + Shift + >` then ticking 'Merge into selection' 4. Invert the selection (selecting all the icons you want to get rid of) 5. Remove the unused icons 6. Condense 7. Generate the font 7. Save as .woff2 This process feels very repetitive and I certainly wasn’t doing this for 84 icons or even for our 19 icons. Of course, if you’re only using 5 you might not mind searching for them then removing the rest, but for any more it’s tedious. I automated this step using [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/) and have a video of a robot following the process outlined above in the next section. ## Automate the minify font file process with PyAutoGUI So we’ve decided selecting 19 or more icons is too repetitive, time consuming and prone to human error. Let’s automate the process. This video shows the IconFontMiniferRobot in action following the process we outlined earlier. I opened FontForge, loaded the .ttf file, ran the robot, switched back to FontForge and the robot takes over. I just love building robotic automation process solutions. The robot reads the CSS stylesheet, extracts the icons selector names used in the stylesheet using a regular expression, selects those identified icons in FontForge, selects the inverse, condenses and generates the font then saves as a .woff2 file and it’s only 2.8KB! When I did the same thing for this site it was around 7KB for 84 icons. Now we can test it still works in our site. Before that though, you want to see the code for the robot right? ```python [icon_font_minifier_robot.py] """Automates removing unused material icons from a .tff font file. Reads in a material icons scss or css file and parses applied css selectors such as '.mdi-close-box-multiple-outline::before'. Uses PyAutoGUI to control FontForge in order to remove all unused icons from the .tff file then saves the output as a .woff2 file. Ensure FontForge is opened and loaded with the .tff before running, then run the program, switch to FontForge and let the robot take over :) Typical usage example: robot = IconFontMinifierRobot() robot.removeUnusedIcons( css_filepath="css/materialdesignicons.css", woff2_output_path="C:\\Users\\shedloadofcode\\Documents\\icon-fonts-project\\fonts\\" ) """ import pyautogui import re import time class IconFontMinifierRobot(): def removeUnusedIcons(self, css_filepath, woff2_output_folderpath): print(f"Opening stylesheet...") stylesheet = open(css_filepath, "r") print("Parsing stylesheet...") icons = self.get_icon_styles_from(stylesheet) print("Now switch active window to FontForge :)") time.sleep(15) print("Selecting icons in FontForge...") for icon in icons: self.select_icon_in_fontforge(icon) print("Removing icons not in CSS...") self.invert_selection() self.detach_and_remove_selected_glpyhs() self.make_compact() print("Generating .woff2 file...") self.generate_fonts( woff2_output_folderpath, ) self.confirm_generate() print("Font saved.") def get_icon_styles_from(self, stylesheet): pattern = re.compile(r"\.mdi-[a-z\-A-Z\-0-9]+::before") icon_styles = pattern.findall(stylesheet.read(), re.IGNORECASE) print(f"{len(icon_styles)} icons found.") return icon_styles def select_icon_in_fontforge(self, icon): pyautogui.hotkey("ctrl", "shift", ">") time.sleep(0.5) pyautogui.typewrite( icon.replace(".mdi-", "").replace("::before", "") ) pyautogui.moveTo(927, 533) pyautogui.click() time.sleep(0.5) pyautogui.moveTo(915, 560) pyautogui.click() time.sleep(0.5) pyautogui.press('enter') time.sleep(2) def click_encoding_menu(self): pyautogui.moveTo(240, 35) pyautogui.click() time.sleep(2) def invert_selection(self): pyautogui.moveTo(53, 32) pyautogui.click() time.sleep(1) pyautogui.moveTo(112, 477) pyautogui.click() time.sleep(1) pyautogui.moveTo(444, 503) pyautogui.click() time.sleep(2) def detach_and_remove_selected_glpyhs(self): self.click_encoding_menu() time.sleep(1) pyautogui.moveTo(292, 182) pyautogui.click() time.sleep(2) pyautogui.press("enter") time.sleep(10) def make_compact(self): self.click_encoding_menu() pyautogui.moveTo(248, 80) pyautogui.click() time.sleep(2) def generate_fonts(self, woff2_output_folderpath): pyautogui.hotkey("ctrl", "shift", "g") time.sleep(2) pyautogui.hotkey("ctrl", "a") time.sleep(1) pyautogui.typewrite( woff2_output_folderpath + \ "materialdesignicons-webfont-min.woff2" ) pyautogui.press("enter") time.sleep(2) def confirm_generate(self): pyautogui.moveTo(1022, 609) pyautogui.click() time.sleep(3) if __name__ == "__main__": robot = IconFontMinifierRobot() robot.removeUnusedIcons( css_filepath="css/materialdesignicons.css", woff2_output_folderpath="C:\\Users\\shedloadofcode\\Documents\\icon-fonts-project\\fonts\\" ) ``` You’ll need to install PyAutoGUI to use this script with ``` pip install pyautogui ``` You might need to update the screen coordinates in all of the `moveTo` methods too if you’re using a different resolution screen to adjust where the robot clicks to be the same as in the video. PyAutoGUI is a super useful tool but I’ve found it needs adjustments when using on different devices, so consider this your chance to practice and perfect your automation skills. You can check the screen coordinates of your current mouse position with the script below which is [from the docs](https://pyautogui.readthedocs.io/en/latest/mouse.html): ```python [get_mouse_coordinates.py] import pyautogui, sys print('Press Ctrl-C to quit.') try: while True: x, y = pyautogui.position() positionStr = 'X: ' + str(x).rjust(4) + ' Y: ' + str(y).rjust(4) print(positionStr, end='') print('\b' * len(positionStr), end='', flush=True) except KeyboardInterrupt: print('\n') ``` I also used PyAutoGUI in another interesting project [Creating a screen and mouse jiggler with Python](/blog/creating-a-screen-and-mouse-jiggler-with-python/). It really is a great lightweight automation tool. ## Replace font file with minified version Now we have the minified font file generated and saved to the font folder as 'materialdesignicons-webfont-min.woff2', we can update our stylesheet to use the minified version instead of the bloated version as seen at the end of the video. You can see I've removed all of the other font files leaving only the .woff2 font file, and only referenced that in the stylesheet. ```css [materialdesignicons.css] @font-face { font-family: "Material Design Icons"; src: url("../fonts/materialdesignicons-webfont-min.woff2?v=6.5.95") format("woff2"); font-weight: normal; font-style: normal; font-display: block; } ... ``` If we hard refresh we can see the icons still worked as expected! Checking the network tab shows the CSS file using only the woff2 at 1.8KB and the font file at 2.8KB! This is a 99.22% reduction to 2.8KB in font file size from our starting 361KB! ## Performance improvements summary I am very pleased with the performance improvements as a result of this project. It means that every user doesn't have to download a 361KB font file just to see icons display on the page. This has led to a better user experience, better page load times and has reduced bandwidth consumption. The stats for file size reductions from this project can be seen in the table below: | Type | Starting Size KB | Final Size KB | Reduction % | |-------|------------------|---------------|-------------| | CSS | 369KB | 1.8KB | 99.51% | | Woff2 | 361KB | 2.8KB | 99.24% | If you have any questions about this tutorial please leave a comment in the comments section below or feel free to reach out via the contact button at the bottom of this page 👍 I hope this has given you an insight into how you can go about self-hosting and reducing icon fonts in your own projects. If you enjoyed this article, be sure to check out [other articles on the site](/).

How to do an index match with Python and Pandas

Wed, 08 Dec 2021 13:30:00 GMT

Inspired by my previous article [How to batch rename files in folders with Python](/blog/how-to-batch-rename-files-in-folders-with-python/) and the theme of quickly solving problems with Python, let's explore how make life easier and do an index match using Pandas rather than with Excel. The code and files used are available to download via a link at the end of the article 😄 ## Index Match with Excel Let's say we have three tables, Orders, OrderDetails and Products. All of these tables are related by either OrderID or ProductID. A typical problem might be trying to add the ProductName and TotalPrice column values to OrderDetails like this... Here we are effectively trying to merge / match the values based upon the ProductID column from the OrderDetails table and the ID column from the Products table. Using the INDEX MATCH formula in Excel has become the better option vs VLOOKUP due to it not breaking if new columns are inserted. ``` =INDEX(TargetArray, MATCH(LookupValue, LookupArray, ExactMatch=0)) ``` As we can see, the ProductName and TotalPrice (ListPrice * Quantity) have been filled after dragging the formula downwards. Although I am using a formatted table (using Ctrl + T) in this example, you could also use this without formatted tables by amending the index match formula, but remembering to include the $ for [fixed references](https://support.microsoft.com/en-us/office/switch-between-relative-absolute-and-mixed-references-dfec08cd-ae65-4f56-839e-5f0d8d0baca9) to the TargetArray and the LookupArray. ``` =INDEX($N$2:$N$9, MATCH(H3, $M$2:$M$9, 0)) ``` ## Merge with Python and Pandas We're now going to try and do the same thing, but this time using Pandas. We're going to use [Pandas merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html). We'll need a few packages for working with Excel and of course Pandas itself. ``` pip install numpy pandas openpyxl xlrd ``` Before running this script, I placed each table into it's own sheet within the 'Index Match Python Problem.xlsx' workbook. ```python [index-match-merge-solution.py] import pandas as pd excel_file = pd.ExcelFile("Index Match Python Problem.xlsx") orders = pd.read_excel(excel_file, sheet_name="Orders") order_details = pd.read_excel(excel_file, sheet_name="OrderDetails") products = pd.read_excel(excel_file, sheet_name="Products") df = pd.merge( left=order_details, right=products, left_on="ProductID", right_on="ID", how="inner" ) df["TotalPrice"] = df["ListPrice"] * df["Quantity"] df.to_csv("outputs/merge-output.csv", index=False) ``` We read in each sheet from the Excel workbook, merge the OrderDetails with the Products table on the ProductID and ID columns, then calculate TotalPrice and output to CSV. If the ID columns were named the same, we could have just used the `on=` argument, however `left_on=` and `right_on=` allows us to specify different column names to merge on. By also using the `how=` argument we can specify what kind of merge we want to perform. For those familiar with SQL JOINs, here we are using an inner join, which is the most common generally. For those unfamiliar, I find this [Visual JOIN](https://joins.spathon.com/) a great way to understand what's happening. You can also see a summary of each in the table below. | Join Type | Description | |-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | inner | selects records that have matching values in both dataframes. | | left | returns all records from the left dataframe, and the matching records from the right dataframe. The result is null records from the right side, if there is no match. | | right | returns all records from the right dataframe, and the matching records from the left dataframe. The result is null records from the left side, if there is no match. | | outer | returns all records when there is a match in left or right dataframe records. | | cross | returns cartesian product of both dataframes (number of rows in the first dataframe multiplied by the number of rows in the second dataframe). | Be aware cross merges can result in very large result sets, you also don't need the `on=` argument, since both tables are merged on every record. This script produces the CSV we can see below in the output folder. All done! We have the output for OrderDetails showing the ProductName and TotalPrice. You might notice that this isn't sorted the same way as in our original Excel file. This is because by using an inner merge, we are using intersection of keys from both dataframes (ProductID and ID). We can change this to a 'left' merge to use only the keys from the left dataframe. ```python df = pd.merge( left=order_details, right=products, left_on="ProductID", right_on="ID", how="left" ) ``` If you want to try out a 'right' merge, I added a Product called 'Robot' in 'Index Match Python Problem.xlsx' that isn't included in any orders so wouldn't show up using a left or inner join as there is no match. If you wanted to drop any unneeded columns, like the ID and ListPrice columns from the right dataframe you can add a line before outputting to CSV. ```python df.drop(columns=["ID", "ListPrice"], axis=1, inplace=True) ``` ## Merging multiple tables Using the same dataset, we will now look at a more advanced example to demonstrate the power of merging. We'll write a function to retrieve order information for a given OrderID and CustomerName. This merges together all tables Orders, OrderDetails and Products. ```python [index-match-order-lookup.py] import pandas as pd def load_data(): excel_file = pd.ExcelFile("Index Match Python Problem.xlsx") orders = pd.read_excel(excel_file, sheet_name="Orders") order_details = pd.read_excel(excel_file, sheet_name="OrderDetails") products = pd.read_excel(excel_file, sheet_name="Products") return orders, order_details, products def get_order_information(id, customer_name): orders, order_details, products = load_data() order = orders.loc[ (orders['OrderID'] == id) & (orders['Customer'] == customer_name) ] order_info = pd.merge( left=order, right=order_details, on="OrderID", how="inner" ) order_info = pd.merge( left=order_info, right=products, left_on="ProductID", right_on="ID" ) order_info["TotalPrice"] = order_info["ListPrice"] * order_info["Quantity"] order_info.drop(columns=["ID", "ListPrice"], inplace=True) products = order_info.groupby(["OrderID"])["ProductName"].agg(list) order_info = order_info \ .groupby(["OrderID", "Customer"])['ProductName', 'TotalPrice'].agg(sum) \ .reset_index() order_info["Products"] = products.values print(order_info) order_info.to_csv(f"outputs/order-information-for-id-{id}.csv", index=False) if __name__ == "__main__": get_order_information(id=4, customer_name="Mike") ``` We load the dataframes, use [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to find the rows in OrderDetails where where OrderID and Customer is a match with the inputs giving us the `order` itself. We inner merge `order` with `order_details`, then merge that with `products`. We calculate TotalPrice, drop any columns not required, and aggregate the `products` into a list. Finally, we group by the OrderID and calculate the sum of each OrderDetail, and add the Products for the order. Going back to verify we can see for OrderID 4, Mike did indeed purchase three Desk Lamps and a Mousemat for a combined total of £110! He must really like Desk Lamps! This is a script I will keep coming back to, as it provides so many useful things you might want to do. Particularly if you don't necessarily want to merge you just want to 'lookup' or 'filter' the dataframe by one or more criteria - for this example we filtered on both OrderID and Customer name to demonstrate. The line using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) can be applied to other datasets to achieve this. You can also filter without using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) like in the alternative below but [this post](https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc) explains why it might be better to use it. ```python order = orders[ (orders["OrderID"] == id) & (orders["Customer"] == customer_name) ] ``` We could also do something like this to lookup a single value like the name of the customer for the given OrderID. ```python orders, order_details, products = load_data() customer_name = orders.at[orders.loc[orders["OrderID"] == 4].index[0], "Customer"] print(customer_name) ``` An alternative to Pandas merge is to use [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) which is very similar. The [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging) gives a comparison for those who wish to learn more. ## Bonus: Stacking multiple tables As a bonus, what if we're not trying to merge multiple tables, but stack multiple tables? First of all, this is what I mean by stack. Let's say you have two or more tables that all need 'stacking' on top of one another. It might be hundreds of different CSV files that need bringing together! We can use [Pandas concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) to handle this. This script targets the 'logs' folder and stacks all 12 CSV files into one file. Each CSV has 37 rows, so after combining we should expect 444 rows. ```python [stack-with-concat-solution.py] import glob import pandas as pd from pandas.core.reshape.concat import concat csv_files = glob.glob("logs/*.csv") dataframes = [] for filename in csv_files: df = pd.read_csv(filename, index_col=None, header=0) dataframes.append(df) concatenated_df = pd.concat(dataframes, axis=0, ignore_index=True) print(concatenated_df.shape) concatenated_df.to_csv(f"logs/concatenated.csv", index=False) ``` Now all files have been saved to the 'logs' folder in the file 'concatenated.csv' which we can see in the image below. Perfect! This is a super fast way to bring similar but dispersed datasets together and 'stack' them on top of one another. The main thing your source files need, are to all have the same column names so they all align whilst concatenating. A similar option is to use [Pandas append](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas-dataframe-append), however my understanding is concat is faster as the append method will add rows of the second dataframe to the first dataframe iteratively one at a time. However, the concat function will do a single operation, which makes it faster than append. ## What we learned Using Pandas [merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) brings the power of SQL database-style joins to Excel, it gives you many more options than an index match ever could and with greater simplicity and scalability. We can also lookup rows and values by given criteria using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and easily 'stack' data from many files using [concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). In my opinion, it's essential to keep each on your Data Science toolbelt as you never known when you'll need them! As always, if you have any questions leave a comment in the comments section, or use the contact button at the bottom of the page to get in touch. You can [download all of the code and files](https://github.com/shedloadofcode/index-match-with-python-and-pandas) used in this article to try things out yourself. I hope this article helped you out. If you enjoyed this article be sure to check out: * [How to build a random recipe selector with Python](/blog/how-to-build-a-random-recipe-selector-with-Python/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)

How to batch rename files in folders with Python

Sat, 04 Dec 2021 15:30:00 GMT

Although there are many tutorials on renaming files with Python, most don’t go into how to create flexible logic to tailor that batch file rename job to your needs. This is a situation I found myself in recently, a seemingly simple request to help rename a few hundred files in a folder. However, not all of the renaming followed a set pattern! Nor did it follow any real pattern at all, so using regex probably wasn’t going to help. This called for a custom script to help out a fellow engineer. ## Problem The problem given was that during an automation process hundreds of files had been produced but using the wrong names. These now all needed changing. The files names on the left needed to look like the file names on the right (this is a small sample but there were hundreds of files). As you can see it isn’t a straight up find and replace job, we will need some logic to match a search term to a replacement. For example, if the file name includes X then replace with Y. To trim the identifier at the beginning of the file name we’ll use string slicing. ## Solution This script makes use of the [os module](https://docs.python.org/3/library/os.html). We provide a folder path and then loop over all of the files within it, renaming with the replacements where the file name contains the search term. ```python [renamer.py] import os def rename_files(path): replacements = ["_dualforecast", "_narrative", "_pf1", "_summary", "_txn"] search_terms = ["CLAIM", "NARRATIVE", "PF1", "SUMMARY", "Txn"] count = 0 for filename in os.listdir(path): file_path = os.path.join(path, filename) name, extension = os.path.splitext(filename) for i, term in enumerate(search_terms): if term in name: prefix = name[:11] postfix = replacements[i] new_name = os.path.join(path, prefix + postfix + extension) os.rename(file_path, new_name) continue count += 1 print(f"{count} files in folder {path} were renamed.") if __name__ == "__main__": rename_files(r"C:\\Users\\shedloadofcode\\Documents\\TestFolder") ``` Success! All of the files were renamed according to the logic applied and are now in the format like on the right side of the image shown earlier. This logic will also rename any subfolders in the directory too if you were wanting to rename folders rather than files. In this script I have also seperated the file `name` from the `extension` so if you were wanting to say change hundreds of txt files to csv format you can do that with just one change `new_name = os.path.join(path, prefix + postfix + ".csv")`. If you want to give this script a test drive, download the [test folder](https://github.com/shedloadofcode/batch-rename-files-in-folders), then extract the contents and place the directory 'TestFolder' in your Documents folder ensuring it has the name 'TestFolder'. Then update the path given to the `rename_files` function with your username before running 😄 ## Bonus: Recursive batch renaming You might be thinking, but what if I have files within folders within folders? Do I have to run this in each folder one at a time? Hell no 😆 we can adapt the function to go through every subfolder and perform the rename operation in each recursively. Let's say we have folders A, B and C in the TestFolder directory. Now let's take a look at the recursive function we'll run against the TestFolder directory path. ```python [recursive-renamer.py] import os def rename_files_recursively(root_path): replacements = ["_dualforecast", "_narrative", "_pf1", "_summary", "_txn"] search_terms = ["CLAIM", "NARRATIVE", "PF1", "SUMMARY", "Txn"] count = 0 for path, subdirs, files in os.walk(root_path): for filename in files: file_path = os.path.join(path, filename) name, extension = os.path.splitext(filename) for i, term in enumerate(search_terms): if term in name: prefix = name[:11] postfix = replacements[i] new_name = os.path.join(path, prefix + postfix + extension) os.rename(file_path, new_name) continue count += 1 print(f"{count} files were renamed recursively from root {root_path}") if __name__ == "__main__": rename_files_recursively(r"C:\\Users\\shedloadofcode\\Documents\\TestFolder") ``` Now we can see every file in every subfolder is renamed in one operation. This will also work to any folder tree depth.. subfolders within subfolders within subfolders.. everything. Isn't recursion wonderful? Here are folders A, B and C after the operation completes: ## Adapting to your needs There we are, two short adaptable and extendable functions that give us everything we need to get the job done! My colleague was certainly happy with the result, they said it worked like a dream. You can easily adapt these functions to your own needs by changing or adding to the conditional logic in the inner loop that processes each file name. Not only does this script apply to conditionally renaming files, but also conditionally deleting files. You could use `os.remove(file_path)` instead of the `os.rename(file_path, new_name)` we used. Thanks very much for reading, this was a very short article covering how to effectively batch rename files in folders with Python. If you have any questions feel free to leave a comment 👍 If you enjoyed this article be sure to check out other articles on the site, you may be interested in: * [How to do an index match with Python and Pandas](/blog/how-to-do-an-index-match-with-python-and-pandas/) * [How to build a random recipe selector with Python](/blog/how-to-build-a-random-recipe-selector-with-Python/) * [Developing your data science and analytical coding skills - a review of DataCamp](/blog/developing-your-data-science-and-analytical-coding-skills-a-review-of-datacamp/)

Five ways to improve your system design and software architecture skills

Thu, 11 Nov 2021 19:38:00 GMT

I always thought working in software development and data science, and building systems more generally, that writing code would take up the majority of the time. I mean you 'learn to code' right? You don't 'learn to design systems'. I thought coding creative solutions to problems would be the main task, and [data structures and algorithms](/blog/exploring-coding-interview-topics-in-python/) would be the main skills. Although these things are very important, it didn’t really turn out to be the case. I found the majority of the time would be spent in meetings explaining to stakeholders and others, how systems worked or would work. I learnt quickly that when features were suggested or requested, it wasn’t coding ability that would allow you to discuss them, it was the ability to discuss whether there was a workable design. This can be particularly hard when you’re just starting out. After these conceptual discussions of feasibility with the business stakeholders, there would be technical discussions like estimation and planning how to divide the tasks, whether they could be performed concurrently, or whether any exploration was required. Still no code had been written at this point. So the main skill being used here is one of system design. How to either create a new system or extend an existing one. It’s almost like a mini system design interview process. Knowing what’s possible from a system design point of view can really help you zone in on whether an ask is feasible, technically possible, and most importantly, if it’s even needed in the first place! The main reason to constantly improve your architecture aptitude is to always ensure you are building solid, inexpensive, maintainable, scalable and speedy technical solutions or features. Whether that be a machine learning model, a web application, an automated process or anything else, they will all benefit from these things. So if you’ve focused a little too much on the code and neglected your system design and software architecture skills, read on to find out five things you can do to improve them. ## Know the core concepts of system design Technopedia describes [system design](https://www.techopedia.com/definition/29998/system-design) as: > “the process of defining the elements of a system such as the architecture, modules and components, the different interfaces of those components and the data that goes through that system. It is meant to satisfy specific needs and requirements of a business or organization through the engineering of a coherent and well-running system.” System design is a vast subject that includes the following topics: * Programming paradigms - Object oriented, Functional * Programming design patterns - [Gang of Four](https://en.wikipedia.org/wiki/Design_Patterns) * Code organisation * Frameworks * Dependencies * Design principles * Components * [Functional](https://en.wikipedia.org/wiki/Functional_requirement) vs. [Non-functional requirements](https://en.wikipedia.org/wiki/Non-functional_requirement) * N-tier Layering * Microservices * Messaging * Caching * Load balancing * Performance * Relational and NoSQL databases * [Database design](https://en.wikipedia.org/wiki/Database_design) * [Data model design](https://en.wikipedia.org/wiki/Data_modeling) * API design * Polling and Sockets * User interface design * Networking and Proxies * Scaling! Both [horizontal and vertical](https://www.section.io/blog/scaling-horizontally-vs-vertically/#:~:text=Horizontal%20scaling%20means%20scaling%20by,as%20%E2%80%9Cscaling%20up%E2%80%9D) * Capacity and demand estimations * Storage * Fault tolerance * Maintainability * Extensibility * Accessibility - [WCAG 2.1](https://www.w3.org/TR/WCAG21/) * Security - [OWASP Top Ten](https://owasp.org/www-project-top-ten/#) * Analytics and Machine Learning * Communication * Authentication - [OIDC](https://en.wikipedia.org/wiki/OpenID), [WsFederation](https://en.wikipedia.org/wiki/WS-Federation), [JWT](https://jwt.io/) As we can see from this list, there is so much involved in system design! I think although all of the things listed above are important, the key one to understand scalability. It is really useful to understand [how to scale a system from 100 to 1,000,000 users](https://systeminterview.com/scale-from-zero-to-millions-of-users.php). The seperation of concerns is a key factor in scalability, hence the adoption of [N-tier architecture](https://www.techopedia.com/definition/17185/n-tier-architecture#:~:text=N%2Dtier%20architecture%20is%20a,both%20logically%20and%20physically%20separated.&text=N%2Dtier%20architecture%20is%20also%20known%20as%20multi%2Dtier%20architecture.), [MVC (Model-View-Controller)](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller) or [MVVM (Model-View-ViewModel)](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel) and [Microservices](https://en.wikipedia.org/wiki/Microservices) or [service oriented architecture](https://en.wikipedia.org/wiki/Service-oriented_architecture) patterns. An analytics system I have been in the process of designing recently had the same pattern as the [scaling to millions of users diagram](https://systeminterview.com/imgs/top10/millions_of_users.png). I think it is a great starting point for most designs. I am a big fan of the [component template](https://systeminterview.com/drawing.php) from systeminterview.com and utilised it in this design diagram. Just knowing the core components and concepts that go into a robust solid system, then allows you to start putting components together and building your own designs. It also means you have a general awareness of the kinds of things to start considering learning more about. Whether it be networking, databases or object oriented design, you'll no doubt find some weaker area that you can go away and read up on. ## Learn from the designs of existing systems You can learn a lot from systems that have already been built and most are documented online. I created this Trello board which served as my checklist and notes on designing various systems from Netflix and YouTube to Facebook and Amazon. This not only meant I would be more prepared for any system design interviews, but for building real systems within those industries such as e-commerce, video streaming, content management, social media, storage, chat and messaging applications. You can find walkthroughs on how to design some of these systems on [The System Design Primer](https://github.com/donnemartin/system-design-primer#system-design-interview-questions-with-solutions) Github page. There is also a page exploring [real world architectures](https://github.com/donnemartin/system-design-primer#real-world-architectures). Another key resource to explore are [company technical blogs](https://github.com/donnemartin/system-design-primer#company-engineering-blogs), where you can get an inside view on design decisions taken by engineering teams. These are some of my favourite engineering and data blogs at the moment: * [Discord Engineering Blog](https://blog.discord.com/engineering-posts/home) * [Twitter Engineering Blog](https://blog.twitter.com/engineering/en_us) * [Instagram Engineering Blog](https://instagram-engineering.com/) * [LinkedIn Engineering Blog](https://engineering.linkedin.com/blog) * [Data in Government Blog](https://dataingovernment.blog.gov.uk/) * [Dropbox Infrastructure Blog](https://dropbox.tech/infrastructure) * [Stripe Engineering Blog](https://stripe.com/blog/engineering) * [Government Digital Service Blog](https://gds.blog.gov.uk/) * [Heroku Engineering Blog](https://blog.heroku.com/engineering) * [Netflix Tech Blog](https://netflixtechblog.com/?gi=14887958ebcb) * [Spotify Engineering Blog](https://engineering.atspotify.com/) * [Airbnb Tech Blog](https://medium.com/airbnb-engineering) * [Uber Engineering Blog](https://eng.uber.com/) * [Google Developers Blog](https://developers.googleblog.com/) ## Follow a framework for practical system design A quote by Richard Pattis I added to my [favourite quotes article](/blog/programming-quotes-that-offer-wisdom-and-motivation/) says 'If you cannot grok the overall structure of a program while taking a shower, you are not ready to code it'. This means the design is not clear enough, and for me the essence of good software design is reducing and managing complexity. That is, the ability to easily understand how a system works, and therefore easily modify it. This is the main theme of the book [A Philosophy of Software Design](https://www.amazon.co.uk/Philosophy-Software-Design-2nd/dp/173210221X/) which I found really insightful. To create a clear design, a framework can help to structure it and ensure nothing is missed out. I like the [PEDALS method](https://www.lewis-lin.com/blog/pedals-method) from The [System Design Interview](https://www.amazon.co.uk/System-Design-Interview-2nd/dp/B09559NJKL/ref=sr_1_4?keywords=the+system+design+interview&qid=1636480313&s=books&sr=1-4) to guide the process of architecting a system. This stands for: * Process requirements * Estimate * Design the system * Articulate the data model * List the architectural components * Scale This provides a nice easy to remember process to kick off designing a system. I know this is geared to system design interviews but really the process should also be very useful on the job. I mean I’ve always thought of a system design interview as a conversation between two or more engineers that need to plan out a solution, this process facilitates that conversation very well. Once you've mastered using the PEDALS framework, you might want to explore more 'enterprise-level' architecture frameworks. These might include [The Open Group Architecture Framework (TOGAF)](https://en.wikipedia.org/wiki/The_Open_Group_Architecture_Framework) and [The Zachman Framework](https://en.wikipedia.org/wiki/Zachman_Framework). ## Explore cloud computing providers and services With most solutions now deployed using cloud infrastructure it helps to know the range of cloud providers and their offerings. The big players are [Microsoft Azure](https://azure.microsoft.com/en-gb/), [Amazon Web Services](https://aws.amazon.com/) (AWS) and [Google Cloud Platform](https://cloud.google.com/) (GCP). Other providers include Heroku, Linode, IBM Cloud, Digital Ocean and more. Each provider offers a whole range of services, with [Microsoft Azure](https://azure.microsoft.com/en-gb/overview/what-is-azure/) for example provides over [200 products and cloud services](https://azure.microsoft.com/en-gb/services/). These include Machine Learning, Virtual Machines, Chatbots, Web App Hosting, Storage, Databases, Serverless Functions and many other services. The [benefits of cloud computing](https://www.salesforce.com/products/platform/best-practices/benefits-of-cloud-computing/) are numerous and it seems most big companies and government organisations are moving towards the cloud, so having a good understanding of the providers and services and how they fit into a robust cloud based architecture is vital. Most of the cloud services providers offer free introductory trials - usually for 12 months. This allows you to try some of their services, and build your own cloud based solutions for practice. You usually need a credit card to register, but as long as you only use the free services you shouldn't be charged. Always check your costs section though, as if you use any service not part of the free trial, it will be added to your bill. This is good practice for making sure you're provisioning cost efficient services, and keeping an eye on their cost as demand and usage increases! That skill alone is vital for a company to manage costs and prevent them from spiralling out of control. Most cloud services providers have tools to calculate product usage costs, here is the [Azure Pricing Calculator](https://azure.microsoft.com/en-gb/pricing/calculator/) as an example. To learn more about cloud solution architecture (but geared towards AWS) a good book is [Solution Architect’s Handbook](https://www.amazon.co.uk/Solutions-Architects-Handbook-Kick-start-architecture/dp/1838645640/ref=sr_1_1). ## Study, practice then prototype System design is a huge topic and can feel overwhelming. I think the more you implement different aspects of systems, you learn what’s possible and it becomes easier. Therefore the best way to continually improve your system design expertise is constant learning and experimenting with new ideas. I like using [diagrams.net](https://diagrams.net) for planning out a design - a free open source tool. You can practice estimating system capacity using our [System Capacity Calculator](/tools/system-capacity-calculator). If you decide to read [The System Design Interview](https://www.amazon.co.uk/System-Design-Interview-2nd/dp/B09559NJKL/ref=sr_1_4?keywords=the+system+design+interview&qid=1636569963&sr=8-4) book, you can run the scenario metrics in chapter 4 (Estimates) on page 27 through the calculator. This calculator was built to help with both system design interview scenarios, alongside building real world scalable systems. After planning the design, go ahead and try to build a small working prototype of the system in your selected tech stack. This will teach you a lot of how a more complex version of the system might work. This is an essential step and reminds me of [one of my favourite quotes](/blog/programming-quotes-that-offer-wisdom-and-motivation/) from John Gall "A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system". ## Conclusion The key takeaway is to never stop learning, practising and improving your system design skills. Using the five things discussed in this article, you'll be able to improve no matter your current experience level. We've never had access to more opportunities to learn and improve skills. This morning at 11am, I observed a two minute silence for [Remembrance Day](https://en.wikipedia.org/wiki/Remembrance_Day) which is a reminder how lucky we are to have the freedom and tools to learn. I should also explain why I chose the cover image I did for this article. It was in reference to [Margaret Hamilton](https://en.wikipedia.org/wiki/Margaret_Hamilton_(software_engineer)) who led to the team which developed the onboard flight software for the [Apollo space program](https://en.wikipedia.org/wiki/Apollo_program). Here is an [interesting interview about her journey](https://www.youtube.com/watch?v=4sKY6_nBLG0). An incredible feat of software development and engineering, and a very inspirational story on how important well designed, well built systems are when other's lives are on the line. Finally, here are some recommended resources for further learning: * [System Design Playlist by Gaurev Sen](https://youtube.com/playlist?list=PLMCXHnjXnTnvo6alSjVkgxV-VH6EPyvoX) * [System Design Interview](https://www.amazon.co.uk/System-Design-Interview-insiders-Second/dp/B08CMF2CQF/) * [The System Design Interview](https://www.amazon.co.uk/System-Design-Interview-2nd/dp/B09559NJKL/) * [Solution Architect's Handbook](https://www.amazon.co.uk/Solutions-Architects-Handbook-Kick-start-architecture/dp/1838645640/) * [A Philosophy of Software Design](https://www.amazon.co.uk/Philosophy-Software-Design-2nd/dp/173210221X/) * [Web Scalability for Startup Engineers](https://www.amazon.co.uk/Scalability-Startup-Engineers-Artur-Ejsmont/dp/0071843655/) * [Release It!: Design and Deploy Production-Ready Software](https://www.amazon.co.uk/Release-Design-Deploy-Production-Ready-Software/dp/1680502395/) * [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://www.amazon.co.uk/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321) * [Software Engineering at Google: Lessons Learned from Programming Over Time](https://www.amazon.co.uk/Software-Engineering-Google-Lessons-Programming/dp/B08VKJXVHK) * [The Imagineering Process: Using the Disney Theme Park Design Process to Bring Your Creative Ideas to Life](https://www.amazon.co.uk/Imagineering-Pyramid-Principles-Develop-Creative/dp/194150096X)

Exploring coding interview topics in Python

Thu, 02 Sep 2021 16:22:00 GMT

There are fundamental topics on algorithms and data structures that need to be understood for coding interviews. I am certainly no expert on coding interviews themselves, but I did embark on working through [Elements of Programming Interviews in Python](https://www.amazon.co.uk/Elements-Programming-Interviews-Python-Insiders/dp/1537713949/ref=pd_bxgy_img_2/262-9365292-3109168?pd_rd_w=Y6OlR&pf_rd_p=c7ea61ca-7168-47e3-9c8b-d84748f5b23c&pf_rd_r=D0WECF6DRCT5DPW9E23H&pd_rd_r=1f09cc37-a87b-404f-8ed5-79f25c54beb0&pd_rd_wg=zODeE&pd_rd_i=1537713949&psc=1). I used this book in combination with it’s companion [EPI-Judge](https://github.com/adnanaziz/EPIJudge) and also [LeetCode](https://leetcode.com/) problems. I did this to become better at programming in general and to brush up on algorithms and data structures. I’m not sure if you agree, but I feel (and have heard others say) that most of the time, the kinds of problems you find in programming interview questions are not the same as what actually occur on the job. Probably more of the job is focused on [system design and architecture](/blog/five-ways-to-improve-your-system-design-and-software-architecture-skills/) instead. Nevertheless, they have their merits and I admit they made me think more algorithmically and improved the efficiency of my code. With all this in mind, in this article I’ve collated what I think are good examples mostly from LeetCode that help with learning and applying the concepts in the real world. They cover the major coding interview topics. It should be a good overview for those new to these topics, and a good reminder for those wishing to recap knowledge that might not have been used for a while. This is a long article you can use as a reference again and again - you can use the contents panel above to find your way back to the relevant section more easily. ## Big-O Notation Before diving into the topics and examples, it's important to understand [Big-O Notation](https://en.wikipedia.org/wiki/Big_O_notation#:~:text=Big%20O%20notation%20is%20a,a%20particular%20value%20or%20infinity.&text=In%20computer%20science%2C%20big%20O,as%20the%20input%20size%20grows.) first. This allows us to assess the efficiency of an algorithm. For time complexity we ask how fast does the algorithm execute it's operations as the input size scales and becomes very large. For space complexity we ask how much memory will the algorithm consume as the input size scales and becomes very large. Space complexity consists of auxiliary space (space for extra variables and data structures we declare), input space (space for the given input) and stack space (for recursion). As the input size can vary, it is referred to as *n*. Below are the common notations you will see, with complexities ordered from smallest to largest, along with examples. These notations can be applied to both time and space complexity. **Constant time: O(1)** Where an algorithm does not depend on input size *n*, it runs in constant time. In this example, the loop will always run 100 times. ```python count = 0 for i in range(100): count += 1 print(count) ``` Other examples include accessing an array by index, adding or removing an element from an array, looking up a value in a dictionary (hashmap) and arithmetic operations. **Logarithmic time: O(log(n))** Where an algorithm's run time grows in proportion to the logarithm of the input size *n*. This means the algorithm isn't really affected by the input size and still runs rapidly on large inputs. Using Binary Search to find an element in a sorted list is a good example. The algorithm uses a "divide and conquor" approach, it jumps to the middle of the list, divides the list into two and repeats until the element is found. So the algorithm is reducing the size of the input at each step therefore doesn't need to check every value. ```python n = 1000000 my_list = list(range(n)) # generates a list of numbers 0 through to "n" def binary_search(array, target_value): list_length = len(array) left = 0 right = list_length - 1 while left <= right: middle = (left + right) // 2 # // performs integer division rather than floating-point division if target_value < array[middle]: right = middle - 1 elif target_value > array[middle]: left = middle + 1 else: return middle return "Search completed but value not found in array" search_result = binary_search(my_list, 300000) print(search_result) ``` This example searches a sorted array of integers 0 through to 1000000 (*n*). The algorithm sets a `left` and `right` index, then while the left index is lower than or equal to the right, checks whether the target value is less than the middle or greater than the middle, adjusting the left or right indexes accordingly to "split" the array. If the target value isn't less than or greater than the middle, we've found it and can just return `middle` 😄 This algorithm runs extremely fast even if the list input size *n* grows larger. **Linear time: O(n)** Where an algorithm depends on input size *n*, it runs in linear time. ```python n = 10000000 count = 0 for i in range(n): count += 1 print(count) ``` **Linearithmic time: O(n log(n))** Where an algorithm uses a combination of linear and logarithmic time complexity. In the first place a linear search taking O(n) occurs followed by a reduction by half which means the next operation is O(log(n)) - we saw this "divide and conquer" approach used in Binary Search earlier. Therefore it's O(n*log(n)). Examples include Merge Sort, Quick Sort and Heap Sort. Let's implement Merge Sort to sort an array. ```python my_list = [54, 567, 26, 93, 17, 77, 31, 44, 55, 20, 44, 55, 14, 52] def merge_sort(array: list): if len(array) > 1: # split the array into two print("Splitting", array) middle = len(array) // 2 left = array[:middle] right = array[middle:] # recursive calls print("Recursing") merge_sort(left) merge_sort(right) # merge i = 0 # index to traverse the left array j = 0 # index to traverse the right array k = 0 # index for the main array # compare the left array and right array and # overwrite the main array with the lowest value print("Merging ", array) while i < len(left) and j < len(right): if left[i] < right[j]: array[k] = left[i] i += 1 else: array[k] = right[j] j += 1 k += 1 # transfer all remaining values in the left array while i < len(left): array[k] = left[i] i += 1 k += 1 # transfer all remaining values in the right array while j < len(right): array[k] = right[j] j += 1 k += 1 merge_sort(my_list) print(my_list) # [14, 17, 20, 26, 31, 44, 44, 52, 54, 55, 55, 77, 93, 567] ``` An initial array is divided into two roughly equal parts. If the array has an odd number of elements, one of those "halves" is by one element larger than the other. The subarrays are divided over and over again into halves until you end up with arrays that have only one element each. Then you combine the pairs of one-element arrays into two-element arrays, sorting them in the process. Then these sorted pairs are merged into four-element arrays, and so on until you end up with the initial array sorted. This [CS50 video](https://www.youtube.com/watch?v=Ns7tGNbtvV4) explains the process in more detail. You can visualise Merge Sort through it's [algorithm diagram](https://commons.wikimedia.org/wiki/File:Merge_sort_algorithm_diagram.svg). If you want some real fun you can check out what Merge Sort looks like in real time [in this video](https://youtu.be/kPRA0W1kECg?t=67). **Quadratic time: O(n²)** Where an algorithm has two nested loops / iterations, it runs in quadratic time. ```python n = 100 array_x = [42] * n # list which is the length of "n" with all the same elements 42 def print_all_array_pairs(array_x): count = 0 for i in range(len(array_x)): for j in range(len(array_x)): print(array_x[i], array_x[j]) count += 1 print(count) print_all_array_pairs(array_x) ``` The final run count is 10000 for this example where *n* is 100. As 100^2 = 10000 so it's O(n²). However a nested loop where the input sizes are different would be O(n*y). ```python x, y = 50, 100 array_x = [42] * x # list which is the length of "x" with all the same elements 42 array_y = [22] * y # list which is the length of "y" with all the same elements 22 def print_all_array_pairs(array_x, array_y): count = 0 for i in range(len(array_x)): for j in range(len(array_y)): print(array_x[i], array_y[j]) count += 1 print(count) print_all_array_pairs(array_x, array_y) ``` This is because the inner loop with a constant number of iterations is run y times for each iteration of the outer loop that is run x times. In this example the the outer loop runs for the length of `array_x` which is 50 and the inner loop runs for the length of `array_y` which is 100. So the final `count` is 5000 which is 50 x 100 therefore O(n*y). **Cubic time: O(n³)** Where an algorithm has three nested loops or iterations, it runs in cubic time. Here is an example I made to find the sum of all the numbers with three loops: ```python my_list = [44, 55, 63, 123, 54, 43, 34, 54] # "n" is the length of the list which is 8 def sum_all_numbers(array: list): sum_of_numbers = 0 run_count = 0 for i in my_list: sum_of_numbers += i for j in my_list: sum_of_numbers += j for k in my_list: sum_of_numbers += k run_count += 1 print(i, j, k) return sum_of_numbers, run_count result, run_count = sum_all_numbers(my_list) print(f"Sum with three nested iterations: " + str(result)) # Summing this list across three nested loops is 34310 print("Run count: " + str(run_count)) # Run count where "n" is 8 is 8^3 so 512 ``` The print statements helps to visualise what's happening as `i` `j` and `k` iterate over the list. As the list in this example has a length of 8 then n = 8. We're iterating over the list with three loops so the time complexity is O(n³), therefore the total executions stored in `run_count` is 8^3 or 8x8x8 which is 512. **Exponential time: O(2^n)** Where an algorithm's run time doubles with each addition to the input. Iterating through subsets comes to mind here. A good example is the use of a recursive algorithm to calculate Fibonacci numbers. The Fibonacci sequence is where each number is the sum of the two preceding numbers starting from 0 and 1. So 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ... So if we set *n* = 9, the 9th number in the sequence starting from 1 is 34, and this runs very fast. But what if we want the 40th number in the sequence? ```python import time def find_nth_number_in_fibonacci_sequence(n): if n <= 1: return n return find_nth_number_in_fibonacci_sequence(n - 2) + \ find_nth_number_in_fibonacci_sequence(n - 1) n = 40 start = time.time() print(find_nth_number_in_fibonacci_sequence(n)) # 40th number in the fibonacci sequence is 102334155 end = time.time() print(f"Time taken: {end - start} seconds") # This took 59.58 seconds for me ``` You can see the greater the number in the sequence you're looking for, the more recursive calls are required. You can better visualise what's happening in a [recursion diagram](https://www.google.com/search?q=fibonacci+recursive+diagram&tbm=isch&ved=2ahUKEwiZ09jiuOrxAhUHmhoKHRQxAbsQ2-cCegQIABAA&oq=fibonacci+recursive+diagram&gs_lcp=CgNpbWcQAzoECAAQQzoCCAA6BAgAEB46BggAEAgQHjoECAAQGFDIN1jbP2D3QGgAcAB4AIABVIgB9QOSAQE4mAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=VvryYJnQDYe0apTihNgL&bih=1007&biw=1920). In the Dynamic Programming / Memoization section later in the article, we'll look at how to dramatically improve this run-time. **Factorial time: O(n!)** Where an algorithm's run time increases factorially with the increase in input size. A quick refresher on what a factorial is: > the product of all positive integers less than or equal to a given positive integer and denoted by that integer and an exclamation point. Thus, factorial seven is written 7!, meaning 1 × 2 × 3 × 4 × 5 × 6 × 7. - [Britannica](https://www.britannica.com/science/factorial) So to find the factorial of a number *n* using a recursive approach has n factorial or O(n!) time complexity. We can see this algorithm will perform exponentially more operations as the input size increases (calculating the factorial of all positive integers before *n*) ```python def find_factorial(n): if n == 1: return n else: return n * find_factorial(n - 1) ``` Now that we've covered Big-O Notation and the common time complexities, we can dive into the topics 😆 ## Arrays **Definition:** An array (list in Python) is a data structure that holds a group of elements, usually of the same data type (but not always) - like this `[1, 2, 3, 4, 5]` **Example Problem:** Chunk an array into a given size *n*. **Example Input:** [1, 2, 3, 4, 5, 6, 7], size=3 **Example Output:** [[1, 2, 3], [4, 5, 6], [7]] ```python import math def chunk(collection: list, size: int) -> list: result = [] count = 0 for i in range(math.ceil(len(collection) / size)): start = i * size end = start + size result.append(collection[start:end]) count += 1 print(count) return result if __name__ == "__main__": print(chunk([1, 2, 3, 4, 5, 6, 7], size=2)) print(chunk([1, 2, 3, 4, 5, 6, 7], size=3)) print(chunk([1, 2, 3, 4, 5, 6, 7], size=4)) ``` **Explanation:** On each iteration a new start and end index is defined, and the array is sliced then added to the new result array. The run time of this algorithm is O(n) as the loop runs the length of `collection` *n*/`size`, each loop runs a slice operation of `size` using `start` and `end`. So the time complexity is *n*/size * size which is *n*. **Practical use:** A practical use of an algorithm like this I have seen is creating a navigation tile layout on a webpage. Of course there are libraries that fulfil this need too, but why not implement it yourself to cut down your list of dependencies 😄 This is the first example I could find to illustrate. This is how we might make use of the algorithm to achieve it. ```python chunks = chunk(["Wellbeing", "Healthy weight", "Exercise", "Sleep", "Eat well", "Alcohol support"], size=3) for chunk in chunks: for index, item in enumerate(chunk): print(item, end="\n") if index == 2 else print(item, end=", ") ``` ## Strings **Definition:** Strings can be thought of as an array, but made up of characters **Example Problem:** [Reverse String (Leetcode 344)](https://leetcode.com/problems/reverse-string/) **Example Input:** hello **Example Output:** olleh ```python class Solution: def reverseString(self, s: List[str]) -> None: """ Do not return anything, modify s in-place instead. """ left = 0 right = len(s) - 1 while left < right: temp = s[left] s[left] = s[right] s[right] = temp left += 1 right -= 1 ``` **Explanation:** Using a two pointer approach we start from the left and right, swapping each character with the help of a temporary variable, eventually meeting in the middle. ## Linked Lists **Definition:** A linked list is a linear collection of data elements similar to an array, but the order is not given by their physical placement in memory. A linked list can be singly or doubly linked. **Example Problem:** [Merge Two Sorted Lists (Leetcode 21)](https://leetcode.com/problems/merge-two-sorted-lists/) **Example Input:** l1 = [1,2,4], l2 = [1,3,4] **Example Output:** [1,1,2,3,4,4] ```python # Definition for singly-linked list. # class ListNode: # def __init__(self, val=0, next=None): # self.val = val # self.next = next class Solution: def mergeTwoLists(self, l1: ListNode, l2: ListNode) -> ListNode: dummy = ListNode(0) head = dummy while l1 and l2: if l1.val < l2.val: dummy.next = l1 l1 = l1.next else: dummy.next = l2 l2 = l2.next dummy = dummy.next if l1 != None: dummy.next = l1 else: dummy.next = l2 return head.next ``` **Explanation:** We are asked to return a *sorted* list by merging two sorted linked lists. We create a `dummy` head node, then while both `l1` and `l2` are not None, we assign the lower of the two as the `dummy.next` node and move them along. This builds up our new linked list in sorted order. When we break out of the while loop, we check which one has the leftover node (still not None) and assign it as the last node in the chain. Finally, returning `head.next` to avoid the first dummy node we created 😄 ## Stacks **Definition:** A stack holds an ordered, linear sequence of items. In contrast to a queue, a stack is a last in, first out (LIFO) data structure. It is also used to implement depth first search. **Example Problem:** [Min Stack (Leetcode 155)](https://leetcode.com/problems/min-stack/) **Example Input:** ["MinStack","push","push","push","getMin","pop","top","getMin"] **Example Output:** [[],[-2],[0],[-3],[],[],[],[]] ```python class MinStack: def __init__(self): """ initialize your data structure here. """ self.stack = [] self.min_stack = [] def push(self, val: int) -> None: self.stack.append(val) val = min(val, self.min_stack[-1]) if len(self.min_stack) > 0 else val self.min_stack.append(val) def pop(self) -> None: self.stack.pop() self.min_stack.pop() def top(self) -> int: return self.stack[-1] def getMin(self) -> int: return self.min_stack[-1] # Your MinStack object will be instantiated and called as such: # obj = MinStack() # obj.push(val) # obj.pop() # param_3 = obj.top() # param_4 = obj.getMin() ``` **Explanation:** To keep track of and lower the expense of retrieving the minimum element, we implement a two stack approach. The main stack holds the entries and the min stack holds the current minimum value, which updates during `push` with either the new value or the popped minimum value, whichever is lower. ## Queues **Definition:** A queue holds an ordered, linear sequence of items. In contrast to a stack, a queue is a first in, first out (FIFO) data structure. It is used to implement breadth first search. **Example Problem:** [Binary Tree Level Order Traversal (Leetcode 102)](https://leetcode.com/problems/binary-tree-level-order-traversal/) **Example Input:** root = [3,9,20,null,null,15,7] **Example Output:** [[3],[9,20],[15,7]] ```python # Definition for a binary tree node. # class TreeNode: # def __init__(self, val=0, left=None, right=None): # self.val = val # self.left = left # self.right = right import collections class Solution: def levelOrder(self, root: TreeNode) -> List[List[int]]: result: List[List[int]] = [] if root == None: return result # Initialise queue and add first node queue: Deque[int] = collections.deque() queue.append(root) # Loop over queue while not len(queue) == 0: current_level = [] for i in range(len(queue)): current_node: TreeNode = queue.popleft() current_level.append(current_node.val) if (current_node.left): queue.append(current_node.left) if (current_node.right): queue.append(current_node.right) result.append(current_level) return result ``` **Explanation:** We create a queue frontier to implement breadth first search and append the root node. Then at each iteration we clear the queue appending everything to the `current_level` before expanding the node's left and right children. To understand the difference between using a queue or stack as a frontier [watch this video](https://youtu.be/D5aJNFWsWew?t=1561) from CS50 AI. ## Heaps **Definition:** A heap is a data structure like a tree with the interesting property that any node has a lower value than any of its children (min-heap) or any node has a higher value than any of its children (max-heap). **Example Problem:** [Last Stone Weight (Leetcode 1046)](https://leetcode.com/problems/last-stone-weight/) **Example Input:** [2,7,4,1,8,1] **Example Output:** 1 ```python import heapq class Solution: """ See https://docs.python.org/3/library/heapq.html for heapq docs """ def lastStoneWeight(self, stones: List[int]) -> int: heap = [-abs(x) for x in stones] # negative value as heapq is min heap by default heapq.heapify(heap) while len(heap) > 1: stone_one = abs(heapq.heappop(heap)) stone_two = abs(heapq.heappop(heap)) if stone_one != stone_two: heapq.heappush(heap, -abs(stone_one - stone_two)) heap_is_empty = len(heap) == 0 return 0 if heap_is_empty else abs(heapq.heappop(heap)) ``` **Explanation:** Our brief is if x == y, both stones are destroyed, and if x != y, the stone of weight x is destroyed, and the stone of weight y has new weight y - x. At the end of the game, there is at most one stone left. We create a `heap` list with the negative value of the stones (because heapq is a min-heap by default we need to turn that into a max-heap). Then `heapq.heapify(heap)` transform the list in-place. We then pop the two heaviest stones from the max-heap and if not equal add back their difference, not forgetting to make the value negative. If they are the same we do nothing (both stones were destroyed). We then just need to check if any stones are left with `heap_is_empty` and if it is return the last stone's weight 😄 ## HashMaps or Dictionaries **Definition:** A hashmap or hashtable (dictionary in Python) is a data structure that implements an associative array abstract data type - a structure that can map keys to values, like this `{ "name": "John", "age": "44" }` **Example Problem:** [Valid Anagram (Leetcode 242)](https://leetcode.com/problems/valid-anagram/) **Example Input:** s = "anagram", t = "nagaram" **Example Output:** true There are a few valid solutions for this - we'll start with the fundamental example of using a hashmap, then simplify. ```python class Solution: def isAnagram(self, s: str, t: str) -> bool: if len(s) != len(t): return False counter = {} for letter in s: if letter in counter.keys(): counter[letter] += 1 else: counter[letter] = 1 for letter in t: if letter not in counter.keys(): return False if counter[letter] < 1: return False counter[letter] -= 1 return True ``` ```python class Solution: def isAnagram(self, s: str, t: str) -> bool: return Counter(s) == Counter(t) ``` ```python class Solution: def isAnagram(self, s: str, t: str) -> bool: return sorted(s) == sorted(t) ``` **Explanation:** Wikipedia tells us 'An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. For example, the word anagram itself can be rearranged into nagaram, also the word binary into brainy and the word adobe into abode'. We need to test if `t` is an anagram of `s`. In the first approach, implement a counter ourselves, counting each character in `s` and storing the count in a dictionary. We then go over each letter in `t` decrementing from the count. If the letter isn't in the dictionary, or the count drops below zero we know it's not a valid anagram. Approach two simplifies this to use Python's built in Counter to compare both strings. As the order of the words don't matter, we could also sort the strings then compare them as in the third approach. The commonly used data structures for hashmaps in Python are set, dict, collections.defaultdict and collections.Counter. ## Searching **Definition:** A search algorithm is used to find specific data within a data structure. **Example Problem:** [Binary Search (Leetcode 704)](https://leetcode.com/problems/binary-search/) **Example Input:** nums = [-1,0,3,5,9,12], target = 9 **Example Output:** 4 ```python class Solution: def search(self, nums: List[int], target: int) -> int: left, right = 0, len(nums) - 1 while left <= right: middle = (left + right) // 2 if nums[middle] == target: return middle if target > nums[middle]: left = middle + 1 else: right = middle - 1 return -1 ``` **Explanation:** For our example input, 9 exists in `nums` and its index is 4. To satisfy a O(log n) runtime complexity, we implement binary search. We keep finding the middle, if it is the target we return it's index, else when the target is greater than the middle we replace the left index with the middle or the right index when less than the middle. This effectively cuts the array in half every time until we find the target or leave the while loop. ## Sorting **Definition:** A sorting algorithm re-organises a data structure into a specific order, such as alphabetical, highest-to-lowest value or shortest-to-longest distance. **Example Problem:** [Intersection of Two Sorted Arrays II (Leetcode)](https://leetcode.com/problems/intersection-of-two-arrays-ii/) **Example Input:** nums1 = [4,9,5], nums2 = [9,4,9,8,4] **Example Output:** [4,9] or [9,4] ```python class Solution: def intersect(self, nums1: List[int], nums2: List[int]) -> List[int]: if len(nums1) > len(nums2): return self.intersect(nums2, nums1) map: dict = {} for number in nums1: if number in map.keys(): map[number] += 1 else: map[number] = 1 intersection: List = [] for number in nums2: count: int = map[number] if number in map.keys() else 0 if count > 0: intersection.append(number) map[number] -= 1 return intersection ``` or by using Counter we saw in the hashmaps section with it's [elements() method](https://docs.python.org/3/library/collections.html#collections.Counter.elements) ... ```python class Solution: def intersect(self, nums1: List[int], nums2: List[int]) -> List[int]: if len(nums1) > len(nums2): return self.intersect(nums2, nums1) nums1_count = Counter(nums1) nums2_count = Counter(nums2) return (nums1_count & nums2_count).elements() ``` **Explanation:** The intersection is everything that `nums1` and `nums2` have in common. We must ensure each element in the result must appear as many times as it shows in both arrays. In the first approach we create our own counter `map` to count the occurance of each number in `nums1`. Then for each number in `nums2` we check if it's in the dictionary and if it is append it to the `intersection` list then decrement the count by one. We can then return the intersection as the answer. Approach two simplifies this by using Counter. ## Graphs **Definition:** A graph represents a non-linear relationship between it's nodes which are connected by edges. **Example Problem:** [Clone Graph (Leetcode 133)](https://leetcode.com/problems/clone-graph/) **Example Input:** adjList = [[2,4],[1,3],[2,4],[1,3]] **Example Output:** [[2,4],[1,3],[2,4],[1,3]] ```python """ # Definition for a Node. class Node: def __init__(self, val = 0, neighbors = None): self.val = val self.neighbors = neighbors if neighbors is not None else [] """ class Solution: def cloneGraph(self, node: 'Node') -> 'Node': if not node: return None map: dict = {} def dfs(node, map): if node in map: return map[node] print(f"Copying node {node.val}") copy = Node(val=node.val) map[node] = copy for neighbor in node.neighbors: print(f"Appending neighbour node {neighbor.val} to node {copy.val}") copy.neighbors.append(dfs(neighbor, map)) return copy return dfs(node, map) ``` **Explanation:** We are asked to effectively manually implement `copy.deepcopy(node)` to copy the contents of a graph given it's entry node. We initialise a dictionary `map` to store and return the copied nodes we've already seen. If we've not already seen the node, we create a copy and store it, then for each of it's neighbours append them as the copy's neighbours using depth first search and recursion with `dfs`. This clones every node and in turn copies the neighbors of each node. ## Bitwise manipulation **Definition:** Bitwise manipulation performs a logical operation on each individual bit of a binary number. **Preparation:** To solve these problems you must first know the [bitwise operators](https://realpython.com/python-bitwise-operators/#overview-of-pythons-bitwise-operators) and converting [binary numbers to denery](https://youtu.be/q7nZbAUTSC4) and [denery numbers to binary](https://youtu.be/70lM1qAD5u4). It is also useful to understand [signed and unsigned numbers](https://www.youtube.com/watch?v=miwMEUfkqfY) and [least significant bit](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit). Here is a concise bitwise operators reference table I stapled together from various sources. | Operator | Syntax | Meaning | Description | Example | | -------- | ------ | -------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------ | | & | a & b | Bitwise AND | Returns 1 if both the bits are 1 else 0 | 1010 & 0100 = 0000 | | \| | a \| b | Bitwise OR | Returns 1 if either of the bit is 1 else 0 | 1010 \| 0100 = 1110 | | ^ | a ^ b | Bitwise XOR (exclusive OR) | Returns 1 if one of the bits is 1 and the other is 0 else returns false. | 1010 ^ 0100 = 1110 | | ~ | ~a | Bitwise NOT | Returns one’s complement of the number | ~1010 = -(1010 + 1) = -(1011) = -11 (decimal) | | << | a << n | Bitwise left shift | Shifts the bits of the number to the left and fills 0 on voids left as a result. | 0000 0101 << 2 = 0001 0100 | | \>> | a >> n | Bitwise right shift | Shifts the bits of the number to the right and fills 0 on voids left as a result. | 0000 0101 >> 2 = 0000 0001 | **Example Problem:** [Counting Bits (Leetcode 338)](https://leetcode.com/problems/counting-bits/) **Example Input:** 9 **Example Output:** 2 A good example to illustrate bitwise manipulation is counting bits set to 1 in a positive integer. The Leetcode example is an array of positive integers - so the same solution but for each item in the array. ```python def count_bits(x: int) -> int: num_bits = 0 while x: num_bits += x & 1 # checks if the rightmost bit is 1 (0001 & 0001 = 1) x >>= 1 # shifts the number right one bit, shifting out the least significant bit return num_bits count_bits(9) # Returns 2 ``` **Explanation:** If we take the number 9, which in binary is 1001, then we can see there are two bits set to 1. * We start with 1001 and add `x & 1` (1001 & 0001 = 0001) to `num_bits` which now has a count of 1 (1 added) * Then shift the bits right making 0100 and repeat adding `x & 1` (0100 & 0001 = 0000) to `num_bits` which now has a count of 1 (nothing added). * Then shift the bits right making 0010 and repeat adding `x & 1` (0010 & 0001 = 0000) to `num_bits` which now has a count of 1 (nothing added). * Then shift the bits right making 0001 and repeat adding `x & 1` (0001 & 0001 = 0001) to `num_bits` which now has a count of 2 (1 added). * `x` is now 0 so the while loop exits and the returned count of `num_bits` is 2! 😄 ## Binary Trees **Definition:** A binary tree is a tree data structure in which each node has at most two children, which are referred to as the left child and the right child. **Example Problem:** [Balanced Binary Tree (Leetcode 110)](https://leetcode.com/problems/balanced-binary-tree/) **Example Input:** root = [3,9,20,null,null,15,7] **Example Output:** true Here I've presented the code along with the explanation in an image, to visualise what's going on. **Explanation:** A binary tree is balanced when the left and right subtrees of every node differ in height by no more than 1. We use recursion (covered later) to carry out [postorder traversal](https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/) to ensure every subtree is balanced all the way back to the top. If that's a bit confusing [this video](https://www.youtube.com/watch?v=LU4fGD-fgJQ) goes into more detail. ## Binary Search Trees **Definition:** Binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure whose internal nodes each store a key greater than all the keys in the node’s left subtree and less than those in its right subtree. **Example Problem:** [Validate Binary Search Tree (Leetcode 98)](https://leetcode.com/problems/validate-binary-search-tree/) **Example Input:** [2,1,3] **Example Output:** true ```python # Definition for a binary tree node. # class TreeNode: # def __init__(self, val=0, left=None, right=None): # self.val = val # self.left = left # self.right = right class Solution: def isValidBST(self, root: Optional[TreeNode]) -> bool: def validate(node: TreeNode, lower_bound: float, upper_bound: float): if not node: return True node_in_bounds = node.val < upper_bound and node.val > lower_bound if not node_in_bounds: return False return ( validate(node.left, lower_bound, node.val) and validate(node.right, node.val, upper_bound) ) return validate(root, float("-inf"), float("inf")) ``` **Explanation:** We need to determine if a binary tree is a valid binary search tree (BST) given it's root node. Any node must be greater than all the keys in it's left subtree and less than those in it's right subtree. We can use depth first search and recursion to `validate` each node is within the `lower_bound` and `upper_bound`. This ensures that if a left or right subtree falls out of bounds it is not a valid BST. ## Recursion **Definition:** Recursion is a process in which a function calls itself as a subroutine, thereby dividing a problem into subproblems of the same type. **Example Problem:** [Permutations (Leetcode 46)](https://leetcode.com/problems/permutations/) **Example Input:** nums = [1,2,3] **Example Output:** [[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]] ```python class Solution: def permute(self, nums: List[int]) -> List[List[int]]: result: List[List[int]] = [] if len(nums) == 0: return [nums[:]] for i in range(len(nums)): number = nums.pop(0) permutations = self.permute(nums) for permutation in permutations: permutation.append(number) result.extend(permutations) nums.append(number) return result ``` **Explanation:** We are asked to return *all the possible permutations* of `nums` in any order. So for each integer in `nums` [1,2,3] we pop the first element leaving [2,3] and then call `permute` again (recursively) to get each sub-permutation. This would leave us with [3,2] and [2,3] so now we append the popped `number` back the the `permutation`, giving [3,2,1] and [2,3,1] and add both of these to the `result` using extend(). Finally, we append the popped `number` back to `nums`. This will repeat for each element giving all possible permutations. The magic of recursion right? Here is a diagram to visualise the process that will be carried out for each. We always pop the first element, then get permutations, add the element back, extend the result, append the element back. ## Dynamic Programming or Memoization **Definition:** Dynamic programming is a technique for solving problems of recursive nature, iteratively and is applicable when the computations of the subproblems overlap. Memoization is a term describing an optimization technique where you cache previously computed results, and return the cached result when the same computation is needed again. **Example Problem:** [Fibonacci Number (Leetcode 509)](https://leetcode.com/problems/fibonacci-number/) **Example Inputs:** n = 40 **Example Output:** 102334155 Earlier when discussing Big-O Notation I used an example `find_nth_number_in_fibonacci_sequence` to demonstrate exponential time O(2^n). In the example I tried to find the 40th number in the fibonacci sequence using recursion and this took a huge 59.58 seconds. The greater the number in the sequence we were looking for, the more the runtime grew exponentially. How can we improve this? What about using memoization to cache the results of each recursive call so no unnecessary repeat calls are ever made. ```python import time def find_nth_number_in_fibonacci_sequence(n, cached_results: dict): if n in cached_results.keys(): return cached_results[n] # return result if already in cache if n <= 1: result = n else: result = find_nth_number_in_fibonacci_sequence(n - 2, cached_results) + \ find_nth_number_in_fibonacci_sequence(n - 1, cached_results) # ensure cache is passed to all recursive calls cached_results[n] = result # cache the result return result start = time.time() n = 40 print(find_nth_number_in_fibonacci_sequence(n, dict())) # 40th number in the fibonacci sequence is 102334155 end = time.time() print(f"Time taken: {end - start} seconds") # This took 0.0009975433349609375 seconds for me ``` **Explanation:** Modifying the code we used earlier to include a cache in the form of a dictionary, to find the 40th number in the sequence it now takes 0.0009975433349609375 seconds for me!! By caching and reusing earlier results the speed has improved dramatically. Using a hashmap (dictionary) to lookup cached results has a constant time complexity of O(1). ## Final thoughts So now you should have a good idea of the data structures and algorithms included in a coding interview. This might not prepare you for one (only constant focused practice can do that) but it will make you aware of what you don’t know. I have been doing problems on LeetCode and EPI and trying to really understanding the solution before moving on. A key part of this strategy has been tracking performance and repeating failed problems. Here is my Trello board when I first started. The approach was to add problems for each topic we've covered to the Problems list, then each day move 3-5 problems across to the Doing list. Once attempted, I rated performance out of 5 (5 being perfect with no help needed, 1 being didn't finish it without help and further study) and move it to the Repeat list (worst at the top). Then on subsequent days take another 3-5 problems from the Problems list, and one from the top of the Repeat list. Rinse and repeat. The idea came from this [Engineering with Utsav video](https://youtu.be/7UlslIXHNsw?t=696) (I really like this channel).This has allowed me to focus on breath of knowledge, whilst revisiting and repeating weaker areas. More so than passing any test, I hope this article gives you the inspiration to become an (even) better programmer and to think more algorithmically. As always if you have any thoughts let me know in the comments section below. Alternatively, you can complete the site's new [feedback form](https://forms.office.com/r/Eu2HTx8kvn) - you might have noticed the new 👍 feedback button on the navbar, so you can say how you think the site is doing and what you would like to see more of in the future 😄 ## Resources * [Python Standard Library Reference](https://docs.python.org/3/library/index.html#library-index) * [Elements of Programming Interviews](https://elementsofprogramminginterviews.com/) | [Book](https://www.amazon.co.uk/Elements-Programming-Interviews-Python-Insiders/dp/1537713949/ref=pd_bxgy_img_2/262-9365292-3109168?pd_rd_w=Y6OlR&pf_rd_p=c7ea61ca-7168-47e3-9c8b-d84748f5b23c&pf_rd_r=D0WECF6DRCT5DPW9E23H&pd_rd_r=1f09cc37-a87b-404f-8ed5-79f25c54beb0&pd_rd_wg=zODeE&pd_rd_i=1537713949&psc=1) * [EPI-Judge](https://github.com/adnanaziz/EPIJudge) * [LeetCode](https://leetcode.com/) * [Grokking Algorithms](https://www.amazon.co.uk/Grokking-Algorithms-illustrated-programmers-curious/dp/1617292230/ref=pd_sbs_1/262-9365292-3109168?pd_rd_w=ft5SM&pf_rd_p=a3a7088f-4aec-4dbd-97cc-9a059581fe7b&pf_rd_r=ZE7W9K1EBJ7VNZPCHC07&pd_rd_r=58618f0b-9890-477a-9f09-a2df9551f80d&pd_rd_wg=zdcP9&pd_rd_i=1617292230&psc=1) * [Computer Science Distilled](https://www.amazon.co.uk/Computer-Science-Distilled-Computational-Problems/dp/0997316020/ref=sr_1_1?dchild=1&keywords=computer+science+distilled&qid=1626618696&s=books&sr=1-1) * [Big-O Examples in Python](https://www.youtube.com/watch?v=5yJ_QLec0Lc) * [Time Complexity Examples in Python](https://towardsdatascience.com/understanding-time-complexity-with-python-examples-2bda6e8158a7) * [Memoization and Dynamic Programming Explained](https://www.youtube.com/watch?v=WbwP4w6TpCk) * [Bitwise Operators](https://www.geeksforgeeks.org/python-bitwise-operators/) * [Data Structures & Algorithms in Python](https://www.amazon.co.uk/Structures-Algorithms-Python-Michael-Goodrich/dp/1118290275) | Excellent but expensive, might be able to find an e-book version cheaper * [10 Important Data Structures & Algorithms for Interviews](https://www.youtube.com/watch?v=RcvQagxK_9w) * [Understanding Merge Sort in Python](https://www.youtube.com/watch?v=rAqBlKhy_oI)

Searching for text in PDFs at increasing scale

Wed, 04 Aug 2021 13:58:00 GMT

I had the interesting challenge of searching for text within a large number of PDFs recently. This was to assist a finance team in automating the organising and categorising of some of their existing documents. When I said large number, it was around 350,000 PDF documents, so quite a few! I iterated through a few different solutions and tried to focus on delivering something optimal and efficient. I tested each on a smaller scenario to benchmark how they might perform at increasing scale - the results can be found at the end of the article. ## Getting started with PyPDF2 With Python being my usual go to Swiss Army Knife for many things, I first installed this very useful package to give it a go: ``` pip install PyPDF2 ``` I had read about [PyPDF2](https://pypi.org/project/PyPDF2/) in [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/chapter13/) so at least I had a starting point. PyPDF2 has also has some changes in the latest version 3.0.1 which you can read about in the [documentation](https://pypdf2.readthedocs.io/en/latest/) and [migration guide](https://pypdf2.readthedocs.io/en/3.0.0/user/migration-1-to-2.html) so some of the functions have changed. I put together the following CLI tool using the PyPDF2 package: ```python import PyPDF2 import re import time import sys def main(): if len(sys.argv) != 2: sys.exit("Usage: python pdf_searcher.py filename.pdf") filename = sys.argv[1] file = open(filename, "rb") pdf_reader = PyPDF2.PdfReader(file) # Formerly PyPDF2.PdfFileReader(file) number_of_pages = len(pdf_reader.pages) # Formerly pdf_reader.getNumPages() start = time.time() print("Type your search term and hit enter") print("You can add as many search terms as you like") print("Once you're done, hit enter to continue...") search_terms = get_search_terms_from_user(search_terms = []) for i in range(0, number_of_pages): page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(i) page_content = page.extract_text() # Formerly page.extractText() for search_term in search_terms: if re.search(search_term, page_content): print(f"Matched '{search_term}' on page {i}") print(f"Program took {time.time() - start} seconds") def get_search_terms_from_user(search_terms: list) -> list: search_term = str(input("Search term: ")) if search_term != "": search_terms.append(search_term) return get_search_terms_from_user(search_terms) else: return search_terms if __name__ == "__main__": main() ``` This accepted a filename as a command line argument, followed by a prompt to enter search terms. ## Optimising the PyPDF2 script So this was a good start and a fun program for searching a single PDF but some optimisations were needed. In addition, the program needed to search an entire directory of files so it needed extending. The program didn't need to find every word that matched the search criteria in the given document, just that it does in fact occur in there at least once. So to optimise based on that use case, once it's certain that the search term does exist for the given document, it doesn't have to look for that word again saving time. ```python [pypdfsearcher.py] import PyPDF2 import re import time import sys import os import glob def main(): directory = os.path.dirname(os.path.abspath(__file__)) pdf_filepaths = glob.glob("**/*.pdf", recursive=True) start = time.time() results = {} for filepath in pdf_filepaths: print(f"Searching document {filepath}") search_terms = ["hurricanes", "walt", "avenue", "disney", "mercedes"] filename = os.path.basename(filepath) found_terms = {} file = open(filepath, "rb") pdf_reader = PyPDF2.PdfReader(file) # Formerly PyPDF2.PdfFileReader(file) number_of_pages = len(pdf_reader.pages) # Formerly pdf_reader.getNumPages() for i in range(0, number_of_pages): page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(i) page_content = page.extract_text() #Formerly page.extractText() for term in search_terms: if term in found_terms.keys(): continue if re.search(term.lower(), page_content.lower()): print(f"Found '{term}' in document '{filename}'") found_terms[term] = 1 if filename in results.keys(): results[filename].append(term) else: results[filename] = [term] print(f"Program took {time.time() - start} seconds") print(results) if __name__ == "__main__": main() ``` ## Alternative approach with pdftotext subprocess The second solution called the [pdftotext](https://www.xpdfreader.com/pdftotext-man.html) program in a Python subprocess to receive the text as the subprocess output. It did exactly the same thing as the previous script but might be faster - we'll compare the speed of each approach later. ```python [pdftotextsearcher.py] import os import subprocess import re import time import glob def main(): directory = os.path.dirname(os.path.abspath(__file__)) pdf_filepaths = glob.glob("**/*.pdf", recursive=True) start = time.time() results = {} for filepath in pdf_filepaths: print(f"Searching document {filepath}") search_terms = ["hurricanes", "epcot", "daimler", "disney", "mercedes"] filename = os.path.basename(filepath) found_terms = {} args = ["pdftotext", '-enc', 'UTF-8', filepath, # Example: "pdfs/United-Kingdom-Strategic-Export-Controls-Annual-Report-2021.pdf" '-'] res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) output = res.stdout.decode('utf-8') for term in search_terms: if term in found_terms.keys(): continue if re.search(term.lower(), output.lower()): print(f"Found '{term}' in document '{filename}'") found_terms[term] = 1 if filename in results.keys(): results[filename].append(term) else: results[filename] = [term] print(f"Program took {time.time() - start} seconds") print(results) if __name__ == "__main__": main() ``` ## Trying out C# and iTextSharp I thought I'd switch to C# and investigate the [iTextSharp](https://www.nuget.org/packages/iTextSharp/) NuGet package for reading and searching PDFs. I was pleasantly surprised at how well this package worked. It was also quick to install and get started with. Here is the program I put together using it: ```csharp [Program.cs] using System; using System.Collections.Generic; using System.IO; using System.Text; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; namespace PDFSearcherSharp { class Program { static void Main(string[] args) { var stopwatch = new System.Diagnostics.Stopwatch(); stopwatch.Start(); string directory = @"C:/Users/shedloadofcode/source/repos/PDFSearcherSharp/pdfs/"; string[] files = Directory.GetFiles(directory, "*.pdf"); List searchTerms = new List() { "hurricanes", "epcot", "daimler", "disney", "mercedes" }; foreach (var filename in files) { Console.WriteLine($"Searching document {filename}"); StringBuilder stringBuilder = new StringBuilder(); string filePath = System.IO.Path.Combine(directory, filename); using (PdfReader reader = new PdfReader(filePath)) { List foundTerms = new List(); for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++) { ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy); text = Encoding.UTF8.GetString( ASCIIEncoding.Convert( Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text) ) ); stringBuilder.Append(text); foreach (string term in searchTerms) { if (foundTerms.Contains(term)) { continue; } if (text.ToLower().Contains(term.ToLower())) { Console.WriteLine($"Found '{term}' in document '{filename}'"); foundTerms.Add(term); } } } } // Console.WriteLine(stringBuilder.ToString()); } stopwatch.Stop(); Console.WriteLine($"Program took {stopwatch.ElapsedMilliseconds / 1000} seconds"); } } } ``` ## A last approach with C++ and pdftotext The fourth and final approach involved calling the [pdftotext](https://www.xpdfreader.com/pdftotext-man.html) executable again, but this time with the main script written in C++. I was curious to see how to put together a solution for this in C++ more than anything. I couldn't figure out a way to return the output from the pdftotext executable to stdout in-process, so resorted to converting the PDFs to text files first, searching the text files and then finally deleting them - this created added overhead so will likely slow it down. ```cpp [PdfSearcher.cpp] #include #include #include #include #include #include #include #include #include using namespace std; using std::filesystem::directory_iterator; void DeleteTextFile(string filePath) { string fileName = filePath; fileName = fileName.substr(0, fileName.size() - 4); fileName = fileName + ".txt"; const char* file = fileName.c_str(); if (remove(file) != 0) cout << "Error deleting file " << fileName << endl; else cout << "File " << fileName << " successfully deleted" << endl; } string TransformLineToLowercase(string line) { std::for_each(line.begin(), line.end(), [](char& c) { c = ::tolower(c); }); return line; } void SearchTextFile(string fileName, string searchTerms[], int searchTermsLength) { map foundSearchTerms; for (int i = 0; i < searchTermsLength; i++) { foundSearchTerms[searchTerms[i]] = false; } fstream textFile; textFile.open(fileName, ios::in); if (textFile.is_open()) { int totalNumberOfMatches = 0; string line; while (getline(textFile, line)) { string lowercaseLine = TransformLineToLowercase(line); for (int i = 0; i < searchTermsLength; i++) { bool searchTermAlreadyFound = foundSearchTerms[searchTerms[i]] == 1; if (searchTermAlreadyFound) { continue; } int indexOfMatch = lowercaseLine.find(searchTerms[i]); if (indexOfMatch > -1) { cout << "Found search term " << searchTerms[i] << "in " << fileName << " at "; cout << "position " << indexOfMatch << " in line" << lowercaseLine << endl; foundSearchTerms[searchTerms[i]] = 1; } } } textFile.close(); } } vector GetAllFileNamesInDirectory() { string path = "pdfs/"; vector filePaths; for (const auto& file : directory_iterator(path)) { filePaths.push_back(file.path()); } return filePaths; } void GenerateTextFile(string filePath) { STARTUPINFO startupInfo; PROCESS_INFORMATION processInformation; ZeroMemory(&startupInfo, sizeof(startupInfo)); ZeroMemory(&processInformation, sizeof(processInformation)); wstring filePathWs = wstring(filePath.begin(), filePath.end()); wstring commandLineArgs = L"pdftotext.exe -enc UTF-8 \"" + filePathWs + L"\""; wstring commandLineArgsWs = wstring(commandLineArgs.begin(), commandLineArgs.end()).c_str(); std::wstring commandLineInput(commandLineArgsWs); // This was the first attempt // wchar_t commandLineInput[] = TEXT("pdftotext.exe -enc UTF-8 \"pdfs/United-Kingdom-Strategic-Export-Controls-Annual-Report-2021 - Copy - Copy (7).pdf\""); bool output = CreateProcess( NULL, // Application name &commandLineInput[0], // Command line arguments NULL, // Process attributes NULL, // Thread attributes TRUE, // Inherit handles 0, // No creation flags NULL, // Environment NULL, // Current directory &startupInfo, // Startup information &processInformation // Process information ); if (output == FALSE) { cout << "Generating text file for PDF " << filePath << " failed" << endl; } else { cout << "Generating text file for PDF " << filePath << endl; // cout << "Process ID: " << processInformation.dwProcessId << endl; } WaitForSingleObject(processInformation.hProcess, INFINITE); CloseHandle(processInformation.hProcess); CloseHandle(processInformation.hThread); } int main() { clock_t start = clock(); vector filePaths = GetAllFileNamesInDirectory(); for (int i = 0; i < filePaths.size(); i++) { string filePath = filePaths[i].string(); GenerateTextFile(filePath); } for (int i = 0; i < filePaths.size(); i++) { string filePath = filePaths[i].string(); string fileName = filePath; fileName = fileName.replace(0, 5, ""); fileName = fileName.substr(0, fileName.size() - 4); string searchTerms[5] = { "hurricanes", "epcot", "daimler", "disney", "mercedes" }; string textFilePath = "pdfs/" + fileName + ".txt"; SearchTextFile(textFilePath, searchTerms, (sizeof(searchTerms) / sizeof(*searchTerms))); } for (int i = 0; i < filePaths.size(); i++) { string filePath = filePaths[i].string(); DeleteTextFile(filePath); } double duration = (clock() - start) / (double)CLOCKS_PER_SEC; cout << "Program took " << duration << " seconds" << endl; system("pause > 0"); return 0; } ``` ## Test exercise and speed benchmarks So we now have four (almost) equivalent programs in terms of logic and desired output. It was time to run all of the solutions above through a scenario to see how they perform. The scenario was a directory `/pdfs` containing around 200 PDF documents inside. The program would need to search all of the PDF documents and return the names of the PDF files containing the search terms. I had placed a few PDFs I knew contained the search terms with unique file names to test it works. Most documents were around 71 - 150 pages, with the largest at 432 pages. So I was testing with quite large files. If this ever went into production the files would likely be much smaller. Ok here we go! **Inputs** * 200 PDF documents * Each PDF between 71 and 432 pages * Average PDF file size was 5MB * Number of search terms was 5 ["hurricanes", "epcot", "daimler", "disney", "mercedes"] * My two target files were a [Disney financial report](https://thewaltdisneycompany.com/app/uploads/2021/01/2020-Annual-Report.pdf) and a [Daimler financial report](https://www.daimler.com/documents/investors/reports/annual-report/daimler/daimler-ir-annual-report-2019-incl-combined-management-report-daimler-ag.pdf) as I knew these actually contained the search terms (no particular reason I chose these, they were just the first I could find 😆) **Results** | Approach | Found all search terms | Time in seconds | | ---------------------------- | ---------------------- | --------------- | | Python and PyPDF2 | Yes | 306 | | Python running pdftotext.exe | Yes | 66 | | C# and iTextSharp | Yes | 66 | | C++ running pdftotext.exe | Yes | 72 | As I predicted the C++ program was likely slowed down by having to convert to text files first before searching. The most performant approaches and my most preferred, are Python running pdftotext.exe (which is straightforward to receive the stdout of the child process) and C# with the iTextSharp NuGet package. Both of these solutions completed in 66 seconds in the test scenario. **Folder structure for Python project (containing both versions)** ``` /pdfs pdftotext.exe pdftotextsearcher.py pypdfsearcher.py ``` **Folder structure for C# Visual Studio project** ``` /bin /obj /pdfs /PDFSearcherSharp PDFSearcher.csproj PDFSearcherSharp.sln Program.cs ``` **Folder structure for C++ Visual Studio project** ``` /pdfs PdfSearcher.cpp PdfSearcher.sln PdfSearcher.vcxproj PdfSearcher.vcxproj.filters PdfSearcher.vcxproj.user pdftotext.exe ``` ## Reflections So I learned quite a bit about from this exercise, and this provides a good starting point to further develop a solution. It certainly needs more testing and refining to the specific use case. If searching these 200 or so fairly large files took 66 seconds, then at worst case 350,000 / 200 is 1,750 and 1750 x 66 gives 115,500 seconds. Dividing that by 60 gives 1,925 minutes. Dividing that by 60 gives 32 hours. Finally, dividing that by 24 gives 1.33 days 😄. Moving one of these scripts onto a virtual machine and letting it run until done might be the best solution depending where the files are stored. A caveat to note is the PDFs I used had searchable text, so if you had scanned PDF documents you might need to go down the avenue of using OCR (optical character recognition). I hear [pytesseract](https://pypi.org/project/pytesseract/) is useful for this as it acts as a wrapper for [Google’s Tesseract-OCR Engine](https://github.com/tesseract-ocr/tesseract). I might venture into this area next if the need for it arises 😄. Altogether I hope I've shown that reading and searching many PDFs at increasing scale is possible with different approaches, if not always temperamental. ## Resources * [Searching text in a PDF using Python](https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python) * [Using pdftotext on AWS Lambda](http://howto.philippkeller.com/2018/03/13/How-to-extract-text-from-pdf-in-python/) * [Extract text from PDF in C#](https://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET) * [Extract text from PDF using iTextSharp](https://www.youtube.com/watch?v=y6s2mLpYfMc) * [Searching strings in C#](https://docs.microsoft.com/en-us/dotnet/csharp/how-to/search-strings) * [Child Process in Windows System Programming](https://www.youtube.com/watch?v=W2Qu4RDk__k) * [Creating a Child Process with Redirected Input and Output](https://docs.microsoft.com/en-us/windows/win32/procthread/creating-a-child-process-with-redirected-input-and-output) * [PDF parsing in C++](https://stackoverflow.com/questions/11715561/pdf-parsing-in-c-podofo) * [C++ regex](https://www.youtube.com/watch?v=uL9Qt2v2yjk) * [C++ list files in a directory](https://www.delftstack.com/howto/cpp/how-to-get-list-of-files-in-a-directory-cpp/) * [C++ Wide Char Array Strings](https://www.youtube.com/watch?v=R21fh-17um0)

How to scrape and analyse your Chess.com data

Sat, 10 Jul 2021 12:38:00 GMT

In this article I will scrape data from my Chess.com profile and analyse my historical performance in live matches. This is a reproducible pipeline using Python. I took up Chess again at the end of 2020 after a long hiatus, so was eager to monitor my performance and see where the weaknesses were. The good part of this pipeline is that the data will be automatically updated so I can always see what I need to improve on and ask the interesting questions on my performance just by re-running these scripts. ## Before starting Before starting you will need a few things. These will set you up to carry out other Data Science projects in the future too - like [analysing your Amazon spending data](/blog/how-to-scrape-and-analyse-your-amazon-spending-data/) or [scraping AutoTrader for multiple makes / models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/) * Anaconda * Jupyter Notebooks (installed with Anaconda) * Selenium * Google Chrome (latest version) * Chrome Driver (latest version) This article will not cover installing programs in detail, but here is a starting point. Install [Anaconda](https://www.anaconda.com/distribution/) first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using `pip install selenium`. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of [Google Chrome](http://google.co.uk/chrome/?brand=CHBD&gclid=EAIaIQobChMI0LPsqNXl5QIVCLTtCh3pJwybEAAYASAAEgJxkvD_BwE&gclsrc=aw.ds) installed and [ChromeDriver](https://chromedriver.chromium.org/downloads) for the version number of Chrome you're running. On Windows, ensure `chromedriver.exe` is in a [suitable location](https://chromedriver.chromium.org/getting-started) such as `C:\Windows`. ## What will the web scraper do? Here are the step by step actions the web scraper will perform to scrape Amazon spending data: * Launches a Chrome browser controlled by Selenium * Navigates to the Chess.com login page and logs in with your given details * After login, navigates to the [My Games](https://www.chess.com/games/archive) page * Scrapes all game data * Repeats for each page in the archive until finished The resulting data will be enough to answer questions such as: * Do I win more matches as black or white? * Do I win shorter or longer games? * Am I losing to higher or lower rated players? * Is time-pressure affecting my wins? * How many of my games reach the endgame? * Do specific days affect my results? * Does seasonality affect my results? * How has my rating developed in 30 min games? ## Scraping games data First to scrape the required data using Selenium. You must provide your Chess.com `USERNAME` and `PASSWORD` so the script can log you in so be sure to amend these variables these first. ```python [chess-scraper.py] import numpy as np import pandas as pd import bs4 from bs4 import BeautifulSoup import requests import csv import datetime import time import hashlib import os from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options options = webdriver.ChromeOptions() options.add_argument("--start-maximized") now = datetime.datetime.now() USERNAME = "DeadlyKnightX" PASSWORD = "Your password here" GAMES_URL = "https://www.chess.com/games/archive?gameOwner=other_game&username=" + \ USERNAME + \ "&gameType=live&gameResult=&opponent=&opening=&color=&gameTourTeam=&" + \ "timeSort=desc&rated=rated&startDate%5Bdate%5D=08%2F01%2F2013&endDate%5Bdate%5D=" + \ str(now.month) + "%2F" + str(now.day) + "%2F" + str(now.year) + \ "&ratingFrom=&ratingTo=&page=" LOGIN_URL = "https://www.chess.com/login" driver = webdriver.Chrome("chromedriver.exe", options=options) driver.get(LOGIN_URL) driver.find_element_by_id("username").send_keys(USERNAME) driver.find_element_by_id("password").send_keys(PASSWORD) driver.find_element_by_id("login").click() time.sleep(5) tables = [] game_links = [] for page_number in range(4): driver.get(GAMES_URL + str(page_number + 1)) time.sleep(5) tables.append( pd.read_html( driver.page_source, attrs={'class':'table-component table-hover archive-games-table'} )[0] ) table_user_cells = driver.find_elements_by_class_name('archive-games-user-cell') for cell in table_user_cells: link = cell.find_elements_by_tag_name('a')[0] game_links.append(link.get_attribute('href')) driver.close() games = pd.concat(tables) identifier = pd.Series( games['Players'] + str(games['Result']) + str(games['Moves']) + games['Date'] ).apply(lambda x: x.replace(" ", "")) games.insert( 0, 'GameId', identifier.apply(lambda x: hashlib.sha1(x.encode("utf-8")).hexdigest()) ) print(games.head(3)) ``` | GameId | Unnamed: 0 | Players | Result | Accuracy | Moves | Date | Unnamed: 6 | |-------------------|--------------- |-------- |--------|------ |-------|------|----------- | |7e0c2bc5f27e025 | 1 hour | DominikHrbaty (1319) DeadlyKnightX (1387) | 0 1 | 84.7 84.4 |68 | Dec 22,2020 | NaN | |7f6c05e773ebe23 | 30 mins | Omarricardo34 (1126) DeadlyKnightX (1359) | 0 1 | 49 57.2 | 52 | Dec 19,2020 | NaN | |af2b84926911844 | 30 mins | DeadlyKnightX (1344) albert106 (1138) | 1 0 | 94.4 5.6 |13 | Dec 19,2020 | NaN | Now we have a `games` DataFrame which holds the raw data, we can concentrate on transforming the data by splitting columns, removing unnecessary columns, and adding calculated columns to derive more insight. ## Transform games data ```python [chess-scraper.py] # Create white player, black player, white rating, black rating new = games.Players.str.split(" ", n=5, expand=True) new = new.drop([1,4], axis=1) new[2] = new[2].str.replace('(','').str.replace(')','').astype(int) new[5] = new[5].str.replace('(','').str.replace(')','').astype(int) games['White Player'] = new[0] games['White Rating'] = new[2] games['Black Player'] = new[3] games['Black Rating'] = new[5] # Add results result = games.Result.str.split(" ", expand=True) games['White Result'] = result[0] games['Black Result'] = result[1] # Drop unneccessary columns games = games.rename(columns={"Unnamed: 0": "Time"}) games = games.drop(['Players', 'Unnamed: 6', 'Result', 'Accuracy'], axis=1) # Add calculated columns for wins, losses, draws, ratings, year, game links conditions = [ (games['White Player'] == USERNAME) & (games['White Result'] == '1'), (games['Black Player'] == USERNAME) & (games['Black Result'] == '1'), (games['White Player'] == USERNAME) & (games['White Result'] == '0'), (games['Black Player'] == USERNAME) & (games['Black Result'] == '0'), ] choices = ["Win", "Win", "Loss", "Loss"] games['W/L'] = np.select(conditions, choices, default="Draw") conditions = [ (games['White Player'] == USERNAME), (games['Black Player'] == USERNAME) ] choices = ["White", "Black"] games['Colour'] = np.select(conditions, choices) conditions = [ (games['White Player'] == USERNAME), (games['Black Player'] == USERNAME) ] choices = [games['White Rating'], games['Black Rating']] games['My Rating'] = np.select(conditions, choices) conditions = [ (games['White Player'] != USERNAME), (games['Black Player'] != USERNAME) ] choices = [games['White Rating'], games['Black Rating']] games['Opponent Rating'] = np.select(conditions, choices) games['Rating Difference'] = games['Opponent Rating'] - games['My Rating'] conditions = [ (games['White Player'] == USERNAME) & (games['White Result'] == '1'), (games['Black Player'] == USERNAME) & (games['Black Result'] == '1') ] choices = [1, 1] games['Win'] = np.select(conditions, choices) conditions = [ (games['White Player'] == USERNAME) & (games['White Result'] == '0'), (games['Black Player'] == USERNAME) & (games['Black Result'] == '0') ] choices = [1, 1] games['Loss'] = np.select(conditions, choices) conditions = [ (games['White Player'] == USERNAME) & (games['White Result'] == '½'), (games['Black Player'] == USERNAME) & (games['Black Result'] == '½') ] choices = [1, 1] games['Draw'] = np.select(conditions, choices) games['Year'] = pd.to_datetime(games['Date']).dt.to_period('Y') games['Link'] = pd.Series(game_links) # Optional calculated columns for indicating black or white pieces - uncomment if interested in these # games['Is_White'] = np.where(games['White Player']==USERNAME, 1, 0) # games['Is_Black'] = np.where(games['Black Player']==USERNAME, 1, 0) # Correct date format games["Date"] = pd.to_datetime( games["Date"].str.replace(",", "") + " 00:00", format = '%b %d %Y %H:%M' ) print(games.head(3)) ``` | GameId | Time | Moves | Date | White Player | White Rating | Black Player | Black Rating | White Result | Black Result | W/L | Colour | My Rating | Opponent Rating | Rating Difference | Win | Loss | Draw | Year | Link | | ---------------------------------------- | ------ | ----- | ---------- | ------------- | ------------ | ------------- | ------------ | ------------ | ------------ | --- | ------ | --------- | --------------- | ----------------- | --- | ---- | ---- | ---- | ------------------------------------------ | | 7e0c2bc5f27e025b741fa464cf45a40054e0e637 | 1 hour | 68 | 22/12/2020 | DominikHrbaty | 1319 | DeadlyKnightX | 1387 | 0 | 1 | Win | Black | 1387 | 1319 | \-68 | 1 | 0 | 0 | 2020 | https://www.chess.com/game/live/6032087036 | | 17f6c05e773ebe23c52164b09fec2ea9de2a9dc6 | 30 min | 52 | 19/12/2020 | Omarricardo34 | 1126 | DeadlyKnightX | 1359 | 0 | 1 | Win | Black | 1359 | 1126 | \-233 | 1 | 0 | 0 | 2020 | https://www.chess.com/game/live/6009160294 | | af2b84926911833c2e644d6400f39437f8fe0341 | 30 min | 13 | 19/12/2020 | DeadlyKnightX | 1344 | albert106 | 1138 | 1 | 0 | Win | White | 1344 | 1138 | \-206 | 1 | 0 | 0 | 2020 | https://www.chess.com/game/live/6009042670 | Great! The data has been transformed, extended and is now ready for analysis. ## Analysing games data With a solid dataset prepared, you can now apply any analysis you would like to it. These are the visualisations I produced based upon what I was interested in. First let's import the key visualisations libraries [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/). ```python import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set(rc={'figure.facecolor':'white'}) ``` ## Overall rating ```python fig, ax = plt.subplots(figsize=(15,6)) plt.title("Chess.com Rating Development") sns.lineplot(x="Date", y="My Rating", data=games.iloc[::-1], color="black") plt.xticks(rotation=0) plt.show() ``` I can quite clearly see here that I didn't play for a while, until the end of 2020 when I picked Chess back up. This was met by a few losses and a rating dip - I was certainly out of practice. ## Wins, losses and draws ```python fig, ax = plt.subplots(figsize=(15,6)) plt.title("Wins, Losses and Draws") sns.countplot(data=games, x='W/L', palette="Greys", edgecolor="black") ``` The good news from this data, is that I win more than I lose... but plenty of room for improvement! ## Wins with white vs black pieces ```python fig, ax = plt.subplots(figsize=(15,6)) plt.title("Wins, Losses and Draws by Colour") sns.countplot(data=games, x='W/L', hue="Colour", palette={"Black": "Grey", "White": "White"}, edgecolor="black"); ``` This clearly shows that I am stronger playing as black. ## Win rate with white vs black pieces ```python fig, ax = plt.subplots(figsize=(15,6)) ax.set_title("Win Rate by Colour") sns.barplot(data=games, x='Colour', y='Win', palette={"Black": "Grey", "White": "White"}, edgecolor="black", ax=ax); ``` A higher win rate as black. ## Correlation ```python corr = games.corr() fig, ax = plt.subplots(1, 1, figsize=(14, 8)) sns.heatmap(corr, cmap="Greys", annot=True, fmt='.2f', linewidths=.05, ax=ax).set_title("Chess Results Correlation Heatmap") fig.subplots_adjust(top=0.93) ``` Can see an immediate negative correlation on Wins with Rating Difference and Moves. ## Moves in a typical game ```python fig = plt.figure(figsize=(14,8)) ax = fig.add_subplot(1,1,1) ax.set_title("How many moves in my typical game?") sns.histplot(games, x="Moves", hue="Colour", palette={"Black": "Black", "White": "Grey"}) plt.close(2) ``` Most of my games are around 25 to 30 moves in length. ## Moves vs wins ```python fig = plt.figure(figsize=(14,8)) ax = fig.add_subplot(1,1,1) ax.set_title("Does the amount of moves affect my win rate?") sns.histplot(games, x="Moves", hue="W/L", multiple="stack", palette={"Loss": "Black", "Win": "Gray", "Draw": "lightgray"}) plt.close(2) ``` My win rate does seem to decrease the more moves taken - around the 40 to 80 range is a problem. The number of draws increases as moves taken goes up also. I seem to win more around the sub-35 move range. Lets confirm that... ```python grouped_df = games.groupby(['W/L', pd.cut(games['Moves'], 10)]) grouped_df = grouped_df.size().unstack().transpose() total_games = grouped_df["Win"] + grouped_df["Loss"] + grouped_df["Draw"] total_wins = grouped_df["Win"] grouped_df["Win Rate %"] = round((total_wins / total_games) * 100, 0) grouped_df ``` | W/L | Draw | Loss | Win | Win Rate % | | --------------- | ---- | ---- | --- | ---------- | | Moves | | | | | | (0.846, 16.4\] | 1 | 5 | 12 | 67 | | (16.4, 31.8\] | 0 | 37 | 44 | 54 | | (31.8, 47.2\] | 2 | 19 | 29 | 58 | | (47.2, 62.6\] | 9 | 17 | 14 | 35 | | (62.6, 78.0\] | 0 | 4 | 3 | 43 | | (78.0, 93.4\] | 1 | 0 | 2 | 67 | | (93.4, 108.8\] | 0 | 0 | 0 | NaN | | (108.8, 124.2\] | 0 | 0 | 0 | NaN | | (124.2, 139.6\] | 0 | 0 | 0 | NaN | | (139.6, 155.0\] | 1 | 0 | 0 | 0 | As thought, only a 35% win rate in the 47-63 moves bin, and a 43% win rate in the 62-78 move bin. Seems like a good idea to practice the endgame more right? ## Opponent's rating vs wins ```python fig = plt.figure(figsize=(14,8)) ax = fig.add_subplot(1,1,1) ax.set_title("Does my opponent's rating affect my win rate?") sns.histplot(games, x="Rating Difference", hue="Win", palette={0: "Black", 1: "Grey"}) plt.close(2) ``` Clearly a higher loss rate against higher rated opponents (+) which I think is to be expected. ## Time pressure vs wins ```python fig = plt.figure(figsize=(14,8)) plt.title("How is time pressure affecting my game?") sns.countplot(data=games, x='Time', hue="W/L", palette={"Win":"#CCCCCC", "Loss":"Grey", "Draw":"White"}, edgecolor="Black"); ``` Overwhelmingly better at 30 and 10 minute games, quicker games fair much worse - a lesson to be learnt here, take your time and play long games. ## Rating vs wins ```python fig = plt.figure(figsize=(14,8)) ax = fig.add_subplot(1,1,1) ax.set_title("How does my rating affect wins?") sns.histplot(games, x="My Rating", hue="Win", multiple="dodge", palette={0: "Black", 1: "Grey"}) plt.close(2) ``` There is a pattern of high losses, then an increase in rating, higher wins then high losses again - this must be a development pattern in action. Importantly, must get more experience playing games at the higher level to match the 1000 - 1200 range. The 1400 - 1600 should be as high to be able to break into the 1600 - 1800 range. ## Final words I hope you enjoyed this tutorial. Now you have a way to monitor, track and analyse your Chess.com games archive to identify trends. Some of the actions this analysis has led me to are: * Concentrating on improving on the endgame. * Increasing my exposure to higher rated games. * Strengthening play with the White pieces. * Playing more consistently to ensure rating is accurate. If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article. If you want to export the data to CSV you can use something like this on the `games` DataFrame: ```python path = os.path.join(os.path.dirname(os.getcwd()), 'my-chess-games-data.csv') games.to_csv(path, index=False) ```

Multiple authentication schemes with ASP.NET Core and Azure Active Directory

Fri, 25 Jun 2021 16:49:00 GMT

I recently came across an interesting and challenging problem. I was asked to add Azure Active Directory (AAD) authentication to an existing ASP.NET Core web app, which already had two sign in options. I had added AAD to an application as the only sign in option before, but not alongside other sign in options. I found that within the documentation [adding AAD to an application as the only sign option](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-v2-aspnet-core-webapp) was fairly straightforward - as mentioned I’d done this before. However, when trying to add it as a third authentication scheme, things got a little more tricky. There was some guidance for [multiple authentication]( https://github.com/AzureAD/microsoft-identity-web/wiki/Multiple-Authentication-Schemes) but not much. Although this article is not extensive and I can’t share all the code because it was at work, hopefully it will provide enough information to help you out if you find yourself attempting the same thing. This article is certainly not a tutorial, more of a reflection on how I arrived at the solution. ## The starting point The application I was working on already had two sign in options. There was a selection screen flow which looked something like the image below. Another option would need adding to this for internal AAD users. The first and second option would go off to the existing sign in options, the third would direct to the AAD / Microsoft Identity sign in page. Excuse the bad flow diagram 😆 The existing authentication schemes were configured in the `Startup` class using a method `AddAndConfigureExternalAuthentication`. I have only included relevant parts in the code snippets, so these are not working examples. ```csharp [Startup.cs] using Microsoft.Identity.Web.UI; using Microsoft.IdentityModel.Protocols.OpenIdConnect; using Microsoft.OpenApi.Models; ... namespace ShedloadOfCode.Web { public class Startup { private readonly IConfiguration _configuration; private readonly IHostEnvironment _hostEnvironment; public Startup(IConfiguration configuration, IHostEnvironment hostEnvironment) { _configuration = configuration; _hostEnvironment = hostEnvironment; } public void ConfigureServices(IServiceCollection services) { ... services.AddAndConfigureExternalAuthentication(_configuration); ... } ... } } ``` The app handled sign in and sign out within an `AccountController`, particularly important is the `ExternalLogin` action, as when the option in the diagram is selected this action will take the given authentication scheme and issue a new challenge redirecting to the relevant identity provider: ```csharp [AccountController.cs] using Microsoft.AspNetCore.Authentication; using Microsoft.AspNetCore.Authorization; using Microsoft.AspNetCore.Mvc; using Microsoft.Extensions.Configuration; using Microsoft.Extensions.Options; using System.Linq; using System.Threading.Tasks; namespace ShedloadOfCode.Web.Controllers { public class AccountController : Controller { ... [HttpGet] [AllowAnonymous] public async Task Login(string returnUrl = null) { var result = await _appAuthenticationHandler.SignInAsync(returnUrl, this); return result; } [HttpPost] [AllowAnonymous] public async Task Login( LoginViewModel credentials, string returnUrl = null) { var result = await _appAuthenticationHandler.SignInAsync( credentials, returnUrl, this); return result; } public new IActionResult SignOut() { var callbackUrl = Url.Action("Index", "Home"); HttpContext.ClearAllTempData(); return _appAuthenticationHandler.SignOut(callbackUrl, this); } public IActionResult SignedOut() { if (User.Identity.IsAuthenticated) { return RedirectToAction(nameof(HomeController.Welcome), "Home"); } return RedirectToAction(nameof(HomeController.Index), "Home"); } [HttpGet] public async Task Selector() { if ((await _authenticationSchemeProvider.GetRequestHandlerSchemesAsync()).Count() < 2) { return NotFound(); } return View(); } [HttpGet] [AllowAnonymous] public async Task ExternalLogin( [FromQuery] string provider, [FromQuery] string returnUrl = "/") { if ((await _authenticationSchemeProvider.GetRequestHandlerSchemesAsync()).Count() < 2) { return NotFound(); } string authenticationScheme = _appAuthenticationHandler.GetAuthenticationScheme(provider); if (string.IsNullOrWhiteSpace(authenticationScheme)) { ModelState.AddModelError(nameof(provider), "Select a sign in option"); return View("Selector"); } var auth = new AuthenticationProperties { RedirectUri = Url.Action(nameof(LoginCallback), new { provider, returnUrl }) }; return new ChallengeResult(authenticationScheme, auth); } public IActionResult LoginCallback( string provider, string returnUrl = "~/") { if (User.Identity.IsAuthenticated) { return LocalRedirect(string.IsNullOrEmpty(returnUrl) ? "~/" : returnUrl); } return RedirectToAction(nameof(Selector), new { returnUrl = returnUrl }); } } } ``` As you might have noticed this controller had a few helper methods injected from a service. I added a new value 'AAD' to the `GetAuthenticationScheme` lookup method - this would return an authentication scheme called 'AzureAd': ```csharp [FederationAppAuthenticationHandler.cs] using Microsoft.AspNetCore.Authentication; using Microsoft.AspNetCore.Authentication.Cookies; using Microsoft.AspNetCore.Authentication.OpenIdConnect; using Microsoft.AspNetCore.Authentication.WsFederation; using Microsoft.AspNetCore.Http; using Microsoft.AspNetCore.Mvc; namespace ShedloadOfCode.Web.Services { public class FederationAppAuthenticationHandler : IAppAuthenticationHandler { private readonly IHttpContextAccessor _httpContextAccessor; public FederationAppAuthenticationHandler( IHttpContextAccessor httpContextAccessor) { _httpContextAccessor = httpContextAccessor; } public Task SignInAsync( string returnUrl, Controller controller) { throw new NotSupportedException("No such page exists"); } public Task SignInAsync( LoginViewModel credentials, string returnUrl, Controller controller) { throw new NotSupportedException(); } public IActionResult SignOut(string callbackUrl, Controller controller) { var provider = _httpContextAccessor.HttpContext.User.AuthenticationProvider(); var authenticationScheme = GetAuthenticationScheme(provider); return controller.SignOut( new AuthenticationProperties { RedirectUri = callbackUrl }, CookieAuthenticationDefaults.AuthenticationScheme, authenticationScheme); } public string GetAuthenticationScheme(string provider) { string authenticationScheme = null; if (String.Equals("FirstAuthenticationProviderName", provider, StringComparison.OrdinalIgnoreCase)) { authenticationScheme = WsFederationDefaults.AuthenticationScheme; } else if (String.Equals("SecondAuthenticationProviderName", provider, StringComparison.OrdinalIgnoreCase)) { authenticationScheme = OpenIdConnectDefaults.AuthenticationScheme; } else if (String.Equals("AAD", provider, StringComparison.OrdinalIgnoreCase)) { authenticationScheme = "AzureAd"; } return authenticationScheme; } } } ``` ## My first steps I recalled how I had added AAD as the only sign in method to an app before, and tried those steps first: * Create an app registration in the AAD in the Azure Portal * Create a sign-in and sign-out route for the new app registration, and enable ID tokens * Create a client secret for the new app registration * Install [Microsoft.Identity.Web](https://www.nuget.org/packages/Microsoft.Identity.Web) and [Microsoft.Identity.Web.UI](https://www.nuget.org/packages/Microsoft.Identity.Web.UI) Nuget packages in the project * Update `appsettings.json` with the app registration details (found in the 'Overview' tab in the Azure portal) ```json [appsettings.json] { "AzureAd": { "Instance": "https://login.microsoftonline.com/", "Domain": "yourdomain.onmicrosoft.com", "ClientId": "11adca46-d907-4803-945f-demoClientId", "TenantId": " b3b8b34a82f9-c69a-4da1-a5f2-demoTenantId", "ClientSecret": ".dVv3r.2g2ED6_Xb-bSaXROml~demoClientSecret", "MetadataAddress": "https://login.microsoftonline.com/b3b8b34a82f9-c69a-4da1-a5f2-demoTenantId/v2.0/.well-known/openid-configuration", "CallbackPath": "/signin-oidc", "SignedOutCallbackPath": "/signout-callback-oidc", "SignedOutRedirectUri": "/" } ... } ``` * Add the same method I had used before for AAD authentication to `Startup.cs` called `AddMicrosoftIdentityWebApp` which is also in the [documentation](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-v2-aspnet-core-webapp#more-information). I also initialised the Microsoft.Identity.Web.UI package with `AddMicrosoftIdentityUI` to handle the sign in screen. ```csharp [Startup.cs] using Microsoft.Identity.Web.UI; using Microsoft.IdentityModel.Protocols.OpenIdConnect; using Microsoft.OpenApi.Models; ... namespace ShedloadOfCode.Web { public class Startup { private readonly IConfiguration _configuration; private readonly IHostEnvironment _hostEnvironment; public Startup(IConfiguration configuration, IHostEnvironment hostEnvironment) { _configuration = configuration; _hostEnvironment = hostEnvironment; } public void ConfigureServices(IServiceCollection services) { ... services.AddAndConfigureExternalAuthentication(_configuration); services.AddAuthentication() .AddMicrosoftIdentityWebApp(_configuration, configSectionName: "AzureAd", openIdConnectScheme: "AzureAd", cookieScheme: "AzureAdCookies") services.AddRazorPages() .AddMicrosoftIdentityUI(); ... } ... } } ``` I had to add a distinct `openIdConnect` and `cookieScheme` to [avoid scheme conflicts](https://stackoverflow.com/questions/56433112/system-invalidoperationexception-scheme-already-exists-identity-application) when using this approach. `configSectionName` just pulls the relevent config section `AzureAd` from `appsettings.json`. However, after selecting the new sign in option for AAD, being sent to the Microsoft Identity sign in page and entering credentials and clicking login, I was redirected back to the application, but wasn't authenticated! I was very confused by this, especially since it had worked so well in other apps as the only sign in method. Plus we can see quite clearly here in the docs for [single authentication](https://github.com/AzureAD/microsoft-identity-web/wiki/web-apps) and [multiple authentication]( https://github.com/AzureAD/microsoft-identity-web/wiki/Multiple-Authentication-Schemes) this is the recommended approach: ## The solution - using AddOpenIdConnect() So I came across this [super helpful article](https://www.codeproject.com/Articles/5297820/Azure-Active-Directory-Authentication-with-OpenID), and thought okay I should try using the `AddOpenIdConnect` method to sign in. I added the configuration for each option... and this time, after the redirect back to the application, the user was authenticated! 😄 ```csharp [Startup.cs] using Microsoft.Identity.Web.UI; using Microsoft.IdentityModel.Protocols.OpenIdConnect; using Microsoft.OpenApi.Models; ... namespace ShedloadOfCode.Web { public class Startup { private readonly IConfiguration _configuration; private readonly IHostEnvironment _hostEnvironment; public Startup(IConfiguration configuration, IHostEnvironment hostEnvironment) { _configuration = configuration; _hostEnvironment = hostEnvironment; } public void ConfigureServices(IServiceCollection services) { ... services.AddAndConfigureExternalAuthentication(_configuration); var azureAdConfiguration = _configuration.GetSection("AzureAd").Get(); services.AddAuthentication() .AddOpenIdConnect("AzureAd", options => { options.SignInScheme = CookieAuthenticationDefaults.AuthenticationScheme; options.Authority = azureAdConfiguration.MetadataAddress; options.ClientId = azureAdConfiguration.ClientId; options.ClientSecret = _configuration.GetValue(azureAdConfiguration.ClientSecret); options.CallbackPath = new PathString(azureAdConfiguration.CallbackPath); options.MetadataAddress = azureAdConfiguration.MetadataAddress; options.SignedOutCallbackPath = new PathString(azureAdConfiguration.SignedOutCallbackPath); options.SignedOutRedirectUri = new PathString(azureAdConfiguration.SignedOutRedirectUri); options.ResponseType = OpenIdConnectResponseType.Code; options.UsePkce = true; options.Scope.Add("openid"); options.Scope.Add("profile"); options.SaveTokens = true; options.Events.OnSignedOutCallbackRedirect += context => { context.Response.Redirect(azureAdConfiguration.SignedOutRedirectUri); context.HandleResponse(); return Task.CompletedTask; }; options.Events.OnTokenValidated = async (context) => { if (context.Principal.Identity.IsAuthenticated) { // Set auth provider using an extension method to facilitate logout context.Principal.SetAuthenticationProvider("AAD"); // Get AAD username from claims var emailAddress = context.Principal.Claims .Where(c => c.Type == "preferred_username") .Select(c => c.Value) .ToList() .First(); // Get AAD security groups from claims var groups = context.Principal.Claims .Where(c => c.Type == "groups") .Select(c => c.Value) .ToList(); } }; }); services.AddRazorPages() .AddMicrosoftIdentityUI(); ... } ... } } ``` I set the authentication scheme as `AzureAd` so the controller knows which challenge to issue after the selection screen. After the token validates, I can see the user is authenticated and I can get the user details and claims that are returned from AAD. No separate `cookieScheme` needs setting for this approach either, it will just use `CookieAuthenticationDefaults.AuthenticationScheme` which is 'Cookies'. This code is still using the values we set in `appsettings.json` just mapping them to `AzureAdConfigOptions` and using them individually. ```csharp [AzureAdConfigOptions.cs] namespace ShedloadOfCode.Web.Options { public class AzureAdConfigOptions { public string Instance { get; set; } public string Domain { get; set; } public string ClientId { get; set; } public string TenantId { get; set; } public string ClientSecret { get; set; } public string MetadataAddress { get; set; } public string CallbackPath { get; set; } public string SignedOutCallbackPath { get; set; } public string SignedOutRedirectUrl { get; set; } } } ``` I was really pleased with this outcome. Usually, when it comes to searching documentation, reading Stack Overflow and general Google-Fu, I’m quite skilled. However the answer to this one evaded me for some time! I traced back the usage of `AddOpenIdConnect` within the `AddMicrosoftIdentityWebApp` method in the Microsoft.Identity.Web [source code](https://github.com/AzureAD/microsoft-identity-web/blob/master/src/Microsoft.Identity.Web/WebAppExtensions/MicrosoftIdentityWebAppAuthenticationBuilderExtensions.cs). ## Getting AAD group information One requirement for authorisation was to only allow users with a specific AAD group to access the application - others needed to ask permission to be added to the AAD group. I retrieved them in the solution code in the `groups` variable, however for group claims to be returned from AAD, they need enabling in Azure. To enable group claims, you head back to the app registration and select 'Add groups claim' inside 'Token configuration'. This allows the AAD group information to be returned for the authenticated user. You can then access these as claims and use specific groups a user belongs to for authorisation and access control. ## Next steps My next steps will be code clean up. I’ll move the `ClientSecret` into [Azure Key Vault](https://azure.microsoft.com/en-gb/services/key-vault/), and move the AAD authentication code into an `AddAndConfigureAzureAdAuthentication` method to tidy things up. So now any user who selects that new option, and is part of the organisation's AAD and within the specific AAD group can access the application 😄 Well it was a tough journey, but got there in the end. I would be lying if I said I didn't nearly give up on it a few times! I really hope this article has helped you to avoid the issues I had trying to set this up. If you enjoyed this article be sure to check out [other articles](/) on the site including: * [Searching for text in PDFs at increasing scale](/blog/searching-for-text-in-pdfs-at-increasing-scale/) with C# and Python

Building a website analytics dashboard with Power BI and Google Sheets

Fri, 18 Jun 2021 10:40:00 GMT

In a previous article I demonstrated [creating a website analytics solution using AWS Lambda and Google Sheets](/blog/creating-your-own-website-analytics-solution-with-aws-lambda-and-google-sheets/). The data collected was then used to build a Power BI dashboard. I chose Power BI because I’ve worked with it in a professional setting, and for quickly putting together high quality, interactive dashboards, it’s very good. There are a few gotchas to watch out for with it, but on the whole it’s quite straightforward to use. In this article, we’ll go over how I built the usage analytics dashboard for this site and in the process you’ll learn some fundamental Power BI skills. Unlike some tutorials, this is a real world use case, many workplaces have digital products and want to monitor how well they are performing to improve them for their customers. Although this tutorial will be suitable for beginners, we'll be diving straight into the skills needed to build a professional report including using the Power Query Editor and DAX (data analysis expressions). I think jumping into the deep end is a good thing, gaps in understanding can be filled in later. Being able to import website usage data and turn it into valuable information is a very useful skill to have. By the end of this article you should be able to develop an entire dashboard from scratch without any prior knowledge of Power BI. Let’s begin! ## Requirements Firstly I set out a list of requirements of what metrics and functionality the dashboard would need to have: * Total visitors card * Total page views card * Total page views by device and timezone bar chart * Total page views by browser and operating system bar chart * Total views by path table * Daily page views time series * Hourly breakdown for any given day * Date slicer The finished product should end up looking something like this: ## Download Power BI Desktop Let's start by [downloading Power BI Desktop](https://powerbi.microsoft.com/en-us/downloads/). Once installed, open Power BI Desktop and you should arrive at a screen which looks like this: Power BI Desktop receives monthly updates from Microsoft and the layout can change slightly. If you're ever interested in keeping up to date with the monthly updates, you can find them on the [Power BI blog](https://powerbi.microsoft.com/en-us/blog/) and review previous month's updates on this [previous updates page](https://docs.microsoft.com/en-us/power-bi/fundamentals/desktop-latest-update-archive?tabs=powerbi-desktop). The first thing you'll always want to do from this start screen, is click 'Get data'. That will be our starting point for the next section. ## Building the report Power BI Desktop has a very interactive interface with lots of things to click on! So rather than writing out all the steps with screenshots, I've added a video in this section that shows the whole dashboard building process from start to finish. This should make it much easier to follow along 😄 The steps in the video are: 1. Get data from the Google Sheet using this URL: https://docs.google.com/spreadsheets/d/1jIUARNqb02c0xzTqj6AhE94M38WnAEcymSfsXfZm06s/edit?usp=sharing 2. Change the URL ending from `/edit?usp=sharing` to `/export?format=xlsx` 3. Transform data in Power Query by removing blank rows then adding a date, hour and index column 4. Add [calculated columns](https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-tutorial-create-calculated-columns) and [measures](https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-tutorial-create-measures) with [DAX](https://docs.microsoft.com/en-us/dax/) to calculate visitors, page views and average time on page in seconds 5. Create report visuals 6. Style the report 7. Add a drill through page for hourly analysis 8. Add a toggle between browser, device, OS and timezone visuals using bookmarks The Google Sheet data we're using is test data - not the live site data. It’s the same structure, but only contains logs from early testing activity. For step 4, you’ll find all the DAX you’ll need for it underneath the video. **DAX statements for step 4 as promised 😄** In step 4, you can see I'm first calculating the next row's session ID and created at date. Then if the session ID is different from the previous row, I know it's a completely different person / session. We can't predict how long that last page view event was, but for all the others we can calculate the time between dates using `DATEDIFF` to find `TimeOnPageInSeconds`. The average of that column gives the `AverageTimeOnPageInSeconds` measure - concatenated with an 's' so it displays units nicely in the visual. Allowing users to quickly interpret the units of measurement is very important. **Visitors** ```dax [visitors.dax] Visitors = CALCULATE( COUNT( EventsLog[EventType] ), EventsLog[EventType] = "Visit Site" ) ``` **Page Views** ```dax [page-views.dax] Page Views = COUNT(EventsLog[EventType]) ``` **NextSessionId** ```dax [next-session-id.dax] NextSessionId = VAR PreviousIndex = CALCULATE( MAX( EventsLog[Index] ), FILTER( EventsLog, EventsLog[Index] < EARLIER( EventsLog[Index] ) ) ) VAR Result = CALCULATE( MAX( EventsLog[SessionId] ), FILTER( EventsLog, EventsLog[Index] = PreviousIndex ) ) RETURN Result ``` **NextCreatedAt** ```dax [next-created-at.dax] NextCreatedAt = VAR PreviousIndex = CALCULATE( MAX( EventsLog[Index] ), FILTER( EventsLog, EventsLog[Index] < EARLIER( EventsLog[Index] ) ) ) VAR Result = CALCULATE( MAX( EventsLog[CreatedAt] ), FILTER( EventsLog, EventsLog[Index] = PreviousIndex ) ) RETURN Result ``` **TimeOnPageInSeconds** ```dax [time-on-page.dax] TimeOnPageInSeconds = IF( EventsLog[SessionId] <> EventsLog[NextSessionId], 0, DATEDIFF(EventsLog[CreatedAt], EventsLog[NextCreatedAt], SECOND) ) ``` **AverageTimeOnPageInSeconds** ```dax [average-time-on-page.dax] AverageTimeOnPageInSeconds = CONCATENATE( ROUND( AVERAGE(EventsLog[TimeOnPageInSeconds]), 2 ), "s" ) ``` In **step 7** I used a drill through page. To use drill through, a user must right-click a data point in another report page, and drill through to the focused page to get details that are filtered to that context. This effectively 'filters' the destination page by whichever data point you drilled through on. In our case, when you right click a `Date` data point on the time series on the 'Dashboard' page, you drill through to the 'Hourly Analysis' page, which breaks down the usage by hour for that day. This works because on the 'Hourly Analysis' we added the `Date` column to the drill through section, which enables drillthrough for any visual using that column. Once you wrap your head around that, it becomes a very powerful tool for providing deeper insight without overloading pages. You can use it to separate the main high-level visualisations from more low-level analysis. Some users might only want the high-level information, but more advanced users might want to drill through to the details. This let's you accomodate both. So usually I just want to see the day by day page views, but if I see a spike on any given day, I might drill through to see at what hours the page views happened. In **step 8** I used bookmarks to toggle between visuals. Bookmarks are created first and then can be linked to buttons. Bookmarks sort of 'take a snapshot' of which visuals are visible or hidden and what filters have been applied (if any). So in this case we have many buttons to only show one visual at a time, whilst hiding the others. We then attach those bookmarks to the buttons as actions, so when they are clicked that bookmark 'snapshot' is applied. This can be time consuming to set up, but works well for simple show and hide or toggle functionality like this. The main use for this is to avoid overcrowding your report page. It also gives it more of an app-like feel. ## Job done! Where to next? I hope you’ve enjoyed this tutorial, and have picked up some knowledge of Power BI you can use in other projects. You should now be able to build a robust professional dashboard from scratch, so well done! You might have noticed that using Power BI is as much about preparing and transforming the data, as it is about the visuals themselves. The 'garbage in, garbage out' principle is very important, your report will only ever be as good as the data fed into it. So always know your underlying data inside out and question the quality of it. We used DAX to calculate the average time on the page in the tutorial, but did we really need that metric? Will knowing how much time a user spent on the page help to deliver a better product? It might, it might not. Knowing what to measure is the absolute key skill. Don't overcomplicate a report if you don't have to, keep it as simple as possible. Follow the quote 'Don’t include a single line in your code which you could not explain to your grandmother in a matter of two minutes' - one on [the favourite quotes list](/blog/programming-quotes-that-offer-wisdom-and-motivation/) and as applicable to analysis and reports as it is of code. If you just keep including measures blindly, it will crowd the report with noise, and soon you'll face the dreaded analysis paralysis - you're tracking so much stuff but it doesn't offer any insight or call to action. I wanted a way to present some simple stats on how the site is being received, which pages are popular and which need improving. When presenting data, keep the audience and purpose in mind. Although that sounds simplistic, it can be easy to forget those things. I think we’ve explored some key topics, but there is a lot more to learn for those who wish to. One thing we didn't cover is [relationships](https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-create-and-manage-relationships), which are important for more complex multi-source data models. Here are my top recommendations for where to go next if you want to learn more about Power BI: * [Power BI Docs](https://docs.microsoft.com/en-us/power-bi/) - Offical Power BI docs from Microsoft * [Analysing and Visualising Data with Power BI Course](https://www.youtube.com/watch?v=1c01r_pAZdk&list=PL1N57mwBHtN0JFoKSR0n-tBkUJHeMP2cP) - full course from Microsoft * [Guy in a cube](https://www.youtube.com/channel/UCFp1vaKzpfvoGai0vE5VJ0w) - great YouTube channel for Power BI tutorials * [SQLBI](https://www.sqlbi.com/) - articles on business intelligence, Power BI, DAX and more * [DAX reference](https://dax.guide/) - Browse DAX functions * [DAX reference](https://docs.microsoft.com/en-us/dax/dax-function-reference) - Browse DAX functions

Automated deployment of a Vue Flask app using Azure Pipelines

Tue, 15 Jun 2021 12:15:00 GMT

In this article we will look at how to automate the deployment of a Vue Flask app to Azure App Service with Azure Pipelines. In a previous article I covered [building a Vue Flask app](/blog/query-sql-and-download-csv-and-xlsx-in-flask/) to query a SQL database and return data to the browser to view or download. We will start with the same template, prepare it for deployment and configure the app service and pipeline in Azure. The result will be a deployed Flask app which serves a static Vue.js frontend. If you have your own application, you can adapt these steps. Before starting, you will need an [Azure subscription](https://azure.microsoft.com/en-gb/free/) alongside Python, Node.js and Yarn installed. ## Download the template First go to this [public repository](https://github.com/gtalarico/flask-vuejs-template) and download the project template as a zip file. Extract the folder contents and open the folder in a code editor like Visual Studio Code. ## Configure the template for deployment One thing needs adding before we deploy. Create a new file `startup.py` at the top of the folder - same directory as `run.py`. This will be the file Azure App Service uses to start the application. ```python [/startup.py] """ The startup file for Azure App Service that just imports the app object. """ from app import app ``` ## Setting up the automated pipeline Here are the step by step actions the video below will go through to create the automated pipeline: * Create a new Azure App Service in the [Azure portal](https://portal.azure.com/#home) * Set the environment variable `SCM_DO_BUILD_DURING_DEPLOYMENT` to true in the App Service * Create a new project in [Azure DevOps](https://dev.azure.com/) * Create an Azure Repo in the project * Push the application code to the Azure Repo * Set up an Azure Pipeline in the project * Build and deploy the app to Azure App Service * Check the site is deployed (had to hard refresh with Ctrl + F5) 😄 Setting the `SCM_DO_BUILD_DURING_DEPLOYMENT` environment variable to true took me a while to figure out. It's in [this section of the docs](https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/python-webapp?view=azure-devops#run-the-pipeline) and states: > If your app fails because of a missing dependency, then your requirements.txt file was not processed during deployment. This behavior happens if you created the web app directly on the portal rather than using the az webapp up command as shown in this article. The az webapp up command specifically sets the build action SCM_DO_BUILD_DURING_DEPLOYMENT to true. If you provisioned the app service through the portal, however, this action is not automatically set. The YAML I used for the build and deploy steps looked like this: ```yaml [pipeline.yml] # Python to Linux Web App on Azure # Build your Python project and deploy it to Azure as a Linux Web App. # Change python version to one thats appropriate for your application. # https://docs.microsoft.com/azure/devops/pipelines/languages/python trigger: - master variables: # Azure Resource Manager connection created during pipeline creation azureServiceConnectionId: 'f59ed866-b638-412b-bdce-02504965ee64' # Web app name webAppName: 'vue-flask-app' # Agent VM image name vmImageName: 'ubuntu-latest' # Environment name environmentName: 'vue-flask-app' # Project root folder. Point to the folder containing manage.py file. projectRoot: $(System.DefaultWorkingDirectory) # Python version: 3.6 pythonVersion: '3.6' stages: - stage: Build displayName: Build stage jobs: - job: BuildJob pool: vmImage: $(vmImageName) steps: - task: UsePythonVersion@0 inputs: versionSpec: '$(pythonVersion)' displayName: 'Use Python $(pythonVersion)' - task: NodeTool@0 inputs: versionSpec: '10.x' displayName: 'Install Node.js' - script: pip install --upgrade pip displayName: 'Upgrade pip' workingDirectory: $(projectRoot) - script: pip install pipenv displayName: 'Install pipenv' - script: python -m pipenv install --dev displayName: 'Install Python dependencies' - script: python -m pipenv run pip freeze > requirements.txt displayName: 'Generate requirements.txt' - script: | curl -o- -L https://yarnpkg.com/install.sh | bash -s -- --version 1.9.4 export PATH="$HOME/.yarn/bin:$PATH" yarn install yarn upgrade displayName: 'Install Node dependencies' - script: yarn build displayName: 'Build Vue app' - script: | pip install codecov pip install pytest pip install pytest-sugar pip install pytest-cov pip install pytest-azurepipelines python -m pipenv run pytest --junitxml=$(System.DefaultWorkingDirectory)/testResults.xml --cov=app --cov-report=xml --cov-report=html displayName: 'Run tests with pytest' - task: PublishTestResults@2 displayName: "Publish test results" inputs: testResultsFiles: '$(System.DefaultWorkingDirectory)/testResults.xml' testRunTitle: '$(Agent.OS) - $(Build.BuildNumber)[$(Agent.JobName)] - Python $(python.version)' failTaskOnFailedTests: true condition: succeededOrFailed() - task: PublishCodeCoverageResults@1 displayName: "Publish code coverage" inputs: codeCoverageTool: Cobertura summaryFileLocation: '$(System.DefaultWorkingDirectory)/**/coverage.xml' reportDirectory: '$(System.DefaultWorkingDirectory)/**/htmlcov' - task: ArchiveFiles@2 displayName: 'Archive files' inputs: rootFolderOrFile: '$(projectRoot)' includeRootFolder: false archiveType: zip archiveFile: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip replaceExistingArchive: true - upload: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip displayName: 'Upload package' artifact: drop - stage: Deploy displayName: 'Deploy Web App' dependsOn: Build condition: succeeded() jobs: - deployment: DeploymentJob pool: vmImage: $(vmImageName) environment: $(environmentName) strategy: runOnce: deploy: steps: - task: UsePythonVersion@0 inputs: versionSpec: '$(pythonVersion)' displayName: 'Use Python version' - task: AzureWebApp@1 displayName: 'Deploy Azure Web App : vue-flask-app' inputs: azureSubscription: $(azureServiceConnectionId) appName: $(webAppName) package: $(Pipeline.Workspace)/drop/$(Build.BuildId).zip startUpCommand: 'gunicorn --bind=0.0.0.0 --workers=4 --timeout 600 startup:app' ``` Your `azureServiceConnectionId` will be different so be sure to change that. ## Deployment was successful! You now have a deployed Vue Flask app with a continuous integration pipeline configured. You can deploy new features with a simple push to the master branch which will trigger the pipeline. You could completely change the application we have deployed and take it in your own direction. Not only that, you might have noticed that this setup also publishes pytest code test coverage to the pipeline! Let me know in the comments if this helped you and if you have any questions. I know this was quite Azure specific, I think you could set up a similar pipeline using AWS or Google Cloud Platform. I really like the Vue Flask combination for the ease of creating an interactive experience with Vue, alongside the many packages for data science that Python offers. You could separate this setup and have Vue served from a CDN and Python running as the API layer, but for a quick starter single-deploy setup this is perfect. It might need a little tailoring to your own needs, the template we used in this tutorial used Python 3.6 and pipenv, your setup might not, so adjust the Pipeline and App Service accordingly. If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in: * [How to query a database with Python Flask and download data to CSV or XLSX in Vue](/blog/query-sql-and-download-csv-and-xlsx-in-flask/) * [How to upload PDF files to Azure Blob Storage with Vue and Python Flask](/blog/how-to-upload-pdf-files-to-azure-blob-storage-with-vue-and-python-flask/) * [How to import a CSV from Dropbox or GitHub into Google Sheets](/blog/how-to-import-a-csv-from-dropbox-or-github-into-google-sheets/)

Creating your own website analytics solution with AWS Lambda and Google Sheets

Mon, 14 Jun 2021 11:12:00 GMT

After creating and launching this site, I needed a way to capture some simple usage statistics. Although I could have used Google Analytics, which is free, the mantra of “if the product is free, then you are the product” was at the back of my mind. After reading other articles like [roll your own analytics](https://www.pcmaffey.com/roll-your-own-analytics/ ) and [logging sensor data to Google Sheets via AWS Lambda](https://ncd.io/logging-data-to-google-sheets-through-aws-iotlambda/) I was inspired to give it a go. So for this setup I wanted: * No third party tracking * Free or very low cost * Serverless * Low maintenance * No bloat for fast page load * Completely anonymous data * Only useful data collected - visitors, page views etc * No personal data collected * No cookies and no cookie banner * A simple dashboard to present the analytics I know what you’re thinking, why not use Google Analytics when you’re using Google Sheets anyway? Well, my opinion is that the Google Sheet is my own data, controlled by me. The alternative is capturing lots of information I don’t need - bloating the page load time alongside placing tracking and ad cookies on users devices. I’m not against Google Analytics but because many sites use it, and Google runs on advertising, it gives it a powerful position - and let’s face it most users (including myself sometimes) are quick to click that ‘Accept cookies’ button without realising just how much tracking they are subjected to. However, I am impressed by the [opt-out browser add on](https://tools.google.com/dlpage/gaoptout) offered by Google which prevents any data being sent to Google Analytics. The plan for how it would work looked like this: * Collect events in state as the user browses the site * The user ends their browsing session * Events data is sent to AWS Lambda function * AWS Lambda function writes the data to a Google Sheet * The Google Sheet acts as the database * Consume the Google Sheet into a dashboard tool like Power BI Desktop * Build the analytics dashboard To determine when to send the analytics events to the AWS Lambda function, I will be adding event listeners to my Vue web app. They will listen for the [`pagehide`](https://developer.mozilla.org/en-US/docs/Web/API/Window/pagehide_event), [`beforeunload`](https://developer.mozilla.org/en-US/docs/Web/API/WindowEventHandlers/onbeforeunload), and [`unload`](https://developer.mozilla.org/en-US/docs/Web/API/Window/unload_event) events, alongside [`visibilitychange`](https://developer.mozilla.org/en-US/docs/Web/API/Document/visibilitychange_event) and [`blur`](https://developer.mozilla.org/en-US/docs/Web/API/Element/blur_event) to handle mobile closing or switching tabs, particularly on iOS. ## Setting up the infrastructure In the video below I replicate the setup to demo how the solution is put together. Creating a Lambda Layer is not covered in the video but I cover it in the section following the video. The step by step actions are: * Create a Google Cloud project * Enable the Google Sheets and Google Drive APIs for the project * Create a service account * Create credentials for the service account * Create and share Google Sheet with service account email * Create AWS Lambda function * Add a Layer to AWS Lambda function for the [gspread](https://pypi.org/project/gspread/) package * Create AWS API Gateway to call function * Call the API endpoint with Postman to test it When creating the AWS Lambda function, I added a file `google_service_account_credentials.json` and pasted in the json from the generated service account credentials. This allows the function to use the gspread Python package to read and write to the Google Sheet. I also shared the Google Sheet with the service account email to ensure it had permission to access it. **AWS Lambda function** ```python import json import gspread def lambda_handler(event, context): request_body = json.loads(event["body"]) if type(event["body"]) is str else event["body"] write_events_to_google_sheet(request_body["events"]) return { "statusCode": 200 } def write_events_to_google_sheet(events): gc = gspread.service_account(filename='google_service_account_credentials.json') gsheet = gc.open("Website Analytics") for event in events: row = [ event["sessionId"], event["eventType"], event["createdAt"], event["device"], event["userAgent"], event["browser"], event["os"], event["language"], event["timezone"], event["path"] ] gsheet.sheet1.insert_row(row, index=2) print( str(len(events)) + " events logged to the Google Sheet." ) ``` **Data used to test function** ```json { "body": { "events": [ { "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d", "eventType": "visit-site", "createdAt": "01-01-2021 09:21:11", "device": "Desktop", "userAgent": "Chrome", "browser": "Safari", "os": "MacOS", "language": "en-GB", "timezone": "London-GMT", "path": "/" }, { "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d", "eventType": "visit-page", "createdAt": "01-01-2021 09:41:11", "device": "Desktop", "userAgent": "Chrome", "browser": "Safari", "os": "MacOS", "language": "en-GB", "timezone": "London-GMT", "path": "/blog/article-1" }, { "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d", "eventType": "visit-page", "createdAt": "01-01-2021 09:31:11", "device": "Desktop", "userAgent": "Chrome", "browser": "Safari", "os": "MacOS", "language": "en-GB", "timezone": "London-GMT", "path": "/blog/article-2" }, { "sessionId": "2d885afe-dece-4d1e-829f-e08c305ab32d", "eventType": "visit-page", "createdAt": "01-01-2021 09:51:11", "device": "Desktop", "userAgent": "Chrome", "browser": "Safari", "os": "MacOS", "language": "en-GB", "timezone": "London-GMT", "path": "/about" } ] } } ``` **Adding a Layer** You may have seen I added a Layer to the function so it had access to the gspread package (and it’s dependencies) for interacting with the Google Sheet. This [video](https://youtu.be/3BH79Uciw5w) covers adding a Layer nicely but my steps were: * Open command prompt * Create a folder using `mkdir python` * Install package and dependencies to the folder using `pip install gspread -t .` * Zip the python folder in file explorer * Go to AWS Lambda Layers (Image A below) * Create a new Layer and upload your zip file (Image B below) * You can now use that Layer with any function Now that the function is receiving data and writing it to the Google Sheet, the main thing to focus on now, is actually capturing events data in the Vue app to send to it. We’ll explore how I did that in the following section. ## Capturing and logging events in Vue Since I was using Nuxt with Vue, I stored the events in the top level `default.vue` component state. I did consider using Vuex but this approach worked well. As the user browses the site and the page changes, the `logVisitPageOnRouteChange` method saves the events in the `analyticsEvents` array. Within the mounted hook, I listen for a number of exit events such as `beforeunload` and `pagehide`. This means whenever a user switches tabs, closes the tab, closes the browser, switches to another app on mobile or just visits another site, the `sendAnalyticsData` method is fired. This logs all the events currently stored in state to the AWS Lambda function, then clears the state to ensure it never logs duplicate records. ```html [layouts/default.vue] ``` There are many helper methods in this component, mostly for identifying things like the browser, timezone, language and OS. To handle switching tabs or apps on mobile I added the `visibilitychange` exit listener. This was very effective in capturing events from iOS devices, which proved tricky at first until I [read more on the topic](https://stackoverflow.com/questions/6162188/javascript-browsers-window-close-send-an-ajax-request-or-run-a-script-on-win). I took inspiration from the article [roll your own analytics](https://www.pcmaffey.com/roll-your-own-analytics/) for the `sendBeacon` implementation. I used the [uuid](https://www.npmjs.com/package/uuid) package to generate a random identifier so it persists over tab switching, but not refreshing the page or closing the browser - no cookies, privacy first approach. Here is my Google Sheet after sending through quite a bit of test data by interacting with the site. ## Presenting the data in a dashboard Now data is coming in from the Vue app, I needed a way to make sense of it in some form of dashboard. I chose to build an analytics dashboard using Power BI Desktop. It is free to download, fairly quick to create a dashboard and lots of support online to get started. You can get data from your Google Sheet by following these steps: * Go to the Google Sheet * Click Share * Get a link as share with anyone * Change URL ending `/edit?usp=sharing` to `/export?format=xlsx` * Open Power BI and select get data from Web * Paste in the share link * Select the name of your sheet and Power BI will load it as a table There are many other ways to present the data held in Google Sheets, use whichever tool you like the most. This is what my dashboard looks like with test data: It’s simple, straightforward and easy to read. It provides all the high level and detailed information I need to see how well the site is being received, which pages are popular, and which need more work. As the data is automatically logged to the Google Sheet, all I need to do to receive the most up to date data in the Power BI dashboard is hit the Refresh button. The styling might not be amazing, but it’s for my eyes only, I’m not out to win any style awards 😆. The most important part for me, is that I'm only capturing the data I need, without any third party tracking or cookies. It's a privacy first approach. There is no personal data collected, it's all anonymous aggregated data. There was some DAX involved to create calculated columns and measures for the average time on page calculation. I have covered the entire process for building this dashboard from scratch in the article [building a website analytics dashboard with Power BI and Google Sheets](/blog/building-a-website-analytics-dashboard-with-power-bi-and-google-sheets/). ## Bonus: Avoid tracking your own activity As I tested and interacted with the site myself, I didn't want to track my own activity. During the site launch, I didn't want any logs of testing activity either. This would skew the usage statistics and create an inaccurate picture. I addressed this by adding a private route for internal users that saved a value in local storage then redirected back to the home page. So for any internal testing, we can use the private route URL to deactivate analytics logging. ```html [deactivateanalytics.vue] ``` Once this value is set, I added a guard just before the `sendBeacon` method is called. So if analytics are set to deactivated, the events data won't be sent. ```javascript [layouts/default.vue] sendAnalyticsData() { let url = "https://f2hrck8yp5.execute-api.eu-west-1.amazonaws.com/website-analytics-logger-demo"; let data = JSON.stringify({ events: [...this.analyticsEvents] }); let analyticsDeactivated = localStorage.getItem("analyticsDeactivated") || false; if (!this.sendingAnalyticsData && !analyticsDeactivated) { this.sendBeacon(url, data); } } ``` ## Lessons learnt This has been a fun project and overall I’m pleased with the outcome. Does it give me an insight into visitors and page views? Absolutely. It’s not perfect, there are some negatives but it meets most of my initial goals. I did find it difficult to handle mobile use cases such as switching tabs, closing tabs, leaving the browser and switching to another app. This was overcome with the `visibilitychange` and `blur` events - effectively creating a ‘log when you can’ approach. Whenever the `sendBeacon` method is successfully called I clear the `analyticsEvents` array held in state, so if it happens to try and send again when a user comes back, it won’t send if there are no new events to log 😄 Although I acknowledge I will be missing some sessions, I am happy with that. I only set out to get a simple overview of how the site is being received so I can improve it. This satisfies that purpose nicely. If capturing every single event was the number one priority, I would switch this setup to log each event as it happens - using the `beforeEach()` hook to call the AWS function on each page change rather than all in one call at the end of the session. This would lead to increased AWS function calls which would increase the costs at scale. The AWS Lambda [free usage tier](https://aws.amazon.com/lambda/pricing/) includes 1M free requests per month and 400,000 GB-seconds of compute time per month at the time of writing. I can see uses for this setup beyond website analytics logging. I think it could be handy in a variety of situations when it comes to logging information. If you have adapted this setup to your own needs, I'd love to hear about it in the comments below. ## How it’s performed I will update this section when more data on performance is available.

Programming quotes that offer wisdom and motivation

Sat, 12 Jun 2021 11:12:00 GMT

This article is a place I keep all of my favourite programming quotes. Some didn’t come from programmers, but they are very applicable nonetheless. A few wise words can go a long way in furthering understanding, I have grouped them by topic so they’re a little easier to find. Enjoy! ## Being effective > Give me six hours to chop down a tree and I will spend the first four sharpening the axe — Abraham Lincoln > Slow is smooth and smooth is fast - US Navy SEALs > An investment in knowledge pays the best interest — Benjamin Franklin > The best work happens when it doesn’t feel like you’re working at all — Shedload Of Code > Most good programmers do programming not because they expect to get paid or get adulation by the public, but because it is fun to program — Linus Torvalds > Simplicity is the soul of efficiency — Austin Freeman > One of my most productive days was throwing away 1000 lines of code — Ken Thompson > Every great developer you know got there by solving problems they were unqualified to solve until they actually did it — Patrick McKenzie > Prolific developers don’t always write a lot of code, instead they solve a lot of problems. The two things are not the same — J. Chambers > Measuring programming progress by the lines of code is like measuring aircraft building progress by weight - Bill Gates > Good software, like wine, takes time — Joel Spolsky > To be effective engineers, we need to be able to identify which activities produce more impact with smaller time investments. Not all work is created equal. Not all efforts, however well-intentioned, translate into impact ― Edmond Lau, The Effective Engineer > Choose a job you love, and you will never have to work a day in your life — Confucius > Delegate - work smarter not harder; do what you do best and drop the rest; get control of your calendar; do what you love because it will give you energy; work with people you like so your energy isn't depleted — John C. Maxwell > A hacker on a roll may be able to produce-in a period of a few months - something that a small development group (say, 7-8 people) would have a hard time getting together over a year. IBM used to report that certain programmers might be as much as 100 times as productive as other workers, or more — Peter Seebach > Better than a thousand days of diligent study is one day with a great teacher — Japanese Proverb ## Writing code > First, solve the problem. Then, write the code — John Johnson > Programming is a blend of gardening and surgery — Shedload Of Code > Computer science education cannot make anybody an expert programmer any more than studying brushes and pigment can make somebody an expert painter – Eric S. Raymond > Programming is a skill best acquired by practice and example rather than from books — Alan Turing > All problems in computer science can be solved by another level of indirection — David Wheeler > Any fool can write code that a computer can understand. Good programmers write code that humans can understand — Martin Fowler > Don’t include a single line in your code which you could not explain to your grandmother in a matter of two minutes — Unknown > Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live — Martin Golding > Programming isn't about what you know; it's about what you can figure out - Chris Pine > In some ways, programming is like painting. You start with a blank canvas and certain basic raw materials. You use a combination of science, art, and craft to determine what to do with them. You sketch out an overall shape, paint the underlying environment, then fill in the details. You constantly step back with a critical eye to view what you've done. Every now and then you'll throw a canvas away and start again. But artists will tell you that all the hard work is ruined if you don't know when to stop. If you add layer upon layer, detail over detail, the painting becomes lost in the paint ― Andrew Hunt, The Pragmatic Programmer: From Journeyman to Master > Premature optimization is the root of all evil - Donald Knuth ## Building systems > Simplicity is prerequisite for reliability — Edsger Dijkstra > A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system — John Gall > Complexity kills. It sucks the life out of developers, it makes products difficult to plan, build and test, it introduces security challenges, and it causes end-user and administrator frustration — Ray Ozzie > Software being 'Done' is like lawn being 'Mowed' — Jim Benson > If you cannot grok the overall structure of a program while taking a shower, you are not ready to code it — Richard Pattis > No one in the brief history of computing has ever written a piece of perfect software. It's unlikely that you'll be the first — Andy Hunt > It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change — Charles Darwin > Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away — Antoine de Saint-Exupery > As a programmer, it is your job to put yourself out of business. What you do today can be automated tomorrow - Doug McIlroy > The purpose of software engineering is to control complexity, not to create it — Pamela Zave > It’s easier to ask for forgiveness, than it is to get permission — Admiral Grace Hopper > Computers make it easier to do a lot of things, but most of the things they make it easier to do don't need to be done — Andy Rooney ## Using statistics > All models are wrong, but some are useful — George Box > Not everything that counts can be counted, and not everything that can be counted counts — William Bruce Cameron > The greatest value of a picture is when it forces us to notice what we never expected to see — John Tukey > He uses statistics as a drunken man uses lamp posts - for support rather than for illumination — Andrew Lang > Statistics are no substitute for judgment — Henry Clay

How to scrape and analyse your Amazon spending data

Thu, 03 Jun 2021 09:19:00 GMT

Ever wondered just how much you've spent on Amazon since signing up? Well I read an article recently from Dataquest which outlined how to find out [how much you've spent on Amazon](https://www.dataquest.io/blog/how-much-spent-amazon-data-analysis/?utm_content=buffer06d87&utm_medium=social&utm_source=twitter.com&utm_campaign=dataquest_buffer). However, I quickly found out that this feature of downloading your spending in a report, is not available on the UK version of this site! I really wanted to gather this data, and started a small project to do just that. So, if you're interested in gathering and analysing your Amazon spending data with Python, while learning some web scraping, you're in the right place. ## Before starting Before starting you will need a few things. These things will set you up to carry out other Data Science projects in the future too. * Anaconda * Jupyter Notebooks (installed with Anaconda) * Selenium * Google Chrome (latest version) * Chrome Driver (latest version) This article will not cover installing programs in detail, but here is a starting point. Install [Anaconda](https://www.anaconda.com/distribution/) first. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Once installed, open Anaconda Prompt and install Selenium using `pip install selenium`. Selenium is a web driver built for automated actions in the browser and testing. Finally, ensure you have the latest version of [Google Chrome](http://google.co.uk/chrome/?brand=CHBD&gclid=EAIaIQobChMI0LPsqNXl5QIVCLTtCh3pJwybEAAYASAAEgJxkvD_BwE&gclsrc=aw.ds) installed and [ChromeDriver](https://chromedriver.chromium.org/downloads) for the version number of Chrome you're running. On Windows, ensure `chromedriver.exe` is in a [suitable location](https://chromedriver.chromium.org/getting-started) such as `C:\Windows`. There is a link to download the Jupyter Notebook at the end of this article so you can try out the code on your own. Alternatively, just use the code you find in this page if you don't want to use Anaconda and Jupyter Notebooks, and install the required Python packages in a virtual environment. ## What will the web scraper do? Here are the step by step actions the web scraper will perform to scrape Amazon spending data: * Launches a Chrome browser controlled by Selenium * Navigates to the Amazon login page * Waits 30 seconds for you to manually log in * After login, navigates to the Orders page * Scrapes Item Costs, Order IDs, and Order Dates * Repeats for each year in the year filter and each page in the pagination filter until finished * Outputs the data model to a CSV file The result will be enough to answer questions such as: * How much have I spent in total? * How much do I spend on average per order? * What were the most expensive orders? * What is my spending like per day of the week, month, year? Before we step into the code, let's take a look at the automated scraper in action. Pay attention to the `&orderFilter=` and `&startIndex=` parameters in the URL bar. I've blurred out personal details of course, but you'll see how the scraper moves from year to year, and then page to page to scrape all of the order data. ## Scraping the data Let's look at the `AmazonOrderScraper` class which will be center stage. Bear in mind, this script was accurate at the time of writing, however if the Amazon website changes (id or class names, page structure or url paths) this script may no longer work and will require amending. Underneath this fairly long snippet you can simulate running the code to understand what it's doing, and what the final dataframe would look like. ```python [order-scraper.py] import numpy as np import pandas as pd import bs4 from bs4 import BeautifulSoup import requests import csv import datetime import time import os from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options class AmazonOrderScraper: def __init__(self): self.date = np.array([]) self.cost = np.array([]) self.order_id = np.array([]) def URL(self, year: int, start_index: int) -> str: return "https://www.amazon.co.uk/gp/your-account/order-history/" + \ "ref=ppx_yo_dt_b_pagination_1_4?ie=UTF8&orderFilter=year-" + \ str(year) + \ "&search=&startIndex=" + \ str(start_index) def scrape_order_data(self, start_year: int, end_year: int) -> pd.DataFrame: years = list(range(start_year, end_year + 1)) driver = self.start_driver_and_manually_login_to_amazon() for year in years: driver.get( self.URL(year, 0) ) number_of_pages = self.find_max_number_of_pages(driver) self.scrape_first_page_before_progressing(driver) for i in range(number_of_pages): self.scrape_page(driver, year, i) print(f"Order data extracted for { year }") driver.close() print("Scraping done :)") order_data = pd.DataFrame({ "Date": self.date, "Cost £": self.cost, "Order ID": self.order_id }) order_data = self.prepare_dataset(order_data) order_data.to_csv(r"amazon-orders.csv") return order_data def start_driver_and_manually_login_to_amazon(self) -> webdriver: options = webdriver.ChromeOptions() options.add_argument("--start-maximized") driver = webdriver.Chrome("chromedriver.exe", options=options) amazon_sign_in_url = "https://www.amazon.co.uk/ap/signin?" + \ "_encoding=UTF8&accountStatusPolicy=P1&" + \ "openid.assoc_handle=gbflex&openid.claimed_id" + \ "=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&" + \ "openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier" + \ "_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid" + \ ".net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net" + \ "%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid" + \ ".return_to=https%3A%2F%2Fwww.amazon.co.uk%2Fgp%2Fcss%2Forder-history" + \ "%3Fie%3DUTF8%26ref_%3Dnav_orders_first&" + \ "pageId=webcs-yourorder&showRmrMe=1" driver.get(amazon_sign_in_url) time.sleep(30) # allows time for manual sign in - increase if you need more time return driver def find_max_number_of_pages(self, driver: webdriver) -> int: time.sleep(2) page_source = driver.page_source page_content = BeautifulSoup(page_source, "html.parser") a_normal = page_content.findAll("li", {"class": "a-normal"}) a_selected = page_content.findAll("li", {"class": "a-selected"}) max_pages = len(a_normal + a_selected) - 1 return max_pages def scrape_first_page_before_progressing(self, driver: webdriver) -> None: time.sleep(2) page_source = driver.page_source page_content = BeautifulSoup(page_source, "html.parser") order_info = page_content.findAll("span", {"class": "a-color-secondary value"}) orders = [] for i in order_info: orders.append(i.text.strip()) index = 0 for i in orders: if index == 0: self.date = np.append(self.date, i) index += 1 elif index == 1: self.cost = np.append(self.cost, i) index += 1 elif index == 2: self.order_id = np.append(self.order_id, i) index = 0 def scrape_page(self, driver: webdriver, year: int, i: int) -> None: start_index = list(range(10, 110, 10)) driver.get( self.URL(year, start_index[i]) ) time.sleep(2) data = driver.page_source page_content = BeautifulSoup(data, "html.parser") order_info = page_content.findAll("span", {"class": "a-color-secondary value"}) orders = [] for i in order_info: orders.append(i.text.strip()) index = 0 for i in orders: if index == 0: self.date = np.append(self.date, i) index += 1 elif index == 1: self.cost = np.append(self.cost, i) index += 1 elif index == 2: self.order_id = np.append(self.order_id, i) index = 0 def prepare_dataset(self, order_data: pd.DataFrame) -> pd.DataFrame: order_data.set_index("Order ID", inplace=True) order_data["Cost £"] = order_data["Cost £"].str.replace("£", "").astype(float) order_data['Order Date'] = pd.to_datetime(order_data['Date']) order_data["Year"] = pd.DatetimeIndex(order_data['Order Date']).year order_data['Month Number'] = pd.DatetimeIndex(order_data['Order Date']).month order_data['Day'] = pd.DatetimeIndex(order_data['Order Date']).dayofweek day_of_week = { 0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday' } order_data["Day Of Week"] = order_data['Order Date'].dt.dayofweek.map(day_of_week) month = { 1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December' } order_data["Month"] = order_data['Order Date'].dt.month.map(month) return order_data if __name__ == "__main__": aos = AmazonOrderScraper() order_data = aos.scrape_order_data(start_year = 2010, end_year = 2021) print(order_data.head(3)) ``` Once instantiated as `aos`, we call the `scrape_order_data` method and it handles everything else. You will need to pass `start_year` and `end_year` as parameters to it, this allows for scraping the full range of years applicable to you, or a selected range. I used a similar method to this in [How to scrape AutoTrader with Python and Selenium to search for multiple makes and models](/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/). ## Analysing the data The `prepare_dataset` method applied some feature engineering to enhance the dataset. This is simply to ensure that the data is able to be sliced by date, year, month and day of the week. It carried out a series of data manipulation steps, such as removing the pound sign from the cost column, ensuring data types were correct, and mapping day and month names to their integer representations ready to use with charts. So now you have your data, you can apply any analysis you would like to it. I will give you some inspiration on the kinds of questions you might wish to ask. You might find (like I did) your spending is higher or lower than you expected, so brace yourself for unexpected surprises! ## Import packages ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set(rc={'figure.facecolor':'white'}) ``` ## Summary statistics ```python order_data.describe() ```

| | Cost £ | Year | Month Number | Day | |--------|-------------|--------------|---------------|----------- | | count | 523.000000 | 523.000000 | 523.000000 | 523.000000 | | mean | 18.695985 | 2015.139579 | 6.699809 | 2.797323 | | std | 23.793675 | 3.276180 | 3.612417 | 2.164905 | | min | 0.000000 | 2010.000000 | 1.000000 | 0.000000 | | 25% | 5.330000 | 2012.000000 | 3.500000 | 1.000000 | | 50% | 12.750000 | 2015.000000 | 7.000000 | 3.000000 | | 75% | 23.015000 | 2018.000000 | 10.000000 | 5.000000 | | max | 299.990000 | 2021.000000 | 12.000000 | 6.000000 |

## Total spend ```python total_amount_spent = order_data["Cost £"].sum() print(f"Total amount spent: £{ total_amount_spent }") ``` ## Average spend per order ```python average_amount_spent_per_order = order_data["Cost £"].mean() print(f"Average amount spent per order: £{ round(average_amount_spent_per_order, 2) }") ``` ## Most and least expensive orders ```python order_data.loc[order_data["Cost £"] == order_data["Cost £"].max()] ```

| Order ID | Date | Cost £ | Order Date | Year | Day Of Week | Month | |---------------------|---------------|--------|------------|------|-------------|-------| | 205-1516165-1234567 | 31 March 2020 | 299.99 | 2020-03-31 | 2020 | Tuesday | March|

```python order_data.loc[order_data["Cost £"] == order_data["Cost £"].min()] ```

| Order ID | Date | Cost £ | Order Date | Year | Day Of Week | Month | |---------------------|---------------|--------|------------|------|-------------|-------| | 123-5616156-1234567 | 21 June 2011 | 0.0 | 2011-06-21 | 2011 | Tuesday | June |

## Top five most expensive orders ```python order_data.sort_values(ascending=False, by="Cost £").head(5) ```

| Order ID | Date | Cost £ | Order Date | Year | Day Of Week | Month | |---------------------|--------------- |--------|------------|------|-------------|------- | | 205-2452455-9123505 | 31 March 2020 | 299.99 | 2020-03-31 | 2020 | Tuesday | March | | 204-4525421-7169117 | 15 November 2020| 239.00 | 2020-11-15 | 2020 | Sunday | November | | 205-5245215-9426706 | 28 February 2020| 138.22 | 2020-02-28 | 2020 | Friday | February | | 202-5278588-7857857 | 17 November 2018| 135.99 | 2018-11-17 | 2018 | Saturday | November | | 204-2542525-5654645 | 5 December 2020 | 127.37 | 2020-12-05 | 2020 | Saturday | December |

## Total spend per year ```python fig, ax = plt.subplots(figsize=(15,6)) yoy_cost = order_data.groupby(["Year"], as_index=False).sum() sns.lineplot(x=yoy_cost["Year"], y=yoy_cost["Cost £"], color="grey") plt.title("How much spending per year?") plt.ylabel("Spending £") ``` ## Count of orders per year ```python fig, ax = plt.subplots(figsize=(15,6)) yoy_order_count = order_data.groupby(["Year"], as_index=False).count() sns.lineplot(x=yoy_order_count["Year"], y=yoy_order_count["Cost £"], color="Grey") plt.title("How many orders per year?") plt.ylabel("Count of Orders") ``` ## Total monthly spend ```python months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"] fig, ax = plt.subplots(figsize=(15,6)) monthly_cost = order_data.groupby(["Month"], as_index=False).sum() sns.barplot(x=monthly_cost["Month"], y=monthly_cost["Cost £"], order=months, color="Grey") plt.ylabel("Spending £") plt.title("How much overall spending per month?") ``` ## Average monthly spend ```python months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"] fig, ax = plt.subplots(figsize=(15,6)) monthly_cost = order_data.groupby(["Month"], as_index=False).mean() sns.barplot(x=monthly_cost["Month"], y=monthly_cost["Cost £"], order=months, color="Grey") plt.ylabel("Spending £") plt.title("Average spending per month?") ``` ## Day of the week with highest spend ```python days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] fig, ax = plt.subplots(figsize=(15,6)) day_of_week_cost = order_data.groupby(["Day Of Week"], as_index=False).sum() sns.barplot(x=day_of_week_cost["Day Of Week"], y=day_of_week_cost["Cost £"], order=days_of_week, color="Grey") plt.ylabel("Spending £") plt.title("Which day of the week has the highest spend?") ``` ## Full time series ```python fig, ax = plt.subplots(figsize=(15,6)) sns.lineplot(x=order_data['Order Date'], y=order_data["Cost £"], color="Grey") plt.ylabel("Spending £") plt.title("Spending Time Series") ``` ## Final words and next steps So there it is, you can now scrape and analyse your Amazon spending data using Python. Hopefully, the answers to the questions we've asked in this article haven't caused too many surprises! Now you have a way to monitor, track and analyse spending to identify trends. If there are any other analytical questions you'd like to ask of this dataset, let me know in the comments below and I'll update the article. The full Jupyter notebook can be [downloaded for reference](https://github.com/shedloadofcode/notebooks/blob/main/Amazon%20Orders%20Web%20Scraping.ipynb). Ideas for future development might include importing the CSV into Power BI or other analysis tools. This would allow interactive data exploration and would introduce cross-filtering functionality. You could then cross examine day of the week with year, or day of the month with month and all other combinations. This could unlock further insights.

Maintaining a healthy positive mindset as a programmer

Wed, 02 Jun 2021 19:42:00 GMT

Being in a positive frame of mind is so important in any profession, even more so for programmers who undergo daily mental gymnastics! Easier said than done, I know. Not only is writing code hard (whether it be for software development or data science), dealing with people can be harder. Of course I can only speak from my own experience so far, but maybe some of it will relate. This article will go over the main things that cause my positive mindset to turn negative, and how I try to overcome them to get back on track. ## You overwork yourself I wanted to enter the programming world because I found programming to be fun! It’s so good to be able to have an idea and then go ahead and build it. By doing so you can make other people’s lives easier and genuinely provide value. The downside of that, is you might end up working on things for too long. I can find myself finishing the working day, only to start working on my side projects or to study at night. I do it because it’s fun, it’s almost like being paid a good amount for a hobby, which is great. The only problem is it leaves no time to wind down and do other things. This can lead to burn out and stress. These are two things you need to avoid. They will really harm you in the long run and are unsustainable. **Suggested remedy:** Always have a start time and an end time for the working day. You value your ‘work’ time and should value your ‘free’ time. Your free time is sacred. I love building things and writing code but there are other things in the world too 😄 You don’t want life to pass you by while coding all hours, no matter how fun it is! On that note, try to go to bed at the same time and get at least eight hours sleep. Always take periodic breaks throughout the day - maybe Pomodoro, or five minute walks every so often (your eyes will thank you for time away from a screen). Sitting for long periods is very bad for you. Drink water throughout the day and use a smaller cup so you have to get up to refill it. Don't skip breakfast or lunch, and try to eat a balanced nutritional diet. I found trying a standing desk helped with taking breaks, easier to move around when you’re already standing right? Might be worth considering. Try to get some frequent exercise in your down time too, it boosts your mood and your health is the most important thing you have. ## You must attend scrum rituals I really gave agile and scrum a chance when I first started in the field. It was new to me so I thought I’ll see what it’s all about. It didn’t really leave a good impression on me (maybe I’ve just been unfortunate). The daily stand ups were way too long and more like a status report to the project manager. Those meetings felt so unnatural, everyone seemed to be justifying their existence, sometimes with what felt like busywork. It started to look like these [kinds of things](https://www.aaron-gray.com/a-criticism-of-scrum/), rather than the [agile manifesto](https://agilemanifesto.org) I had read. I didn’t get into programming to justify myself daily that’s for sure. It all felt a little belittling and hostile. It doesn't surprise me many others have [similar thoughts](https://www.quora.com/In-a-nutshell-why-do-a-lot-of-developers-dislike-Agile-What-are-better-project-management-paradigm-alternatives) that agile is fundamentally a good thing, but it's become a hindrance rather than a help. I've sat in my fair share of meetings that looked a little too much [like this](https://www.youtube.com/watch?v=BKorP55Aqvg) - containing vague requests and haphazard, irrational plans. Despite always being the voice of reason, by the end of them I had no idea what just happened much like Anderson 😆 I still work in agile teams but I handle it differently now, I’ve come to terms with what agile is and what it’s not. It’s not a silver bullet. The main ingredient in getting anything done is amazing experienced people who are team players and want to improve the product or service they’re building. **Suggested remedy:** Remember why you got into programming in the first place. The answer for me is to have fun, get paid for it and build amazing things that help other people. I want to manage deadlines, costs and slackers as much as the next guy, but checking up on people daily is not my idea of trust. Always stay away from the politics, and stand up for yourself if you find yourself up against hostile people who are asking too much. At the end of the day, the doer is the most important person in the room. As a doer you hold a lot of power over the talkers, and if they aren’t nice to you they can either do the work themselves or find someone else who will put up with it right? I let those people who love their meetings and rituals get on with it, I focus on building amazing products that help others and that I’m passionate about. If you want some fun counting the cost of scrum, try [running these numbers](https://www.aaron-gray.com/a-criticism-of-scrum/#count-the-cost) through our [Meeting Cost Calculator](/tools/meeting-cost-calculator). ## You find ‘how long do you think that will take’ hard to answer It’s a question so difficult to answer, yet asked by everyone. Entire books have been written on the subject of giving accurate estimates. The problem is it’s not always taken as an estimate, but as a commitment. I feel unless you’ve done the exact same thing a hundred times before, in a similar setting, the estimate will be wrong. That creates resentment, dysfunction and distrust after ‘missed’ deadlines. It makes people feel bad, they feel responsible because they thought it would be done quicker. They question their own ability to get things done, when it could be something outside of their own control or something unforeseen by everyone. There are many things you know you don’t know and things that surprise us when it’s too late to change course. This can lead to a very negative mood. I watched an interesting talk on [no estimates](https://youtu.be/QVBlnCTu9Ms) that seems like a great way to work. If you’re building and improving a working product consistently, on time and in budget, why do estimates matter anyway? There is strong evidence that once a task requires even rudimentary cognitive skill, rewards and other motivators (like deadlines) simply don’t work, they actually [lead to poorer performance](https://youtu.be/rrkrvAUbU9Y?t=98). **Suggested remedy:** Honesty is the best policy. If you’ve done something similar, use that as a starting point, and maybe double it. State your plan out loud - this will help to break down the steps and what tasks are involved. Give a range, so something like ‘worst case scenario one week, best case three days’. Don’t try to impress anyone and if you don’t know how long something will take, say so. People don’t like uncertainty, they will press for an estimate, but if you’ve never done something before how can you say how long it will take? Better to ask for time to explore the problem first, or speak to a more experienced colleague, to gauge how much effort is involved in solving it. You’ll feel much better and you won’t be pressured into accepting a timeframe you’re not comfortable with. At the end of the day, be professional, but things take as long as they take. If you ever find yourself in a disagreement, come at it from a business / economics point of view - writing subpar code and cutting corners slows things down in the long run and the costs of that can be massive. Finally, if you want to provide more robust estimates using statistical techniques be sure to check out our [Agile Task Estimation Calculator](/tools/agile-task-estimation-calculator). ## You feel like you’re not good enough This is referred to as ‘imposter syndrome’ and it affects everyone I think. Sometimes you get a negative feeling that you simply don’t know enough to be good. What I’ve seen is that programmers of all types are looked to for guidance. They are seen as the experts, the problem solvers and the clever people in the room. So what happens when the expert is asked a question they don’t know the answer to? Or asked a question others think they should know the answer to? They feel like a fraud or unqualified for their position. These feelings make you doubt and question yourself as to how good you are. This is true of newcomers and veterans alike, I imagine veterans have become better at handling these thoughts, but not always. I think when you arrive at a point where you’ve built some projects that others have used (production code) it helps with those doubts. You have concrete evidence that you can code, you can solve problems, and you can build working products. You might not be an expert at everything, but you know enough to get things done. **Suggested remedy:** Remember no one can know everything. Even experts in every field forget or don’t know something from time to time. Work hard at filling gaps in your knowledge - if you don’t understand something, read up on it. If you work alongside someone who knows way more than you do, learn from them. Never stop learning new things whenever you can. Staying inquisitive is better than assuming or pretending you know everything. It is this motivation to learn new things, and find solutions to problems that gives you immense worth, not pre-existing knowledge. ## You are no longer learning anything new At the beginning of a new role, learning is the main activity. You might be learning a new technology stack, a new programming language or a new way of working. This process of initial learning can last up to a year I've found. You pick up small bits of information until eventually, there isn't much that happens which surprises you. You reach a competence level in a role where you know how to solve everything (almost everything). In a good organisation, you'll be encouraged to try new things, learn new technologies and undergo any training that can help you improve professionally. In a bad organisation, you won't. Regardless, both of these situations can still lead to you feeling negative. The reason for that is no matter how much learning and development you do, if you're not using that new-found knowledge on a day-to-day basis, it won't be fully realised. Say you learn about cloud computing services with AWS, but your organisation uses Azure, you won't get to use that new skill. Was it worth learning though? Absolutely. You have a new valuable skill, but to use it day-to-day in a professional setting, it might require you to change organisations. Even worse is the scenario where your organisation doesn't encourage learning new things. I think of this erroding the value of your skills the same way [inflation](https://en.wikipedia.org/wiki/Inflation) errodes the value of money. You see, whilst you're working for an organisation that offers no time for learning, they're gaining your portfolio of skills, without giving you the time to grow that portfolio. Over time, your portfolio becomes less valuable - new technology emerges, updates are made to existing technology and frameworks, and old skills become rusty. Some may argue the portfolio should be maintained on your own time, I disagree - any organisation you work for should be very interested in the state of your skills portfolio and actively help you to grow it. **Suggested remedy:** Always keep your skills portfolio healthy and growing. Continuously learn new things whether it be via online courses on platforms like EdX, Coursera or YouTube, or reading a technical book. The more knowledge you add to your portfolio, the more marketable, valuable and competent you become. Knowledge certainly is power, but it can also [improve your mental wellbeing](https://www.nhs.uk/mental-health/self-help/guides-tools-and-activities/five-steps-to-mental-wellbeing/#:~:text=Research%20shows%20that%20learning%20new,you%20to%20connect%20with%20others), boosting your self-confidence, self-esteem and giving you new directions and opportunities. If you're in an organisation where you're not encouraged to learn new things or you feel locked into a particular tech stack with no room for growth, consider finding another organisation or another role which does offer that support and a new challenge. ## You don’t have anyone around you to turn to for help Programming is labelled as a job for introverts. However programming is very much a team game. Think about how many times you search Google or Stack Overflow to find insight and guidance - you’re consulting with the community each time. The problems that these places can’t help you with, are the problems very specific to the project or place you’re working at. You might find yourself at a loss when you face these issues, I certainly have done. You have to turn to other members of the team with internal knowledge of the company to solve these problems. On the odd occasion, particularly smaller projects whilst working as a solo developer, there is no one to really turn to. In these circumstances, I’ve had non-technical managers to turn to, but they can’t really help you fix issues within the code base. I think there should always be someone you can go to for technical guidance and support. This is true whether you are beginner, intermediate or advanced. Just because you might be advanced in most areas, doesn’t mean something won’t come up that makes you feel like a total beginner. In most cases, I’ve been fortunate enough to have someone around. On my first real project, I worked alongside an amazing senior developer who could talk the talk and walk the walk, and would always be available to guide me. When that’s not the case, it can leave you feeling isolated, with no one to turn to for support and therefore unable to deliver what’s being asked. It can make you feel like quitting, because without a mentor of any kind to guide and support you to the next level, you lose direction and focus. **Suggested remedy:** Remember you can’t do it all by yourself all of the time. As said before, be honest. Let it be known that you need support on something - and if you don’t get it then it then offer two choices. Either the ask is abandoned because you tried but can’t see a way to do it, or you can carry on trying for a little longer with no guarantee you can get it done. Finally, if you feel like you have no mentor to learn from and no support at all for a long period of time, the best thing to do might be to leave and find somewhere that does offer those things. ## Key takeaways I hope this article has given you ways to keep your mental and physical health in shape as someone who writes code professionally. There are so many positives to programming, it’s like no other activity, a mixture of art, creativity and science that can bring real joy to those that practice it. Nevertheless, you need watch out for the negatives listed in this article and work to balance your pursuit with your health and life. Code runs the world, and I think the demand for enthusiastic dedicated programmers is only going to keep going up. Not only will they need to learn the technical topics, but also topics such as these. It will hopefully make programmers realise their worth and to prioritise their mental and physical wellbeing. If there are any ways you use to keep a healthy mindset or overcome certain negatives, let me know in the comments below. Here is a recap of all the suggested mindset remedies mentioned in this article: * Always have a start and end time to your working day * Respect your free time * Do something other than programming in your free time sometimes * Don’t neglect exercise - your health is the most important thing * Remember why you got into programming in the first place - for me to have fun, get paid for it, learn new things and build amazing stuff * Be honest and professional when giving estimates - but admit if you can’t say how long something will take * Stand up for yourself if you’re being made to accept a timeframe that is unrealistic * Remember no one can know everything * Your motivation to find solutions to problems is what gives you worth * Stay inquisitive and learn new things whenever you can * Keep your skills portfolio healthy and growing * Remember you can’t do it all by yourself all of the time * If you don’t have a mentor or any support, consider moving to a place that provides those things

How to query a database with Python Flask and download data to CSV or XLSX in Vue

Thu, 27 May 2021 17:18:00 GMT

## Background I was recently working on a project building a web app to automate viewing and downloading data. The result was a Vue - Flask app which accepted some user input, and based upon that input, sent the relevant SQL query to a data warehouse. The data could then be viewed or downloaded straight from the browser. There were a multitude of benefits from this. The queries no longer needed to be ran manually, saving time. They were indexed and easily updated. Finally, the data was more accessible via a web app to users without knowledge of SQL. This article will cover building a simplified version of this app where we’ll go over the following: * Optional: Setting up an Azure SQL database for testing * Getting a Vue - Flask app set up from a template * Creating a SQL query lookup * Configuring Flask RESTX API endpoints * Sending an Axios call to get data * Building a simple form to accept user input * Presenting the data in the browser * Adding links to download the data * Bonus: Displaying the SQL code nicely formatted You can use this as a starting point to further develop a more complex and tailored solution. You’ll need either your own database set up to follow along, or you can set one up in the optional first step. I’ll be setting up and connecting to an Azure SQL database however it should be adaptable to other databases. You'll also need [Python 3.6.x](https://www.python.org/downloads/) along with [Node](https://nodejs.org/en/) and [Yarn](https://yarnpkg.com/getting-started/install) installed. ## Optional: Setting up an Azure SQL database for testing This first step is optional as you might already have your own database you want to connect to. To facilitate an end to end tutorial, I’m setting up an Azure SQL database for testing. You can register for an [Azure account](https://azure.microsoft.com/en-gb/free/) which has some services free for 12 months. The video below starts from the [Azure portal](https://portal.azure.com/). It will guide you through the process of setting up an Azure SQL database with a sample AdventureWorks dataset, and find the connection string. Now make a note of the connection string, we’ll need that later on. It should look something like this. ``` Driver={ODBC Driver 13 for SQL Server};Server=tcp:test-sql-server-0123.database.windows.net,1433;Database=test-sql-database-01;Uid=AdminUser;Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30; ``` Replace `{your_password_here}` with the password you created during setting up the SQL database. ## Setting up a Vue - Flask project First things first, head over to this [public repository](https://github.com/gtalarico/flask-vuejs-template) and download the project template. This is a great project template from gtalarico and this will be our starting point. Use whichever editor or IDE you're comfortable with, I'm using Visual Studio Code. The general project structure should look something like this: ``` [project-structure.txt] flask-vuejs-template-master │ README.md | .flaskenv | .gitignore | app.json | package.json │ Pipfile | Pipfile.lock | run.py | vue.config.js | yarn.lock | ... | │ └───app │ │ __init__.py │ │ client.py | | config.py │ │ │ └───api │ │ __init__.py | | resources.py │ │ security.py │ └───src │ │ App.vue │ │ backend.js | | filters.js | | main.js | | router.js | | store.js │ │ │ └───assets │ | │ ... │ └───components │ | │ HelloWorld.vue │ └───views │ │ Api.vue │ │ Home.vue ``` The `app` directory contains the Flask app and the `src` directory contains the frontend Vue app. We'll now install pipenv, create a virtual environment, install the project packages to it, and activate it. The Pipfile requires Python 3.6, but you should be able to manually change this if you have a different Python version. We'll be installing `flask-restx`, a community driven fork of Flask-RESTPlus. We'll also be installing `pyodbc` to connect to the SQL database, `xlsxwriter` for downloading an excel file and `pandas` for general dataframe processing. ``` cd flask-vuejs-template-master python -m pip install pipenv python -m pipenv install --dev python -m pipenv install flask-restx pyodbc xlsxwriter pandas python -m pipenv shell ``` Now that the Python packages are installed, let's install and upgrade the Vue dependencies with Yarn, and build the Vue dist directory. ``` yarn install --dev yarn upgrade yarn build ``` If everything went smoothly, you should be able to run both the backend and frontend dev servers. Run `python run.py` and from another terminal window in the same directory run `yarn serve`. You should see the app running at `http://localhost:8080/#/`. ## Creating a SQL query lookup table Creating a simple lookup table for the SQL queries that the database expects will be useful for later on. Of course, the queries here are specific to the AdventureWorksLT database I’m working with, so feel free to adapt them to yours. Create another folder inside the `app` folder called `data`. Then create a `lookup.csv` file and copy the data below into it. The other columns will map to the user’s input to find their chosen query. ``` csv [lookup.csv] Query,SQL All customers who live in Canada,"SELECT C.[FirstName],C.[LastName],A.[AddressLine1],A.[CountryRegion]FROM [SalesLT].[Customer] C JOIN [SalesLT].[CustomerAddress] CA ON CA.[CustomerId] = C.[CustomerId] JOIN [SalesLT].[Address] A ON CA.[AddressId] = A.[AddressId] WHERE A.[CountryRegion] LIKE 'Canada'" All products ordered by price,"SELECT TOP (1000) [ProductID],[Name],[ProductNumber],[Color],[StandardCost],[ListPrice],[Size],[Weight] FROM [SalesLT].[Product] ORDER BY [ListPrice] DESC" Total revenue for each product,"SELECT P.Name, SUM(LineTotal) AS TotalRevenue FROM [SalesLT].[SalesOrderDetail] AS SOD JOIN [SalesLT].[Product] AS P ON SOD.[ProductID] = P.[ProductID] GROUP BY P.Name ORDER BY TotalRevenue DESC" ``` The key thing to note here are the double brackets which escape commas inside the SQL statements. ## Configuring Flask API endpoints Since we’ll be using a Vue single page application, there will need to be endpoints for it to send requests to later on. Let’s get started building these out. Within `app/api` add a file `query.py`. This will be our main API route for handling queries. Once the file is created open `api/__init__.py` and add the `.query` import just underneath the `.resources` import, to ensure our new route is registered. ``` python [app/api/__init__.py] ... # Import resources to ensure view is registered from .resources import * # NOQA from .query import * ``` Now in `query.py` add two routes, one for getting the data, and one which will download the data. ``` python [app/api/query.py] import os import io from flask import request, send_file, make_response from flask_restx import Resource from . import api_rest import pyodbc import pandas as pd connection_string = os.getenv("DB_URI") @api_rest.route('/query/get') class GetData(Resource): def post(self): """ Retrieves data from the database """ # TODO @api_rest.route('/query/download') class DownloadData(Resource): def post(self): """ Returns data as a downloadable file """ # TODO ``` **Adding an environment variable for DB_URI** As you can see we're ready to hook up the connection string for our database using `os.getenv("DB_URI")`. The best and most secure way to do that is via an environment variable. This template has the `python-dotenv` package installed, so we can use a `.env` file. At the folder top level create a file called `.env` and add in your own connection string: ``` [/.env] DB_URI="DRIVER={ODBC Driver 17 for SQL Server};SERVER=test-sql-server-0123.database.windows.net;DATABASE=test-sql-database-01;UID=AdminUser;PWD={your_password_here}" ``` Now the environment variable is added, you will have to close your current terminal, start a new one and reactivate the shell with `python -m pipenv shell`. This should show a message during start saying `Loading .env environment variables...` so we know they're registered! **Route for getting data** With the connection string ready, let's complete the route for retrieving data from the database. We'll be grabbing the query from the POST request, and then we'll use the lookup file we made earlier to find the correct SQL statement. ``` python [app/api/query.py] @api_rest.route('/query/get') class GetData(Resource): def post(self): """ Retrieves data from the database """ query = request.get_json()['query'] lookup = pd.read_csv(os.path.join( os.getcwd(), "app", "data", "lookup.csv")) sql_statement = lookup.loc[lookup["Query"] == query, "SQL"].iloc[0] conn = pyodbc.connect(connection_string) dataframe = pd.read_sql(sql_statement, conn) conn.close() return { "sql_statement": sql_statement, "data": dataframe.to_json() } ``` **Route for downloading data** Next we'll complete the route which will query the database and return a downloadable file in either CSV or XLSX format. ``` python [app/api/query.py] @api_rest.route('/query/download') class DownloadData(Resource): def post(self): """ Returns data as a downloadable file """ file_type = request.get_json()['fileType'] query = request.get_json()['query'] lookup = pd.read_csv(os.path.join( os.getcwd(), "app", "data", "lookup.csv")) sql_statement = lookup.loc[lookup["Query"] == query, "SQL"].iloc[0] conn = pyodbc.connect(connection_string) dataframe = pd.read_sql(sql_statement, conn) conn.close() if file_type == "csv": response = make_response(dataframe.to_csv(index=False)) response.headers["Content-Disposition"] = "attachment; filename=data.csv" response.headers["Content-Type"] = "text/csv" return response elif file_type == "xlsx": bytes_stream = io.BytesIO() writer = pd.ExcelWriter(bytes_stream, mode="w", engine="xlsxwriter") dataframe.to_excel(writer, startrow=0, merge_cells=False, sheet_name="Sheet_1", index_label=None, index=False) writer.save() bytes_stream.seek(0) return send_file(bytes_stream, attachment_filename="data.xlsx", mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", as_attachment=True) ``` Now with the endpoints built we can call them from the Vue app using Axios. ## Sending an Axios request to get data Open up `src/components/HelloWorld.vue` and delete everything so we can start from a blank template. ``` html [src/components/HelloWorld.vue] ``` **Get data method** ``` javascript [src/components/HelloWorld.vue] getData() { axios.post(`api/query/get`, { query: String(this.selectedQuery) }) .then(response => { this.sqlStatement = response.data.sql_statement; this.dataframe = JSON.parse(response.data.data); }) }, ``` **Download data method** ``` javascript [src/components/HelloWorld.vue] downloadData(fileType) { axios.post(`api/query/download`, { fileType: fileType }, { responseType: fileType === "csv" ? "text" : "arraybuffer" }) .then(response => { let filename = response.headers["content-disposition"].split("filename=")[1]; if (fileType === "csv") { const csv = response.data; const link = document.createElement("a"); link.target = "_blank"; link.href = "data:text/csv;charset=utf-8," + encodeURIComponent(csv); link.download = filename; link.click(); } if (fileType === "xlsx") { const blob = new Blob([response.data], { type: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' }); const url = window.URL.createObjectURL(blob); const link = document.createElement("a"); link.target = "_blank"; link.href = url; link.download = filename; link.click(); window.URL.revokeObjectURL(url); } }) .catch(error => console.log(error)); } ``` You should now see data on the page after selecting an option and clicking the get data button. It might not look too good just yet, but it will soon. Now we have the Axios calls working and ready to go, we can improve the UI and render the data to a table. ## Viewing the data in a table and downloading To improve the UI and display the returned data in a table, let's install [Buefy](https://buefy.org/), which has lightweight UI components for Vue.js based on [Bulma](https://bulma.io/). ``` yarn add buefy ``` With Buefy installed, initialise it within `src/main.js`: ``` javascript [src/main.js] import Vue from 'vue' import App from './App.vue' import router from './router' import store from './store' import Buefy from 'buefy' import 'buefy/dist/buefy.css' import './filters' Vue.use(Buefy) Vue.config.productionTip = false new Vue({ router, store, render: h => h(App) }).$mount('#app') ``` Here is the revised `HelloWorld.vue` component to improve the UI and display our data in a table! ``` html [src/components/HelloWorld.vue] ``` I've added helper methods to wrangle the returned data so it can be used with the [table component](https://buefy.org/documentation/table/). The table component expects a `column` prop as an array of column objects, and a `data` prop as an array of row objects. So effectively we transform something like this: ``` json { "Name": { "0":"Touring-1000 Blue, 60", "1":"Mountain-200 Black, 42", "2":"Road-350-W Yellow, 48", "3":"Mountain-200 Black, 38", "4":"Touring-1000 Yellow, 60", "5":"Touring-1000 Blue, 50", }, "TotalRevenue": { "0":37191.492, "1":37178.838, "2":36486.2355, "3":35801.844, "4":23413.474656, "5":22887.072 } } ``` ... into something like this: ``` json [ { "Name":"Touring-1000 Blue, 60", "TotalRevenue":37191.492 }, { "Name":"Mountain-200 Black, 42", "TotalRevenue":37178.838 }, { "Name":"Road-350-W Yellow, 48", "TotalRevenue":36486.2355 }, { "Name":"Mountain-200 Black, 38", "TotalRevenue":35801.844 }, { "Name":"Touring-1000 Yellow, 60", "TotalRevenue":23413.474656 } ] ``` You should now see the data rendered in the table and two buttons at the bottom to download it in either CSV or XLSX format. Both use cases are now fulfilled! Great job if you made it this far! ## Bonus: Displaying the SQL query nicely formatted What if a more advanced user is interested in what underlying SQL query was executed based upon their selections? That was the reason I added the query from the lookup to the JSON response, so it would be available for this last nice to have! I came across a package recently [sql-formatter](https://yarnpkg.com/package/sql-formatter) that formats SQL for easier reading. Using this package with [prism](https://yarnpkg.com/package/prismjs), not only will the SQL be formatted but also have syntax highlighting. First to install and configure both. ``` yarn add sql-formatter prismjs ``` Now these two packages are ready to go, add the SQL query underneath the download buttons, import both packages and they should handle the rest. ``` html [src/components/HelloWorld.vue] ... Download data to XLSX

View the SQL statement this query

``` As you can see I've wrapped the returned `response.data.sql_statement` with the `format` function and added `Prism.highlightAll()` to the `updated` lifecycle hook - so everytime the DOM updates it will highlight the new query! ## Demonstration and next steps Here is a video of the completed project in action. It’s a simplified version of the app I worked on, however you should see the potential to make this your own and introduce additional functionalities. I hope you enjoyed this end to end project, let me know in the comments if you have any questions or if you've adapted this to your own needs. I think this is a very popular use case that can automate manual queries and put data in the hands of people who don’t know much SQL - they will certainly thank you for opening that door up for them! In terms of next steps and ideas for further development I suggest: * Make the code more modular and introduce a service layer * Generate the Form options dynamically from the SQL lookup sheet * Pull the SQL lookup file from cloud storage like S3, Google Cloud Storage or Azure blob storage (allows admin to upload a new version with new queries easily) * Deploying this app to a cloud hosting platform like AWS or Azure * Adding user authentication if required * Designing and improving the UI (this tutorial was more focused on functionality than UI design) * Expanding the range of queries available * Building dynamic queries into the app including where and group by clauses (always be aware of what SQL you’re allowing the user to execute to avoid SQL injection attacks) * Connecting to multiple databases * Adding a 'copy code' to clipboard button If you enjoyed this article be sure to check out other articles on the site 👍 you may be interested in: * [How to upload PDF files to Azure Blob Storage with Vue and Python Flask](/blog/how-to-upload-pdf-files-to-azure-blob-storage-with-vue-and-python-flask/) * [Automated deployment of a Vue Flask app using Azure Pipelines](/blog/automated-deployment-of-a-vue-flask-app-using-azure-pipelines/) * [How to import a CSV from Dropbox or GitHub into Google Sheets](/blog/how-to-import-a-csv-from-dropbox-or-github-into-google-sheets/)