Six tips for producing and assuring high quality analytical code

December 15, 2022 11 minute read

Contents

In this article we'll look at six tips on producing solid analytical code and ensuring it is of high quality. As with all software engineering the goal is to solve the problem alongside reducing complexity, creating useful abstractions, and keeping it simple!

These tips are inspired by two excellent resources Quality assurance of code for analysis and research and The Turing Way.

Begin with the end in mind

Analysis can get complicated without a good roadmap of where you want to get to. What is the purpose of the analysis? What does the end result look like? It's worth asking questions like this first. You want to be able to describe it to someone who's never heard of your project in one sentence.

A model to identify our most valuable customers.
A model to allocate the correct amount of stock to each store.
A model to forecast product sales.

This helps people understand 'what it does'. To explain to those more curious 'how it does it' we might require a simple and clear solution diagram. It is the A to B summary - I find this helps newcomers understand the technical big picture. It doesn't even have to be a diagram it can be as simple something like this in the README file:

Read sales data 
|
---> Apply forecasting model 
|
------> Output daily predicted sales for each product 
|
---------> Email output to store manager

Without looking at any code I know what this model should do. By writing this before writing the code it allows you plan at a high level what the solution should actually do and avoids coding parts that aren't actually needed. If you want to improve your system design skills more generally, check out the article Five ways to improve your system design and software architecture skills.

Structure your project neatly

This enables you and others to find the files they need quickly, and to make sense of the overall solution. cookiecutter and govcookiecutter provide useful Data Science project structures.

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

This project structure might be too complex for simpler projects, but it gives you a start and you can reduce or repurpose from there. Just a 'data', 'models', 'notebooks', 'output' and 'tests' folder might be enough with a 'src' directory for helper modules/functions and a good README. This structure can also be replicated for projects where R is used instead of Python.

Use version control always

You may think it's only a small project and that using version control is too complex for it. Always use version control! Your future self will thank you 😄 It enables the ability to back up your work, collaborate with others using branches, revert to previous versions, and more. Plus, there's usually no good reason not to use it!

First create a repository with a repository hosting provider such as GitHub.

Then in your working directory initialise the directory as a repo and push your initial commit.

git init
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/your-username/repo-name.git
git push -u origin main

Then every time you make a change, commit again. Keep commits short and often, rather than committing lots of changes all in one go. Then push to the remote repository every once in a while so your changes are backed up.

git add .
git commit -m "Add new percentage calculations to model"
git push

There are many commands with Git you should explore, the most useful are to revert to a previous commit, and to create a new branch to work on something separately before you merge it back to the main branch. You can see the whole history of the project with every commit using git log --graph.

Keep it reproducible with a virtual environment and README

A virtual environment is a collection of packages / dependencies that gives you everything you need to run a project. It solves 'but it works on my machine' problems. You want your analysis to be reproducible, which means someone should be able to clone your repo, install the package dependencies and run your code successfully. For Python there is the venv and pipenv packages and for R there is the renv and packrat packages. I prefer venv and renv.

When someone first clones your repo, there may be other steps they have to go through to run your code too. There may be environment variables that need adding to a .env file or sensitive data files adding that could not be stored in version control. A good README.md file helps with the setup steps. Here I have used some setup steps from an analytical web app I worked on recently which used the Django Python web framework.

README.md

# My data visualisation app

    This app presents data visualisation in a web interface.

## Features

    * Security and user login
    * HTTPS Let's Encrypt
    * Object-relational mapping
    * Integration to Google Sheets API

## Running locally

    * Create and activate a virtual environment 

        python -m venv venv
        .\venv\Scripts\activate
        python -m pip install <package-name>
        python -m pip install -r requirements.txt

    * To deactivate use:

        deactivate

    * To install new packages use:

        python -m pip install <package-name>

    * To register newly installed packages use:

        python -m pip freeze > requirements.txt

    * Create the database 'db.sqlite3' and migrate the latest schema using:

        python manage.py migrate

    * Create a superuser account to login using:

        python manage.py createsuperuser
        Username: admin
        Email address: <your-email-address>
        Password: admin
        Bypass password validation and create user anyway? [y/N]: y

    * Pre-populate the database with some testing data (optional):

        python manage.py loaddata responses.json

    * Add environment variable file '.env' in /home directory with:

        ENVIRONMENT='Development'
        SECRET_KEY=''
        EMAIL_HOST=''
        EMAIL_HOST_USER=''
        EMAIL_HOST_PASSWORD=''
        DEFAULT_FROM_EMAIL=''

    * Run the application using:

        python manage.py runserver

Keep code modular, adaptable, documented and simple

Some problems do sometimes call for quite complex solutions, but by abstracting away some of that complexity into easy to understand classes, methods, functions and variables we can make it simpler. The main characteristics of high quality code are:

Clean and consistent style
Functional
Easy to understand for others
Efficient
Testable
Easy to maintain
Easy to change and adapt
Well documented

We can achieve most of these things by creating well defined classes, methods and functions that do what they say they will, are well documented and are testable. We can also refactor early and often to ensure the code is the most readable it can be - we write code for humans more so than computers! Following a style guide such as the Google Python style guide or the Tidyverse R style guide can also keep the code standardised.

Files should start with a docstring describing the contents and usage of the module:

"""A one line summary of the module or program, terminated by a period.

Leave one blank line.  The rest of this docstring should contain an
overall description of the module or program.  Optionally, it may also
contain a brief description of exported classes and functions and/or usage
examples.

Typical usage example:

  foo = ClassFoo()
  bar = foo.FunctionBar()
"""

R function docstring:

#' Short title for function
#'
#' @description
#' Longer description of the function
#'
#' @param first An object of class "?". Description of parameter
#' @param second An object of class "?". Description of parameter
#' @return Returns an object of class "?". Description of what the function returns
#' @examples
#' # Add some code illustrating how to use the function
my_new_function <- function(first, second) {
    return("hello world")
}

JavaScript function docstring:

/**
 * Summary. (use period)
 *
 * Description. (use period)
 *
 * @see  Function/class relied on
 * @link URL
 *
 * @param {type}   var           Description.
 * @param {type}   [var]         Description of optional variable.
 * @param {type}   [var=default] Description of optional variable with default variable.
 * @param {Object} objectVar     Description.
 * @param {type}   objectVar.key Description of a key in the objectVar parameter.
 *
 * @yield {type} Yielded value description.
 *
 * @return {type} Return value description.
 */
function myNewFunction () {
  return "hello world";
}

Python function docstring:

def my_new_function(first: str, second: int) -> str:
    """Short title for function.

    Longer description of the function.

    Args:
        first (str): A description of the first argument.
        second (int): A description of the second argument.

    Returns:
        result (str): A description of the return value.

    Raises:
        IOError: A description of the error raised.
    """
    result = first + str(second)

    return result

Not only do docstrings make your code easier for yourself and others to understand, the best part is that you can auto-generate documentation using Sphinx for Python and using Roxygen for R! These require another article to go through but are really useful for keeping documentation up to date.

We can also make any code more adaptable by not hardcoding configuration values and instead putting them in a YAML or JSON config file. This makes input parameters easier to quickly change and see the result of that change on the outputs.

config.yaml

input_path: "C:/a/very/specific/path/to/input_data.csv"
output_path: "outputs/predictions.csv"

test_split_proportion: 0.3
random_seed: 42

prediction_parameters:
    constant_a: 7
    max_v: 1000

model.py

import yaml

with open("./config.yaml") as file:
    config = yaml.load(file)

data = read_csv(config["input_path"])
...

model.R

config <- yaml::yaml.load_file("config.yaml")

data <- read.csv(config$input_path)
...

Use automated unit tests and peer review

Using a unit testing framework like pytest, unittest, testthat or Runit will help you to check whether those nicely documented functions you wrote actually do what they say they should. Test driven development to me, simply means you are the first user of your own code. If all your functions, classes and methods do what they are expected to do, we can be very sure the overall program will behave as expected.

These same frameworks can be used to write higher level acceptance tests too like 'does the whole program produce somewhat expected results?'. This tests the overall behaviour of the code as opposed to the implementation. Don't aim for 100% test coverage, I think testing the critical functions and most realistic use cases of your code are the most important. Create your first tests and build your library of tests from there. A unit test should be small, it should run fast and it should test one unit of code.

Below is an example of a unit test with pytest. This one fails as the function does not return the number multiplied by 3 but by 2! All test files must begin 'test_' before running the pytest command in the same directory. It also helps readability to use the arrange, act, assert pattern.

test_calculations.py

def times_number_by_three(number: float):
    return number * 2

def test_times_number_by_three():
    # Arrange
    value = 3
    
    # Act
    result = times_number_by_three(value)

    # Assert
    expected = 9
    assert result == expected

TERMINAL

Python

user@ShedloadOfCode:~$ pytest

Next is the same example but using R and testthat. RStudio will automatically recognise the test_that function and give a 'Run Tests' option in the top right. Alternatively you can use the command testthat::test_file("test_calculations.R") to test a single file.

test_calculations.R

library(testthat)

time_number_by_three <- function(number) {
  return(number * 2)
}

test_that("number_is_multiplied_by_three", {
    # Arrange 
    value <- 3

    # Act 
    result <- time_number_by_three(value)

    # Assert
    expected <- 9
    expect_equal(result, expected)
})

TERMINAL

user@ShedloadOfCode:~$ testthat::test_file("test_calculations.R")

Other things to be aware of when testing are:

The function you want to test doesn't have to be in the test file like in these examples, you can import it from elsewhere in your project making testing super simple.
You can also split your tests up into separate files to keep the project structure clean.
You can create tests to validate any outputs and check the behaviour of the code as QA and acceptance tests.
You can run all test files in a directory with both pytest and testthat fully automating your test suite.

Finally, although automation is great and having a suite of tests you can run every time you introduce a new change gives you confidence, having peer review is equally important. This is where someone else reviews your code and checks that it is readable, understandable and actually works. When reviewing code you should ask yourself these questions:

Can I easily understand what the code does?
- Is the code sufficiently documented for me to understand it? Is there duplication in the code that could be simplified by refactoring into functions and classes? Are functions and class methods simple, using few parameters?
Does the code fulfil its requirements?
Is the required functionality tested sufficiently?
How easy will it be to alter this code when requirements change? They always do.
- Are high level parameters kept in dedicated configuration files? Or would somebody need to work their way through the code with lots of manual edits to reconfigure for a new run?
Can I generate the same outputs that the analysis claims to produce?
- Have dependencies been sufficiently documented?
- Is the code version, input data version and configuration recorded?

In the useful site I shared at the beginning of this article, you can find code quality assurance checklists for analytical projects which seem a really good starting point too.

Conclusion

These six tips should make any analytical project you start a pleasure to work on. Spending the time to really think about the end goal, keep things simple and get your project structure set is worth it. I think it was Abraham Lincoln who said "give me six hours to chop down a tree and I will spend the first four sharpening the axe". Solid advice we should all take.

Thanks for reading 👍 If you enjoyed this article you might also like the article Preparing for a statistical data science interview.

Here are some recommended resources for further learning:

Creating statistical neighbours comparator benchmarking models with Python

How to build and visualise a Monte Carlo simulation with Python and Plotly