Good practices in Scientific Computing

While browsing through dozens of projects and research, are they reproducible? Could you clone the repository, run the code and get the same results? Or do you stop when there are dependencies conflict? Is it documented? Where to start?

To waste no more time and also, to develop a good project — PLEASE, consider these good practices. Remember, the code you write today will ease the life of the future you!

Project organization

What would the directory structure look like in an ideal project? Whatsoever you like, as long as it follows some logic.

Before we write any code, we need to make an src folder that contains all of your project’s scripts, snippets, and shell procedures. We may separate this directory into two more directories: one containing all of the most common functions, utils, constants, and neural network parameters, which we can easily update, while the other folder has all of the scientific code, where you will be creating the script’s brain.

Writing variable names, function definitions, and so forth; Each directory (and each file within it!) should include scripts and material associated with the name.

For intermediate results, it would be considered good practice to create a temp folder - this includes all cleaned data frames and preprocessed images. After you finished your program you can choose if you wish to save or delete them - although they can be generated again by just running the script. The results folder should only contain publication-ready figures and simulated images already analyzed.

How scientific software should be built

The goal of all of these approaches is to keep your software legible, reusable, and testable. There are no hard prerequisites; you may do this by including a statement at the beginning of the program, as well as an example of usage that explains parameters.

Another recommendation, which I cannot urge enough, is to split your programs into functions, to fit the program into the most limited memory of all: ours. Instead of duplicating code, write and reuse functions. This way, you keep your code as reusable as possible by making it clear in a requirements.txt and documenting all of the environment setups.

Now, this tip is something I’ve seen in countless repositories, and it freaks me out: unexplained commented sections of code. Just imagine how would someone who never saw that interpret it.

I could speak a ton here and discuss software engineering practices, but for most research problems, this should be more than enough. There are a few concepts that would be pretty interesting to develop in scientific research:

Unit-testing the hell of your code! A unit test is a small test of 1 particular feature of a piece of software. Projects rely on unit tests to prevent regression, i.e., to ensure that a change to one line of code doesn’t break other functions. While unit tests are essential to the health of large libraries and programs, we have found that they usually aren’t compelling for solo exploratory work.

Refactor and analyze the code you are working on - Profiling is the act of measuring where a program spends its time and is an essential first step in tuning the program (i.e., making it run faster). Both are worth doing but only when the program’s performance is a bottleneck.

Tracking knowledge base

More crucial than programming is keeping track of all project modifications and executed research ideas. The ability to refer to or retrieve a specific version of the project helps with repeatability. Git, for example, keeps track of which files were updated, by whom, and when.

While talking about this, each change should accompany a meaningful message if you programmed the area of a polygon to your software; add a commit message like feature: added polygon area formula.

Each modification should not be so significant that the change tracking becomes obsolete. A single commit such as “Revise script file” that adds or modifies several hundred lines, for example, is likely too big since it prevents changes to distinct components of analysis from being reviewed individually. As a rule of thumb, a good size for a single change is a group of edits you could imagine wanting to undo in one step in the future.

Where possible, save data as generated (i.e., by an instrument, from a survey). It is tempting to overwrite raw data files with cleaned-up versions but faithful retention is essential for rerunning analyses from start to finish, for recovery from analytical mishaps, and for experimenting without fear. Consider changing file permissions to read-only or using spreadsheet protection features so that it is harder to damage raw data by accident or to hand edit it in a moment of weakness.

Create the data set you wished you had received. The goal is to improve machine and human readability, but not to do vigorous data filtering or add external information. Machine readability allows automatic processing using computer programs, which is relevant when others want to reuse your data.

Data manipulation is as integral to your analysis as statistical modeling and inference. If you do not document this step thoroughly, it is impossible for you or anyone else to repeat the methodology. The best way to do this is to write scripts for every data processing stage, this might feel frustratingly slow, but you will get faster with practice. The immediate payoff will be the ease with which you can redo data preparation when new data arrive. You can also reuse data preparation steps in the future for related projects. For large data sets, data preparation may also include writing and saving scripts to get it or getting subsets of the data from remote storage.

Final Remarks

These practices are pragmatic and accessible to people who consider themselves new to computing, and anyone can apply them! Most importantly, these practices make researchers more productive individually by enabling them to get more done in less time and with less pain. They also accelerate research as a whole by making computational work (which increasingly means all work) more reproducible.

However, progress will not happen by itself. These practices described are incentivized by companies and journals - but the time and skills required to do them are still not being valued.

Good practices in Scientific Computing

Project organization

How scientific software should be built

Tracking knowledge base

All things data-related

Final Remarks

Further Reading

GANs to incorporate uncertainty in Multiple-Point Statistics Simulations

Geostatistics 101

Jura Project