Skip to content

Scientific Computing practices and workflow for MSc and PhD students

Mikael Kaandorp edited this page Jul 6, 2021 · 4 revisions

Purpose of this document

Oceanographic Research projects like MSc and PhD theses often include months or years of modelling and data analysis. Ocean and Climate scientists are often very adept at building complex software to model and analyze the physical phenomena they study. Because this code is a very valuable component of the research, students are asked to comply with high standards of quality and transparency to make their work publically available. Most of them have not had extensive formal training in scientific computing however. This makes it a daunting task to start writing months or years of reliable and maintainable scientific code. To help each other with the workflow and development of good coding habits, this document collects ideas, good practices and challenging problems.

Literature

Software Carpentry https://doi.org/10.1371/journal.pbio.1001745

Tips

Version Control

All OceanParcels and TOPIOS projects have a GitHub repository to store and share the created code. Learning how to use this repository can help structure your project and keep track of code you have written months ago. Because all changes are tracked, you do not have to worry about removing code you think is unnecessary or expired. To work with this repository you can either work from the command line (useful if you also work remotely, like on gemini or cartesius) or from GitHub Desktop (very easy to use).

If you want to access your repository from the command line, the starting point is to git clone the repository onto your system, e.g. git clone https://github.com/OceanParcels/parcels.git. Then you want to become familiar with the basic git commands or work through a guide.

Github nowadays requires a 'Personal Access Token' or an ssh-key if you want to make changes to a repository you are working on. SSH-keys are convenient, as you don't have to type in your password each time you try to access for example github or a remote server such as gemini. A step-by-step guide how to make an SSH-key for github can be found here: https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh

You can make separate SSH-keys for your laptop and gemini (and adding them both to your github account), or copy keys, see e.g. https://gist.github.com/stormpython/9517102#copy-the-public-key

If you cloned the github repository using https, you might need to change the .git/config file inside the repository, see e.g. https://stackoverflow.com/questions/7773181/git-keeps-prompting-me-for-a-password

Under the '[remote "origin"] section' you have to change 'https://github.com/username/repo.git' to 'ssh://[email protected]/username/repo.git'. If you cloned the repository using '[email protected]/username/repository.git' after making the ssh key this shouldn't be necessary to adjust.

Structure your programming workflow

Repositories, data flow

Parcels General Structure

Working on different systems

Levels of refactoring

Version documentation

Python Environment

YML file Jupyter Extension version_information

Parcels Simulation Metadata

When you run a Parcels simulation, you are creating new data to analyse. For reproducibility and clarity, it is important that the new data include metadata about the parameters and settings that are used to create the data. Good naming conventions for your datafiles is an important start to differentiate simulations, but a more robust way to document the simulation is to use the parcels method ParticleFile.add_metadata().

If you are working from within a git repository, a quick way to be able to document the version of the kernels and preprocessing software in a parcels simulation is to include the hash or tag that points to the latest commit. To do this, you first have to find the commit label and then store it in the ParticleFile:

import subprocess

label = subprocess.check_output(["git", "describe", "--always"]).strip()

output_file = ParticleFile(name, ...)
output_file.add_metadata('git_hash', label)
Clone this wiki locally