Maelstrom Application 6

Lifetime determination of large-scale weather regimes.

Prerequisites

Installation of the following packages is required:

gcc (build-essential) (required by HDBSCAN)
libgeos and libgeos-dev (required by Cartopy)
python3-opencv (required by opencv)

Ubuntu 20:

sudo apt-get install -y \
  build-essential \
  libgeos-3.9.0 \
  libgeos-dev \
  libopenmpi-dev \
  python3-opencv

Ubuntu 22:

sudo apt-get install -y \
  build-essential \
  libgeos3.10.2 \
  libgeos-dev \
  libopenmpi-dev \
  python3-dev \
  python3-opencv

Using `torch-cpu` for local development

For local development, torch-cpu can be installed:

poetry run pip install -r requirements-cpu.txt

Version conflicts

The versions of pytorch and torchvision must match in all of these files:

pyproject.toml
requirements-cpu.txt
docker/a6-cuda.Dockerfile

Otherwise, different versions might get installed, which will lead to conflicts.

Running with MLflow

Running directly via Python

Copy the .env.example to a .env file, set the required environment variables for tracking (see first block in .env.example) and then source the file:
```
source .env
```
- Imporant note: Make sure to copy the .env.example file to a file with a .env extension. Such files will be ignored by git (see .gitignore). Otherwise, you will risk to commit your credentials to the git repository.
- Note: Make sure to set the correct MLFLOW_EXPERIMENT_ID environment variable to track to the desired experiment.
Initialize tracking with mantik
```
eval $(poetry run mantik init)
```
The above command will set the MLFLOW_TRACKING_TOKEN environment variable which enables tracking to mantik.
Run the DCv2 script
```
poetry run python mlflow/train_dcv2.py --enable-logging --use-cpu --epochs 1 --nmb-clusters 2
```
Note: Running with the data used by the script as default file requires git-lfs. When executing for the first time, the data file has to be pulled via git-lfs pull.
Refresh the MLflow UI to see the logged parameters, metrics models and artifacts.

Running as a project

Build the Docker image
```
make build-docker
```
Initialize tracking to mantik and set the MLFLOW_EXPERIMENT_ID environment variable (see above).

Run the project

poetry run mlflow run mlflow/a6 \
  -e cluster \
  -P weather_data=/data/temperature_level_128_daily_averages_2020.nc \
  -P config=cluster.yaml
  -P use_varimax=false

Note: The a6 package is installed into the Docker container at build time. If the source code of the a6 package was modified, the Docker image has to be rebuilt (see 1.) in order to have the updated source code in the container image. The given folder (mlflow/), on the other hand, is copied by mlflow into the container when running the project and, hence, does not require rebuilding the Docker image manually if any of these files was modified (see here).

Run remotely on HPC

Build the Apptainer image
```
make build-cuda
```
Set the required environment variables for the Compute Backend:
- MANTIK_UNICORE_USERNAME
- MANTIK_UNICORE_PASSWORD
- MANTIK_COMPUTE_BUDGET_ACCOUNT

Run on HPC via mantik

poetry run mantik runs submit \
  --run-name "<run-name>" \
  --entry-point dcv2 \
  --backend-config compute-backend-config-dcv2.yaml \
  $PWD/mlflow/

Note: Running with Apptainer (and not as an MLproject via mlflow run) does not track the git version (git commit hash), because, when creating a new run, MLflow attempts to import the git Python module and read the project repository to retrieve the commit hash. This is not possible inside the Apptainer container since

git is not installed within the container (error is usually logged by MLflow, but can be silenced by setting the GIT_PYTHON_REFRESH=quiet environment variable inside the container).
the repository is not available inside the container, but only the train_kmeans.py file. Hence, installing git inside the container does not solve the issue.

As a consequence, the version (mlflow.source.git.commit tag) is set to None.

Building and deploying the Jupyter kernel

Prerequisites:
- Create a private SSH file ~/.ssh/jsc (~/.ssh/e4) and upload its public counterpart to JuDoor (the E4 help center), or adjust the path to the JSC_SSH_PRIVATE_KEY_FILE (E4_SSH_PRIVATE_KEY_FILE) in the Makefile.
- JSC: Set the MANTIK_UNICORE_USERNAME and MANTIK_UNICORE_PASSWORD environment variables to allow uploading via SSH.
- E4: Set the E4_USERNAME and E4_SERVER_IP environment variables. E4_SERVER_IP here is the IP of the E4 machine you want to use for SSH login.
Build Apptainer image with package and ipykernel installed
```
make build-jsc-kernel
```
For E4, use the build-e4-kernel target.
Upload the image and the kernel.json file:
```
make upload-jsc-kernel
```
For E4, use the upload-e4-kernel target. Note: Alternatively, you can also execute the two above steps at once:
```
make deploy-jsc-kernel
```
For E4, use the deploy-e4-kernel target.

If this worked correctly, the kernel should be available in Jupyter JSC/on the E4 system under the name a6. The Apptainer image may generally be used to run the package on e.g. JUWELS via apptainer exec <path to image> python <path to script>.

Running on Juwels (Booster)

Start a Jupyter lab via Jupyter JSC on the respective login node (i.e. Juwels or Juwels Booster).
Select the kernel (see above).
Run the notebook notebooks/jsc/parallel_a6.iypnb.

Running on E4

Connect to the VPN.
SSH onto a certain host.
The kernel needs Apptainer (formerly Singularity), hence the module has to be loaded
```
module load go-1.17.6/singularity-3.9.5
```

Start jupyter on the host

cd <repo directory>
poetry install -E notebooks
poetry run jupyter notebook

From your local terminal, establish an SSH tunnel to the machine's Jupyter port
```
ssh -fN -L <local port>:localhost:8888 <user>@<IP of the host>
```
Access Jupyter from your local browser by copying the token or URL from the output of poetry run jupyter notebook command. The URL should look as follows: http://localhost:8888/?token=<token>.
Run the notebook notebooks/e4/parallel_a6.ipynb.

Known Issues

Additional `expvar` dimension in ERA5 data

Recent ERA5 data may contain an additional dimension called expvar with levels 1 and 5. Level 1 is typically NaN after some point in the time dimension, and Level 5 is NaN up to that point. After that point in time, this is the opposite: level 1 is NaN and level 5 has values. Thus, the levels have to be reduced by taking the sum, ignoring NaN. This can be achieved with np.nansum:

ds_new = ds.reduce(np.nansum, dim="expvar", keep_attrs=True)

Name		Name	Last commit message	Last commit date
Latest commit History 943 Commits
apptainer		apptainer
docker		docker
jube		jube
mlflow		mlflow
notebooks		notebooks
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
jube.yaml		jube.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-cpu.txt		requirements-cpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maelstrom Application 6

Prerequisites

Using `torch-cpu` for local development

Version conflicts

Running with MLflow

Running directly via Python

Running as a project

Run remotely on HPC

Building and deploying the Jupyter kernel

Running on Juwels (Booster)

Running on E4

Known Issues

Additional `expvar` dimension in ERA5 data

About

Releases

Packages

Contributors 4

Languages

License

4castRenewables/maelstrom-a6

Folders and files

Latest commit

History

Repository files navigation

Maelstrom Application 6

Prerequisites

Using torch-cpu for local development

Version conflicts

Running with MLflow

Running directly via Python

Running as a project

Run remotely on HPC

Building and deploying the Jupyter kernel

Running on Juwels (Booster)

Running on E4

Known Issues

Additional expvar dimension in ERA5 data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Using `torch-cpu` for local development

Additional `expvar` dimension in ERA5 data

Packages