Lifetime determination of large-scale weather regimes.
Installation of the following packages is required:
- gcc (build-essential) (required by HDBSCAN)
- libgeos and libgeos-dev (required by Cartopy)
- python3-opencv (required by opencv)
Ubuntu 20:
sudo apt-get install -y \
build-essential \
libgeos-3.9.0 \
libgeos-dev \
libopenmpi-dev \
python3-opencv
Ubuntu 22:
sudo apt-get install -y \
build-essential \
libgeos3.10.2 \
libgeos-dev \
libopenmpi-dev \
python3-dev \
python3-opencv
For local development, torch-cpu
can be installed:
poetry run pip install -r requirements-cpu.txt
The versions of pytorch and torchvision must match in all of these files:
pyproject.toml
requirements-cpu.txt
docker/a6-cuda.Dockerfile
Otherwise, different versions might get installed, which will lead to conflicts.
- Copy the
.env.example
to a.env
file, set the required environment variables for tracking (see first block in.env.example
) and then source the file:source .env
- Imporant note: Make sure to copy the
.env.example
file to a file with a.env
extension. Such files will be ignored by git (see.gitignore
). Otherwise, you will risk to commit your credentials to the git repository. - Note: Make sure to set the correct
MLFLOW_EXPERIMENT_ID
environment variable to track to the desired experiment.
- Imporant note: Make sure to copy the
- Initialize tracking with mantik
The above command will set the
eval $(poetry run mantik init)
MLFLOW_TRACKING_TOKEN
environment variable which enables tracking to mantik. - Run the DCv2 script
Note: Running with the data used by the script as default file requires git-lfs. When executing for the first time, the data file has to be pulled via
poetry run python mlflow/train_dcv2.py --enable-logging --use-cpu --epochs 1 --nmb-clusters 2
git-lfs pull
. - Refresh the MLflow UI to see the logged parameters, metrics models and artifacts.
- Build the Docker image
make build-docker
- Initialize tracking to mantik and set the
MLFLOW_EXPERIMENT_ID
environment variable (see above). - Run the project
poetry run mlflow run mlflow/a6 \ -e cluster \ -P weather_data=/data/temperature_level_128_daily_averages_2020.nc \ -P config=cluster.yaml -P use_varimax=false
Note: The a6 package is installed into the Docker container
at build time. If the source code of the a6 package was modified,
the Docker image has to be rebuilt (see 1.) in order to have the updated source code
in the container image. The given folder (mlflow/
), on the other hand, is copied by mlflow into
the container when running the project and, hence, does not require rebuilding the
Docker image manually if any of these files was modified
(see
here).
-
Build the Apptainer image
make build-cuda
-
Set the required environment variables for the Compute Backend:
MANTIK_UNICORE_USERNAME
MANTIK_UNICORE_PASSWORD
MANTIK_COMPUTE_BUDGET_ACCOUNT
-
Run on HPC via mantik
poetry run mantik runs submit \ --run-name "<run-name>" \ --entry-point dcv2 \ --backend-config compute-backend-config-dcv2.yaml \ $PWD/mlflow/
Note:
Running with Apptainer (and not as an MLproject via mlflow run
)
does not track the git version (git commit hash), because, when creating a new run,
MLflow attempts to import the git Python module and read the project repository to
retrieve the commit hash. This is not possible inside the Apptainer container since
- git is not installed within the container (error is usually logged by MLflow, but can be
silenced by setting the
GIT_PYTHON_REFRESH=quiet
environment variable inside the container). - the repository is not available inside the container, but only the
train_kmeans.py
file. Hence, installing git inside the container does not solve the issue.
As a consequence, the version (mlflow.source.git.commit
tag) is set to None
.
- Prerequisites:
- Create a private SSH file
~/.ssh/jsc
(~/.ssh/e4
) and upload its public counterpart to JuDoor (the E4 help center), or adjust the path to theJSC_SSH_PRIVATE_KEY_FILE
(E4_SSH_PRIVATE_KEY_FILE
) in theMakefile
. - JSC: Set the
MANTIK_UNICORE_USERNAME
andMANTIK_UNICORE_PASSWORD
environment variables to allow uploading via SSH. - E4: Set the
E4_USERNAME
andE4_SERVER_IP
environment variables.E4_SERVER_IP
here is the IP of the E4 machine you want to use for SSH login.
- Create a private SSH file
- Build Apptainer image with package and ipykernel installed
For E4, use the
make build-jsc-kernel
build-e4-kernel
target. - Upload the image and the
kernel.json
file:For E4, use themake upload-jsc-kernel
upload-e4-kernel
target. Note: Alternatively, you can also execute the two above steps at once:For E4, use themake deploy-jsc-kernel
deploy-e4-kernel
target.
If this worked correctly, the kernel should be available in Jupyter JSC/on the E4 system
under the name a6
.
The Apptainer image may generally be used to run the package on e.g. JUWELS
via apptainer exec <path to image> python <path to script>
.
- Start a Jupyter lab via Jupyter JSC on the respective login node (i.e. Juwels or Juwels Booster).
- Select the kernel (see above).
- Run the notebook
notebooks/jsc/parallel_a6.iypnb
.
- Connect to the VPN.
- SSH onto a certain host.
- The kernel needs Apptainer (formerly Singularity), hence the module has to be loaded
module load go-1.17.6/singularity-3.9.5
- Start jupyter on the host
cd <repo directory> poetry install -E notebooks poetry run jupyter notebook
- From your local terminal, establish an SSH tunnel to the machine's Jupyter port
ssh -fN -L <local port>:localhost:8888 <user>@<IP of the host>
- Access Jupyter from your local browser by copying the token or URL from the output of
poetry run jupyter notebook
command. The URL should look as follows:http://localhost:8888/?token=<token>
. - Run the notebook
notebooks/e4/parallel_a6.ipynb
.
Recent ERA5 data may contain an additional dimension called expvar
with levels 1
and 5
.
Level 1 is typically NaN
after some point in the time
dimension, and Level 5 is NaN
up to that point.
After that point in time, this is the opposite: level 1 is NaN
and level 5 has values.
Thus, the levels have to be reduced by taking the sum, ignoring NaN
. This can be achieved with np.nansum
:
ds_new = ds.reduce(np.nansum, dim="expvar", keep_attrs=True)