Skip to content

Commit

Permalink
document how to install conda env on the fly
Browse files Browse the repository at this point in the history
  • Loading branch information
bast committed Aug 8, 2024
1 parent 5d269f2 commit 8514205
Showing 1 changed file with 39 additions and 39 deletions.
78 changes: 39 additions & 39 deletions content/blog/2024-umetaflow-on-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,15 @@ container side and "too much" outside of the container and decided to install
UMetaFlow and its dependencies more directly, without a container, and by using
a Conda environment.

Below is a step-by-step guide on how we did it.
We then created an install script to set up the Conda environment. However,
this took hours on the network file system because of the thousands of tiny
files. Sometimes it timed out, and sometimes it crashed with surprising
errors.

Therefore, we experimented with an approach where we set up the Conda
environment "on the fly" on the local disk of a compute node.

Below is a step-by-step guide on how we ended up using it.

<!-- toc -->

Expand Down Expand Up @@ -55,7 +63,7 @@ This script automatically places the SIRIUS binary and libraries in the
`resources` directory.


## Installing the Conda environment
## Conda environment definition file

Save the following file as `environment.yml`:
```yaml
Expand All @@ -82,39 +90,6 @@ dependencies:
- ms2query
```
The next step is to create a new environment from this definition file. This
step can take 2 hours and therefore we run it as a job on the cluster. Save
the file as `install-environment.sh`:
```bash
#!/usr/bin/env bash
#SBATCH --account=nn____k
#SBATCH --job-name=installation
#SBATCH --time=0-08:00:00
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=1
# settings to catch errors in bash scripts
set -euf -o pipefail
# adapt these
my_conda_storage=/cluster/projects/nn____k/conda
environment_file="environment.yml"
environment_name="umetaflow"
# typically no need to modify anything below
# using mamba here since it is potentially faster than conda
# to resolve the dependencies
module load Mamba/4.14.0-0
export CONDA_PKGS_DIRS=${my_conda_storage}/package-cache
mamba env create --prefix ${my_conda_storage}/${environment_name} --file ${environment_file}
```

Adapt `--account=nn____k` and `my_conda_storage` and then submit the job with:
```bash
$ sbatch install-environment.sh
```

## Create a file that holds the credentials for SIRIUS
Expand All @@ -136,22 +111,47 @@ Save the following file as `run-umetaflow.sh`:
#SBATCH --time=0-01:00:00
#SBATCH --mem-per-cpu=6G
#SBATCH --ntasks=4
#SBATCH --gres=localscratch:20G
# settings to catch errors in bash scripts
set -euf -o pipefail
module load Miniconda3/22.11.1-1
source ${EBROOTMINICONDA3}/bin/activate
# function to install and activate the environment "on the fly"
# requires setting --gres=localscratch (above)
# advantages: no impact on the network file system or home quota, often 100x faster install
# disadvantages: a bit wasteful in terms of network and additional 1-5 minutes for each job
install_and_activate_environment() {
local environment_file=$1
# adapt this line to match the environment created further above
conda activate /cluster/projects/nn____k/conda/umetaflow
# install latest micromamba into local scratch
cd ${LOCALSCRATCH}
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
export MAMBA_ROOT_PREFIX=${LOCALSCRATCH}/mamba-prefix
eval "$(${LOCALSCRATCH}/bin/micromamba shell hook -s posix)"
# install and activate the environment
export TMPDIR=${LOCALSCRATCH}
time micromamba -y env create --prefix ${LOCALSCRATCH}/environment/ --file ${environment_file}
micromamba activate ${LOCALSCRATCH}/environment/
cd ${SLURM_SUBMIT_DIR}
}
# install and activate
install_and_activate_environment "${SLURM_SUBMIT_DIR}/environment.yml"
echo "successful conda installation!"
# this sets email and password for sirius
source credentials.sh
snakemake --cores ${SLURM_NTASKS}
```

The job script creates a Conda environment and activates it "on the fly". This
takes only perhaps 3 or 5 minutes of the job run time. The disadvantage of this
approach is that the environment disappears after the job finishes and is
re-created at each run.

Also here you need to adapt `--account=nn____k` and the paths to the Conda
environment.

Expand Down

0 comments on commit 8514205

Please sign in to comment.