Skip to content

Commit

Permalink
more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
svandenhaute committed Jul 27, 2024
1 parent 3f6fd08 commit 2b1fe60
Show file tree
Hide file tree
Showing 2 changed files with 228 additions and 22 deletions.
247 changes: 226 additions & 21 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,29 @@ container engine:
- Apptainer >= 1.2
- SingularityCE >= 3.11

To detect which of these is available on your HPC, execute `apptainer --version` or
`singularity --version` in a shell on a login or compute node. Note that on some systems, the container runtime is packaged in a
module in which case you would first have to load it before it becomes available in your
shell. Check your HPC's documentation for more information.
If none of these are available, contact your system administrators or set up psiflow
manually. Otherwise, proceed with the following steps.
[manually](#manual-setup). Otherwise, proceed with the following steps.

We provide two versions of essentially the same container; one for Nvidia GPUs (based on a
PyTorch wheel for CUDA 11.8) and one for AMD GPUs (based on a PyTorch wheel for ROCm 5.6).
These images are hosted on the Github Container Registry (abbreviated by `ghcr`) and can
be directly downloaded and cached by the container runtime.
For example, if we wish to execute a simple command `ls` using the container image for
Nvidia GPUs, we would write:
```bash
apptainer exec oras://ghcr/io/molmod/psiflow:main_cu118 ls
```
We use `psiflow:main_cu118` to get the image which was built from the latest `main` branch
of the psiflow repository, for CUDA 11.8.
Similarly, for AMD GPUs and, for example, psiflow v4.0.0-rc0, we would use
```bash
apptainer exec oras://ghcr.io/molmod/psiflow:4.0.0-rc0_rocm5.6 ls
```
See the [Apptainer](https://apptainer.org/docs/user/latest/)/[SingularityCE](https://docs.sylabs.io/guides/4.1/user-guide/) documentation for more information.

## Python environment
The main Python script which defines the workflow requires a Python 3.10 / 3.11 environment
Expand Down Expand Up @@ -35,14 +56,22 @@ with a recent version of `pip` and [`ndcctools`](https://github.com/cooperative-
pip install git+https://github.com/molmod/[email protected]
```
_Everything else_ -- i-PI, CP2K, GPAW, Weights & Biases, PLUMED, ... -- is handled by
the container image and hence need not be installed manually.
the container images and hence need not be installed manually.

- **(with `virtualenv`/`venv`)**: create a new environment and install psiflow from github
using the same command as above. In addition, you will have to set up the `cctools`
using the same command as above. In addition, you will have to compile and install the `cctools`
package manually. See the
[documentation](https://cctools.readthedocs.io/en/stable/install/) for the appropriate
instructions.

Verify the correctness of your environment using the following commands:

```bash
python -c 'import psiflow' # python import should work
which work_queue_worker # tests whether ndcctools is available and on PATH
```


## Execution
Psiflow scripts are executed as a simple Python process.
Expand All @@ -51,18 +80,54 @@ execute the calculations asynchronously and as fast as possible.
To achieve this, it automatically requests the compute resources it needs during
execution.

To make this work, it is necessary to define precisely (i) how each elementary calculation
(model training, CP2K singlepoint evaluation, molecular dynamics) should proceed, and (ii)
To make this work, it is necessary to define precisely how ML potential training, molecular
dynamics, and QM calculations should proceed, and (ii)
how the required resources for those calculations should be obtained.
These additional parameters are to be specified in a separate 'configuration' `.yaml` file, which is
passed into the main Python workflow script as an argument.
In what follows, we provide an exhaustive list of all execution-side parameters along with
a few examples. We suggest to go through Parsl's [documentation on
The configuration file has a specific structure which is explained in the following
sections. In many cases, you will be able to start from one of the [example
configurations](https://github.com/molmod/psiflow/tree/main/configs)
in the repository and adapt it for your cluster.
We also suggest you to go through Parsl's [documentation on
execution](https://parsl.readthedocs.io/en/stable/userguide/execution.html) first as this
will improve your understanding of what follows.
### 1. model training
This defines how `model.train` operations are performed. Since
There are three types of calculations:
- **ML potential training** (`ModelTraining`)
- **ML potential inference, i.e. molecular dynamics** (`ModelEvaluation`)
- **QM calculations** (`CP2K`, `GPAW`, `ORCA`)
and the structure of a typical `config.yaml` consequently looks like this
```yaml
# top level options define the overall behavior
# see below for a full list
container_engine: <singularity or apptainer>
container_uri: <link to container, i.e. oras://ghcr.io/...>
ModelTraining:
# specifies how ML potential training should be performed
# and which resources it needs to use
ModelEvaluation:
# specifies how MD / geometry optimization / hamiltonian computations are performed
# and which resources it needs to use
CP2K:
# specifies how CP2K single points need to be performed
GPAW:
# specifies how GPAW single points need to be performed
ORCA:
# specifies how ORCA single points need to be performed
```
### 1. ML potential training
This defines how `model.train()` operations are performed. Since
training is necessarily performed on a GPU, it is necessary to specify resources in
which a GPU is available. Consider the following simple training example
```py
Expand Down Expand Up @@ -94,11 +159,16 @@ Then we can execute the script using the following command:
python train.py config.yaml
```
The `config.yaml` file should define how and where the model should be trained and
evaluated. Internally, Parsl will use that information to construct the appropriate
evaluated.
Next, we define how model training should be performed.
Internally, Parsl will use that information to construct the appropriate
SLURM jobscripts, send them to the scheduler, and once the resources are allocated,
start the calculation. For example, assume that the GPU partition on this cluster is
named `infinite_a100`, and it has 12 cores per GPU. Consider the following config
```yaml
container_runtime: apptainer # or singularity; check HPC docs to see which one is available
container_uri: oras://ghcr.io/molmod/psiflow:main_cu118 # built from github main branch
ModelTraining:
cores_per_worker: 12
gpu: true
Expand All @@ -112,7 +182,7 @@ ModelTraining:
scheduler_options: "#SBATCH --gpus=2"
```
The top-level keyword `ModelTraining` indicates that we're defining the execution of
`model.train`. It has a number of special keywords:
`model.train()`. It has a number of special keywords:
- **cores_per_worker** (int): number of CPUs per GPU.
- **gpu** (bool): whether to use GPU(s) -- should almost always be true for training.
Expand Down Expand Up @@ -148,7 +218,7 @@ There exist a few additional keywords for `ModelTraining` which might be useful:
OMP_PROC_BIND: spread
```
### 2. model evaluation
### 2. molecular dynamics
Consider the following example:
```py
import psiflow
Expand All @@ -161,14 +231,9 @@ def main():
mace = MACEHamiltonian.mace_mp0()
start = Geometry.load('start.xyz')
walkers = []
for i in range(8):
walker = Walker(mace, temperature=300)
replica_exchange(walkers[:4], trial_frequency=100)
walkers[-1].nbeads = 8
walkers = Walker(mace, temperature=300).multiply(8)
outputs = sample(walkers, steps=1000, step=10)
outputs = sample(walkers, steps=int(1e9), step=10) # extremely long
for i, output in enumerate(outputs):
output.trajectory.save(f'{i}.xyz')
Expand All @@ -179,5 +244,145 @@ if __name__ == '__main__':
```
In this example, we use MACE-MP0 to run 8 molecular dynamics simulations in the NVT
ensemble. The first four walkers are coupled with replica exchange moves, and the last
walker is set to use PIMD with 8 replicas (or beads).
ensemble. Since they are all independent from each other, psiflow will attempt to execute
them in parallel as much as possible.
The configuration section which deals with ML potential inference, including molecular
dynamics but also geometry optimization and `hamiltonian.compute()` calls, is named
`ModelEvaluation`:
```yaml
ModelEvaluation:
cores_per_worker: 12
gpu: true
slurm:
partition: "infinite_a100"
account: "112358"
nodes_per_block: 2
cores_per_node: 48 # full node; sometimes granted faster than partials
max_blocks: 1
walltime: "01:00:00" # small to try and skip the queue
scheduler_options: "#SBATCH --gpus=4"
```
It is in general quite similar to `ModelTraining`. Because in general, psiflow workflows
contain a large number of molecular dynamics simulations, it makes sense to ask for larger
allocations for each block (= SLURM job). In this example, we immediately ask for two full
GPU nodes, with four GPUs each. This is exactly the amount we need to execute all eight
molecular dynamics simulations in parallel, without wasting any resources.
As such, when we execute the above example using `python script.py config.yaml`, Parsl
will recognize that we need resources for eight simulations, ask for precisely one allocation
according to the above parameters, and start all eight simulations simultaneously.
Of course, we greatly overestimate the number of steps we wish to simulate.
The SLURM allocation has a walltime of one hour, which means that if a simulation does not
finish in 12 hours, it will be gracefully terminated and the saved trajectories will only
cover a fraction of the requested one billion steps.
Psiflow will not automatically continue the simulations on a new SLURM allocation.
The available keywords in the `ModelEvaluation` section are the same as for
`ModelTraining`, except for one:
- **max_simulation_time** (float, in minutes):
### 3. QM calculations
Finally, we need to specify how QM calculations are performed.
By default, these calculations are not executed within the container image provided by
`container_uri` at the top level.
Users can choose to rely on their system-installed QM software or employ one of the
smaller and specialized container images for CP2K or GPAW. We will discuss both cases
below.
First, assume we wish to use a system-installed CP2K module, and execute each singlepoint
on 32 cores. Assume that the nodes in our cpu partition possess 128 cores:
```yaml
CP2K:
cores_per_worker: 32
max_evaluation_time: 30 # kill calculation after 30 mins; SCF unconverged
launch_command: "OMP_NUM_THREADS=1 mpirun -np 32 cp2k.psmp" # force 1 thread/rank
slurm:
partition: "infinite_CPU"
account: "112358"
nodes_per_block: 16
cores_per_node: 128
max_blocks: 1
walltime: "12:00:00"
worker_init: "ml CP2K/2024.1" # activate CP2K module in jobscript!
```
We asked for a big allocation of 16 nodes, each with 128 cores. On each node, psiflow can
concurrently execute four singlepoints, since we specified `cores_per_worker: 32`.
Consider now the following script:
```py
import psiflow
from psiflow.data import Dataset
from psiflow.reference import CP2K
def main():
unlabeled = Dataset.load('long_trajectory.xyz')
with open('cp2k_input.txt', 'r') as f:
cp2k_input = f.read()
cp2k = CP2K(cp2k_input)
labeled = unlabeled.evaluate(cp2k)
labeled.save('labeled.xyz')
if __name__ == '__main__':
with psiflow.load():
main()
```
Assume `long_trajectory.xyz` is a large XYZ file with, say, 1,000 snapshots.
In the above script, we simply load the data, evaluate the energy and forces of each
snapshot with CP2K, and save the result as (ext)XYZ.
Again, we execute this script by running `python script.py config.yaml` within a Python
environment with psiflow and cctools available.
Even though all of these calculations can proceed in parallel, we specified `max_blocks:
1` to not overload our resource usage.
As such, Parsl will request precisely one block/allocation of 16 nodes, and start
executing the singlepoint QM evaluations.
At any given moment, there will be (16 nodes x 4 calculations/node = ) 64 calculations
running.
Now assume our system administrators did not provide us with the latest and greatest
version of CP2K.
The installation process is quite long and tedious (even via tools like EasyBuild or Spack),
which is why psiflow provides **small containers which only contain the QM software**.
They are separate from the psiflow containers mentioned before in order to improve
modularity and reduce individual container sizes.
At the moment, such containers are available for CP2K 2024.1 and GPAW 24.1.
To use them, it suffices to wrap the launch command inside an `apptainer` or `singularity`
invocation, whichever is available on your system:
```yaml
CP2K:
cores_per_worker: 32
max_evaluation_time: 30 # kill calculation after 30 mins; SCF unconverged
launch_command: "apptainer exec -e --no-init oras://ghcr.io/molmod/cp2k:2024.1 /opt/entry.sh mpirun -np 32 cp2k.psmp"
slurm:
partition: "infinite_CPU"
account: "112358"
nodes_per_block: 16
cores_per_node: 128
max_blocks: 1
walltime: "12:00:00"
# no more need for module load commands!
```
The command is quite long but normally self-explanatory if you're somewhat familiar with
containers.
## SLURM quickstart
Psiflow contains a small script which detects the available SLURM partitions and their
hardware and creates a minimal, initial `config.yaml` which you can use as a starting point
to further tune to your liking. To use it, simply activate your psiflow Python environment
and execute the following command:
```sh
python -c 'import psiflow; psiflow.setup_slurm_config()'
```
## manual setup
TODO
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ It supports:
- **quantum mechanical calculations** at various levels of theory (GGA and hybrid DFT, post-HF methods such as MP2 or RPA, and even coupled cluster; using CP2K | GPAW | ORCA)

- **trainable interaction potentials** as well as easy-to-use universal potentials, e.g. [MACE-MP0](https://arxiv.org/abs/2401.00096)
- a wide range of **sampling algorithms**: NVE | NVT | NPT, path-integral molecular dynamics, alchemical replica exchange, metadynamics, phonon-based sampling, ... (thanks to [i-PI](https://ipi-code.org/))
- a wide range of **sampling algorithms**: NVE | NVT | NPT, path-integral molecular dynamics, alchemical replica exchange, metadynamics, phonon-based sampling, thermodynamic integration; using [i-PI](https://ipi-code.org/),
[PLUMED](https://www.plumed.org/), ...

Users may define arbitrarily complex workflows and execute them **automatically** on local, HPC, and/or cloud infrastructure.
To achieve this, psiflow is built using [Parsl](https://parsl-project.org/): a parallel execution library which manages job submission and workload distribution.
Expand Down

0 comments on commit 2b1fe60

Please sign in to comment.