-
Notifications
You must be signed in to change notification settings - Fork 230
[Manual] Devito on ARCHER2
Before you start, important readings:
https://docs.archer2.ac.uk/quick-start/quickstart-users/
https://docs.archer2.ac.uk/user-guide/tuning/
https://docs.archer2.ac.uk/user-guide/scheduler/#interconnect-locality
https://docs.archer2.ac.uk/user-guide/energy/
Important: Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.
# After completing the registration
# Do `ssh` to your login node (private key needed)
ssh [email protected] -vv
# ch-dir to work filesystem
cd /work/"project-number"/"project-number"/"username"/
module load cray-python
# Create a python3 virtual env
# Activate it
# If devito is not cloned:
git clone https://github.com/devitocodes/devito
pip3 install -e .
# Load Cray MPI / https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/
module load cray-mpich
# Build mpi4py using Cray's wrapper
env MPICC=/opt/cray/pe/craype/2.7.6/bin/cc pip3 install --force-reinstall --no-cache-dir -r requirements-mpi.txt
export OMP_PLACES=cores
Example script:
#!/bin/bash
# Slurm job options (job-name, compute nodes, job time)
#SBATCH --job-name=Example_MPI_Job
#SBATCH --time=0:20:0
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
# Set the number of threads to 16 and specify placement
# There are 16 OpenMP threads per MPI process
# We want one thread per physical core
export OMP_NUM_THREADS=16
export OMP_PLACES=cores
# Launch the parallel job
# Using 32 MPI processes
# 8 MPI processes per node
# 16 OpenMP threads per MPI process
# Additional srun options to pin one thread per physical core
srun --hint=nomultithread --distribution=block:block ./my_mixed_executable.x arg1 arg2
salloc --nodes=2 --ntasks-per-node=8 --cpus-per-task=16 --time=01:00:00 --partition=standard --qos=standard --account=d011
# In interactive job
# Allocated nodes
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 12 -a aggressive
# Nodes 1
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun -n 8 --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 512 512 512 --tn 100
# Nodes 2
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun -n 16 --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 512 512 512 --tn 512 -so 8
!Add autotuning! Very important!
Notes: autotuning may lead to perf variance from runs to runs. Block shape selected not standard?
https://docs.archer2.ac.uk/user-guide/scheduler/#interactive-jobs
Notes:
export FI_OFI_RXM_SAR_LIMIT=524288
export FI_OFI_RXM_BUFFER_SIZE=131072
export MPICH_SMP_SINGLE_COPY_SIZE=16384
export CRAY_OMP_CHECK_AFFINITY=TRUE
TO TRY:
module swap craype-network-ofi craype-network-ucx
module swap cray-mpich cray-mpich-ucx
For example, to place processes sequentially on nodes but round-robin on the 16-core NUMA regions in a single node, you would use the --distribution=block:cyclic option to srun. This type of process placement can be beneficial when a code is memory bound.