Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process binding for pure MPI tests #138

Open
casparvl opened this issue May 3, 2024 · 2 comments
Open

Process binding for pure MPI tests #138

casparvl opened this issue May 3, 2024 · 2 comments

Comments

@casparvl
Copy link
Collaborator

casparvl commented May 3, 2024

I'm seeing some strange performance issues with the GROMACS test on our system. I.e. occasionally, it just runs 10 times slower. Looking at htop, I see individual cores not being used - even though I would have expected each core to be running a single process (the GROMACS test is pure MPI).

The generated job script looks like this for a 2-node test:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_GROMACS_bd8ac108"
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p rome
#SBATCH --export=None
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load GROMACS/2024.1-foss-2023b
export OMP_NUM_THREADS=1
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/Crambin/benchmark.tpr
mpirun -np 256 gmx_mpi mdrun -nb cpu -s benchmark.tpr -dlb yes -npme -1 -ntomp 1

I checked the binding of each process. To my surprise, the processes were bound to NUMA domains. I would never have expected that. According to https://www.open-mpi.org/doc/current/man1/mpirun.1.php when the number of processes is larger than 2, binding should be to socket.

Note that both binding to NUMA domain and to socket are potentially bad for the reproducibility of test performance: to make this performance predictable, I would just like to bind to core. I'm wondering if we shouldn't just call the set_compact_process_binding hook for this test... I'm not sure if this is the cause of my performance variation, but it seems like a good idea to me to enforce binding to core (which is essentially done by set_compact_process_binding) for the GROMACS test (and potentially others).

Right now, set_compact_process_binding is only used in the TensorFlow test, where it is quite essential (since that is a hybrid test).

@boegel
Copy link
Contributor

boegel commented May 3, 2024

Maybe @victorusu has some experience with this for GROMACS?

@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

See #139 . I seem to get both better and more consistent performance with binding. Since reproducibility of the performance is important, I'd be in favor of enabling it (I'd probably even be in favor if the performance was worse, as long as it is more consistent :P).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants