-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use compact process binding for GROMACS #139
Use compact process binding for GROMACS #139
Conversation
… compact process binding, since for e.g. 2 nodes 4 tasks it will lead to task 0 on node 0, task 1 on node 1, task 2 on node 0, task 3 on node 1. Compact would have been task 0 and 1 on node 0, and task 2 and 3 on node 1. Thus, this was a bug. Mapping by slot (combined with specifying PE=n) IS correct. The PE=n will set cpus-per-rank to n (see https://www.open-mpi.org/doc/current/man1/mpirun.1.php), while mapping by slot will mean that each rank is mapped to a consecutive slot. The only scenario in which mapping by slot will result in non-compact mapping is if oversubscription is enabled - then the oversubscribed ranks will be assigned round robin to the slots.
…ith mpirun it is currently free to migrate between cores within a numa domain. On Snellius I've seen some strange issues with occassionally very slow performance (10x slower than normal), potentially due to the OS thread schedulling being silly. Process binding leads to better _and_ more reproducible results
Fixes #138 |
Hm, not sure why https://github.com/EESSI/test-suite/actions/runs/8942074753/job/24563848640?pr=139 is happening:
The The solution will be to make sure that |
…reading is enabled in the github CI environment, but it doesn't really matter for the dry-runs anyway.
there is a bug in |
for slurm versions >= 22.05 < 23.11 when using would you mind adding the following to the hook if test.current_partition.launcher_type().registered_name == 'srun':
test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task |
more testing fun with srun. by default, srun does a block distribution over nodes, but a cyclic distribution over sockets, which is different from if test.current_partition.launcher_type().registered_name == 'srun':
test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task
test.env_vars['SLURM_DISTRIBUTION'] = 'block:block'
test.env_vars['SLURM_CPU_BIND'] = 'verbose' |
Good catch, I didn't realize! I think it makes sense what you are proposing and to do this only if A few things I am wondering regarding
|
i mean SRUN_CPUS_PER_TASK, see for example the docs for v22.05: https://slurm.schedmd.com/archive/slurm-22.05.0/srun.html#OPT_cpus-per-task
it doesn't determine the allocation of the job step, only the binding of the tasks. i agree though it's better to always set it so the behavior is always the same regardless of the slurm version being used.
it's not a bug, it is intended behavior for the range of slurm versions i mentioned. they have now reversed this behavior to before 22.05. |
i'll make a separate PR for the |
test.env_vars['SLURM_DISTRIBUTION'] = 'block:block' | ||
test.env_vars['SLURM_CPU_BIND'] = 'verbose' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add log for SLURM_DISTRIBUTION here, and move log for SLURM_CPU_BIND in this if block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also added a log for I_MPI_PIN_CELL
, since that was missing too. I also nested the mpirun
variables in a seperate if
. Additionally, I added a warning message if a launcher is used that we do not support in this function. It's not a blocker, so I won't abort the test, but the user should be aware that performance might be sub-par if the default binding strategy isn't great.
...
Oh, I didn't realize that. Weird change, and I'm happy they changed it back :D
Interesting, I never realized. I thought that
But agreed, it's probably still good to set this in any case at a more general level, even if the result is a form of binding. The issue if we don't set it is that we don't control what the default affinity is - and if that's a single core, a test that we think doesn't use process binding will still be bound (to a single core). That's awkward. If you can create a PR for that: great! |
…cher is used that binding might not be effective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested for openmpi with mpirun and srun, seems to work fine
Without process binding,
mpirun
was binding processes to NUMA domains. I'm not sure why, but occasionally I saw some cores on Snellius being empty, while others seemed to run 2 processes. These runs would run about 10x slower than normal. This must be some weird behaviour from the OS thread scheduler. With process binding, performance is more consistent: while I still see some variation, I don't see the extremely slow runs anymore. Also, it seems performance also went up a bit (few %) on average, though it is a bit hard to be sure with the natural variation between runs.Depends on: