Use compact process binding for GROMACS #139

casparvl · 2024-05-03T16:34:53Z

Without process binding, mpirun was binding processes to NUMA domains. I'm not sure why, but occasionally I saw some cores on Snellius being empty, while others seemed to run 2 processes. These runs would run about 10x slower than normal. This must be some weird behaviour from the OS thread scheduler. With process binding, performance is more consistent: while I still see some variation, I don't see the extremely slow runs anymore. Also, it seems performance also went up a bit (few %) on average, though it is a bit hard to be sure with the natural variation between runs.

Depends on:

Fix compact process binding for OpenMPI mpirun #137

… compact process binding, since for e.g. 2 nodes 4 tasks it will lead to task 0 on node 0, task 1 on node 1, task 2 on node 0, task 3 on node 1. Compact would have been task 0 and 1 on node 0, and task 2 and 3 on node 1. Thus, this was a bug. Mapping by slot (combined with specifying PE=n) IS correct. The PE=n will set cpus-per-rank to n (see https://www.open-mpi.org/doc/current/man1/mpirun.1.php), while mapping by slot will mean that each rank is mapped to a consecutive slot. The only scenario in which mapping by slot will result in non-compact mapping is if oversubscription is enabled - then the oversubscribed ranks will be assigned round robin to the slots.

…ith mpirun it is currently free to migrate between cores within a numa domain. On Snellius I've seen some strange issues with occassionally very slow performance (10x slower than normal), potentially due to the OS thread schedulling being silly. Process binding leads to better _and_ more reproducible results

casparvl · 2024-05-03T16:37:34Z

Fixes #138

casparvl · 2024-05-03T16:42:25Z

Hm, not sure why https://github.com/EESSI/test-suite/actions/runs/8942074753/job/24563848640?pr=139 is happening:

def set_compact_process_binding(test: rfm.RegressionTest):
...
    check_proc_attribute_defined(test, 'num_cpus_per_core')
    num_cpus_per_core = test.current_partition.processor.num_cpus_per_core
    physical_cpus_per_task = int(test.num_cpus_per_task / num_cpus_per_core)
...

The check_proc_attribute_defined should have printed a clear error that this property wasn't defined in the ReFrame config file. Also, I'm not sure why we wouldn't have run into this before, but I guess the CI test wasn't there before...

The solution will be to make sure that num_cpus_per_core is defined in the CI config file... I'll check how we can achieve that.

…reading is enabled in the github CI environment, but it doesn't really matter for the dry-runs anyway.

smoors · 2024-05-05T07:50:44Z

The check_proc_attribute_defined should have printed a clear error that this property wasn't defined in the ReFrame config file.

there is a bug in check_proc_attribute_defined: the last line (the raise) has 1 indent too many

smoors · 2024-05-06T21:01:31Z

for slurm versions >= 22.05 < 23.11 when using srun, there is another issue: --cpus-per-task is not inherited from the job environment into the srun tasks, preventing "bind each process to subsequent domains of test.num_cpus_per_task cores".

would you mind adding the following to the hook set_compact_process_binding to fix this?

    if test.current_partition.launcher_type().registered_name == 'srun':
        test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task

smoors · 2024-05-07T08:18:54Z

more testing fun with srun.

by default, srun does a block distribution over nodes, but a cyclic distribution over sockets, which is different from mpirun with test.env_vars['OMPI_MCA_rmaps_base_mapping_policy'] = 'slot:PE=%s' % physical_cpus_per_task, which does block distribution over sockets.
to get the same behavior as mpirun we can use environment variable $SLURM_DISTRIBUTION. i propose to only set all srun specific environments when srun is actually used as a launcher:

   if test.current_partition.launcher_type().registered_name == 'srun':
       test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task
       test.env_vars['SLURM_DISTRIBUTION'] = 'block:block'
       test.env_vars['SLURM_CPU_BIND'] = 'verbose'

casparvl · 2024-05-07T11:15:24Z

srun does a block distribution over nodes, but a cyclic distribution over sockets

Good catch, I didn't realize! I think it makes sense what you are proposing and to do this only if srun is used as launcher.

A few things I am wondering regarding

test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task

I only see a SLURM_CPUS_PER_TASK in the srun and sbatch manuals https://slurm.schedmd.com/srun.html , https://slurm.schedmd.com/sbatch.html . Did you mean that?
As I understand it, SLURM_CPUS_PER_TASK is an input environment variable for srun, i.e. it does the same as when you would invoke srun -c $SLURM_CPUS_PER_TASK. This means it determines the size of the allocation for the job step that srun creates, doesn't it? In that case, I'm not sure if the process-binding hook is the right place to set it, it seems more general to me. Shouldn't it also be set for tests that don't care about process binding?
Should we fix things that are broken in SLURM in our test suite? SLURM_CPUS_PER_TASK is a documented output environment variable for sbatch. If it's not set, that's a bug in SLURM. It means that all jobs that rely on srun picking up the correct number of cpus per task from the parent allocation suddenly fail, or behave differently. If that's the case, I'm ok with the tests failing too, no?

…el launcher

smoors · 2024-05-07T11:48:22Z

I only see a SLURM_CPUS_PER_TASK in the srun and sbatch manuals https://slurm.schedmd.com/srun.html , https://slurm.schedmd.com/sbatch.html . Did you mean that?

i mean SRUN_CPUS_PER_TASK, see for example the docs for v22.05: https://slurm.schedmd.com/archive/slurm-22.05.0/srun.html#OPT_cpus-per-task
they have now reversed the behavior back to the old behavior and thus SRUN_CPUS_PER_TASK is no longer needed.

As I understand it, SLURM_CPUS_PER_TASK is an input environment variable for srun, i.e. it does the same as when you would invoke srun -c $SLURM_CPUS_PER_TASK. This means it determines the size of the allocation for the job step that srun creates, doesn't it? In that case, I'm not sure if the process-binding hook is the right place to set it, it seems more general to me. Shouldn't it also be set for tests that don't care about process binding?

it doesn't determine the allocation of the job step, only the binding of the tasks. i agree though it's better to always set it so the behavior is always the same regardless of the slurm version being used.

Should we fix things that are broken in SLURM in our test suite? SLURM_CPUS_PER_TASK is a documented output environment variable for sbatch. If it's not set, that's a bug in SLURM. It means that all jobs that rely on srun picking up the correct number of cpus per task from the parent allocation suddenly fail, or behave differently. If that's the case, I'm ok with the tests failing too, no?

it's not a bug, it is intended behavior for the range of slurm versions i mentioned. they have now reversed this behavior to before 22.05.

smoors · 2024-05-07T12:01:53Z

i'll make a separate PR for the SRUN_CPUS_PER_TASK issue

smoors · 2024-05-07T12:08:40Z

eessi/testsuite/hooks.py

+        test.env_vars['SLURM_DISTRIBUTION'] = 'block:block'
+        test.env_vars['SLURM_CPU_BIND'] = 'verbose'


add log for SLURM_DISTRIBUTION here, and move log for SLURM_CPU_BIND in this if block?

Done. Also added a log for I_MPI_PIN_CELL, since that was missing too. I also nested the mpirun variables in a seperate if. Additionally, I added a warning message if a launcher is used that we do not support in this function. It's not a blocker, so I won't abort the test, but the user should be aware that performance might be sub-par if the default binding strategy isn't great.

casparvl · 2024-05-07T14:00:20Z

i mean SRUN_CPUS_PER_TASK, see for example the docs for v22.05: https://slurm.schedmd.com/archive/slurm-22.05.0/srun.html#OPT_cpus-per-task
they have now reversed the behavior back to the old behavior and thus SRUN_CPUS_PER_TASK is no longer needed.

...

it's not a bug, it is intended behavior for the range of slurm versions i mentioned. they have now reversed this behavior to before 22.05.

Oh, I didn't realize that. Weird change, and I'm happy they changed it back :D

it doesn't determine the allocation of the job step, only the binding of the tasks. i agree though it's better to always set it so the behavior is always the same regardless of the slurm version being used.

Interesting, I never realized. I thought that srun -c 2 would create a cgroup with two CPU cores per task, just like it does for srun --mem=<something>. But you're right, it doesn't, it indeed only sets the affinity of the process:

[casparl@int5 ~]$ salloc -n 2 --ntasks-per-node 2 -c 24 -t 10:00 -p genoa
...
[casparl@tcn928 ~]$ srun -n 2 -c 2 --mem=2G cat /sys/fs/cgroup/memory/slurm/uid_45397/job_6176002/step_0/memory.limit_in_bytes
2147483648
2147483648
[casparl@tcn928 ~]$ srun -n 2 -c 2 --mem=2G cat /sys/fs/cgroup/cpuset/slurm/uid_45397/job_6176002/step_1/cpuset.cpus
64-65,80-81
64-65,80-81
[casparl@tcn739 ~]$ srun -c 2 numactl --show
...
physcpubind: 80 81
...
physcpubind: 64 65

But agreed, it's probably still good to set this in any case at a more general level, even if the result is a form of binding. The issue if we don't set it is that we don't control what the default affinity is - and if that's a single core, a test that we think doesn't use process binding will still be bound (to a single core). That's awkward. If you can create a PR for that: great!

…cher is used that binding might not be effective

smoors

tested for openmpi with mpirun and srun, seems to work fine

Caspar van Leeuwen added 2 commits May 3, 2024 16:33

casparvl mentioned this pull request May 3, 2024

Process binding for pure MPI tests #138

Open

Explicitely add num_cpus_per_core=1. I don't actually know if hyperth…

241eabb

…reading is enabled in the github CI environment, but it doesn't really matter for the dry-runs anyway.

Caspar van Leeuwen added 2 commits May 7, 2024 13:25

Fix indentation of rais in check_proc_attribute_defined

393f270

Set block distribution in sockets as well when srun is used as parall…

829d6ff

…el launcher

smoors reviewed May 7, 2024

View reviewed changes

Caspar van Leeuwen added 3 commits May 7, 2024 16:25

Make if-statements based on launcher, and warn if an unsupported laun…

23705ac

…cher is used that binding might not be effective

Make linter happy

34e7870

Make linter happy

9f46667

smoors mentioned this pull request May 7, 2024

set SRUN_CPUS_PER_TASK for srun launcher #141

Merged

smoors approved these changes May 10, 2024

View reviewed changes

smoors merged commit 6622529 into EESSI:main May 10, 2024
10 checks passed

boegel changed the title ~~Use compact process binding for gromacs~~ Use compact process binding for GROMACS Jun 27, 2024

casparvl deleted the use_compact_process_binding_gromacs branch September 4, 2024 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use compact process binding for GROMACS #139

Use compact process binding for GROMACS #139

casparvl commented May 3, 2024 •

edited

Loading

casparvl commented May 3, 2024

casparvl commented May 3, 2024

smoors commented May 5, 2024

smoors commented May 6, 2024

smoors commented May 7, 2024 •

edited

Loading

casparvl commented May 7, 2024

smoors commented May 7, 2024 •

edited

Loading

smoors commented May 7, 2024

smoors May 7, 2024

casparvl May 7, 2024

casparvl commented May 7, 2024

smoors left a comment

		test.env_vars['SLURM_DISTRIBUTION'] = 'block:block'
		test.env_vars['SLURM_CPU_BIND'] = 'verbose'

Use compact process binding for GROMACS #139

Use compact process binding for GROMACS #139

Conversation

casparvl commented May 3, 2024 • edited Loading

casparvl commented May 3, 2024

casparvl commented May 3, 2024

smoors commented May 5, 2024

smoors commented May 6, 2024

smoors commented May 7, 2024 • edited Loading

casparvl commented May 7, 2024

smoors commented May 7, 2024 • edited Loading

smoors commented May 7, 2024

smoors May 7, 2024

Choose a reason for hiding this comment

casparvl May 7, 2024

Choose a reason for hiding this comment

casparvl commented May 7, 2024

smoors left a comment

Choose a reason for hiding this comment

casparvl commented May 3, 2024 •

edited

Loading

smoors commented May 7, 2024 •

edited

Loading

smoors commented May 7, 2024 •

edited

Loading