Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use compact process binding for GROMACS #139

Merged
merged 8 commits into from
May 10, 2024

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented May 3, 2024

Without process binding, mpirun was binding processes to NUMA domains. I'm not sure why, but occasionally I saw some cores on Snellius being empty, while others seemed to run 2 processes. These runs would run about 10x slower than normal. This must be some weird behaviour from the OS thread scheduler. With process binding, performance is more consistent: while I still see some variation, I don't see the extremely slow runs anymore. Also, it seems performance also went up a bit (few %) on average, though it is a bit hard to be sure with the natural variation between runs.

Depends on:

Caspar van Leeuwen added 2 commits May 3, 2024 16:33
… compact process binding, since for e.g. 2 nodes 4 tasks it will lead to task 0 on node 0, task 1 on node 1, task 2 on node 0, task 3 on node 1. Compact would have been task 0 and 1 on node 0, and task 2 and 3 on node 1. Thus, this was a bug. Mapping by slot (combined with specifying PE=n) IS correct. The PE=n will set cpus-per-rank to n (see https://www.open-mpi.org/doc/current/man1/mpirun.1.php), while mapping by slot will mean that each rank is mapped to a consecutive slot. The only scenario in which mapping by slot will result in non-compact mapping is if oversubscription is enabled - then the oversubscribed ranks will be assigned round robin to the slots.
…ith mpirun it is currently free to migrate between cores within a numa domain. On Snellius I've seen some strange issues with occassionally very slow performance (10x slower than normal), potentially due to the OS thread schedulling being silly. Process binding leads to better _and_ more reproducible results
@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

Fixes #138

@casparvl
Copy link
Collaborator Author

casparvl commented May 3, 2024

Hm, not sure why https://github.com/EESSI/test-suite/actions/runs/8942074753/job/24563848640?pr=139 is happening:

def set_compact_process_binding(test: rfm.RegressionTest):
...
    check_proc_attribute_defined(test, 'num_cpus_per_core')
    num_cpus_per_core = test.current_partition.processor.num_cpus_per_core
    physical_cpus_per_task = int(test.num_cpus_per_task / num_cpus_per_core)
...

The check_proc_attribute_defined should have printed a clear error that this property wasn't defined in the ReFrame config file. Also, I'm not sure why we wouldn't have run into this before, but I guess the CI test wasn't there before...

The solution will be to make sure that num_cpus_per_core is defined in the CI config file... I'll check how we can achieve that.

…reading is enabled in the github CI environment, but it doesn't really matter for the dry-runs anyway.
@smoors
Copy link
Collaborator

smoors commented May 5, 2024

The check_proc_attribute_defined should have printed a clear error that this property wasn't defined in the ReFrame config file.

there is a bug in check_proc_attribute_defined: the last line (the raise) has 1 indent too many

@smoors
Copy link
Collaborator

smoors commented May 6, 2024

for slurm versions >= 22.05 < 23.11 when using srun, there is another issue: --cpus-per-task is not inherited from the job environment into the srun tasks, preventing "bind each process to subsequent domains of test.num_cpus_per_task cores".

would you mind adding the following to the hook set_compact_process_binding to fix this?

    if test.current_partition.launcher_type().registered_name == 'srun':
        test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task

@smoors
Copy link
Collaborator

smoors commented May 7, 2024

more testing fun with srun.

by default, srun does a block distribution over nodes, but a cyclic distribution over sockets, which is different from mpirun with test.env_vars['OMPI_MCA_rmaps_base_mapping_policy'] = 'slot:PE=%s' % physical_cpus_per_task, which does block distribution over sockets.
to get the same behavior as mpirun we can use environment variable $SLURM_DISTRIBUTION. i propose to only set all srun specific environments when srun is actually used as a launcher:

   if test.current_partition.launcher_type().registered_name == 'srun':
       test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task
       test.env_vars['SLURM_DISTRIBUTION'] = 'block:block'
       test.env_vars['SLURM_CPU_BIND'] = 'verbose'

@casparvl
Copy link
Collaborator Author

casparvl commented May 7, 2024

srun does a block distribution over nodes, but a cyclic distribution over sockets

Good catch, I didn't realize! I think it makes sense what you are proposing and to do this only if srun is used as launcher.

A few things I am wondering regarding

test.env_vars['SRUN_CPUS_PER_TASK'] = test.num_cpus_per_task
  1. I only see a SLURM_CPUS_PER_TASK in the srun and sbatch manuals https://slurm.schedmd.com/srun.html , https://slurm.schedmd.com/sbatch.html . Did you mean that?
  2. As I understand it, SLURM_CPUS_PER_TASK is an input environment variable for srun, i.e. it does the same as when you would invoke srun -c $SLURM_CPUS_PER_TASK. This means it determines the size of the allocation for the job step that srun creates, doesn't it? In that case, I'm not sure if the process-binding hook is the right place to set it, it seems more general to me. Shouldn't it also be set for tests that don't care about process binding?
  3. Should we fix things that are broken in SLURM in our test suite? SLURM_CPUS_PER_TASK is a documented output environment variable for sbatch. If it's not set, that's a bug in SLURM. It means that all jobs that rely on srun picking up the correct number of cpus per task from the parent allocation suddenly fail, or behave differently. If that's the case, I'm ok with the tests failing too, no?

@smoors
Copy link
Collaborator

smoors commented May 7, 2024

I only see a SLURM_CPUS_PER_TASK in the srun and sbatch manuals https://slurm.schedmd.com/srun.html , https://slurm.schedmd.com/sbatch.html . Did you mean that?

i mean SRUN_CPUS_PER_TASK, see for example the docs for v22.05: https://slurm.schedmd.com/archive/slurm-22.05.0/srun.html#OPT_cpus-per-task
they have now reversed the behavior back to the old behavior and thus SRUN_CPUS_PER_TASK is no longer needed.

As I understand it, SLURM_CPUS_PER_TASK is an input environment variable for srun, i.e. it does the same as when you would invoke srun -c $SLURM_CPUS_PER_TASK. This means it determines the size of the allocation for the job step that srun creates, doesn't it? In that case, I'm not sure if the process-binding hook is the right place to set it, it seems more general to me. Shouldn't it also be set for tests that don't care about process binding?

it doesn't determine the allocation of the job step, only the binding of the tasks. i agree though it's better to always set it so the behavior is always the same regardless of the slurm version being used.

Should we fix things that are broken in SLURM in our test suite? SLURM_CPUS_PER_TASK is a documented output environment variable for sbatch. If it's not set, that's a bug in SLURM. It means that all jobs that rely on srun picking up the correct number of cpus per task from the parent allocation suddenly fail, or behave differently. If that's the case, I'm ok with the tests failing too, no?

it's not a bug, it is intended behavior for the range of slurm versions i mentioned. they have now reversed this behavior to before 22.05.

@smoors
Copy link
Collaborator

smoors commented May 7, 2024

i'll make a separate PR for the SRUN_CPUS_PER_TASK issue

Comment on lines +443 to +444
test.env_vars['SLURM_DISTRIBUTION'] = 'block:block'
test.env_vars['SLURM_CPU_BIND'] = 'verbose'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add log for SLURM_DISTRIBUTION here, and move log for SLURM_CPU_BIND in this if block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also added a log for I_MPI_PIN_CELL, since that was missing too. I also nested the mpirun variables in a seperate if. Additionally, I added a warning message if a launcher is used that we do not support in this function. It's not a blocker, so I won't abort the test, but the user should be aware that performance might be sub-par if the default binding strategy isn't great.

@casparvl
Copy link
Collaborator Author

casparvl commented May 7, 2024

i mean SRUN_CPUS_PER_TASK, see for example the docs for v22.05: https://slurm.schedmd.com/archive/slurm-22.05.0/srun.html#OPT_cpus-per-task
they have now reversed the behavior back to the old behavior and thus SRUN_CPUS_PER_TASK is no longer needed.

...

it's not a bug, it is intended behavior for the range of slurm versions i mentioned. they have now reversed this behavior to before 22.05.

Oh, I didn't realize that. Weird change, and I'm happy they changed it back :D

it doesn't determine the allocation of the job step, only the binding of the tasks. i agree though it's better to always set it so the behavior is always the same regardless of the slurm version being used.

Interesting, I never realized. I thought that srun -c 2 would create a cgroup with two CPU cores per task, just like it does for srun --mem=<something>. But you're right, it doesn't, it indeed only sets the affinity of the process:

[casparl@int5 ~]$ salloc -n 2 --ntasks-per-node 2 -c 24 -t 10:00 -p genoa
...
[casparl@tcn928 ~]$ srun -n 2 -c 2 --mem=2G cat /sys/fs/cgroup/memory/slurm/uid_45397/job_6176002/step_0/memory.limit_in_bytes
2147483648
2147483648
[casparl@tcn928 ~]$ srun -n 2 -c 2 --mem=2G cat /sys/fs/cgroup/cpuset/slurm/uid_45397/job_6176002/step_1/cpuset.cpus
64-65,80-81
64-65,80-81
[casparl@tcn739 ~]$ srun -c 2 numactl --show
...
physcpubind: 80 81
...
physcpubind: 64 65

But agreed, it's probably still good to set this in any case at a more general level, even if the result is a form of binding. The issue if we don't set it is that we don't control what the default affinity is - and if that's a single core, a test that we think doesn't use process binding will still be bound (to a single core). That's awkward. If you can create a PR for that: great!

Caspar van Leeuwen added 3 commits May 7, 2024 16:25
Copy link
Collaborator

@smoors smoors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested for openmpi with mpirun and srun, seems to work fine

@smoors smoors merged commit 6622529 into EESSI:main May 10, 2024
10 checks passed
@boegel boegel changed the title Use compact process binding for gromacs Use compact process binding for GROMACS Jun 27, 2024
@casparvl casparvl deleted the use_compact_process_binding_gromacs branch September 4, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants