Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust qv_scope_split_at implementation #9

Open
eleon opened this issue Oct 23, 2021 · 17 comments
Open

Robust qv_scope_split_at implementation #9

eleon opened this issue Oct 23, 2021 · 17 comments

Comments

@eleon
Copy link
Member

eleon commented Oct 23, 2021

Let’s say we have a compute node with 3 GPUs and 3 NUMA domains. The 3 GPUs hang off the first NUMA domain. When I use split_at(..., HW_OBJ_GPUs, ...) I would expect the subscopes to be derived from NUMA 0, but most likely this implementation will derive one subscope per NUMA. f2d5cee

@GuillaumeMercier
Copy link
Collaborator

You can reuse the guided mode implementation from Hsplit if you wish because I think the functionality seems rather similar to me (this does not answer @eleon's comment).

@samuelkgutierrez
Copy link
Member

Please re-test to see if 2703e88 fixes this issue.

@eleon
Copy link
Member Author

eleon commented Feb 26, 2022

Thank you, @samuelkgutierrez . Unfortunately, this is still an issue. Here's an example.

  • A node with 2 sockets (NUMAs) and 2 GPUs.
  • The 2 GPUs are attached to socket 0.
  • Socket 0 PUs: 0-17, 36-53
  • Socket 1 PUs: 18-35, 54-71
    The following program will split_at 2 processes by GPU. What I would expect is that each process gets a GPU and both processes are assigned to socket (NUMA) 0 since the GPUs are attached to socket 0. However, each process is assign to a different socket.
leon@pascal4:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-phases
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:07:00.0
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
[1] Popped up to 18-35
[0] Popped up to 0-17
[...]
=> [1] Split@GPU: got 1 GPUs, running on 18-35,54-71
   [1] GPU 0 PCI Bus ID = 0000:07:00.0
=> [0] Split@GPU: got 1 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0

Perhaps, this will be solved by the affinity preserving CPU/GPU algorithms for split.
This test was using commit d54464b, since later commits have an issue.

@samuelkgutierrez
Copy link
Member

Thank you for testing, @eleon. Yes, an affinity preserving algorithm should fix this issue.

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Greetings @samuelkgutierrez There are still issues with the latest build. Same test machine as above, same command as above:

leon@pascal6:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-phases
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split: got 9 cores, running on 9-17,45-53
[1] Doing pthread_things with 9 cores
[1] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:07:00.0
[0] Popped up to 0-17
[1] Popped up to 18-35
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [1] Split@NUMA: got 0 NUMAs, running on 9-17,45-53
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 18 PUs
[0] Popped up to 0-17
[1] Popped up to 18-35
=> [0] Split@GPU: got 1 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split@GPU: got 1 GPUs, running on 9-17,45-53
   [1] GPU 0 PCI Bus ID = 0000:07:00.0

@samuelkgutierrez
Copy link
Member

Can you please try again by modifying the test to use QV_SCOPE_SPLIT_AFFINITY_PRESERVING?

@samuelkgutierrez
Copy link
Member

No, you will have to modify the test code to use QV_SCOPE_SPLIT_AFFINITY_PRESERVING where a specified group_id is provided. This applies to both qv_scope_split() and qv_scope_split_at().

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Thank you, @samuelkgutierrez! Using QV_SCOPE_SPLIT_AFFINITY_PRESERVING:

leon@pascal6:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-phases

===Phase 1: Regular split===
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 2 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
GPU 1 PCI Bus ID = 0000:07:00.0
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 0 GPU kernels
[0] Popped up to 0-17
[1] Popped up to 18-35

===Phase 2: NUMA split===
[0]: #NUMAs=2 numa_scope_id=0
[1]: #NUMAs=2 numa_scope_id=0
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
=> [1] Split@NUMA: got 1 NUMAs, running on 18-35,54-71
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 36 PUs
[0] Popped up to 0-17
[1] Popped up to 18-35

===Phase 3: GPU split===
=> [0] Split@GPU: got 2 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
   [0] GPU 1 PCI Bus ID = 0000:07:00.0
=> [1] Split@GPU: got 0 GPUs, running on 18-35,54-71

The issues are with the following calls:

  • Phase 2 qv_scope_taskid
    numa_scope_id should be 0 and 1.
  • Phase 3 qv_scope_split_at(..., QV_HW_OBJ_GPU, QV_SCOPE_SPLIT_AFFINITY_PRESERVING, ...)
    each task should get one GPU rather than one task 2 GPUs and the other 0.

@samuelkgutierrez
Copy link
Member

Thank you, @eleon. Can you push the changes you made so I can see what's going on? Regarding the second issue, are both GPUs attached to the package containing cores 0-17?

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Greetings, @samuelkgutierrez --I pushed the changes to test-mpi-phases.c last night. Yes, both GPUs are attached to the first socket (cores 0-17). Please let me now if I can test further. Also, I am testing natively, but the same architecture is available as cts1-pascal.xml in the mpibind repo :)

@samuelkgutierrez
Copy link
Member

Yikes! I didn't notice. My apologies.

If both GPUs are attached to the same socket, then QV_SCOPE_SPLIT_AFFINITY_PRESERVING is working properly. We will have to think about another policy that spreads the resources across tasks in a given scope to accomplish what you like. I'll take a closer look when I have a chance. Thank you.

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Sounds good, @samuelkgutierrez. Thank you!
Yes, the objective of the split_at operation is to split the resources among tasks according to a specific hardware resource. It could be NUMA, GPU, etc. For example, when splitting at GPU onpascal, this would mean binding the tasks to the first socket (local NUMA to the GPUs) and distributing the available GPUs to the tasks. Hope this helps.

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Adding another thought before I forget. The way I see the regular split and the split_at operation is that split is simply an instance of split_at by NUMA.

@samuelkgutierrez
Copy link
Member

Adding another thought before I forget. The way I see the regular split and the split_at operation is that split is simply an instance of split_at by NUMA.

Maybe split_at() is more like split() with a SPREAD policy (which we don't yet have implemented)? I think the idea is that we want to maximally distribute the requested resource among the tasks in the provided scope. Is that a reasonable way of thinking about this?

@eleon
Copy link
Member Author

eleon commented Nov 23, 2022

Possibly, @samuelkgutierrez. It's just that I think about it the opposite way:split is a special case of split_at --thinking in terms of hwloc, split operates from the root of the tree down, while split_at requires a tree whose root is different depending on the hardware resource of interest.

@eleon
Copy link
Member Author

eleon commented Jul 21, 2023

Good morning, @samuelkgutierrez . Progress! but some issues too.

Case 1: Not using USE_AFFINITY_PRESERVING (this is the closest to the behavior we are looking for).

leon@pascal4:qv$ QV_PORT=55996 srun -n2 quo-vadis/build-pascal/tests/test-mpi-phases 

===Phase 1: Regular split===
[1] Base scope w/36 cores, running on 18-35
[0] Base scope w/36 cores, running on 0-17
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:07:00.0
[0] Popped up to 0-17
[1] Popped up to 18-35

===Phase 2: NUMA split===
[0]: #NUMAs=2 numa_scope_id=0
[1]: #NUMAs=2 numa_scope_id=0
=> [1] Split@NUMA: got 1 NUMAs, running on 18-35,54-71
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 36 PUs
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
[1] Popped up to 18-35
[0] Popped up to 0-17

===Phase 3: GPU split===
=> [0] Split@GPU: got 1 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split@GPU: got 1 GPUs, running on 18-35,54-71
   [1] GPU 0 PCI Bus ID = 0000:07:00.0

The GPUs are split correctly among the MPI workers. However, the assigned CPUs are not local to the assigned GPUs.

Case 2: Using USE_AFFINITY_PRESERVING

leon@pascal4:qv$ QV_PORT=55996 srun -n2 quo-vadis/build-pascal/tests/test-mpi-phases 

===Phase 1: Regular split===
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 0 GPU kernels
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 2 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
GPU 1 PCI Bus ID = 0000:07:00.0
[1] Popped up to 18-35
[0] Popped up to 0-17

===Phase 2: NUMA split===
[0]: #NUMAs=2 numa_scope_id=0
[1]: #NUMAs=2 numa_scope_id=0
=> [1] Split@NUMA: got 1 NUMAs, running on 18-35,54-71
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 36 PUs
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
[1] Popped up to 18-35
[0] Popped up to 0-17

===Phase 3: GPU split===
=> [0] Split@GPU: got 2 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
   [0] GPU 1 PCI Bus ID = 0000:07:00.0
=> [1] Split@GPU: got 0 GPUs, running on 0-17,36-53

Two main issues here:

  • While the CPUs are coming from the socket with the GPUs, they are not split across the MPI workers. I would expect something like Task 0 with CPUs 0-8 (+SMT-2 threads) and Task 1 with CPUs 9-17 (+SMT-2 threads).
  • The GPUs are not split across the tasks.

Thanks!

@eleon
Copy link
Member Author

eleon commented Sep 21, 2023

Another subtlety about the split_at operation and the group_id parameter (including USE_AFFINITY_PRESERVING).

qv_scope_split_at(ctx, scope, type, group_id, subscope)

Let's say we split at GPUs. What I'm looking to answer is can the resulting subscopes have more than one GPU?

Here's my desired behavior.

Let's say we have a node with 3 GPUs and a job with 2 tasks.

  • If the caller specifies a positive integer as the group_id, then each subscope should have exactly 1 GPU. For example:

    qv_scope_split_at(ctx, scope, QV_HW_OBJ_GPU, rank % 3, &gpu_scope);
    

    In this case, rank 0 may get GPU 0 and rank 1 may get GPU 1. GPU 2 will be idle because not enough tasks were launched.

  • If the caller specifies USE_AFFINITY_PRESERVING as the group_id, then a subscope may have more than 1 GPU. For example:

    qv_scope_split_at(ctx, scope, QV_HW_OBJ_GPU, QV_SCOPE_SPLIT_AFFINITY_PRESERVING, &gpu_scope);
    

    In this case, rank 0 may get GPUs 0 and 1 and rank 1 may get GPU 2. It is up to rank 0 to use 1 or 2 GPUs. As shown in test-mpi-phases.c, rank 0 can get the number of GPUs in the subscope and launch to both GPUs accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants