ch4/ofi: refactor gpu pipeline #6891

hzhou · 2024-02-01T23:06:40Z

Pull Request Description

Port the ofi GPU pipeline code into using MPIR_aysnc_things (#6841).

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

hzhou · 2024-02-08T21:03:58Z

test:mpich/ch4/ofi ✔️
test:mpich/ch4/gpu/ofi ✔️

raffenet · 2024-02-14T19:39:55Z

src/mpid/ch4/netmod/ofi/ofi_types.h

 /* Bit mask for MPIR_ERR_OTHER */
 #define MPIDI_OFI_ERR_OTHER (0x1ULL)
 /* Bit mask for MPIR_PROC_FAILED */
 #define MPIDI_OFI_ERR_PROC_FAILED (0x2ULL)
+/* Bit mask for gpu pipeline */
+#define MPIDI_OFI_IDATA_PIPELINE  (1ULL << 32)


Should we use << MPIDI_OFI_IDATA_SRC_ERROR_BITS) in case those values change in the future?

raffenet · 2024-02-14T19:46:55Z

src/mpid/ch4/netmod/ofi/ofi_gpu_pipeline.c

 {
-    struct recv_alloc *p = MPIR_Async_thing_get_state(thing);
+    struct recv_chunk_alloc *p = MPIR_Async_thing_get_state(thing);
    MPIR_Request *rreq = p->rreq;

    /* arbitrary threshold */
    if (MPIDI_OFI_REQUEST(rreq, pipeline_info.recv.num_inrecv) > 1) {


just so i'm understanding, the number of outstanding pipeline recvs is capped at 1?

Yea. My thinking was that we probably don't want thousands of chunk recvs waiting since they all match the same signature. Thus we need an arbitrary threshold. 1 is the minimum, but we probably can use something bigger, like 10.

CXI is sensitive to expected vs. unexpected recvs. Did the previous code post everything at once? I'm concerned this will negatively affect performance if the number is too small.

I think previously will keep posting until running out of genq buffer or fi_recv return error, then it goes into one at a time. I think it is equivalent to no limit here. What do you suggest? Remove the limit altogether, or set it to some reasonable number e.g. 100?

I'm leaning towards no limit.

hzhou · 2024-03-01T22:59:50Z

test:mpich/ch4/ofi

MPIR_gpu_req is a union type for either a MPL_gpu_request or a MPIR_Typerep_req, thus it is not just for gpu. Potentially this type can be extended to include other internal async task handles. Thus we rename it to MPIR_async_req. We also establish the convention of naming the variable async_req.

Add an inline wrapper for testing MPIR_async_req. Modify the order of header inclusion due to the dependency on typerep_pre.h.

Refactor the async copy in receive events using MPIR_async facilities.

Refactor the async copy before sending a chunk.

Both gpu_send_task_queue and gpu_recv_task_queue have been ported to async things.

Pipeline send allocates chunk buffers then spawns async copy. The allocation may run out of genq buffers, thus it is disigned as async tasks. The send copy are triggered upon completion of buffer alloc, thus it is renamed into spawn_send_copy and turned into internal static function. This removes MPIDI_OFI_global.gpu_send_queue.

Pipeline recv allocates chunk buffers and then post fi_trecv. The allocation may run out of genq buffers and we also control the number of outstanding recvs, thus it is designed as async tasks. The async recv copy are triggered in recv event when data arrives. This removes MPIDI_OFI_global.gpu_recv_queue. All ofi-layer progress routines for gpu pipelining are now removed.

Consolidate the gpu pipeline code. MPIDI_OFI_gpu_pipeline_request is now an internal struct in ofi_gpu_pipeline.c, rename to struct chunk_req. MPIDI_OFI_gpu_pipeline_recv_copy is now an internal function, rename to start_recv_copy.

Move all gpu pipeline specific code into ofi_gpu_pipeline.c. Make a new function MPIDI_OFI_gpu_pipeline_recv that fills rreq with persistent pipeline_info data. Rename the original MPIDI_OFI_gpu_pipeline_recv into static function start_recv_chunk.

Make the code cleaner to separate the pipeline_info type into a union of send and recv.

Don't mix the usage of cc_ptr, use separate and explicit counters to track the progress and completion of chunks.

Follow a similar approach as nonblocking collectives, internal pipeline chunks use separate tag space (MPIDI_OFI_GPU_PIPELINE_SEND) and incrementing tags to avoid mismatch with regular messages.

Separate the recv tasks between the initial header and chunks since the paths clearly separates them. Use a single async item for all chunk recvs rather than unnecessarily enqueuing individual chunks since we can track the chunks in the state.

It is needed to compile under noinline configuration.

Move these utility functions to ofi_impl.h since they are simple and non-specific. It also simplifies figuring out which file to include especially for .c files.

hzhou · 2024-03-05T04:53:06Z

test:mpich/ch4/ofi ✔️

Remove the limit in posting gpu pipeline recv chunks. The limit can be controlled by the maximum chunks from MPIDI_OFI_global.gpu_pipeline_recv_pool or when the libfabric return EAGAIN. In progressing the recv_chunk_alloc, we'll issue as many chunks as we can instead of one at a time. Refactor the code to have single exit point.

ccchen057 · 2024-05-21T16:25:23Z

I'm eagerly awaiting this feature. Could you please provide an update on the current status of this pull request?

hzhou marked this pull request as draft February 1, 2024 23:06

hzhou force-pushed the 2401_gpupipe branch 15 times, most recently from 0aa81ec to 4632550 Compare February 6, 2024 15:58

hzhou marked this pull request as ready for review February 6, 2024 15:59

hzhou force-pushed the 2401_gpupipe branch 6 times, most recently from 14efa62 to bdd903e Compare February 8, 2024 20:55

hzhou requested a review from raffenet February 9, 2024 14:55

raffenet reviewed Feb 14, 2024

View reviewed changes

hzhou force-pushed the 2401_gpupipe branch from bdd903e to 9e890b5 Compare February 20, 2024 16:37

hzhou force-pushed the 2401_gpupipe branch 2 times, most recently from 9a1b569 to 5acf64f Compare March 1, 2024 22:55

hzhou added 16 commits March 4, 2024 22:49

misc: add MPIR_async_test

20464b6

Add an inline wrapper for testing MPIR_async_req. Modify the order of header inclusion due to the dependency on typerep_pre.h.

ch4/ipc: refactor gpu_ipc_async_poll to use MPIR_async_test

e0e64ee

ch4/ofi: refactor pipeline recv async copy

010f231

Refactor the async copy in receive events using MPIR_async facilities.

ch4/ofi: refactor pipeline send async copy

79f020f

Refactor the async copy before sending a chunk.

ch4/ofi: remove MPIDI_OFI_gpu_progress_task

70469a4

Both gpu_send_task_queue and gpu_recv_task_queue have been ported to async things.

ch4/ofi: refactor pipeline_info into a union

4ed7909

Make the code cleaner to separate the pipeline_info type into a union of send and recv.

ch4/ofi: use explicit counters to track gpu pipeline

52b93ad

Don't mix the usage of cc_ptr, use separate and explicit counters to track the progress and completion of chunks.

ch4/ofi: use internal tag for pipeline chunk match_bits

96988a1

Follow a similar approach as nonblocking collectives, internal pipeline chunks use separate tag space (MPIDI_OFI_GPU_PIPELINE_SEND) and incrementing tags to avoid mismatch with regular messages.

ch4/ofi: refactor gpu pipeline recv_alloc

3e741c7

Separate the recv tasks between the initial header and chunks since the paths clearly separates them. Use a single async item for all chunk recvs rather than unnecessarily enqueuing individual chunks since we can track the chunks in the state.

ch4/ofi: include ofi_impl.h in ofi_gpu_pipeline.c

2ff2048

It is needed to compile under noinline configuration.

ch4/ofi: move some inline util functions

8e5a2c2

Move these utility functions to ofi_impl.h since they are simple and non-specific. It also simplifies figuring out which file to include especially for .c files.

hzhou force-pushed the 2401_gpupipe branch from 5acf64f to 55b98ba Compare March 5, 2024 04:52

hzhou force-pushed the 2401_gpupipe branch from 55b98ba to 8aacd18 Compare March 5, 2024 14:37

raffenet approved these changes Mar 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch4/ofi: refactor gpu pipeline #6891

ch4/ofi: refactor gpu pipeline #6891

hzhou commented Feb 1, 2024 •

edited

Loading

hzhou commented Feb 8, 2024 •

edited

Loading

raffenet Feb 14, 2024

hzhou Feb 20, 2024

raffenet Feb 14, 2024

hzhou Feb 20, 2024

raffenet Feb 28, 2024

hzhou Feb 28, 2024

raffenet Feb 29, 2024

hzhou commented Mar 1, 2024

hzhou commented Mar 5, 2024 •

edited

Loading

ccchen057 commented May 21, 2024

ch4/ofi: refactor gpu pipeline #6891

Are you sure you want to change the base?

ch4/ofi: refactor gpu pipeline #6891

Conversation

hzhou commented Feb 1, 2024 • edited Loading

Pull Request Description

Author Checklist

hzhou commented Feb 8, 2024 • edited Loading

raffenet Feb 14, 2024

Choose a reason for hiding this comment

hzhou Feb 20, 2024

Choose a reason for hiding this comment

raffenet Feb 14, 2024

Choose a reason for hiding this comment

hzhou Feb 20, 2024

Choose a reason for hiding this comment

raffenet Feb 28, 2024

Choose a reason for hiding this comment

hzhou Feb 28, 2024

Choose a reason for hiding this comment

raffenet Feb 29, 2024

Choose a reason for hiding this comment

hzhou commented Mar 1, 2024

hzhou commented Mar 5, 2024 • edited Loading

ccchen057 commented May 21, 2024

hzhou commented Feb 1, 2024 •

edited

Loading

hzhou commented Feb 8, 2024 •

edited

Loading

hzhou commented Mar 5, 2024 •

edited

Loading