-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ofi: refactor gpu pipeline #6891
base: main
Are you sure you want to change the base?
Conversation
0aa81ec
to
4632550
Compare
14efa62
to
bdd903e
Compare
test:mpich/ch4/ofi ✔️ |
src/mpid/ch4/netmod/ofi/ofi_types.h
Outdated
/* Bit mask for MPIR_ERR_OTHER */ | ||
#define MPIDI_OFI_ERR_OTHER (0x1ULL) | ||
/* Bit mask for MPIR_PROC_FAILED */ | ||
#define MPIDI_OFI_ERR_PROC_FAILED (0x2ULL) | ||
/* Bit mask for gpu pipeline */ | ||
#define MPIDI_OFI_IDATA_PIPELINE (1ULL << 32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use << MPIDI_OFI_IDATA_SRC_ERROR_BITS)
in case those values change in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
{ | ||
struct recv_alloc *p = MPIR_Async_thing_get_state(thing); | ||
struct recv_chunk_alloc *p = MPIR_Async_thing_get_state(thing); | ||
MPIR_Request *rreq = p->rreq; | ||
|
||
/* arbitrary threshold */ | ||
if (MPIDI_OFI_REQUEST(rreq, pipeline_info.recv.num_inrecv) > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just so i'm understanding, the number of outstanding pipeline recvs is capped at 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. My thinking was that we probably don't want thousands of chunk recvs waiting since they all match the same signature. Thus we need an arbitrary threshold. 1
is the minimum, but we probably can use something bigger, like 10
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CXI is sensitive to expected vs. unexpected recvs. Did the previous code post everything at once? I'm concerned this will negatively affect performance if the number is too small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think previously will keep posting until running out of genq buffer or fi_recv return error, then it goes into one at a time. I think it is equivalent to no limit here. What do you suggest? Remove the limit altogether, or set it to some reasonable number e.g. 100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards no limit.
9a1b569
to
5acf64f
Compare
test:mpich/ch4/ofi |
MPIR_gpu_req is a union type for either a MPL_gpu_request or a MPIR_Typerep_req, thus it is not just for gpu. Potentially this type can be extended to include other internal async task handles. Thus we rename it to MPIR_async_req. We also establish the convention of naming the variable async_req.
Add an inline wrapper for testing MPIR_async_req. Modify the order of header inclusion due to the dependency on typerep_pre.h.
Refactor the async copy in receive events using MPIR_async facilities.
Refactor the async copy before sending a chunk.
Both gpu_send_task_queue and gpu_recv_task_queue have been ported to async things.
Pipeline send allocates chunk buffers then spawns async copy. The allocation may run out of genq buffers, thus it is disigned as async tasks. The send copy are triggered upon completion of buffer alloc, thus it is renamed into spawn_send_copy and turned into internal static function. This removes MPIDI_OFI_global.gpu_send_queue.
Pipeline recv allocates chunk buffers and then post fi_trecv. The allocation may run out of genq buffers and we also control the number of outstanding recvs, thus it is designed as async tasks. The async recv copy are triggered in recv event when data arrives. This removes MPIDI_OFI_global.gpu_recv_queue. All ofi-layer progress routines for gpu pipelining are now removed.
Consolidate the gpu pipeline code. MPIDI_OFI_gpu_pipeline_request is now an internal struct in ofi_gpu_pipeline.c, rename to struct chunk_req. MPIDI_OFI_gpu_pipeline_recv_copy is now an internal function, rename to start_recv_copy.
Move all gpu pipeline specific code into ofi_gpu_pipeline.c. Make a new function MPIDI_OFI_gpu_pipeline_recv that fills rreq with persistent pipeline_info data. Rename the original MPIDI_OFI_gpu_pipeline_recv into static function start_recv_chunk.
Make the code cleaner to separate the pipeline_info type into a union of send and recv.
Don't mix the usage of cc_ptr, use separate and explicit counters to track the progress and completion of chunks.
Follow a similar approach as nonblocking collectives, internal pipeline chunks use separate tag space (MPIDI_OFI_GPU_PIPELINE_SEND) and incrementing tags to avoid mismatch with regular messages.
Separate the recv tasks between the initial header and chunks since the paths clearly separates them. Use a single async item for all chunk recvs rather than unnecessarily enqueuing individual chunks since we can track the chunks in the state.
It is needed to compile under noinline configuration.
Move these utility functions to ofi_impl.h since they are simple and non-specific. It also simplifies figuring out which file to include especially for .c files.
test:mpich/ch4/ofi ✔️ |
Remove the limit in posting gpu pipeline recv chunks. The limit can be controlled by the maximum chunks from MPIDI_OFI_global.gpu_pipeline_recv_pool or when the libfabric return EAGAIN. In progressing the recv_chunk_alloc, we'll issue as many chunks as we can instead of one at a time. Refactor the code to have single exit point.
I'm eagerly awaiting this feature. Could you please provide an update on the current status of this pull request? |
Pull Request Description
Port the ofi GPU pipeline code into using MPIR_aysnc_things (#6841).
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.