Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: refactor gpu pipeline #6891

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

ch4/ofi: refactor gpu pipeline #6891

wants to merge 17 commits into from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Feb 1, 2024

Pull Request Description

Port the ofi GPU pipeline code into using MPIR_aysnc_things (#6841).

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou marked this pull request as draft February 1, 2024 23:06
@hzhou hzhou force-pushed the 2401_gpupipe branch 15 times, most recently from 0aa81ec to 4632550 Compare February 6, 2024 15:58
@hzhou hzhou marked this pull request as ready for review February 6, 2024 15:59
@hzhou hzhou force-pushed the 2401_gpupipe branch 6 times, most recently from 14efa62 to bdd903e Compare February 8, 2024 20:55
@hzhou
Copy link
Contributor Author

hzhou commented Feb 8, 2024

test:mpich/ch4/ofi ✔️
test:mpich/ch4/gpu/ofi ✔️

@hzhou hzhou requested a review from raffenet February 9, 2024 14:55
/* Bit mask for MPIR_ERR_OTHER */
#define MPIDI_OFI_ERR_OTHER (0x1ULL)
/* Bit mask for MPIR_PROC_FAILED */
#define MPIDI_OFI_ERR_PROC_FAILED (0x2ULL)
/* Bit mask for gpu pipeline */
#define MPIDI_OFI_IDATA_PIPELINE (1ULL << 32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use << MPIDI_OFI_IDATA_SRC_ERROR_BITS) in case those values change in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

{
struct recv_alloc *p = MPIR_Async_thing_get_state(thing);
struct recv_chunk_alloc *p = MPIR_Async_thing_get_state(thing);
MPIR_Request *rreq = p->rreq;

/* arbitrary threshold */
if (MPIDI_OFI_REQUEST(rreq, pipeline_info.recv.num_inrecv) > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just so i'm understanding, the number of outstanding pipeline recvs is capped at 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. My thinking was that we probably don't want thousands of chunk recvs waiting since they all match the same signature. Thus we need an arbitrary threshold. 1 is the minimum, but we probably can use something bigger, like 10.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CXI is sensitive to expected vs. unexpected recvs. Did the previous code post everything at once? I'm concerned this will negatively affect performance if the number is too small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previously will keep posting until running out of genq buffer or fi_recv return error, then it goes into one at a time. I think it is equivalent to no limit here. What do you suggest? Remove the limit altogether, or set it to some reasonable number e.g. 100?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards no limit.

@hzhou hzhou force-pushed the 2401_gpupipe branch 2 times, most recently from 9a1b569 to 5acf64f Compare March 1, 2024 22:55
@hzhou
Copy link
Contributor Author

hzhou commented Mar 1, 2024

test:mpich/ch4/ofi

hzhou added 16 commits March 4, 2024 22:49
MPIR_gpu_req is a union type for either a MPL_gpu_request or a
MPIR_Typerep_req, thus it is not just for gpu. Potentially this type can
be extended to include other internal async task handles. Thus we rename
it to MPIR_async_req.

We also establish the convention of naming the variable async_req.
Add an inline wrapper for testing MPIR_async_req.

Modify the order of header inclusion due to the dependency on
typerep_pre.h.
Refactor the async copy in receive events using MPIR_async facilities.
Refactor the async copy before sending a chunk.
Both gpu_send_task_queue and gpu_recv_task_queue have been ported to
async things.
Pipeline send allocates chunk buffers then spawns async copy. The
allocation may run out of genq buffers, thus it is disigned as async
tasks.

The send copy are triggered upon completion of buffer alloc, thus it is
renamed into spawn_send_copy and turned into internal static function.

This removes MPIDI_OFI_global.gpu_send_queue.
Pipeline recv allocates chunk buffers and then post fi_trecv. The
allocation may run out of genq buffers and we also control the number of
outstanding recvs, thus it is designed as async tasks.

The async recv copy are triggered in recv event when data arrives.

This removes MPIDI_OFI_global.gpu_recv_queue.

All ofi-layer progress routines for gpu pipelining are now removed.
Consolidate the gpu pipeline code.

MPIDI_OFI_gpu_pipeline_request is now an internal struct in
ofi_gpu_pipeline.c, rename to struct chunk_req.

MPIDI_OFI_gpu_pipeline_recv_copy is now an internal function, rename to
start_recv_copy.
Move all gpu pipeline specific code into ofi_gpu_pipeline.c.

Make a new function MPIDI_OFI_gpu_pipeline_recv that fills rreq with
persistent pipeline_info data. Rename the original
MPIDI_OFI_gpu_pipeline_recv into static function start_recv_chunk.
Make the code cleaner to separate the pipeline_info type into a union of
send and recv.
Don't mix the usage of cc_ptr, use separate and explicit counters to
track the progress and completion of chunks.
Follow a similar approach as nonblocking collectives, internal pipeline
chunks use separate tag space (MPIDI_OFI_GPU_PIPELINE_SEND) and
incrementing tags to avoid mismatch with regular messages.
Separate the recv tasks between the initial header and chunks since the
paths clearly separates them.

Use a single async item for all chunk recvs rather than unnecessarily
enqueuing individual chunks since we can track the chunks in the state.
It is needed to compile under noinline configuration.
Move these utility functions to ofi_impl.h since they are simple and
non-specific. It also simplifies figuring out which file to include
especially for .c files.
@hzhou
Copy link
Contributor Author

hzhou commented Mar 5, 2024

test:mpich/ch4/ofi ✔️

Remove the limit in posting gpu pipeline recv chunks. The limit can be
controlled by the maximum chunks from
MPIDI_OFI_global.gpu_pipeline_recv_pool or when the libfabric return
EAGAIN.

In progressing the recv_chunk_alloc, we'll issue as many chunks as we
can instead of one at a time.

Refactor the code to have single exit point.
@ccchen057
Copy link

I'm eagerly awaiting this feature. Could you please provide an update on the current status of this pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants