-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/posix: shared memory based intra-node collectives #3490
Conversation
a6a3e06
to
d4c8373
Compare
test:jenkins/ch3/most |
d4c8373
to
828f9af
Compare
Hack patches are just to enable these for testing. Will be removed when ready to merge. |
One HACK patch selects |
828f9af
to
af23013
Compare
MPIR_Op_is_commutative(op)) { | ||
/* release_gather based algorithm can be used only if izem submodule is built (and enabled) | ||
* and MPICH is not multi-threaded. Also when the op is commutative */ | ||
#ifdef ENABLE_IZEM_ATOMIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're getting to the point where it makes sense to build and include izem by default. I'm not crazy about having to guard this code. @halimamer does build izem with all features automatically enable them in MPICH? Or are there additional configuration options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still problems building all of the izem features on MacOS. Before we can enable it, we need to resolve those issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I made pmodels/izem#20 as a reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @halimamer had mentioned that izem performance is horrible when we oversubscribe the cores with threads. It might not be a common-case in HPC, but jenkins will go bonkers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal is to build with izem=atomic
by default, so that the intra-node collectives can be used without any special configure option. This code would still need to be protected with izem ifdefs in case izem=atomic
was disabled.
No where else in MPICH, izem atomic option is used, so performance will not be compromised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that's what @halimamer told me. He'll need to confirm, but IIRC he said it's so bad that all the tests simply timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the tests by enabling izem=atomic
as part of this PR. The tests passed as they should. As far as I know, izem=queue or sync
results in bad performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the oversubscribed case is the disputed one, though? It still seems odd because there is no code outside of this PR that uses izem atomics in MPICH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, I think I misunderstood. Anyway, we can discuss this outside the context of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The izem
atomics themselves shouldn't have any performance negative side effects. They are just wrappers around GCC __atomic or C11 atomic operations. The oversubscription issue only happens when you busy wait without going to the OS-kernel, in which case calling sched_yield()
or using POSIX synchronization primitives is more appropriate
*/ | ||
#undef FCNAME | ||
#define FCNAME MPL_QUOTE(MPIDI_POSIX_mpi_reduce_release_gather) | ||
MPL_STATIC_INLINE_PREFIX inline int MPIDI_POSIX_mpi_reduce_release_gather(const void *sendbuf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case this slipped your sight, I am rebasing on this code for my PR and I am getting a warning for this duplicated inline
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
3be4f08
to
2fa3709
Compare
795bc9a
to
76c2014
Compare
76c2014
to
44f80b7
Compare
44f80b7
to
74039ad
Compare
@raffenet This PR has been rebased on master and it is complaint with the latest inlining/uninlining scheme in MPICH. |
b946514
to
36cdf0e
Compare
fdd152c
to
76210e1
Compare
test:jenkins/ch4/ofi |
@yfguo The reviews have been addressed and the branch is rebased. |
test:jenkins/ch4/ofi |
1 similar comment
test:jenkins/ch4/ofi |
@jain-surabhi-23 I am going to remove the two "HACK" commit and merge this PR. Is that OK? |
@yfguo You will have to remove the two "HACK" commits as well the "test: Add bcast, reduce tests for newly added CVARS" commit. After that this PR is good to go. |
Thank you! I will take it from here. |
Awesome! Thank you for reviewing ! |
test:jenkins/ch4/ofi |
@yfguo The patch |
test:jenkins/ch4/ofi |
@jain-surabhi-23 Yes. But I think we should keep the test and since we will eventually need them once we are clear about the strategy for |
@yfguo Sounds good to me then 👍 |
Change MPII to MPIR so that it could be used from device Inlining fixes the linking error in fortran tests using gcc, debug mode when this function is used from posix Signed-off-by: Yanfei Guo <[email protected]>
Changing the related functions and data structures prefix to MPIR so that it could be used from device Signed-off-by: Yanfei Guo <[email protected]>
Signed-off-by: Yanfei Guo <[email protected]>
Signed-off-by: Yanfei Guo <[email protected]>
This change allows to create errflag in a function and propagate it further. Needed for init and finalize calls which don't have errflag passed to them. Signed-off-by: Yanfei Guo <[email protected]>
Signed-off-by: Yanfei Guo <[email protected]>
Give user ability to choose an algorithm for intra-node bcast, reduce Also set up infrastructure for posix_coll_init and posix_coll_finalize Signed-off-by: Yanfei Guo <[email protected]>
The global data structures can be reused by posix level intra-node collectives as well Signed-off-by: Yanfei Guo <[email protected]>
Implement the release and gather building blocks which will be used to implement intra-node bcast and intra-node reduce. Shared memory is created per communicator, which is used to place the data to be broadcasted, the data which is to be redued, and flags to update the children or parent in the tree. Release is top-down step in tree, while gather is bottom-up step. A shared limit counter is implemented to track and limit the amount of shared memory created per node for optimized intra-node collectives. Signed-off-by: Yanfei Guo <[email protected]>
Intra-node bcast is implemented using release step followed by gather step. Data movement takes place in release (top-down step) in the tree. Gather (bottom-up step) is used for acknowledgement. Non-roots notify the root that the data was copied out of shared bcast buffer and root can reuse the buffer for next bcast call. Bcast buffer is split into multiple cells, so that the copying in of the next chunk by root can be overlapped with copying out of previous chunks by non-roots (pipelining). Large messages are split into chunks of cell size each and pipelining is used. Signed-off-by: Yanfei Guo <[email protected]>
Intra-node reduce is implemented using release step followed by gather step. Data movement takes place in gather (bottom-up step) in the tree. Release (top-down) step is used for acknowledgement. Root notifies the non-roots that the data was reduced and copied out of its reduce buffer. Hence, children ranks can reuse the reduce buffer for next reduce call. There is a reduce shm buffer per rank, as each rank contributes data in reduce. Each buffer is split into multiple cells, so the copying in of the next chunk by children can be overlapped with reduce and copy out by the parent rank for the previous cells (pipelining). Large messages are split into chunks of cell size each and pipelining is used. Signed-off-by: Yanfei Guo <[email protected]>
Run a few bcast and reduce tests by varying the CVARS for multiple buffer sizes and type, radix of trees. Signed-off-by: Yanfei Guo <[email protected]>
The algorithm is expected to fail since izem is not used by default. This commit is a temporary measure until we decide what to do with enabling izem by default or bring izem functionalities into OPA/MPL. No reviewer.
Set up infrastructure for implementing shared memory collectives using release, gather building blocks. Implement intra-node bcast and intra-node reduce.