Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/posix: shared memory based intra-node collectives #3490

Merged
merged 13 commits into from
Apr 24, 2019

Conversation

jain-surabhi-23
Copy link

Set up infrastructure for implementing shared memory collectives using release, gather building blocks. Implement intra-node bcast and intra-node reduce.

@jain-surabhi-23 jain-surabhi-23 changed the title Shared Memory Based Intra-node collectives posix: shared Memory Based Intra-node collectives Jan 7, 2019
@jain-surabhi-23 jain-surabhi-23 changed the title posix: shared Memory Based Intra-node collectives posix: shared memory based intra-node collectives Jan 7, 2019
@jain-surabhi-23
Copy link
Author

test:jenkins/ch3/most
test:jenkins/ch4/most

@raffenet
Copy link
Contributor

Hack patches are just to enable these for testing. Will be removed when ready to merge.

@jain-surabhi-23
Copy link
Author

One HACK patch selects release_gather algorithm for intra-node Bcast and Reduce. This patch can be dropped because we dont want these algorithms to be enabled by default, unless we have topology aware trees merged, (that is work in progress).
Second HACK patch enables izem=atomic by default. Intra-node collectives have dependence on izem due to C11 atomics. I propose we enable izem by default in MPICH, otherwise few tests in test_coll_algos.sh will fail saying release_gather algorithm was chosen for Bcast or Reduce but izem was not enabled. Any thoughts on this proposal?

MPIR_Op_is_commutative(op)) {
/* release_gather based algorithm can be used only if izem submodule is built (and enabled)
* and MPICH is not multi-threaded. Also when the op is commutative */
#ifdef ENABLE_IZEM_ATOMIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're getting to the point where it makes sense to build and include izem by default. I'm not crazy about having to guard this code. @halimamer does build izem with all features automatically enable them in MPICH? Or are there additional configuration options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still problems building all of the izem features on MacOS. Before we can enable it, we need to resolve those issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I made pmodels/izem#20 as a reminder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @halimamer had mentioned that izem performance is horrible when we oversubscribe the cores with threads. It might not be a common-case in HPC, but jenkins will go bonkers.

Copy link
Author

@jain-surabhi-23 jain-surabhi-23 Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal is to build with izem=atomic by default, so that the intra-node collectives can be used without any special configure option. This code would still need to be protected with izem ifdefs in case izem=atomic was disabled.
No where else in MPICH, izem atomic option is used, so performance will not be compromised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's what @halimamer told me. He'll need to confirm, but IIRC he said it's so bad that all the tests simply timeout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the tests by enabling izem=atomic as part of this PR. The tests passed as they should. As far as I know, izem=queue or sync results in bad performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the oversubscribed case is the disputed one, though? It still seems odd because there is no code outside of this PR that uses izem atomics in MPICH.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind, I think I misunderstood. Anyway, we can discuss this outside the context of this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The izem atomics themselves shouldn't have any performance negative side effects. They are just wrappers around GCC __atomic or C11 atomic operations. The oversubscription issue only happens when you busy wait without going to the OS-kernel, in which case calling sched_yield() or using POSIX synchronization primitives is more appropriate

@gcongiu gcongiu mentioned this pull request Jan 24, 2019
3 tasks
*/
#undef FCNAME
#define FCNAME MPL_QUOTE(MPIDI_POSIX_mpi_reduce_release_gather)
MPL_STATIC_INLINE_PREFIX inline int MPIDI_POSIX_mpi_reduce_release_gather(const void *sendbuf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this slipped your sight, I am rebasing on this code for my PR and I am getting a warning for this duplicated inline.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

@jain-surabhi-23 jain-surabhi-23 force-pushed the intra_node_colls branch 2 times, most recently from 3be4f08 to 2fa3709 Compare January 29, 2019 17:00
@jain-surabhi-23 jain-surabhi-23 force-pushed the intra_node_colls branch 2 times, most recently from 795bc9a to 76c2014 Compare February 22, 2019 23:36
@wesbland wesbland changed the title posix: shared memory based intra-node collectives ch4/posix: shared memory based intra-node collectives Apr 4, 2019
@jain-surabhi-23
Copy link
Author

@raffenet This PR has been rebased on master and it is complaint with the latest inlining/uninlining scheme in MPICH.
Shared memory regions are created in a lazy manner, that is when a collective is called on a communicator for the first time. We also a CVAR which can be used to set a limit on amount of shared memory created per node for collectives.
The performance of this framework along with the intra-node topology aware trees is good. And I believe we should enable this by default, once topology aware trees have been merged as well.
PR #3727 is created which introduces topology aware trees on top of this PR.

@jain-surabhi-23 jain-surabhi-23 force-pushed the intra_node_colls branch 3 times, most recently from b946514 to 36cdf0e Compare April 11, 2019 18:46
@raffenet raffenet requested a review from yfguo April 11, 2019 21:24
@jain-surabhi-23 jain-surabhi-23 force-pushed the intra_node_colls branch 3 times, most recently from fdd152c to 76210e1 Compare April 16, 2019 21:41
@yfguo
Copy link
Contributor

yfguo commented Apr 18, 2019

test:jenkins/ch4/ofi

@jain-surabhi-23
Copy link
Author

@yfguo The reviews have been addressed and the branch is rebased.

@yfguo
Copy link
Contributor

yfguo commented Apr 22, 2019

test:jenkins/ch4/ofi

1 similar comment
@yfguo
Copy link
Contributor

yfguo commented Apr 23, 2019

test:jenkins/ch4/ofi

@yfguo
Copy link
Contributor

yfguo commented Apr 23, 2019

@jain-surabhi-23 I am going to remove the two "HACK" commit and merge this PR. Is that OK?

@jain-surabhi-23
Copy link
Author

@yfguo You will have to remove the two "HACK" commits as well the "test: Add bcast, reduce tests for newly added CVARS" commit. After that this PR is good to go.

@yfguo
Copy link
Contributor

yfguo commented Apr 23, 2019

Thank you! I will take it from here.

@jain-surabhi-23
Copy link
Author

Awesome! Thank you for reviewing !

@yfguo yfguo force-pushed the intra_node_colls branch from 1180d34 to 59a20ba Compare April 23, 2019 18:07
@yfguo
Copy link
Contributor

yfguo commented Apr 23, 2019

test:jenkins/ch4/ofi

@jain-surabhi-23
Copy link
Author

@yfguo The patch "test: Add bcast, reduce tests for newly added CVARS" should be removed as well. Otherwise the tests added there would complain that release_gather algorithm was selected but izem is not built.

@yfguo
Copy link
Contributor

yfguo commented Apr 23, 2019

test:jenkins/ch4/ofi

@yfguo
Copy link
Contributor

yfguo commented Apr 23, 2019

@jain-surabhi-23 Yes. But I think we should keep the test and since we will eventually need them once we are clear about the strategy for izem. I have xfailed them in a separate commit which should be enough for now.

@jain-surabhi-23
Copy link
Author

@yfguo Sounds good to me then 👍

jainsura-intel and others added 13 commits April 24, 2019 11:09
Change MPII to MPIR so that it could be used from device
Inlining fixes the linking error in fortran tests using gcc, debug mode when this
function is used from posix

Signed-off-by: Yanfei Guo <[email protected]>
Changing the related functions and data structures prefix to MPIR
so that it could be used from device

Signed-off-by: Yanfei Guo <[email protected]>
This change allows to create errflag in a function and propagate it
further. Needed for init and finalize calls which don't have errflag
passed to them.

Signed-off-by: Yanfei Guo <[email protected]>
Give user ability to choose an algorithm for intra-node bcast, reduce
Also set up infrastructure for posix_coll_init and posix_coll_finalize

Signed-off-by: Yanfei Guo <[email protected]>
The global data structures can be reused by posix level intra-node
collectives as well

Signed-off-by: Yanfei Guo <[email protected]>
Implement the release and gather building blocks which will be used to
implement intra-node bcast and intra-node reduce. Shared memory is
created per communicator, which is used to place the data to be
broadcasted, the data which is to be redued, and flags to update the
children or parent in the tree. Release is top-down step in tree, while
gather is bottom-up step. A shared limit counter is implemented to track
and limit the amount of shared memory created per node for optimized
intra-node collectives.

Signed-off-by: Yanfei Guo <[email protected]>
Intra-node bcast is implemented using release step followed by gather
step. Data movement takes place in release (top-down step) in the tree.
Gather (bottom-up step) is used for acknowledgement. Non-roots notify
the root that the data was copied out of shared bcast buffer and root
can reuse the buffer for next bcast call. Bcast buffer is split into
multiple cells, so that the copying in of the next chunk by root can be
overlapped with copying out of previous chunks by non-roots
(pipelining). Large messages are split into chunks of cell size each and
pipelining is used.

Signed-off-by: Yanfei Guo <[email protected]>
Intra-node reduce is implemented using release step followed by gather
step. Data movement takes place in gather (bottom-up step) in the tree.
Release (top-down) step is used for acknowledgement. Root notifies the
non-roots that the data was reduced and copied out of its reduce buffer.
Hence, children ranks can reuse the reduce buffer for next reduce call.
There is a reduce shm buffer per rank, as each rank contributes data in
reduce. Each buffer is split into multiple cells, so the copying in of
the next chunk by children can be overlapped with reduce and copy out by
the parent rank for the previous cells (pipelining). Large messages are
split into chunks of cell size each and pipelining is used.

Signed-off-by: Yanfei Guo <[email protected]>
Run a few bcast and reduce tests by varying the CVARS for multiple
buffer sizes and type, radix of trees.

Signed-off-by: Yanfei Guo <[email protected]>
The algorithm is expected to fail since izem is not used by default.
This commit is a temporary measure until we decide what to do with
enabling izem by default or bring izem functionalities into OPA/MPL.

No reviewer.
@yfguo yfguo force-pushed the intra_node_colls branch from 420c513 to 703b78e Compare April 24, 2019 16:09
@yfguo yfguo merged commit 33f46d1 into pmodels:master Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants