ch4/posix: shared memory based intra-node collectives #3490

jain-surabhi-23 · 2019-01-07T19:41:52Z

Set up infrastructure for implementing shared memory collectives using release, gather building blocks. Implement intra-node bcast and intra-node reduce.

jain-surabhi-23 · 2019-01-07T20:20:38Z

test:jenkins/ch3/most
test:jenkins/ch4/most

raffenet · 2019-01-10T15:24:40Z

Hack patches are just to enable these for testing. Will be removed when ready to merge.

jain-surabhi-23 · 2019-01-10T16:25:06Z

One HACK patch selects release_gather algorithm for intra-node Bcast and Reduce. This patch can be dropped because we dont want these algorithms to be enabled by default, unless we have topology aware trees merged, (that is work in progress).
Second HACK patch enables izem=atomic by default. Intra-node collectives have dependence on izem due to C11 atomics. I propose we enable izem by default in MPICH, otherwise few tests in test_coll_algos.sh will fail saying release_gather algorithm was chosen for Bcast or Reduce but izem was not enabled. Any thoughts on this proposal?

raffenet · 2019-01-24T02:52:18Z

src/mpid/ch4/shm/posix/posix_coll_select.h

+        MPIR_Op_is_commutative(op)) {
+        /* release_gather based algorithm can be used only if izem submodule is built (and enabled)
+         * and MPICH is not multi-threaded. Also when the op is commutative */
+#ifdef ENABLE_IZEM_ATOMIC


I think we're getting to the point where it makes sense to build and include izem by default. I'm not crazy about having to guard this code. @halimamer does build izem with all features automatically enable them in MPICH? Or are there additional configuration options?

There are still problems building all of the izem features on MacOS. Before we can enable it, we need to resolve those issues.

Got it. I made pmodels/izem#20 as a reminder.

Also, @halimamer had mentioned that izem performance is horrible when we oversubscribe the cores with threads. It might not be a common-case in HPC, but jenkins will go bonkers.

My proposal is to build with izem=atomic by default, so that the intra-node collectives can be used without any special configure option. This code would still need to be protected with izem ifdefs in case izem=atomic was disabled.
No where else in MPICH, izem atomic option is used, so performance will not be compromised.

I believe that's what @halimamer told me. He'll need to confirm, but IIRC he said it's so bad that all the tests simply timeout.

I ran the tests by enabling izem=atomic as part of this PR. The tests passed as they should. As far as I know, izem=queue or sync results in bad performance.

I guess the oversubscribed case is the disputed one, though? It still seems odd because there is no code outside of this PR that uses izem atomics in MPICH.

Oh nevermind, I think I misunderstood. Anyway, we can discuss this outside the context of this PR.

The izem atomics themselves shouldn't have any performance negative side effects. They are just wrappers around GCC __atomic or C11 atomic operations. The oversubscription issue only happens when you busy wait without going to the OS-kernel, in which case calling sched_yield() or using POSIX synchronization primitives is more appropriate

gcongiu · 2019-01-28T17:22:08Z

src/mpid/ch4/shm/posix/posix_coll_release_gather.h

+ */
+#undef FCNAME
+#define FCNAME MPL_QUOTE(MPIDI_POSIX_mpi_reduce_release_gather)
+MPL_STATIC_INLINE_PREFIX inline int MPIDI_POSIX_mpi_reduce_release_gather(const void *sendbuf,


In case this slipped your sight, I am rebasing on this code for my PR and I am getting a warning for this duplicated inline.

I will fix it.

jain-surabhi-23 · 2019-04-05T21:59:41Z

@raffenet This PR has been rebased on master and it is complaint with the latest inlining/uninlining scheme in MPICH.
Shared memory regions are created in a lazy manner, that is when a collective is called on a communicator for the first time. We also a CVAR which can be used to set a limit on amount of shared memory created per node for collectives.
The performance of this framework along with the intra-node topology aware trees is good. And I believe we should enable this by default, once topology aware trees have been merged as well.
PR #3727 is created which introduces topology aware trees on top of this PR.

yfguo · 2019-04-18T05:14:59Z

test:jenkins/ch4/ofi

jain-surabhi-23 · 2019-04-22T17:34:10Z

@yfguo The reviews have been addressed and the branch is rebased.

yfguo · 2019-04-22T17:35:56Z

test:jenkins/ch4/ofi

yfguo · 2019-04-23T02:11:19Z

test:jenkins/ch4/ofi

yfguo · 2019-04-23T15:03:07Z

@jain-surabhi-23 I am going to remove the two "HACK" commit and merge this PR. Is that OK?

jain-surabhi-23 · 2019-04-23T15:22:45Z

@yfguo You will have to remove the two "HACK" commits as well the "test: Add bcast, reduce tests for newly added CVARS" commit. After that this PR is good to go.

yfguo · 2019-04-23T15:24:31Z

Thank you! I will take it from here.

jain-surabhi-23 · 2019-04-23T15:25:31Z

Awesome! Thank you for reviewing !

yfguo · 2019-04-23T18:07:32Z

test:jenkins/ch4/ofi

jain-surabhi-23 · 2019-04-23T18:12:19Z

@yfguo The patch "test: Add bcast, reduce tests for newly added CVARS" should be removed as well. Otherwise the tests added there would complain that release_gather algorithm was selected but izem is not built.

yfguo · 2019-04-23T19:50:46Z

test:jenkins/ch4/ofi

yfguo · 2019-04-23T19:52:54Z

@jain-surabhi-23 Yes. But I think we should keep the test and since we will eventually need them once we are clear about the strategy for izem. I have xfailed them in a separate commit which should be enough for now.

jain-surabhi-23 · 2019-04-23T19:55:51Z

@yfguo Sounds good to me then 👍

Change MPII to MPIR so that it could be used from device Inlining fixes the linking error in fortran tests using gcc, debug mode when this function is used from posix Signed-off-by: Yanfei Guo <[email protected]>

Changing the related functions and data structures prefix to MPIR so that it could be used from device Signed-off-by: Yanfei Guo <[email protected]>

Signed-off-by: Yanfei Guo <[email protected]>

This change allows to create errflag in a function and propagate it further. Needed for init and finalize calls which don't have errflag passed to them. Signed-off-by: Yanfei Guo <[email protected]>

Signed-off-by: Yanfei Guo <[email protected]>

Give user ability to choose an algorithm for intra-node bcast, reduce Also set up infrastructure for posix_coll_init and posix_coll_finalize Signed-off-by: Yanfei Guo <[email protected]>

The global data structures can be reused by posix level intra-node collectives as well Signed-off-by: Yanfei Guo <[email protected]>

Implement the release and gather building blocks which will be used to implement intra-node bcast and intra-node reduce. Shared memory is created per communicator, which is used to place the data to be broadcasted, the data which is to be redued, and flags to update the children or parent in the tree. Release is top-down step in tree, while gather is bottom-up step. A shared limit counter is implemented to track and limit the amount of shared memory created per node for optimized intra-node collectives. Signed-off-by: Yanfei Guo <[email protected]>

Intra-node bcast is implemented using release step followed by gather step. Data movement takes place in release (top-down step) in the tree. Gather (bottom-up step) is used for acknowledgement. Non-roots notify the root that the data was copied out of shared bcast buffer and root can reuse the buffer for next bcast call. Bcast buffer is split into multiple cells, so that the copying in of the next chunk by root can be overlapped with copying out of previous chunks by non-roots (pipelining). Large messages are split into chunks of cell size each and pipelining is used. Signed-off-by: Yanfei Guo <[email protected]>

Intra-node reduce is implemented using release step followed by gather step. Data movement takes place in gather (bottom-up step) in the tree. Release (top-down) step is used for acknowledgement. Root notifies the non-roots that the data was reduced and copied out of its reduce buffer. Hence, children ranks can reuse the reduce buffer for next reduce call. There is a reduce shm buffer per rank, as each rank contributes data in reduce. Each buffer is split into multiple cells, so the copying in of the next chunk by children can be overlapped with reduce and copy out by the parent rank for the previous cells (pipelining). Large messages are split into chunks of cell size each and pipelining is used. Signed-off-by: Yanfei Guo <[email protected]>

Run a few bcast and reduce tests by varying the CVARS for multiple buffer sizes and type, radix of trees. Signed-off-by: Yanfei Guo <[email protected]>

The algorithm is expected to fail since izem is not used by default. This commit is a temporary measure until we decide what to do with enabling izem by default or bring izem functionalities into OPA/MPL. No reviewer.

jain-surabhi-23 changed the title ~~Shared Memory Based Intra-node collectives~~ posix: shared Memory Based Intra-node collectives Jan 7, 2019

jain-surabhi-23 changed the title ~~posix: shared Memory Based Intra-node collectives~~ posix: shared memory based intra-node collectives Jan 7, 2019

jain-surabhi-23 force-pushed the intra_node_colls branch from a6a3e06 to d4c8373 Compare January 7, 2019 20:19

jain-surabhi-23 force-pushed the intra_node_colls branch from d4c8373 to 828f9af Compare January 9, 2019 17:02

raffenet self-requested a review January 10, 2019 15:24

jain-surabhi-23 force-pushed the intra_node_colls branch from 828f9af to af23013 Compare January 17, 2019 15:45

raffenet reviewed Jan 24, 2019

View reviewed changes

gcongiu mentioned this pull request Jan 24, 2019

Heterogeneous memory integration #3200

Closed

3 tasks

gcongiu reviewed Jan 28, 2019

View reviewed changes

jain-surabhi-23 force-pushed the intra_node_colls branch 2 times, most recently from 3be4f08 to 2fa3709 Compare January 29, 2019 17:00

jain-surabhi-23 force-pushed the intra_node_colls branch 2 times, most recently from 795bc9a to 76c2014 Compare February 22, 2019 23:36

jain-surabhi-23 force-pushed the intra_node_colls branch from 76c2014 to 44f80b7 Compare March 14, 2019 21:23

wesbland changed the title ~~posix: shared memory based intra-node collectives~~ ch4/posix: shared memory based intra-node collectives Apr 4, 2019

jain-surabhi-23 force-pushed the intra_node_colls branch from 44f80b7 to 74039ad Compare April 5, 2019 19:22

jain-surabhi-23 mentioned this pull request Apr 5, 2019

shm: Intra-node topology aware trees #3727

Merged

jain-surabhi-23 force-pushed the intra_node_colls branch 3 times, most recently from b946514 to 36cdf0e Compare April 11, 2019 18:46

raffenet requested a review from yfguo April 11, 2019 21:24

jain-surabhi-23 force-pushed the intra_node_colls branch 3 times, most recently from fdd152c to 76210e1 Compare April 16, 2019 21:41

yfguo approved these changes Apr 23, 2019

View reviewed changes

yfguo force-pushed the intra_node_colls branch from 1180d34 to 59a20ba Compare April 23, 2019 18:07

jainsura-intel and others added 13 commits April 24, 2019 11:09

coll: Namespacing change in calculate_chunking_info

5bdd117

Change MPII to MPIR so that it could be used from device Inlining fixes the linking error in fortran tests using gcc, debug mode when this function is used from posix Signed-off-by: Yanfei Guo <[email protected]>

coll: Namespacing change in MPII Treealgo tree

8c3835f

Changing the related functions and data structures prefix to MPIR so that it could be used from device Signed-off-by: Yanfei Guo <[email protected]>

configure: Add AM_CONDITIONAL for izem_atomic

a32680b

Signed-off-by: Yanfei Guo <[email protected]>

err: Add new err for noizem

0bb172a

Signed-off-by: Yanfei Guo <[email protected]>

maint/extracterrmsgs: Allow errflag creation in func

5a531d4

This change allows to create errflag in a function and propagate it further. Needed for init and finalize calls which don't have errflag passed to them. Signed-off-by: Yanfei Guo <[email protected]>

posix: Fix mpi_errno in posix_init

66a4397

Signed-off-by: Yanfei Guo <[email protected]>

posix: Add collective algorithm CVARS at shm level

41010ed

Give user ability to choose an algorithm for intra-node bcast, reduce Also set up infrastructure for posix_coll_init and posix_coll_finalize Signed-off-by: Yanfei Guo <[email protected]>

posix: Reuse POSIX global data structures in fbox

71690fa

The global data structures can be reused by posix level intra-node collectives as well Signed-off-by: Yanfei Guo <[email protected]>

test: Add bcast, reduce tests for newly added CVARS

b7b4aaa

Run a few bcast and reduce tests by varying the CVARS for multiple buffer sizes and type, radix of trees. Signed-off-by: Yanfei Guo <[email protected]>

test: xfail coll algo tests using release_gather for intra-node

703b78e

The algorithm is expected to fail since izem is not used by default. This commit is a temporary measure until we decide what to do with enabling izem by default or bring izem functionalities into OPA/MPL. No reviewer.

yfguo force-pushed the intra_node_colls branch from 420c513 to 703b78e Compare April 24, 2019 16:09

yfguo merged commit 33f46d1 into pmodels:master Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch4/posix: shared memory based intra-node collectives #3490

ch4/posix: shared memory based intra-node collectives #3490

jain-surabhi-23 commented Jan 7, 2019

jain-surabhi-23 commented Jan 7, 2019

raffenet commented Jan 10, 2019

jain-surabhi-23 commented Jan 10, 2019

raffenet Jan 24, 2019

wesbland Jan 24, 2019

raffenet Jan 24, 2019

pavanbalaji Jan 24, 2019

jain-surabhi-23 Jan 24, 2019 •

edited

Loading

pavanbalaji Jan 28, 2019

jain-surabhi-23 Jan 28, 2019

raffenet Jan 28, 2019

raffenet Jan 28, 2019

halimamer Jan 28, 2019

gcongiu Jan 28, 2019

jain-surabhi-23 Jan 28, 2019

jain-surabhi-23 commented Apr 5, 2019

yfguo commented Apr 18, 2019

jain-surabhi-23 commented Apr 22, 2019

yfguo commented Apr 22, 2019

yfguo commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

yfguo commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

ch4/posix: shared memory based intra-node collectives #3490

ch4/posix: shared memory based intra-node collectives #3490

Conversation

jain-surabhi-23 commented Jan 7, 2019

jain-surabhi-23 commented Jan 7, 2019

raffenet commented Jan 10, 2019

jain-surabhi-23 commented Jan 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jain-surabhi-23 Jan 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jain-surabhi-23 commented Apr 5, 2019

yfguo commented Apr 18, 2019

jain-surabhi-23 commented Apr 22, 2019

yfguo commented Apr 22, 2019

yfguo commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

yfguo commented Apr 23, 2019

yfguo commented Apr 23, 2019

jain-surabhi-23 commented Apr 23, 2019

jain-surabhi-23 Jan 24, 2019 •

edited

Loading