Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterogeneous memory integration #3200

Closed
wants to merge 7 commits into from
Closed

Conversation

gcongiu
Copy link
Contributor

@gcongiu gcongiu commented Jun 26, 2018

This PR introduces NUMA-awareness and heterogeneous memory support by adding the following features to MPICH:

  • Memory locality detection: detect NUMA nodes close to the hardware thread the current MPI process is running onto;
  • Shared memory partitioning: allocate a separate shared memory segment (VMA) for every NUMA node;
  • Memory binding setting: set memory binding for a shared memory segment to the requested memory node.

These three features combined allow MPICH to partition a shared memory object (e.g., fastboxes in point-to-point) by the number of NUMA nodes in the system. Processes will then have their memory objects bound to the closest NUMA.

REF: #3511

@gcongiu gcongiu added the WIP label Jun 26, 2018
@gcongiu gcongiu force-pushed the hetero-mem branch 2 times, most recently from 9f8a50f to eb96e22 Compare July 18, 2018 14:14
@gcongiu gcongiu changed the title mpi/init: add support to detect MCDRAM nodes on KNL architectures Heterogeneous memory integration Jul 18, 2018
@gcongiu gcongiu force-pushed the hetero-mem branch 3 times, most recently from 1007b81 to 701ea10 Compare July 20, 2018 15:32
@gcongiu gcongiu force-pushed the hetero-mem branch 3 times, most recently from c838ac0 to 1cbbb67 Compare August 6, 2018 19:34
@gcongiu
Copy link
Contributor Author

gcongiu commented Aug 14, 2018

test:jenkins/ch3/most

@gcongiu
Copy link
Contributor Author

gcongiu commented Aug 15, 2018

test:jenkins/ch3/tcp

@gcongiu
Copy link
Contributor Author

gcongiu commented Sep 2, 2018

test:jenkins/ch3/ofi

@gcongiu
Copy link
Contributor Author

gcongiu commented Sep 4, 2018

test:jenkins/ch3/ofi

1 similar comment
@gcongiu
Copy link
Contributor Author

gcongiu commented Sep 4, 2018

test:jenkins/ch3/ofi

@gcongiu gcongiu added this to the mpich-3.4a1 milestone Sep 26, 2018
@gcongiu
Copy link
Contributor Author

gcongiu commented Oct 4, 2018

test:jenkins/ch3/most

@gcongiu gcongiu force-pushed the hetero-mem branch 2 times, most recently from 0ae6922 to 764014c Compare October 4, 2018 17:34
@gcongiu
Copy link
Contributor Author

gcongiu commented Oct 4, 2018

test:jenkins/ch3/most

@gcongiu gcongiu force-pushed the hetero-mem branch 5 times, most recently from 8c6b5ea to 2fa570b Compare October 5, 2018 20:05
@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 12, 2019

test:jenkins/ch4/ofi

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 12, 2019

@raffenet @pavanbalaji CH4 tests now complete correctly (for some reason jenkins is failing to collect the summary and thus is not showing tests as passed). There are still a few corrections I need to make to the code but before doing it I would like to get your comments.

@pavanbalaji
Copy link
Contributor

This exposes fastboxes and copy buffers to the MPI layer. The abstraction needs to be improved.

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 12, 2019

This exposes fastboxes and copy buffers to the MPI layer. The abstraction needs to be improved.

Should I move mpir_memkind.h to the common device layer instead (i.e., mpidu_memkind.h)? Concerning MPIR_Process is it ok to have additional numa_info data in there? Or should this be also moved to the device layer and reside in another global data structure?

@pavanbalaji
Copy link
Contributor

Should I move mpir_memkind.h to the common device layer instead (i.e., mpidu_memkind.h)?

Not everything is device specific. So you might need to split it.

Concerning MPIR_Process is it ok to have additional numa_info data in there? Or should this be also moved to the device layer and reside in another global data structure?

Why not generalize it to expose topology in a more uniform fashion instead of hardcoding just "node" and "numa"? We now use hwloc at the MPI layer, so we might as well use those objects.

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 13, 2019

test:jenkins/ch4/ofi

1 similar comment
@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 13, 2019

test:jenkins/ch4/ofi

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 14, 2019

test:jenkins/ch4/ofi

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 14, 2019

Running additional tests as OFI works. @raffenet I still haven't figure out why report collection is failing.
test:jenkins/ch4/ucx
test:jenkins/ch3/tcp

@@ -50,3 +50,11 @@ dtype_send 2
recv_any 2
irecv_any 2
large_tag 2

# Heterogeneous memory tests
sendflood 8 env=MPIR_CVAR_MEMBIND_NUMA_ENABLE="YES" env=MPIR_CVAR_MEMBIND_TYPE_LIST="FASTBOXES:AUTO" env=MPIR_CVAR_MEMBIND_POLICY_LIST="FASTBOXES:BIND" env=MPIR_CVAR_MEMBIND_FLAGS_LIST="FASTBOXES:STRICT" timeLimit=600
Copy link
Contributor

@raffenet raffenet Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these additional quotes are causing the XML parser to barf. Can you remove them? E.g.

sendflood 8 env=MPIR_CVAR_MEMBIND_NUMA_ENABLE=YES env=MPIR_CVAR_MEMBIND_TYPE_LIST=FASTBOXES:AUTO env=MPIR_CVAR_MEMBIND_POLICY_LIST=FASTBOXES:BIND env=MPIR_CVAR_MEMBIND_FLAGS_LIST=FASTBOXES:STRICT timeLimit=600

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 14, 2019

test:jenkins/ch4/ofi

gcongiu and others added 7 commits February 14, 2019 19:54
This is a refactoring patch for shared memory segment allocation
functions. 'MPIDU_shm_seg_alloc' now also takes a memory type parameter
defining in which target memory the requested allocation should be
placed. 'MPIDU_shm_seg_commit' now also takes an hwloc numa node
logical identifier and a shared memory object identifier. The numa node
id allows binding memory to a specific memory domain while the object
identifier allows user defined binding information for that specific
object.

The patch also introduces shared memory object definitions in
'src/mpid/common/shm/mpidu_shm_obj.h', and memory type definitions to
which objects can be bound to in 'src/include/mpir_memtype.h'.
This patch introduces support for numa architectures, including
detection and usage of heterogeneous memory, e.g., KNL MCDRAM. The
patch adds functionalities to detect numa nodes of different type and
set up information useful for binding allocated objects to different
types of memory.
This patch modifies the previous fbox segment allocation mechanism to
make it numa and heterogeneous memory-aware. This is done by counting
the number of available numa nodes used by MPI processes and creating
an equal number of shared memory segments (instead of just one). Each
of these segments will contain the fbox elements for the processes
located in the corresponding numa node and can be bound to the requested
type of memory (i.e., DDR or MCDRAM).
Similarly to pt2pt fastbox integration this patch decomposes current
single shared segment into multiple segments, one per numa node, that
can then be separately bound using hwloc. Moreover, when using symheap
either all single segment allocations succeed or none of them does. If a
symheap segment allocation fails all the previous should be reverted. In
order to accomplish this the new function:
`MPIDI_CH4R_release_shm_symheap` has been introduced.
@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 15, 2019

test:jenkins/ch4/most
test:jenkins/ch3/most

@gcongiu
Copy link
Contributor Author

gcongiu commented Feb 15, 2019

@pavanbalaji @raffenet tests seem fine. I noticed a few minor things that I need to fix. However, if you want you can start looking at the code.

MPIR_MEMTYPE__DDR = 0,
MPIR_MEMTYPE__MCDRAM,
MPIR_MEMTYPE__NUM,
MPIR_MEMTYPE__DEFAULT = MPIR_MEMTYPE__DDR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that allowed in C99?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be allowed. The C99 standard does not say you can't do it (section 6.7.2.2):
The expression that defines the value of an enumeration constant shall be an integer constant expression that has a value representable as an int. [...] (The use of enumerators with = may produce enumeration constants with values that duplicate other values in the same enumeration.). My interpretation might be wrong though.

MPIDU_SHM_OBJ__COPYBUFS,
MPIDU_SHM_OBJ__WIN,
MPIDU_SHM_OBJ__NUM
} MPIDU_shm_obj_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory objects (I'd call these buffer types, instead) are different from memory types. These two should be split into different commits. Memory types are usable even without explicit buffer types, i.e., we can have all buffers be allocated on DRAM or MCDRAM or something else. Buffer types only allow us to have some buffers on one memory type and other buffers on a different memory type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will split the two in different commits.

@gcongiu
Copy link
Contributor Author

gcongiu commented Jun 25, 2019

I am closing this PR as the code has to be completely rewritten and is cleaner to start over with another one.

@gcongiu gcongiu closed this Jun 25, 2019
@gcongiu gcongiu deleted the hetero-mem branch December 8, 2019 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants