-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heterogeneous memory integration #3200
Conversation
9f8a50f
to
eb96e22
Compare
1007b81
to
701ea10
Compare
c838ac0
to
1cbbb67
Compare
test:jenkins/ch3/most |
test:jenkins/ch3/tcp |
test:jenkins/ch3/ofi |
test:jenkins/ch3/ofi |
1 similar comment
test:jenkins/ch3/ofi |
test:jenkins/ch3/most |
0ae6922
to
764014c
Compare
test:jenkins/ch3/most |
8c6b5ea
to
2fa570b
Compare
test:jenkins/ch4/ofi |
@raffenet @pavanbalaji CH4 tests now complete correctly (for some reason jenkins is failing to collect the summary and thus is not showing tests as passed). There are still a few corrections I need to make to the code but before doing it I would like to get your comments. |
This exposes fastboxes and copy buffers to the MPI layer. The abstraction needs to be improved. |
Should I move |
Not everything is device specific. So you might need to split it.
Why not generalize it to expose topology in a more uniform fashion instead of hardcoding just "node" and "numa"? We now use |
test:jenkins/ch4/ofi |
1 similar comment
test:jenkins/ch4/ofi |
test:jenkins/ch4/ofi |
Running additional tests as OFI works. @raffenet I still haven't figure out why report collection is failing. |
test/mpi/pt2pt/testlist.def
Outdated
@@ -50,3 +50,11 @@ dtype_send 2 | |||
recv_any 2 | |||
irecv_any 2 | |||
large_tag 2 | |||
|
|||
# Heterogeneous memory tests | |||
sendflood 8 env=MPIR_CVAR_MEMBIND_NUMA_ENABLE="YES" env=MPIR_CVAR_MEMBIND_TYPE_LIST="FASTBOXES:AUTO" env=MPIR_CVAR_MEMBIND_POLICY_LIST="FASTBOXES:BIND" env=MPIR_CVAR_MEMBIND_FLAGS_LIST="FASTBOXES:STRICT" timeLimit=600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these additional quotes are causing the XML parser to barf. Can you remove them? E.g.
sendflood 8 env=MPIR_CVAR_MEMBIND_NUMA_ENABLE=YES env=MPIR_CVAR_MEMBIND_TYPE_LIST=FASTBOXES:AUTO env=MPIR_CVAR_MEMBIND_POLICY_LIST=FASTBOXES:BIND env=MPIR_CVAR_MEMBIND_FLAGS_LIST=FASTBOXES:STRICT timeLimit=600
test:jenkins/ch4/ofi |
This is a refactoring patch for shared memory segment allocation functions. 'MPIDU_shm_seg_alloc' now also takes a memory type parameter defining in which target memory the requested allocation should be placed. 'MPIDU_shm_seg_commit' now also takes an hwloc numa node logical identifier and a shared memory object identifier. The numa node id allows binding memory to a specific memory domain while the object identifier allows user defined binding information for that specific object. The patch also introduces shared memory object definitions in 'src/mpid/common/shm/mpidu_shm_obj.h', and memory type definitions to which objects can be bound to in 'src/include/mpir_memtype.h'.
This patch introduces support for numa architectures, including detection and usage of heterogeneous memory, e.g., KNL MCDRAM. The patch adds functionalities to detect numa nodes of different type and set up information useful for binding allocated objects to different types of memory.
This patch modifies the previous fbox segment allocation mechanism to make it numa and heterogeneous memory-aware. This is done by counting the number of available numa nodes used by MPI processes and creating an equal number of shared memory segments (instead of just one). Each of these segments will contain the fbox elements for the processes located in the corresponding numa node and can be bound to the requested type of memory (i.e., DDR or MCDRAM).
Similarly to pt2pt fastbox integration this patch decomposes current single shared segment into multiple segments, one per numa node, that can then be separately bound using hwloc. Moreover, when using symheap either all single segment allocations succeed or none of them does. If a symheap segment allocation fails all the previous should be reverted. In order to accomplish this the new function: `MPIDI_CH4R_release_shm_symheap` has been introduced.
test:jenkins/ch4/most |
@pavanbalaji @raffenet tests seem fine. I noticed a few minor things that I need to fix. However, if you want you can start looking at the code. |
MPIR_MEMTYPE__DDR = 0, | ||
MPIR_MEMTYPE__MCDRAM, | ||
MPIR_MEMTYPE__NUM, | ||
MPIR_MEMTYPE__DEFAULT = MPIR_MEMTYPE__DDR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that allowed in C99?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be allowed. The C99 standard does not say you can't do it (section 6.7.2.2
):
The expression that defines the value of an enumeration constant shall be an integer constant expression that has a value representable as an int. [...] (The use of enumerators with = may produce enumeration constants with values that duplicate other values in the same enumeration.)
. My interpretation might be wrong though.
MPIDU_SHM_OBJ__COPYBUFS, | ||
MPIDU_SHM_OBJ__WIN, | ||
MPIDU_SHM_OBJ__NUM | ||
} MPIDU_shm_obj_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory objects (I'd call these buffer types
, instead) are different from memory types. These two should be split into different commits. Memory types are usable even without explicit buffer types, i.e., we can have all buffers be allocated on DRAM or MCDRAM or something else. Buffer types only allow us to have some buffers on one memory type and other buffers on a different memory type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will split the two in different commits.
I am closing this PR as the code has to be completely rewritten and is cleaner to start over with another one. |
This PR introduces NUMA-awareness and heterogeneous memory support by adding the following features to MPICH:
Memory locality detection
: detect NUMA nodes close to the hardware thread the current MPI process is running onto;Shared memory partitioning
: allocate a separate shared memory segment (VMA
) for every NUMA node;Memory binding setting
: set memory binding for a shared memory segment to the requested memory node.These three features combined allow MPICH to partition a shared memory object (e.g., fastboxes in point-to-point) by the number of NUMA nodes in the system. Processes will then have their memory objects bound to the closest NUMA.
REF: #3511