-
Notifications
You must be signed in to change notification settings - Fork 3
Final Report
The Mochi project explores a software defined storage approach for composing storage services that provides new levels of functionality, performance, and reliability for science applications at extreme scale. One key component of Mochi is the use of OpenFabrics Interfaces (OFI), a framework focused on exporting fabric communication services to applications. Libfabric is a core component of OFI. We have developed a test suite, fabtsuite, to assess libfabric for the features that the Mochi project requires.
The Mochi project [1] provides a set of software building blocks for rapid construction of custom, composable HPC data services. These building blocks rely upon the libfabric [2] network transport library, by way of the Mercury RPC library [3], for high-performance network support on most HPC platforms.
Mochi and Mercury impose specific requirements on network transport libraries, and libfabric’s OFI interface meets all of these requirements. We have found, however, that many of the features that we rely upon are not as thoroughly tested as the features that are more routinely used in other environments (e.g., by message-passing libraries). The objective of the Mochi libfabric test suite, fabtsuite, is to evaluate the specific libfabric features required by Mochi in a controlled, reproducible manner, with minimal external dependencies, so that stakeholders can quickly assess the suitability of different providers for use in HPC data services.
We identify 7 features that are required for Mochi but not thoroughly tested in Exascale HPC environment.
FI_WAIT_FD provides a file descriptor that users can block on (without busy spinning) until there are events to process in the libfabric interface. This allows data services to minimize compute resources when they are idle while they arestill responding in a timely manner to new requests.
fi_cancel() is used to cancel pending asynchronous operations and permit the reuse of memory buffers that were previously associated with those operations. This is necessary for graceful shutdown of persistent services that have pending buffers posted. It is also necessary for fault tolerant use cases, in which a process must cancel pending operations before re-issuing them to a secondary server.
HPC applications or collections of processes are typically launched with a parallel job launcher, like mpiexec, aprun, or srun. For many data service use cases, we need to be able to launch service daemons separately from applications. The separately launched collections of processes must be able to communicate with each other exchanging both messages and RMA transfers.
Mercury uses connectionless RDM endpoints in libfabric, but some providers internally allocate resources or memory buffers on a per-peer basis. We need to make sure that these resources are correctly released when those peers exit. In the data service use case, a single daemon may be relatively long-lived as a large number of clients connect to it and exit over time.
HPC systems are trending towards systems with many cores sharing a single network interface. Data services will likely utilize multithreading to make more efficient use of these systems. We need to ensure that multithreaded workloads operate correctly and do not present a significant performance degradation.
HPC systems easily generate structured data. For example, a column of a multidimensional array can describe particles in a molecular dynamics code that generate trillions of rows. Being able to describe those non-contiguous requests with a list of offset-length vectors, and then transfering that request in a single operation gives the network transport a chance to deliver data with lower latency and higher bandwidth.
When an application uses both MPI and libfabric, MPI may use libfabric in a way that conflicts with the appication's use of libfabric.
The libfabric supports many providers.
A provider may use memory-registration cache.
The libfabric provides a test suite [4] for different providers.
It doesn't test any MPI application that calls MPI APIs.
Multi-node test is performed using ssh
command.
Memory leak test is performed using valgrind.
Mercury NA layer has some basic tests [5]. Mercury uses a plugin artchitecture for libfabric and MPI. Mercury uses CMake build and testing system. Mercury test requires kwsys. AddressSanitizer is integrated with CMake to check resource leak.
We have developed fabtsuite. It has one test program that either runs as a transmitter (fabtput) or a receiver (fabtget). Using command-line options, a user can select different operating modes to compare single- and multi-threaded (MT) libfabric operation, compare contiguous (fi_write) and vector (fi_writev) transfers, and so on.
fabtput
RDMA-writes multiple non-contiguous buffer segments at once
using solitary fi_writemsg(3)
calls.
We wrote a single test script that tests all options on a single machine. This script will help users to ensure that the two applications can communicate properly.
We wrote some parallel batch job scripts for multi-node testing. They can be submitted by CTest easily. We support both PBS and SLURM.
We tested one local system (A) and two HPC systems (B,C) for the 6 features. The following table summarizes pass (P) / fail (F) result.
Feature | A | B | C |
---|---|---|---|
wait | P | P | P |
cancel | P | P | P |
cross | P | F | P |
thread | P | P | P |
vector | P | P | P |
MPI-IO | N | N | N |
The MPI-IO
feature test is not implemented (N) yet.
The wait
test requires a slightly longer time allocation than other tests.
Otherwise, receiver job will not finish on time and generate output.
The cross
test failure on B is due to system queue issue
that doesn't allow to run jobs on more than 3 machines.
Although libfabric provides an extensive set of unit testing, our test suite can provide an alternative set of exa-scalable tests in HPC.
Our test suite is fully open-source and anyone can extend it. One can add a new command line option for a new feature testing.
We provide sample test scripts for HPC systems.
This project was funded by Department of Energy ECP grant.
CQ: Circular Queue
MR: Memory Registration
NA: Network Abstraction
RDMA: Remote Direct Memory Access
RMA: Remote Memory Access
rcvr: receiver
xmtr: transmitter