-
Notifications
You must be signed in to change notification settings - Fork 3
Final Report
The Mochi project explores a software defined storage approach for composing storage services that provides new levels of functionality, performance, and reliability for science applications at extreme scale. One key component of Mochi is the use of OpenFabrics Interfaces (OFI), a framework focused on exporting fabric communication services to applications. Libfabric is a core component of OFI. We have developed a test suite, fabtsuite, to assess libfabric for the features that the Mochi project requires.
The Mochi project [1] provides a set of software building blocks for rapid construction of custom, composable HPC data services. These building blocks rely upon the libfabric [2] network transport library, by way of the Mercury RPC library [3], for high-performance network support on most HPC platforms.
Mochi and Mercury impose specific requirements on network transport libraries, and libfabric’s OFI interface meets all of these requirements. We have found, however, that many of the features that we rely upon are not as thoroughly tested as the features that are more routinely used in other environments (e.g., by message-passing libraries). The objective of the Mochi libfabric test suite, fabtsuite, is to evaluate the specific libfabric features required by Mochi in a controlled, reproducible manner, with minimal external dependencies, so that stakeholders can quickly assess the suitability of different providers for use in HPC data services.
We identify 7 features that are required for Mochi but not thoroughly tested in Exascale HPC environment.
FI_WAIT_FD provides a file descriptor that users can block on (without busy spinning) until there are events to process in the libfabric interface. This allows data services to minimize compute resources when they are idle while they arestill responding in a timely manner to new requests.
fi_cancel() is used to cancel pending asynchronous operations and permit the reuse of memory buffers that were previously associated with those operations. This is necessary for graceful shutdown of persistent services that have pending buffers posted. It is also necessary for fault tolerant use cases, in which a process must cancel pending operations before re-issuing them to a secondary server.
HPC applications or collections of processes are typically launched with a parallel job launcher, like mpiexec, aprun, or srun. For many data service use cases, we need to be able to launch service daemons separately from applications. The separately launched collections of processes must be able to communicate with each other exchanging both messages and RMA transfers.
Mercury uses connectionless RDM endpoints in libfabric, but some providers internally allocate resources or memory buffers on a per-peer basis. We need to make sure that these resources are correctly released when those peers exit. In the data service use case, a single daemon may be relatively long-lived as a large number of clients connect to it and exit over time.
HPC systems are trending towards systems with many cores sharing a single network interface. Data services will likely utilize multithreading to make more efficient use of these systems. We need to ensure that multithreaded workloads operate correctly and do not present a significant performance degradation.
HPC systems easily generate structured data. For example, a column of a multidimensional array can describe particles in a molecular dynamics code that generate trillions of rows. Being able to describe those non-contiguous requests with a list of offset-length vectors, and then transfering that request in a single operation gives the network transport a chance to deliver data with lower latency and higher bandwidth.
When an application uses both MPI and libfabric, MPI may use libfabric in a way that conflicts with the appication's use of libfabric.
We have developed fabtsuite. It has one test program that either runs as a transmitter (fabtput) or a receiver (fabtget).
Using command-line options, a user can select different operating modes to compare single- and multi-threaded (MT) libfabric operation, compare contiguous (fi_write) and vector (fi_writev) transfers, and so on.
option | fabtget | fabtput |
---|---|---|
cancel | -c | -c |
contiguous | -g | |
reregister | -r | -r |
wait | -w | -w |
session | -n | -n, -k |
thread | -p 'i-j' | -p 'i-j' |
fabtput
RDMA-writes multiple non-contiguous buffer segments at once
using solitary fi_writemsg(3)
calls.
We wrote a single test script that tests all options on a single machine. This script will help users to ensure that the two applications can communicate properly.
The script uses several environment variables to adjust test parameters.
Name | Purpose |
---|---|
FABTSUITE_CANCEL_TIMEOUT | Cancel transfer after s seconds. Default value is 2 seconds. |
FABTSUITE_RANDOM_FAIL | Fail randomly. Default value is no. |
FI_MR_CACHE_MAX_SIZE | Disable memory-registration cache when it's set to 0. |
The script tests a set of command line options in two phases -
- get and 2) put. The following table summarizes the option set for each phase.
option set | get phase | put phase |
---|---|---|
default | Y | Y |
cacheless | Y | Y |
reregister | Y | Y |
cacheless,reregister | Y | Y |
wait | Y | Y |
contiguous | N | Y |
contiguous,reregister | N | Y |
contiguous,reregister,cacheless | N | Y |
As you can see from the above table,
the script tests contiguous option in put phase only.
When cacheless option is tested,
the FI_MR_CACHE_MAX_SIZE
environment variable is set to 0.
Command line options are supplied to either faptget during the get phase
or faptput during the put phase except for cancel test.
That is, if faptget starts with a -w
option, faptput
runs without -w
option.
However, there is one exception.
The cancel requires both commands to have -c
option.
CTest is a free (unit) testing framework that is integrated with CMake. A CTest test is any command returning an exit code. It does not really matter how the command is issued or what is run.
The goal of using CTest is to allow users to compile the test suite with CMake, run the front-end CTest script on a login node, and dispatch batch jobs to the compute nodes to execute individual tests. This architecture will make it easy to to split off individual test cases for more detailed debugging/analysis.
CTest allows users to run a specific test easily.
For example, you can run wait
feature test only using -I
option.
$ ctest -I 2,2
Test project /home/user/fabtsuite/build
Start 2: FI_WAIT_FD
1/1 Test #2: FI_WAIT_FD ....................... Passed 74.32 sec
100% tests passed, 0 tests failed out of 1
The -I
option can specify a range of tests as well.
For example, if you use -I 2,3
,
it will run both wait
and cancel
feature tests.
We wrote one test script for single-node testing and
some parallel batch job scripts for multi-node testing.
All scripts can be submitted by CTest using a single command make test
after compiling the test suite with make
command.
For batch jobs, we support both PBS and SLURM.
Each multi-node testing job submits fabtget job first on one node and 1 or more fabtput job(s) on other node(s). The diagram below illustrates the mult-node testing workflow.
We tested 5 features on one local system (A) for single-node test and on two ECP systems (B,C) for the multi-node test. The following table summarizes pass (P) / fail (F) result.
Feature | A | B | C |
---|---|---|---|
wait | P | P | P |
cancel | P | F | F |
cross | P | F | P |
thread | P | P | P |
vector | P | P | P |
MPI-IO | N | N | N |
The MPI-IO
feature test, which means MPI interoperability,
is not implemented (N) yet.
However, we included a provisional script in CTest for future work.
At this stage, the test suite is functionally correct and can execute on a baseline TCP provider with libfabric 1.15.
The wait
test requires a slightly longer time allocation than other tests.
Otherwise, receiver job will not finish on time and generate output.
The cancel
test failure on C was caused by the interrupt signal.
fabtget: caught a signal, exiting.
real 1.99
user 0.53
sys 1.38
1
srun: error: SystemB: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=186257.0
The cross
test failure on B is due to system queue issue
that doesn't allow to run jobs on more than 3 machines.
Resource leak was checked using AddressSanitzer once although it is not a part of test suite.
Although libfabric provides an extensive set of unit testing, our test suite can provide an alternative set of exa-scalable tests in ECP systems.
Our work suggests that it is necessary to get the framework itself working with a broader collection of libraries and additional (RDMA-capable) providers, employ this in existing mochi regression testing, and expand the test cases over time.
Additional work is needed to validate the test suite and its results on other versions of libfabric and with other libfabric providers.
Our test suite is fully open-source, and anyone can extend it. One can add a new command line option for a new feature testing. Or one can add a new C code with MPI APIs to test MPI inteoperability.
We provide sample test scripts for ECP systems that use PBS and SLURM only. They can be modified further to run on different schedulers like Cobalt.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.
ECP: Exascale Computing Project
MPI: Message Passing Interface
MR: Memory Registration
RDMA: Remote Direct Memory Access
RMA: Remote Memory Access
PBS: Portable Batch System
RPC: Remote Procedure Call
SLURM: Simple Linux Utility for Resource Management
TCP: Transmission Control Protocol