This project brings together a collection of utilities that have been factored out from different projects. In certain cases we need a specific functionality that does not appear to be available in existing packages. We therefore develop the required code in-house, and if it appears to be general enough, we add it to this collection in the hope that it may be useful to others.
This library has moved from https://gitlab.jsc.fz-juelich.de/hpc4ns/hpc4neuro to Github and will be, where possible, supported by the Multiscale team.
This library has been written by Fahad Khalid at the Forschungszentrum Juelich GmbH
MIT License Copyright (c) 2019 Forschungszentrum Juelich GmbH
Question can be placed at the main repo: https://github.com/multiscale-cosim/EBRAINS-cosim
or by sending an email to [email protected]
The hpc4neuro
package requires Python 3.6
or above. To install, please
use the following command:
python -m pip install git+https://github.com/multiscale-cosim/HPC4Neuro.git
The following modules are available at this time:
Note: This module requires mpi4py
. To install mpi4py
, please
follow installation instructions available
here.
This module exposes the following two classes:
DataDistributor
ErrorHandler
Any function that returns a sized iterable (i.e., an object that supports iter()
and len()
, e.g., list
), can be be decorated by DataDistributor
to seamlessly
distribute items in the resulting object across all participating MPI ranks.
Moreover, ErrorHandler
implements exception handling functions that ensure
graceful application termination via synchronization of all MPI ranks.
The primary motivation for creating this module was to hide the details of distributing training/validation data amongst MPI ranks when training deep artificial neural networks in a data-parallel fashion using Horovod. Even though Horovod hides the intricate details of distributed training, proper distribution of training/validation data is only possible via MPI programming.
The hpc4neuro.distribution
module provides a high-level interface for data distribution
with MPI, without the explicit need to write MPI code on the user's part. The
following examples show what the module does, and how it can be useful.
Note: All examples are available in the hpc4neuro.examples.distribution
package.
Consider the following code that defines a simple function which returns a list of files read from a given directory.
import os
def get_filenames(path):
return os.listdir(path)
# List of the filenames in the 'hpc4neuro' directory
filenames = get_filenames('./hpc4neuro')
Now consider a scenario in which we need to run this code on multiple processors across
multiple nodes in a cluster, and distribute the returned filenames across all the processes.
The following example shows how the hpc4neuro.distribution
module can help with that.
import os
from mpi4py import MPI
from hpc4neuro.distribution import DataDistributor
@DataDistributor(MPI.COMM_WORLD)
def get_filenames(path):
return os.listdir(path)
# List of rank-local file names
filenames = get_filenames('./hpc4neuro')
DataDistributor
decorates the get_filenames
function such that calling
the function returns only a subset of filenames that are to be processed by the
local MPI rank. All the MPI communication required for distribution of filenames
is hidden from the user.
In certain scenarios it is not possible to statically decorate a function using
the decorator syntax, e.g., when the MPI communicator object is not available
at the time of function definition. The following example demonstrates the use
of DataDistributor
in such cases.
import os
from mpi4py import MPI
from hpc4neuro.distribution import DataDistributor
# Initialize the decorator
dist_decorator = DataDistributor(MPI.COMM_WORLD)
# Decorate the function that reads a list of filenames.
get_rank_local_filenames = dist_decorator(os.listdir)
# Use the decorated function to get the rank-local list of filenames
filenames = get_rank_local_filenames('./hpc4neuro')
A function to be decorated by DataDistributor
, such as os.listdir
in the examples
above, may raise an exception. Moreover, exceptions may be raised by DataDistributor
due to other errors. In both cases, if an exception is raised by one MPI rank, the
other MPI ranks may get stuck in a waiting state, unaware of the raised exception. To
handle such a scenario and ensure graceful termination of the application, a flag can
be set in the DataDistributor
initializer to enable graceful application shutdown on
error. The following code examples illustrate how to enable this feature with both the
static and dynamic decoration syntax:
Static: @DataDistributor(MPI.COMM_WORLD, shutdown_on_error=True)
Dynamic: dist_decorator = DataDistributor(MPI.COMM_WORLD, shutdown_on_error=True)
API documentation for hpc4neuro.distribution
is available here.
- Clone this repository
- Change to the cloned repository directory
- Create and activate a virtual environment
- If you use
poetry
, runpoetry install
to install all the required dependencies
To generate API documentation using sphinx
, issue the following commands
from the repository root:
sphinx-build -b html doc doc/html
sphinx-build -b text doc doc/text
pytest
is required for running and working with test code
for this project.
Use the following command to run tests:
mpirun -np <n> python -m pytest
where <n>
should be replaced with the number of MPI ranks to use for testing.