-
Notifications
You must be signed in to change notification settings - Fork 71
Home
One the top level, the structure of this codebase looks like:
├── bluefog
├── docs
├── examples
├── setup.py
├── test
└── ...
The main project code is under bluefog
folder, examples
folder contains the usage code for users, (served as e2e test as well) and test
folder for test of course.
Overall, our repository shares a very similar organization like horovod repository. FYI, in the future, we will put all documentation into a website here.
The main project/code is located under bluefog
, which is composed of two folders common
and torch
(now). We plan to support simple API at least for pytorch
and tensorflow
frameworks. So torch
and tensorflow
fold implement the adaptor for the tensor operations under pytorch and tensorflow respectively. common
folder includes the common functions, such as MPI communication, Background processing thread, etc., because those operations are the same for any frameworks.
The core code consists of three parts of logics:
- Framework(pytorch/tensorflow) API
- Public MPI ops (such as AllReduce, WinPut, NeighborAllGather, etc.)
- Background Processing Part. (Background thread, status maintaining, Communication Scheduling, etc.)
Framework APIs are located under the __init__.py
file. Among them, the most important function is the DistributedOptimizer
wrapper for the original Optimizer. Under it, a hook function is registered for each layer. Depends on what style of Distributed Optimizer is (ring-allreduce, diffusion, consensus, etc), different asynchronized communication/callback/updating functions are implemented.
Public MPI ops can be categorized into four types:
- Basics Ops -- (init, shutdown, size, local_size, rank, local_rank)
- Collective Ops -- (allreduce, allgather, broadcast) x (sync, async, in-place) + (poll, synchronize)
- Topology and Neighbor Ops -- (load_topology, set_topology, neighbor_allgather, neighbor_allreduce)
- One-Sided Communication Ops -- (win_create, win_free, win_get, win_put, win_accumulate, win_sync, win_fence)
The response of most communication ops take the tensors and return the averages tensors of all processes or its neighbors. To make our function not synchronized and blocking, the pattern of implementation of those functions do not directly call the MPI functions but push the communication request to the queue. The real communication will be executed by "Background Processing Part". An asynchronized callback pattern is used to let those communication ops get the results.
Lastly, the Background Processing Part is the main course for this project. There are lots of niceties behind the implementation but the main logic is simple and clear. First, it will set up some global status, which contains some threads for communication. Similar to the threading pool strategy, that background thread keeps pulling the communication event request out of the queue, which is pushed by called in MPI ops; Executes the communications. (Currently, we only have one thread to do that job. It is possible to create multiple threads but may not be necessary and cause more trouble than benefits.) After the communication is done, it will execute the callback function to inform MPI ops.
All the interface code will be python. However, several functions such as low-level hardware interaction, MPI controlling etc, requires C++ interface. Therefore, part of our code base is C++. In order to communication between C++ and python, we need to build a .so
on linux or .dll
on windows file first. If everything works correctly on your system, only thing you need to do it type
python setup.py build_ext
It will generate the object file under build
folder. Currently, we have a simple hello_world.cc
and c_hello_world.py
as an example. Alternatively, you can generate the shared library in-place
by running
python setup.py build_ext -i