BlueFog is a high-performance distributed training framework built with decentralized optimization algorithms. The goal of Bluefog is to make decentralized algorithms easy to use, fault-tolerant, friendly to heterogeneous environment, and even faster than training frameworks built with parameter server, or ring-allreduce.
Below are the charts representing the performance of BlueFog that was done on ResNet50 benchmark. Each machine has 8 V100 GPUs (64GB memory) with NVLink-enabled and the inter-connected communication speed is 25Gbps. This is the same hardware setup you can get on AWS. We test the scaling efficiency with a batch size of 64 for a computationally intensive scenario, and a batch size of 32 for a communicationally intensive scenario.
In the figures, the black box represents the ideal linear scaling. It is observed that Bluefog can achieve over 95% scaling efficiency while Horovod reaches around 66% sacling efficiency with batch size 64 on 128 GPUs. For the communicationally intensive scenario with batch size 32, the scaling efficiency gap between Bluefog and Horovod becomes even larger. To understand more details about the BlueFog benchmark, checkout our performance page.
BlueFog is built with decentralized optimization algorithms. This is fundamentally different from other popular distributed training frameworks, such as DistributedDataParallel provided by PyTorch, Horovod, BytePS, etc.
In each communication stage, neither the typical star-shaped parameter-server toplogy, nor the pipelined ring-allreduce topology is used. Instead, BlueFog will exploit a virtual and probably dynamic network topology (that can be in any shape) to achieve most communication efficiency.
Main Idea: Replace expensive allreduce averaging over gradients by cheap neighbor averaging over parameters
For each training iteration, one process (or agent) will update its model with information received from its direct neighbors defined by the virtual topology. It is observed all communications only occur over the predefied virtual topolgy and no global communication is required. This is why the algorithms is named decentralized. Decentralized training algorithms are proved in literature that it can converge to the same solution as their standard centralized counterparts.
The topology decides the communication efficiency. BlueFog supports both static topology and dynamic topology usages. After tremendous trials, the dynamic Exponential-2 graph is observed to achieve the best performance if the number of agents is the power of 2, such as 4, 32, 128 agents. In Exponential-2 graph, each agent will communicates with the neighbors that are 2 0, 2 1, ..., 2 t hops away. Dynamic toplogy means all agents select one neighbor only in one iteration and select next neighbor in next iteration as illustrated in the following figure:
In this scenario, the communcation cost for each iteration is only one unit delay, one standard parameter size to transmit and no communication conflict happens, which is better than what parameter server or ring-allreduce promises. As for loss and accuracy guarantees, please check out our theoratical paper and our slides preseneted on MLA'20. [A full tutorial will be added in future].
First, make sure your environment is with python>=3.7
and openmpi >= 4.0
.
Then, install Bluefog with: pip install --no-cache-dir bluefog
or
BLUEFOG_WITH_NCCL=1 pip install bluefog
if NCCL is supported (NCCL>=2.7
). Check
the install_bluefog page if you need more information or other install options.
We provide high-level wrapper for torch optimizer. You just need to modify
the existing script to distributed implementation is wrapping the optimizer
with our DistributedNeighborAllreduceOptimizer
,
then run it through bfrun
. That is it!
# Execute Python functions in parallel through
# bfrun -np 4 python file.py
import torch
import bluefog.torch as bf
...
bf.init()
optimizer = optim.SGD(model.parameters(), lr=lr * bf.size())
optimizer = bf.DistributedNeighborAllreduceOptimizer(
optimizer, model=model
)
...
Previous example is for static topology usage. For dynamic topology case, you need a little bit more code:
from bluefog.common import topology_util
...
# Same setup code as previous snippets
dynamic_neighbors_gen = topology_util.GetInnerOuterExpo2DynamicSendRecvRanks(
bf.size(), local_size=bf.local_size(), self_rank=bf.rank())
def dynamic_topology_update(epoch, batch_idx):
send_neighbors, recv_neighbors = next(dynamic_neighbors_gen)
avg_weight = 1/(len(recv_neighbors) + 1)
optimizer.send_neighbors = to_neighbors
optimizer.neighbor_weights = {r: avg_weight for r in recv_neighbors}
optimizer.self_weight = avg_weight
# Torch training code
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
dynamic_topology_update(epoch, batch_idx)
...
loss.backward()
optimizer.step()
Check our BlueFog dynamic topology neighbor averaging page to see more on how to control and use topology. See BlueFog examples folder for full code.
We also provide lots of low-level functions, which you can use those as building blocks to construct your own distributed training algorithm. The following example illustrates how to run a simple consensus algorithm through bluefog.
import torch
import bluefog.torch as bf
bf.init()
x = torch.Tensor([bf.rank()])
for _ in range(100):
x = bf.neighbor_allreduce(x)
print(f"{bf.rank()}: Average value of all ranks is {x}")
Checkout our API explanation page to see all supported synchronous and asynchronous features.
The Bluefog source code was based off Horovod repository. Hence, BlueFog shared lots of common features from Horovod such as timeline, tensor-fusion, etc. Here, we want to express our gratitude to the Horovod team.
Faster Learning over Networks and BlueFog, BlueFog Team, invited talk at MLA, 2020 [slides]
BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning, BlueFog Team, To Appear in 2020