Corresponding author

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, Jian Huang, UIUC

Keywords

distributed machine learning, reinforcement learning, in-switch accelerator, in-network computing

Summary

Challenge

The distributed RL training generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. This is the major bottleneck in distributed RL training which occupies up to 83.2% of the execution time of each training iteration.

Contribution

Propose iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches.
Rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale
Extend network protocol. To support in-switch computing for distributed RL training, they propose to build their own protocol and packet format based on regular network protocols.
Implement iSwitch using a real-world programmable switch NetFPGA board and extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions.

Innovation Points

The iSwitch conducts the in-switch aggregation at the granularity of network packets rather than the entire gradient vectors (consisting of numerous network packets).

Result

The accelerator support for both synchronous and asynchronous distributed RL training with improved parallelism. Experiments show that iSwitch offers a system-level speedup of up to 3.66× for synchronous distributed training, and 3.71× for asynchronous distributed training, compared with state-of-the-art approaches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerating Distributed Reinforcement Learning with In-Switch Computing.md

Accelerating Distributed Reinforcement Learning with In-Switch Computing.md

Files

Accelerating Distributed Reinforcement Learning with In-Switch Computing.md

Latest commit

History

Accelerating Distributed Reinforcement Learning with In-Switch Computing.md

File metadata and controls