Skip to content

AWS OFI NCCL v1.9.0

Compare
Choose a tag to compare
@rajachan rajachan released this 05 Apr 22:07
· 344 commits to master since this release

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

New Features:

  • Support v8 plugin interface introduced with NCCL 2.20. This enables the use of the user memory registration feature recently introduced in NCCL.
  • Update the tuner component to support v2 ext-tuner interface introduced with NCCL 2.21.
  • Reduce ordering constraints for control messages, to reduce head of line blocking under congestion.

Bug Fixes:

  • Increase the number of communicators to 256K (from 4K), supporting larger all-to-all groups.
  • Improve logging in some corner case error conditions.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

7c86650f2f275b97bd08ff66b24ae8fef593269c068ec543259903d0eec80a0fe4153a3f171700e7e3dcb3b809a1d6aba82d5e7dc52ec138eacd7353629d1bc0  aws-ofi-nccl-1.9.0-aws.tar.gz