Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

arturnn · 2023-06-22T07:41:11Z

Bug description

Training does not start on Vertex AI when using >1 A100 GPUs with NCCL due to an unhandled system error. The problem currently only occurs on A100 GPUs, probably due to GPU partitioning in the K8s cluster. Bumping up NCCL during compilation to the most recent master (https://github.com/NVIDIA/nccl) seems to fix the issue. Could we update Marian NCCL fork (https://github.com/marian-nmt/nccl), or would that possibly break something else? @snukky what are your thoughts on that?

Sample log:

[2023-04-21 19:17:31] Error: NCCL error 2 'unhandled system error' - /marian/src/training/communicator_nccl.h:43: ncclGroupEnd()
[2023-04-21 19:17:31] Error: Aborted from void marian::NCCLCommunicator::groupEnd() const in /marian/src/training/communicator_nccl.h:43
[CALL STACK]
[0x561c348a40d2] marian::NCCLCommunicator:: groupEnd () const + 0x222
[0x561c348ab90a] marian::NCCLCommunicator:: NCCLCommunicator (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, marian::ShardingMode, std::shared_ptr<marian::IMPIWrapper>) + 0x2c4a
[0x561c3489203b] marian:: createCommunicator (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, bool, marian::ShardingMode, std::shared_ptr<marian::IMPIWrapper>) + 0x44b
[0x561c3482438d] marian::GraphGroup:: GraphGroup (std::shared_ptr<marian::Options>, std::shared_ptr<marian::IMPIWrapper>) + 0x4fd
[0x561c347fd4e4] marian::SyncGraphGroup:: SyncGraphGroup (std::shared_ptr<marian::Options>, std::shared_ptr<marian::IMPIWrapper>) + 0x74
[0x561c3433af33] marian::Train<marian::SyncGraphGroup>:: run () + 0x333
[0x561c34260fe7] mainTrainer (int, char**) + 0x157
[0x561c3421931c] main + 0x3c
[0x7fc4a2cce083] __libc_start_main + 0xf3
[0x561c3425fc6e] _start + 0x2e

If NCCL debug log is enabled, the following warning shows up, while it does not show up e.g. when using V100 GPUs:
graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f

The text was updated successfully, but these errors were encountered:

snukky · 2023-06-22T08:55:21Z

I don't see any problems with updating NCCL if it helps. It seems we use vanilla NCCL (NVIDIA/nccl@master...marian-nmt:nccl:master) so updating shouldn't be problematic. Would you like to open a PR?

arturnn · 2023-06-22T09:38:30Z

Sure, here it is @snukky: marian-nmt/nccl#1 (after that we need to update the submodule in marian-dev). Else I could just open a PR changing submodule in marian-dev to main NVIDIA repo instead of a fork.

arturnn added the bug label Jun 22, 2023

TommyJonathanSinaga mentioned this issue Jun 22, 2023

Create img #996

Closed

4 tasks

arturnn mentioned this issue Jul 11, 2023

Nccl bump arturnn/marian-dev#1

Merged

4 tasks

TommyJonathanSinaga mentioned this issue Jun 21, 2023

Update README.md #997

Closed

4 tasks

hieuhoang mentioned this issue Aug 15, 2023

don't include nppdefs.h. Problematic on some machines #1004

Merged

This was referenced Aug 24, 2023

Conflicts resolution samirsalman/marian-dev#1

Closed

Dynamic swap mvp samirsalman/marian-dev#2

Closed

Conflicts resolution samirsalman/marian-dev#3

Open

vrnmthr mentioned this issue Nov 7, 2023

add iOS and armv8 build #1014

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

arturnn commented Jun 22, 2023

snukky commented Jun 22, 2023

arturnn commented Jun 22, 2023 •

edited

Loading

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

Comments

arturnn commented Jun 22, 2023

Bug description

snukky commented Jun 22, 2023

arturnn commented Jun 22, 2023 • edited Loading

arturnn commented Jun 22, 2023 •

edited

Loading