error when testing #37

VitaLemonTea1 · 2024-04-23T12:51:17Z

I train the baseline with 1 A100-40G，using ./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base_occ.py 1.
After 24epoch，I tried to use ./tools/dist_test.py ./projects/configs/bevformer/bevformer_base_occ.py work_dirs/bevformer_base_occ/epoch_24.pth 1.
After loading checkpoint and evaluate for 6019tasks, I saw the memory increased from18G to 42G, and suddenly it got error: torch.distributed.elastic.multiprocessing.api:failed.
So how can I fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when testing #37

error when testing #37

VitaLemonTea1 commented Apr 23, 2024

error when testing #37

error when testing #37

Comments

VitaLemonTea1 commented Apr 23, 2024