Fixed distributed Optuna running on multiple GPUs #495

RandomDefaultUser · 2023-12-22T08:59:58Z

Due to some recent performance changes in the MALA GPU usage, using multiple GPUs in the Optuna distributed framework on single nodes would crash. This was due to the way MALA initializes the GPU assignments, leading to the torch.cuda.synchronize() initially having the wrong target device. Fixed by targeting correct device for torch.cuda.synchronize().

RandomDefaultUser added 2 commits December 22, 2023 09:50

Targeted correct device for CUDA synchronize

cd1a696

Also included the device for stream operations, for good measure

45f0749

RandomDefaultUser merged commit 7254c5a into mala-project:develop Dec 22, 2023
5 checks passed

RandomDefaultUser deleted the fix_mpi_hyperopt branch December 22, 2023 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed distributed Optuna running on multiple GPUs #495

Fixed distributed Optuna running on multiple GPUs #495

RandomDefaultUser commented Dec 22, 2023

Fixed distributed Optuna running on multiple GPUs #495

Fixed distributed Optuna running on multiple GPUs #495

Conversation

RandomDefaultUser commented Dec 22, 2023