Enabling data-parallel multi-GPU training #1188
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR enables multi-GPU training, as well as add auto-initialization of a Model.
It also introduces
singlegpu
andmultigpu
pytest markers for splitting the GPU CI Github Actions workflow into two jobs: one for the1GPU
runner, and one for multi-gpu2GPU
runner.Follow-up: The test in
tests/integration
is not complete because Lightning launches separte processes under the hood with the correct environment variables likeLOCAL_RANK
, but the pytest stays in the main process and tests only the LOCAL_RANK=0 case. To follow up with proper test that ensures dataloader is working properly with e.g., global_rank > 0.