Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling data-parallel multi-GPU training #1188

Merged
merged 16 commits into from
Jul 10, 2023
Merged

Enabling data-parallel multi-GPU training #1188

merged 16 commits into from
Jul 10, 2023

Conversation

marcromeyn
Copy link
Contributor

@marcromeyn marcromeyn commented Jul 6, 2023

This PR enables multi-GPU training, as well as add auto-initialization of a Model.

It also introduces singlegpu and multigpu pytest markers for splitting the GPU CI Github Actions workflow into two jobs: one for the 1GPU runner, and one for multi-gpu 2GPU runner.

Follow-up: The test in tests/integration is not complete because Lightning launches separte processes under the hood with the correct environment variables like LOCAL_RANK, but the pytest stays in the main process and tests only the LOCAL_RANK=0 case. To follow up with proper test that ensures dataloader is working properly with e.g., global_rank > 0.

@marcromeyn marcromeyn self-assigned this Jul 6, 2023
@marcromeyn marcromeyn added enhancement New feature or request area/pytorch labels Jul 6, 2023
@github-actions
Copy link

github-actions bot commented Jul 6, 2023

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1188

@edknv edknv marked this pull request as ready for review July 10, 2023 12:37
@edknv edknv merged commit 145e592 into main Jul 10, 2023
37 checks passed
@edknv edknv deleted the torch/multi-gpu branch July 10, 2023 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/pytorch enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants