-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support distributed training with torch.nn.DataParallel() #850
Comments
@lrzpellegrini Can you please tell me where exactly the changes (and tests) must be done? This issue seems too broad to me. Thanks! |
Hi @ashok-arjun, I'm already working on this. It's a very broad issue an it requires a lot of changes in different modules and I'm implementing this along with the checkpointing functionality. |
I'm pasting some notes that I took during our last meeting. Keep in mind that I have very limited experience with distributed training. DataloadingDistributed training requires to use a DistributedSampler (i.e. split samples among workers). Avalanche plugins may need to modify the data loading. How do we ensure they do not break with distributed training? Do we need to provide some guidelines? This is also a more general design question since I think that Online CL will require particular care about the data loading (which we are ignoring right now at a great performance cost). Add Lazy Metrics?Ideally, we should not synchronize the data at every iteration. This is a problem because metrics are evaluated on the global values after each iteration. I think we can avoid this with some changes to metrics:
TestsWe should do a quick test to check that plugins have the same exact behavior on some small benchmark (distributed vs local). E.g. 2 epochs, 3 experiences on a toy benchmark, checking that the learning curves and final parameters are exactly the same. Plugins behaviorThe big question is: what is the default behavior when a plugin tries to access an attribute that is distributed among workers? do we do a synchronization and return the global version or do we return the local one?
|
Is there any update on this functionality and where it is on the timeline for a next Avalanche version? |
We have an open PR #996, but we still need to test it more in depth. It should be ready for the next release but it's a big and complex feature so many things may go wrong. If you want to use avalanche with distributed training, for the moment I would suggest to define your own distributed training loop (doesn't have to be an avalanche strategy), where you call the benchmarks, models, and training plugins yourself. |
🐝 Expected behavior
Support multiple GPU training.
How to set multiple GPUs to the device and train the model on multiple GPUs
The text was updated successfully, but these errors were encountered: