Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributed training with torch.nn.DataParallel() #850

Open
lxqpku opened this issue Dec 13, 2021 · 5 comments
Open

Support distributed training with torch.nn.DataParallel() #850

lxqpku opened this issue Dec 13, 2021 · 5 comments
Assignees
Labels
core Core avl functionalities and assets

Comments

@lxqpku
Copy link

lxqpku commented Dec 13, 2021

🐝 Expected behavior

Support multiple GPU training.

How to set multiple GPUs to the device and train the model on multiple GPUs

@lxqpku lxqpku added the bug Something isn't working label Dec 13, 2021
@ashok-arjun
Copy link
Contributor

@lrzpellegrini Can you please tell me where exactly the changes (and tests) must be done? This issue seems too broad to me.

Thanks!

@lrzpellegrini
Copy link
Collaborator

Hi @ashok-arjun, I'm already working on this.

It's a very broad issue an it requires a lot of changes in different modules and I'm implementing this along with the checkpointing functionality.

@vlomonaco vlomonaco added core Core avl functionalities and assets and removed bug Something isn't working labels Jan 13, 2022
@AntonioCarta
Copy link
Collaborator

I'm pasting some notes that I took during our last meeting. Keep in mind that I have very limited experience with distributed training.

Dataloading

Distributed training requires to use a DistributedSampler (i.e. split samples among workers). Avalanche plugins may need to modify the data loading. How do we ensure they do not break with distributed training? Do we need to provide some guidelines? This is also a more general design question since I think that Online CL will require particular care about the data loading (which we are ignoring right now at a great performance cost).

Add Lazy Metrics?

Ideally, we should not synchronize the data at every iteration. This is a problem because metrics are evaluated on the global values after each iteration. I think we can avoid this with some changes to metrics:

  • by default, metrics are computed on global values, which means they are continuously synchronized (expensive).
  • we could add an additional operation to metrics (metric, not metricplugins), merge, which combines different metrics together. Not every metric needs this operation, but those who do implement it can be safely executed on the local state and then merged together when we want to synchronize the data.
    Ideally, the EvaluationPlugin should deal with the distributed code, while metrics could safely ignore it.

Tests

We should do a quick test to check that plugins have the same exact behavior on some small benchmark (distributed vs local). E.g. 2 epochs, 3 experiences on a toy benchmark, checking that the learning curves and final parameters are exactly the same.

Plugins behavior

The big question is: what is the default behavior when a plugin tries to access an attribute that is distributed among workers? do we do a synchronization and return the global version or do we return the local one?
Examples:

  • In EWC, if we execute on each worker, we multiply the loss by the number of workers (and we don’t want this).
  • In GEM, we project the (global) gradient. How does it work in a distributed setting?
  • In LwF, we want the distillation on the local inputs. Otherwise, we have a synchronization and we also sum the loss.
    So, it seems to me that we have plugins that break in both cases (local vs global default). Tests can help here to quickly identify which plugins are broken.

@AntonioCarta AntonioCarta changed the title How to support torch.nn.DataParallel() Support distributed training with torch.nn.DataParallel() Feb 11, 2022
@mattdl-meta
Copy link

Is there any update on this functionality and where it is on the timeline for a next Avalanche version?

@AntonioCarta
Copy link
Collaborator

We have an open PR #996, but we still need to test it more in depth. It should be ready for the next release but it's a big and complex feature so many things may go wrong. If you want to use avalanche with distributed training, for the moment I would suggest to define your own distributed training loop (doesn't have to be an avalanche strategy), where you call the benchmarks, models, and training plugins yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core avl functionalities and assets
Projects
None yet
Development

No branches or pull requests

6 participants