Support distributed training with torch.nn.DataParallel() #850

lxqpku · 2021-12-13T12:06:29Z

🐝 Expected behavior

Support multiple GPU training.

How to set multiple GPUs to the device and train the model on multiple GPUs

ashok-arjun · 2021-12-29T04:32:14Z

@lrzpellegrini Can you please tell me where exactly the changes (and tests) must be done? This issue seems too broad to me.

Thanks!

lrzpellegrini · 2021-12-29T22:40:05Z

Hi @ashok-arjun, I'm already working on this.

It's a very broad issue an it requires a lot of changes in different modules and I'm implementing this along with the checkpointing functionality.

AntonioCarta · 2022-02-11T14:04:43Z

I'm pasting some notes that I took during our last meeting. Keep in mind that I have very limited experience with distributed training.

Dataloading

Distributed training requires to use a DistributedSampler (i.e. split samples among workers). Avalanche plugins may need to modify the data loading. How do we ensure they do not break with distributed training? Do we need to provide some guidelines? This is also a more general design question since I think that Online CL will require particular care about the data loading (which we are ignoring right now at a great performance cost).

Add Lazy Metrics?

Ideally, we should not synchronize the data at every iteration. This is a problem because metrics are evaluated on the global values after each iteration. I think we can avoid this with some changes to metrics:

by default, metrics are computed on global values, which means they are continuously synchronized (expensive).
we could add an additional operation to metrics (metric, not metricplugins), merge, which combines different metrics together. Not every metric needs this operation, but those who do implement it can be safely executed on the local state and then merged together when we want to synchronize the data.
Ideally, the EvaluationPlugin should deal with the distributed code, while metrics could safely ignore it.

Tests

We should do a quick test to check that plugins have the same exact behavior on some small benchmark (distributed vs local). E.g. 2 epochs, 3 experiences on a toy benchmark, checking that the learning curves and final parameters are exactly the same.

Plugins behavior

The big question is: what is the default behavior when a plugin tries to access an attribute that is distributed among workers? do we do a synchronization and return the global version or do we return the local one?
Examples:

In EWC, if we execute on each worker, we multiply the loss by the number of workers (and we don’t want this).
In GEM, we project the (global) gradient. How does it work in a distributed setting?
In LwF, we want the distillation on the local inputs. Otherwise, we have a synchronization and we also sum the loss.
So, it seems to me that we have plugins that break in both cases (local vs global default). Tests can help here to quickly identify which plugins are broken.

mattdl-meta · 2022-06-30T21:30:33Z

Is there any update on this functionality and where it is on the timeline for a next Avalanche version?

AntonioCarta · 2022-07-04T08:23:12Z

We have an open PR #996, but we still need to test it more in depth. It should be ready for the next release but it's a big and complex feature so many things may go wrong. If you want to use avalanche with distributed training, for the moment I would suggest to define your own distributed training loop (doesn't have to be an avalanche strategy), where you call the benchmarks, models, and training plugins yourself.

lxqpku added the bug Something isn't working label Dec 13, 2021

vlomonaco assigned lrzpellegrini Dec 13, 2021

vlomonaco added core Core avl functionalities and assets and removed bug Something isn't working labels Jan 13, 2022

AntonioCarta mentioned this issue Feb 4, 2022

Avalanche Roadmap to Beta-2 (v0.2.0)🚀 #903

Closed

10 tasks

AntonioCarta changed the title ~~How to support torch.nn.DataParallel()~~ Support distributed training with torch.nn.DataParallel() Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed training with torch.nn.DataParallel() #850

Support distributed training with torch.nn.DataParallel() #850

lxqpku commented Dec 13, 2021 •

edited

Loading

ashok-arjun commented Dec 29, 2021

lrzpellegrini commented Dec 29, 2021

AntonioCarta commented Feb 11, 2022

mattdl-meta commented Jun 30, 2022

AntonioCarta commented Jul 4, 2022

Support distributed training with torch.nn.DataParallel() #850

Support distributed training with torch.nn.DataParallel() #850

Comments

lxqpku commented Dec 13, 2021 • edited Loading

ashok-arjun commented Dec 29, 2021

lrzpellegrini commented Dec 29, 2021

AntonioCarta commented Feb 11, 2022

Dataloading

Add Lazy Metrics?

Tests

Plugins behavior

mattdl-meta commented Jun 30, 2022

AntonioCarta commented Jul 4, 2022

lxqpku commented Dec 13, 2021 •

edited

Loading