Going multi-TPU #23

butchland · 2020-12-14T12:59:02Z

butchland
Dec 14, 2020
Maintainer

The next step to making it work on a single TPU core is enable to run the library on multiple TPU cores.
@tyoc213 and @tmabraham have already started some work on this, and hopefully I (@butchland) will start
working on this soon as well.
My approach is that we should always start with a simple working example using raw pytorch xla code
and slowly integrate that working code into using fastai components.

Here, the most crucial components are:

the model
the dataloaders
the training loop

I believe @tmabraham already has some code related to the the distributed dataloaders derived from multi GPU pytorch code using DDP.

Another issue (and my suspicion is that this is no longer an issue) with the pickling of the model state when it is spawned using the XLA call xmp.spawn for num_cores=8.

The syncing of the gradients when these 8 processes is done in the xm.optimizer_step call which is already being done by the existing package.

So the remaining issue is how to fit the training loop into xmp.spawn call since it requires being a passed a function and some flags. see this code example.

As per example, all the objects (datasets, dataloaders) except the model (which has been wrapped by xmp.MpModelWrapper) are created within the training loop function -- the train loop function actually encompasses more than just the training loop, and includes instantiating the dataloaders, dataset setup etc.

In contrast, fastai has a more limited scope in its train loop and instantiates the datasets, dataloaders etc outside of the loop.

As an intermediate step, I will probably try to replace the pytorch distributed dataloader with a working fastai-based xla aware distributed dataloader (not tied to a gpu)).

Then slowly replace parts of the training loop until most of it is subsumed by the fastai learner class.

This learner class will not be sharing a lot across the different spawned functions so it will have very little state.

One way to do this might be to create a "smaller" mini learner class that is instantiated in the spawned training loop function.

so that calling fit will actually launch the spawned processes, create the mini-learner class for each spawned process, configure the dataloaders and re-use the wrapped model...

I will need to confirm that sharing immutable objects across the spawned processes should be ok and not cause problems...

Ok. this is enough speculative thinking for now.

tyoc213 · 2020-12-14T16:44:45Z

tyoc213
Dec 14, 2020
Collaborator

I have an example working before and now it should also train, the interesting thing is that it spawn the process, but all of them do exactly the same work 8 times, so from that base what should be missing is a distributed sampler, because if is true that is doing the same, then all of them have the data and only need to train with their part of the whole job and return the results to the master to add after each batch or after train. Anyway, I think it is good to know why it works this way without follow the "other suggested ways".

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Going multi-TPU #23

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Going multi-TPU #23

butchland Dec 14, 2020 Maintainer

Replies: 1 comment

tyoc213 Dec 14, 2020 Collaborator

butchland
Dec 14, 2020
Maintainer

tyoc213
Dec 14, 2020
Collaborator