Replies: 1 comment
-
I have an example working before and now it should also train, the interesting thing is that it spawn the process, but all of them do exactly the same work 8 times, so from that base what should be missing is a distributed sampler, because if is true that is doing the same, then all of them have the data and only need to train with their part of the whole job and return the results to the master to add after each batch or after train. Anyway, I think it is good to know why it works this way without follow the "other suggested ways". |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The next step to making it work on a single TPU core is enable to run the library on multiple TPU cores.
@tyoc213 and @tmabraham have already started some work on this, and hopefully I (@butchland) will start
working on this soon as well.
My approach is that we should always start with a simple working example using raw pytorch xla code
and slowly integrate that working code into using fastai components.
Here, the most crucial components are:
I believe @tmabraham already has some code related to the the distributed dataloaders derived from multi GPU pytorch code using DDP.
Another issue (and my suspicion is that this is no longer an issue) with the pickling of the model state when it is spawned using the XLA call
xmp.spawn
fornum_cores=8
.The syncing of the gradients when these 8 processes is done in the
xm.optimizer_step
call which is already being done by the existing package.So the remaining issue is how to fit the training loop into
xmp.spawn
call since it requires being a passed a function and some flags. see this code example.As per example, all the objects (datasets, dataloaders) except the model (which has been wrapped by xmp.MpModelWrapper) are created within the training loop function -- the train loop function actually encompasses more than just the training loop, and includes instantiating the dataloaders, dataset setup etc.
In contrast, fastai has a more limited scope in its train loop and instantiates the datasets, dataloaders etc outside of the loop.
As an intermediate step, I will probably try to replace the pytorch distributed dataloader with a working fastai-based xla aware distributed dataloader (not tied to a gpu)).
Then slowly replace parts of the training loop until most of it is subsumed by the fastai learner class.
This learner class will not be sharing a lot across the different spawned functions so it will have very little state.
One way to do this might be to create a "smaller" mini learner class that is instantiated in the spawned training loop function.
so that calling fit will actually launch the spawned processes, create the mini-learner class for each spawned process, configure the dataloaders and re-use the wrapped model...
I will need to confirm that sharing immutable objects across the spawned processes should be ok and not cause problems...
Ok. this is enough speculative thinking for now.
Beta Was this translation helpful? Give feedback.
All reactions