Multi-GPU training and expected epochs #9

bieltura · 2022-01-17T17:37:41Z

Hi,

First of all, thanks for the nice paper and release code. I am testing your model for a different dataset and two questions come up:

Which is the estimated number of epochs to train the model? We have expierenced some degradation when the model is overtrained (overfitting?) the data.
Is there a way to train the model in a multi-gpu setup? We have more GPUs available, however the code seems to just work on the first available GPU given by the CUDA_VISIBLE_DEVICES argument.

Thanks!

ivanvovk · 2022-01-17T21:02:22Z

@bieltura Hi! Thank you for your interest in Grad-TTS work.

Paper's Grad-TTS model was trained for 1.7mln iterations, which corresponds to approximately 2300 epochs. Usually, we trained our models up to 2000 epochs with mini-batch size 16 and 2sec speech fragments (out_size argument in params.py).
Sorry, our code is not adopted for multi-GPU training, but you can easily change train.py or train_multi_speaker.py according to the best PyTorch multi-GPU training practices.

bieltura · 2022-01-26T12:04:27Z

Hi @ivanvovk,

Thanks for the answering of the quesitons. Here's an update that my be helpful for future development:

DataParallel can not be implemented in the current setup, as compute_loss method is not within the forward pass of the model. The solution is to adapt forward to compute the loss function and generate another method for inference (in a single GPU).

Apart from that, I have found that using multiple GPUs, code breaks when, for a batch, the length of an audio sample is less than the 2sec speech fragments. The solution is to force the shape to be always this 2 sec (in frames).

y_cut_mask = sequence_mask(y_cut_lengths).unsqueeze(1).to(y_mask)
to
y_cut_mask = sequence_mask(torch.LongTensor([out_size] * len(y_cut_lengths))).unsqueeze(1).to(y_mask)

I still find that 2300 epochs in a single GPU is a very large amount of training. Did you follow any procedure to check when the modeled converged to the best checkpoint?

Thanks

ivanvovk · 2022-01-27T10:42:35Z

@bieltura it is usually preferred to use DistributedDataParallel instead of DataParallel. It is faster, and if I am not mistaken, there are no such problems with forward pass at DDP setting.

What about checking the convergence of the model, we just checked the quality at 10 iterations, and when it became good, we stopped training. Nothing special.

bieltura · 2022-02-08T09:50:37Z

Thanks! As a side note, we have been using the Energy metric (predicted-target difference) to check whether samples are "good enough" for evaluation. As you mentioned in your paper, diffusion loss is not informative in terms of model convergence, as it has to update to all possible steps from 0 to T (and this is picked up randomly). Here are some plots that may be useful to you as well. Feel free to close the issue when you read it :) And again, thanks for everything.

iooops · 2023-04-21T06:47:01Z

For my case, I found Accelerate very useful: https://github.com/huggingface/accelerate with just several lines of code.

ivanvovk mentioned this issue Jun 3, 2022

Diffusion loss not decreasing #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training and expected epochs #9

Multi-GPU training and expected epochs #9

bieltura commented Jan 17, 2022

ivanvovk commented Jan 17, 2022

bieltura commented Jan 26, 2022

ivanvovk commented Jan 27, 2022

bieltura commented Feb 8, 2022 •

edited

Loading

iooops commented Apr 21, 2023

Multi-GPU training and expected epochs #9

Multi-GPU training and expected epochs #9

Comments

bieltura commented Jan 17, 2022

ivanvovk commented Jan 17, 2022

bieltura commented Jan 26, 2022

ivanvovk commented Jan 27, 2022

bieltura commented Feb 8, 2022 • edited Loading

iooops commented Apr 21, 2023

bieltura commented Feb 8, 2022 •

edited

Loading