-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request Thread #467
Comments
@dathudeptrai What do you think of voice cloning |
I would like to see better componentization. There are similar blocks (groups of layers) implemented multiple times, like positional encoding, speaker encoding or postnet. Others relates on configuration specific just for one particular network like self-attention block used in FastSpeech. With a little rework, making those block more generic it would be easier to create new network types. Similarly with losses, e.g. training for hifigan contains many duplicated code from mb-melgan. Moreover, most of the training and inference scripts looks quite similar and I believe they can be refactored too to, once again, compose the final solution from more generic components. And BTW, I really appreciate your work and think you did a great job! :) |
Hmm, in this case, users just need to read and understand hifigan without reading mb-melgan. |
@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot? |
|
@unparalleled-ysj That's what I was thinking about. Relevantly, @dathudeptrai I saw https://github.com/dipjyoti92/SC-WaveRNN, could SC-MB-MelGAN be possible? |
@unparalleled-ysj @ZDisket That is also what I’m doing. I'm trying to train a multi-speaker fastspeech2 model replacing current hardcoding speaker-ID with bottleneck feature extracted by a voiceprint model. The bottleneck feature of continuous softcoding represents a speaker-related space. If the unknown voice is similar to a voice in the training space, voice cloning may be realized. But judging from the results of current open source projects, it is a difficult problem and certainly not as simple as I described. Do you have any good ideas? |
One possible option for better support for multiple speakers or styles would be to add a Variable Auto-Encoder which automatically extracts this voice/style "fingerprint". |
LightSpeech https://arxiv.org/abs/2102.04040 |
@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out). This is just the final architecture mentioned in the paper (so I haven't implemented any NAS). Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too. Some samples @ 170k (decoded with pre-trained MB-MelGan): https://github.com/nmfisher/lightspeech_samples/tree/main/v1_170k Noticeably worse quality than FastSpeech 2 at the same number of training steps, and it's falling apart on longer sequences. |
great! :D. how about a number of parameters in LightSpeech ? |
@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced. |
yeah, Postnet is only for faster convergence, we can ignore it after the training process. |
@nmfisher 6M params is small enough, did you get a good result with lighspeech ? . how fast is it ? |
I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher |
@luan78zaoha lightspeech use separableConvolution :D. |
@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 0.018 and 0.01, respectively. |
let wait @luan78zaoha reports lightspeech RTF :D. |
As @dathudeptrai mentioned, LightSpeech uses SeparableConvolution in place of regular Convolution, but then also passes various FastSpeech2 configurations through neural architecture search to determine the best configuration of kernel sizes/attention heads/attention dimensions. Basically they use NAS to find the smallest configuration that performs as well as FastSpeech2. |
@dathudeptrai @xuefeng Can you help me implement Higan with fastspeech2 on android? I have tried to implement the same by using https://github.com/tulasiram58827/TTS_TFLite/tree/main/models pretrained model and changing the line Line 73 in 9a107d9
|
Not really a request but just wondering about the use of Librosa. Is there any reason for librosa over say tf.signal.stft and tf.signal.linear_to_mel_weight_matrix as they seem extremely performant? |
I have no doubt this project would work wonders on Voice cloning. |
With the fastspeech tflite model is it possible to covert to run on a Edge TPU? |
Will tacotron2 support full integer quantization in tflite? |
@dathudeptrai Can you help with implementing forced alignment attention loss for Tacotron2 like in this paper? I've managed to turn MFA durations into alignments and put them in the dataloader, but replacing the regular guided attention loss only makes the attention learning worse, both finetuning and from scratch according to eval results after 1k steps, when in the paper the PAG one should be winning |
@ZDisket let me read the paper first :D. |
@dathudeptrai Since that post I discovered that MAE loss between the generated and forced attention works to guide it, but it's so strong that it ends up hurting performance, which could be fixed with a low enough multiplier like 0.01, although I haven't tested it extensively as I abandoned it in favor of training a universal vocoder with a trick. |
This looks really interesting: |
@tts-nlp That looks like an implementation of Algorithm 1. For the 2nd and third, they mention a shift time transform
|
Hey, I've seen a project about voice cloning recently. |
This looks really interesting: |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Anybody working on VQTTS? |
I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it? In addition, TensorFlowTTS project is not very active, not updated for more than a year. |
Just been looking at wenet but haven't really made an appraisal but so far seems 'very kaldi' :) |
Don't hesitate to tell me what features you want in this repo :)))
The text was updated successfully, but these errors were encountered: