Adaptive text-to-speech (TTS) has many important applications on edge devices, such as synthesizing personalized voices for the speech impaired, producing customized speech in translation apps, etc. However, existing models either require too much memory to adapt on the edge or too much computation for real-time inference on the edge. On the one hand, some auto-regressive TTS models can run inference in real-time on the edge, but the limited memory available on edge devices precludes training these models through backpropagation to adapt to unseen speakers. On the other hand, flow-based models are fully invertible, allowing efficient backpropagation with limited memory; however, the invertibility requirement of flow-based models reduces their expressivity, leading to larger and more expensive models to produce audio of the same fidelity. In this paper, we propose a flow-based adaptive TTS system with an extremely low computational cost, which is achieved through manipulating dimensions of the "information bottleneck" between a series of flows. The system, which requires only 7.2G MACs for inference (42x smaller than its flow-based baselines), can run inference in real-time on the edge. And because it is flow-based, the system also has the potential to perform adaptation with the limited amount of memory available at the edge. Despite its low cost, we show empirically that the audio generated by our system matches target speakers' voices with no significant reduction to fidelity and audio naturalness compared to baseline models.
Audio examples: https://low-cost-adaptive-tts.github.io/SqueezeFlow-Demo
- Blow: https://github.com/joansj/blow
- WaveGlow: https://github.com/NVIDIA/waveglow
Our SqueezeFlow (SF) model can achieve an audio naturalness (MOS) and similarity scores as below. GT is ground truth audio, Blow is our baseline which uses ground truth audio for voice conversion, SF is our model which converts and generates audio from mel-spectrograms, SF+WG is a model for abelation which uses SF for convertion and WaveGlow for audio generation. Seen and Unseen refers to the target speaker being seen or unseen during training. For details on our evaluation, please refer to our paper.
Table 1 (Similarity to Source - the lower the better; Similarity to Target - the higher the better):
Model | MOS | Similarity to Source | Similarity to Target |
---|---|---|---|
GT (target) | 4.28±0.05 | 13.98% | 82.86% |
Blow, Seen | 2.85±0.06 | 18.3% | 36.5% |
SF, Seen | 2.80±0.06 | 5.3% | 38.8% |
SF+WG, Seen | 2.96±0.06 | 17.5% | 35.1% |
SF, Unseen | 2.55±0.06 | 15.8% | 22.3% |
SF+WG, Unseen | 2.84±0.06 | 14.3% | 24.9% |
For an abelation study on the SqueezeFlow vocoder, we compare it against WaveGlow (See the table below). We also introduce 4 variants of SqueezeFlow vocoder in our paper, and we present their results here. Details on the evaluation are in the paper.
Table 2:
Model | length | n_channels | MACs | Reduction | MOS |
---|---|---|---|---|---|
WaveGlow | 2048 | 8 | 228.9 | 1x | 4.57±0.04 |
SqueezeFlow-V | 128 | 256 | 3.78 | 60x | 4.07±0.06 |
SqueezeFlow-V-64L | 64 | 256 | 2.16 | 106x | 3.77±0.05 |
SqueezeFlow-V-128S | 128 | 128 | 1.06 | 214x | 3.79±0.05 |
SqueezeFlow-V-64S | 64 | 128 | 0.68 | 332x | 2.74±0.04 |
In our code, we use codenames: the SqueezeFlow converter is named after Blow: blow-mel
, and the SqueezeFlow vocoder is called SqueezeWave
. Corresponding code are in their respective folders.
Suggested steps are:
- Clone repository.
- Create a conda environment (you can use the
environment.yml
file):conda env create -n test -f environment.yml
conda activate test; pip install torch==1.4.0; pip install tensorflow; pip install tensorboardX
- Install Apex. Note that we use version
5754fa7a961b4b6dd7651436bd29dd5712bc134f
.
cd ../
git clone https://www.github.com/nvidia/apex
cd apex
python setup.py install
Download the VCTK dataset.
-
cd blow-mel/src
-
To preprocess the audio files for VCTK:
python preprocess.py --path_in=../VCTK/wav48 --extension=.wav --path_out=../VCTK_22kHz --sr=22050
- Our code expects audio filenames to be in the form
<speaker/class_id>_<utterance/track_id>_whatever.extension
, where elements inside<>
do not contain the character_
and IDs need not to be consecutive (example:s001_u045_xxx.wav
). Therefore, if your data is not in this format, you should run or adapt the scriptmisc/rename_dataset.py
.
- Our code expects audio filenames to be in the form
-
Prepare the VCTK dataset for seen/unseen speakers:
mv VCTK_22kHz VCTK_22kHz_108 mkdir VCTK_22kHz_10 mkdir VCTK_22kHz_98
- To use the same unseen speakers as us, copy folders
p236 p245 p251 p259 p264 p283 p288 p293 p298 p360
toVCTK_22kHz_10
and others toVCTK_22kHz_98
. Otherwise, randomly choose 10 speakers to be excluded from the dataset.
- To use the same unseen speakers as us, copy folders
-
Download our pretrained models. Models contained in the folder are:
SqueezeFlow-C-108.*
, converter trained on full VCTK, used for inference for seen speakersSqueezeFlow-C-98.*
, converter trained on 98-speaker VCTK, used for inference for unseen speakersSqueezeFlow-C-10.*
,SqueezeFlow-C-98
adapted on 10 unseen speaker VCTK, with embeddings for the 10 unseen speakersSqueezeFlow-V-VCTK
, vocoder trained on full VCTK
-
Inference for Seen speakers:
python3 synthesize.py --path_data=../VCTK_22kHz_108 --base_fn_model=SqueezeFlow-C-108 --path_out=SqueezeFlow-C-108/out --sw_path=SqueezeFlow-V-VCTK --convert
- Make sure your output path exists
-
Inference for Unseen speakers:
python3 synthesize_unseen.py --path_data_root=[parent folder of VCTK_22kHz_10 and VCTK_22kHz_98] --adapted_base_fn_model=SqueezeFlow-C-10 --trained_base_fn_model=SqueezeFlow-C-98 --path_out=[your output path] --sw_path=SqueezeFlow-V-VCTK --convert
cd SqueezeWave-adaptive
- Check
configs/config_a256_c256.json
and make sure all data paths are correct - Start training:
python3 train.py -c configs/config_a256_c256.json
- Substitute
train.py
withdistributed.py
if using multi-gpu.
- Substitute
- Generate results:
python3 inference.py -c configs/config_a256_c256.json -w [path to your best checkpoint] -o [your output path]
. Checkpoints are saved every 2000 iterations by default.
cd blow-mel/src
- Start training:
python train.py --path_data=VCTK_22kHz_108 --base_fn_out=[your checkpoint path + experiment name] --model=blow --sw_path=[your best vocoder checkpoint] --multigpu
- Generate results using SqueezeWave vocoder:
python3 synthesize.py --path_data=../VCTK_22kHz_108 --base_fn_model=[your checkpoint path + experiment name] --path_out=[your output path] --sw_path=[your best vocoder checkpoint] --convert
- Best converter checkpoint is automatically save to
[your checkpoint path + experiment name]
when training - This step saves both the converted mel-spectrogram (as
*.pt
) and that mel-spectrogram turned into speech (using SqueezeWave)
- Best converter checkpoint is automatically save to
- To generate results using WaveGlow, clone your WaveGlow repository, go into its corresponding folder, and run
python3 inference.py -f <(ls [your SqueezeWave output path]/*.pt) -w waveglow_256channels_ljs_v3.pt -o [your WaveGlow output path] --is_fp16 -s 0.6
cd blow-mel/src
- Training is similar to "Converter: Seen" section:
python train.py --path_data=VCTK_22kHz_98 --base_fn_out=[your checkpoint path + experiment name] --model=blow --sw_path=[your best vocoder checkpoint] --multigpu
- Adapt to unseen speakers:
python adapt.py --path_data=VCTK_22kHz_10 --base_fn_model=[your checkpoint path + experiment name] --path_out=[path to save your adapted model] --sw_path=[your best vocoder checkpoint] --sbatch=256 --multigpu --lr=1e-2
- Generate results on Unseen speakers using SqueezeWave:
python3 synthesize_unseen.py --path_data_root=[parent folder of VCTK_22kHz_10 and VCTK_22kHz_98] --adapted_base_fn_model=[best checkpoint to your adapted model on 10 speakers] --trained_base_fn_model=[best checkpoint to your trained model on 98 speakers] --path_out=[your output path] --sw_path=[your best vocoder checkpoint] --convert
- Similar to the previous section, the converted mel-spectrogram and generated audios will be saved to
[your output path]
- Similar to the previous section, the converted mel-spectrogram and generated audios will be saved to
- Use the same steps as in the previous section to generate audios with WaveGlow vocoder
Reproducing Table 2 (LJ Speech Data)
All the below scripts are run in SqueezeWave
folder, instead of SqueezeWave-adaptive
.
cd SqueezeWave
-
Download our pretrained vocoders. We provide 4 pretrained models as described in the paper.
-
Download mel-spectrograms
-
Generate audio. Please replace
SqueezeWave.pt
to the specific pretrained model's name.python3 inference.py -f <(ls mel_spectrograms/*.pt) -w SqueezeWave.pt -o . --is_fp16 -s 0.6
-
Download LJ Speech Data. We assume all the waves are stored in the directory
^/data/
-
Make a list of the file names to use for training/testing
ls data/*.wav | tail -n+10 > train_files.txt ls data/*.wav | head -n10 > test_files.txt
-
We provide 4 model configurations with audio channel and channel numbers specified in the table below. The configuration files are under
/configs
directory. To choose the model you want to train, select the corresponding configuration file. -
Train your SqueezeWave model
mkdir checkpoints python train.py -c configs/config_a256_c128.json
For multi-GPU training replace
train.py
withdistributed.py
. Only tested with single node and NCCL.For mixed precision training set
"fp16_run": true
onconfig.json
. -
Make test set mel-spectrograms
mkdir -p eval/mels python3 mel2samp.py -f test_files.txt -o eval/mels -c configs/config_a128_c256.json
-
Run inference on the test data.
ls eval/mels > eval/mel_files.txt sed -i -e 's_.*_eval/mels/&_' eval/mel_files.txt mkdir -p eval/output python3 inference.py -f eval/mel_files.txt -w checkpoints/SqueezeWave_10000 -o eval/output --is_fp16 -s 0.6
Replace
SqueezeWave_10000
with the checkpoint you want to test.