SqueezeFlow: Adaptive Text-to-Speech in Low Computational Resource Scenarios

Abstract

Adaptive text-to-speech (TTS) has many important applications on edge devices, such as synthesizing personalized voices for the speech impaired, producing customized speech in translation apps, etc. However, existing models either require too much memory to adapt on the edge or too much computation for real-time inference on the edge. On the one hand, some auto-regressive TTS models can run inference in real-time on the edge, but the limited memory available on edge devices precludes training these models through backpropagation to adapt to unseen speakers. On the other hand, flow-based models are fully invertible, allowing efficient backpropagation with limited memory; however, the invertibility requirement of flow-based models reduces their expressivity, leading to larger and more expensive models to produce audio of the same fidelity. In this paper, we propose a flow-based adaptive TTS system with an extremely low computational cost, which is achieved through manipulating dimensions of the "information bottleneck" between a series of flows. The system, which requires only 7.2G MACs for inference (42x smaller than its flow-based baselines), can run inference in real-time on the edge. And because it is flow-based, the system also has the potential to perform adaptation with the limited amount of memory available at the edge. Despite its low cost, we show empirically that the audio generated by our system matches target speakers' voices with no significant reduction to fidelity and audio naturalness compared to baseline models.

Audio examples: https://low-cost-adaptive-tts.github.io/SqueezeFlow-Demo

Credits

Blow: https://github.com/joansj/blow
WaveGlow: https://github.com/NVIDIA/waveglow

Results

Our SqueezeFlow (SF) model can achieve an audio naturalness (MOS) and similarity scores as below. GT is ground truth audio, Blow is our baseline which uses ground truth audio for voice conversion, SF is our model which converts and generates audio from mel-spectrograms, SF+WG is a model for abelation which uses SF for convertion and WaveGlow for audio generation. Seen and Unseen refers to the target speaker being seen or unseen during training. For details on our evaluation, please refer to our paper.

Table 1 （Similarity to Source - the lower the better; Similarity to Target - the higher the better）:

Model	MOS	Similarity to Source	Similarity to Target
GT (target)	4.28±0.05	13.98%	82.86%
Blow, Seen	2.85±0.06	18.3%	36.5%
SF, Seen	2.80±0.06	5.3%	38.8%
SF+WG, Seen	2.96±0.06	17.5%	35.1%
SF, Unseen	2.55±0.06	15.8%	22.3%
SF+WG, Unseen	2.84±0.06	14.3%	24.9%

For an abelation study on the SqueezeFlow vocoder, we compare it against WaveGlow (See the table below). We also introduce 4 variants of SqueezeFlow vocoder in our paper, and we present their results here. Details on the evaluation are in the paper.

Table 2:

Model	length	n_channels	MACs	Reduction	MOS
WaveGlow	2048	8	228.9	1x	4.57±0.04
SqueezeFlow-V	128	256	3.78	60x	4.07±0.06
SqueezeFlow-V-64L	64	256	2.16	106x	3.77±0.05
SqueezeFlow-V-128S	128	128	1.06	214x	3.79±0.05
SqueezeFlow-V-64S	64	128	0.68	332x	2.74±0.04

Reproduction

In our code, we use codenames: the SqueezeFlow converter is named after Blow: blow-mel, and the SqueezeFlow vocoder is called SqueezeWave. Corresponding code are in their respective folders.

Installation

Suggested steps are:

Clone repository.
Create a conda environment (you can use the environment.yml file): conda env create -n test -f environment.yml
conda activate test; pip install torch==1.4.0; pip install tensorflow; pip install tensorboardX
Install Apex. Note that we use version 5754fa7a961b4b6dd7651436bd29dd5712bc134f.

   cd ../
   git clone https://www.github.com/nvidia/apex
   cd apex
   python setup.py install

Reproducing Table 1

Preprocessing

Download the VCTK dataset.

cd blow-mel/src
To preprocess the audio files for VCTK: python preprocess.py --path_in=../VCTK/wav48 --extension=.wav --path_out=../VCTK_22kHz --sr=22050
- Our code expects audio filenames to be in the form <speaker/class_id>_<utterance/track_id>_whatever.extension, where elements inside <> do not contain the character _ and IDs need not to be consecutive (example: s001_u045_xxx.wav). Therefore, if your data is not in this format, you should run or adapt the script misc/rename_dataset.py.
Prepare the VCTK dataset for seen/unseen speakers: mv VCTK_22kHz VCTK_22kHz_108 mkdir VCTK_22kHz_10 mkdir VCTK_22kHz_98
- To use the same unseen speakers as us, copy folders p236 p245 p251 p259 p264 p283 p288 p293 p298 p360 to VCTK_22kHz_10 and others to VCTK_22kHz_98. Otherwise, randomly choose 10 speakers to be excluded from the dataset.

Generate audio with our pretrained model

Download our pretrained models. Models contained in the folder are:
- SqueezeFlow-C-108.*, converter trained on full VCTK, used for inference for seen speakers
- SqueezeFlow-C-98.*, converter trained on 98-speaker VCTK, used for inference for unseen speakers
- SqueezeFlow-C-10.*, SqueezeFlow-C-98 adapted on 10 unseen speaker VCTK, with embeddings for the 10 unseen speakers
- SqueezeFlow-V-VCTK, vocoder trained on full VCTK
Inference for Seen speakers: python3 synthesize.py --path_data=../VCTK_22kHz_108 --base_fn_model=SqueezeFlow-C-108 --path_out=SqueezeFlow-C-108/out --sw_path=SqueezeFlow-V-VCTK --convert
- Make sure your output path exists
Inference for Unseen speakers: python3 synthesize_unseen.py --path_data_root=[parent folder of VCTK_22kHz_10 and VCTK_22kHz_98] --adapted_base_fn_model=SqueezeFlow-C-10 --trained_base_fn_model=SqueezeFlow-C-98 --path_out=[your output path] --sw_path=SqueezeFlow-V-VCTK --convert

Train your own model

Vocoder on VCTK

cd SqueezeWave-adaptive
Check configs/config_a256_c256.json and make sure all data paths are correct
Start training: python3 train.py -c configs/config_a256_c256.json
- Substitute train.py with distributed.py if using multi-gpu.
Generate results: python3 inference.py -c configs/config_a256_c256.json -w [path to your best checkpoint] -o [your output path]. Checkpoints are saved every 2000 iterations by default.

Converter: Seen (full VCTK)

cd blow-mel/src
Start training: python train.py --path_data=VCTK_22kHz_108 --base_fn_out=[your checkpoint path + experiment name] --model=blow --sw_path=[your best vocoder checkpoint] --multigpu
Generate results using SqueezeWave vocoder: python3 synthesize.py --path_data=../VCTK_22kHz_108 --base_fn_model=[your checkpoint path + experiment name] --path_out=[your output path] --sw_path=[your best vocoder checkpoint] --convert
- Best converter checkpoint is automatically save to [your checkpoint path + experiment name] when training
- This step saves both the converted mel-spectrogram (as *.pt) and that mel-spectrogram turned into speech (using SqueezeWave)
To generate results using WaveGlow, clone your WaveGlow repository, go into its corresponding folder, and run python3 inference.py -f <(ls [your SqueezeWave output path]/*.pt) -w waveglow_256channels_ljs_v3.pt -o [your WaveGlow output path] --is_fp16 -s 0.6

Converter: Unseen (with 98/10 split on VCTK)

cd blow-mel/src
Training is similar to "Converter: Seen" section: python train.py --path_data=VCTK_22kHz_98 --base_fn_out=[your checkpoint path + experiment name] --model=blow --sw_path=[your best vocoder checkpoint] --multigpu
Adapt to unseen speakers: python adapt.py --path_data=VCTK_22kHz_10 --base_fn_model=[your checkpoint path + experiment name] --path_out=[path to save your adapted model] --sw_path=[your best vocoder checkpoint] --sbatch=256 --multigpu --lr=1e-2
Generate results on Unseen speakers using SqueezeWave: python3 synthesize_unseen.py --path_data_root=[parent folder of VCTK_22kHz_10 and VCTK_22kHz_98] --adapted_base_fn_model=[best checkpoint to your adapted model on 10 speakers] --trained_base_fn_model=[best checkpoint to your trained model on 98 speakers] --path_out=[your output path] --sw_path=[your best vocoder checkpoint] --convert
- Similar to the previous section, the converted mel-spectrogram and generated audios will be saved to [your output path]
Use the same steps as in the previous section to generate audios with WaveGlow vocoder

Reproducing Table 2 (LJ Speech Data)

All the below scripts are run in SqueezeWave folder, instead of SqueezeWave-adaptive.

cd SqueezeWave

Generate audio with our pretrained model

Download our pretrained vocoders. We provide 4 pretrained models as described in the paper.
Download mel-spectrograms
Generate audio. Please replace SqueezeWave.pt to the specific pretrained model's name.

python3 inference.py -f <(ls mel_spectrograms/*.pt) -w SqueezeWave.pt -o . --is_fp16 -s 0.6

Train your own model

Download LJ Speech Data. We assume all the waves are stored in the directory ^/data/

Make a list of the file names to use for training/testing

ls data/*.wav | tail -n+10 > train_files.txt
ls data/*.wav | head -n10 > test_files.txt

We provide 4 model configurations with audio channel and channel numbers specified in the table below. The configuration files are under /configs directory. To choose the model you want to train, select the corresponding configuration file.
Train your SqueezeWave model
```
mkdir checkpoints
python train.py -c configs/config_a256_c128.json
```
For multi-GPU training replace train.py with distributed.py. Only tested with single node and NCCL.

For mixed precision training set "fp16_run": true on config.json.

Make test set mel-spectrograms

mkdir -p eval/mels
python3 mel2samp.py -f test_files.txt -o eval/mels -c configs/config_a128_c256.json

Run inference on the test data.

ls eval/mels > eval/mel_files.txt
sed -i -e 's_.*_eval/mels/&_' eval/mel_files.txt
mkdir -p eval/output
python3 inference.py -f eval/mel_files.txt -w checkpoints/SqueezeWave_10000 -o eval/output --is_fp16 -s 0.6

Replace SqueezeWave_10000 with the checkpoint you want to test.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
SqueezeWave-adaptive		SqueezeWave-adaptive
SqueezeWave		SqueezeWave
blow-mel		blow-mel
.gitignore		.gitignore
LICENSE		LICENSE
environment.yml		environment.yml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SqueezeFlow: Adaptive Text-to-Speech in Low Computational Resource Scenarios

Abstract

Credits

Results

Reproduction

Installation

Reproducing Table 1

Preprocessing

Generate audio with our pretrained model

Train your own model

Vocoder on VCTK

Converter: Seen (full VCTK)

Converter: Unseen (with 98/10 split on VCTK)

Reproducing Table 2 (LJ Speech Data)

Generate audio with our pretrained model

Train your own model

About

Releases

Packages

Languages

License

floraxue/SqueezeFlow

Folders and files

Latest commit

History

Repository files navigation

SqueezeFlow: Adaptive Text-to-Speech in Low Computational Resource Scenarios

Abstract

Credits

Results

Reproduction

Installation

Reproducing Table 1

Preprocessing

Generate audio with our pretrained model

Train your own model

Vocoder on VCTK

Converter: Seen (full VCTK)

Converter: Unseen (with 98/10 split on VCTK)

Reproducing Table 2 (LJ Speech Data)

Generate audio with our pretrained model

Train your own model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages