-
Notifications
You must be signed in to change notification settings - Fork 304
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a TTS recipe VITS on LJSpeech dataset (#1372)
* first commit * replace phonimizer with g2p * use Conformer as text encoder * modify training script, clean codes * rename directory * convert text to tokens in data preparation stage * fix tts_datamodule.py * support onnx export and testing the exported onnx model * add doc * add README.md * fix style
- Loading branch information
1 parent
ae67f75
commit 0622dea
Showing
34 changed files
with
7,517 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
TTS | ||
====== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
ljspeech/vits |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
VITS | ||
=============== | ||
|
||
This tutorial shows you how to train an VITS model | ||
with the `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ dataset. | ||
|
||
.. note:: | ||
|
||
The VITS paper: `Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech <https://arxiv.org/pdf/2106.06103.pdf>`_ | ||
|
||
|
||
Data preparation | ||
---------------- | ||
|
||
.. code-block:: bash | ||
$ cd egs/ljspeech/TTS | ||
$ ./prepare.sh | ||
To run stage 1 to stage 5, use | ||
|
||
.. code-block:: bash | ||
$ ./prepare.sh --stage 1 --stop_stage 5 | ||
Build Monotonic Alignment Search | ||
-------------------------------- | ||
|
||
.. code-block:: bash | ||
$ cd vits/monotonic_align | ||
$ python setup.py build_ext --inplace | ||
$ cd ../../ | ||
Training | ||
-------- | ||
|
||
.. code-block:: bash | ||
$ export CUDA_VISIBLE_DEVICES="0,1,2,3" | ||
$ ./vits/train.py \ | ||
--world-size 4 \ | ||
--num-epochs 1000 \ | ||
--start-epoch 1 \ | ||
--use-fp16 1 \ | ||
--exp-dir vits/exp \ | ||
--tokens data/tokens.txt | ||
--max-duration 500 | ||
.. note:: | ||
|
||
You can adjust the hyper-parameters to control the size of the VITS model and | ||
the training configurations. For more details, please run ``./vits/train.py --help``. | ||
|
||
.. note:: | ||
|
||
The training can take a long time (usually a couple of days). | ||
|
||
Training logs, checkpoints and tensorboard logs are saved in ``vits/exp``. | ||
|
||
|
||
Inference | ||
--------- | ||
|
||
The inference part uses checkpoints saved by the training part, so you have to run the | ||
training part first. It will save the ground-truth and generated wavs to the directory | ||
``vits/exp/infer/epoch-*/wav``, e.g., ``vits/exp/infer/epoch-1000/wav``. | ||
|
||
.. code-block:: bash | ||
$ export CUDA_VISIBLE_DEVICES="0" | ||
$ ./vits/infer.py \ | ||
--epoch 1000 \ | ||
--exp-dir vits/exp \ | ||
--tokens data/tokens.txt | ||
--max-duration 500 | ||
.. note:: | ||
|
||
For more details, please run ``./vits/infer.py --help``. | ||
|
||
|
||
Export models | ||
------------- | ||
|
||
Currently we only support ONNX model exporting. It will generate two files in the given ``exp-dir``: | ||
``vits-epoch-*.onnx`` and ``vits-epoch-*.int8.onnx``. | ||
|
||
.. code-block:: bash | ||
$ ./vits/export-onnx.py \ | ||
--epoch 1000 \ | ||
--exp-dir vits/exp \ | ||
--tokens data/tokens.txt | ||
You can test the exported ONNX model with: | ||
|
||
.. code-block:: bash | ||
$ ./vits/test_onnx.py \ | ||
--model-filename vits/exp/vits-epoch-1000.onnx \ | ||
--tokens data/tokens.txt | ||
Download pretrained models | ||
-------------------------- | ||
|
||
If you don't want to train from scratch, you can download the pretrained models | ||
by visiting the following link: | ||
|
||
- `<https://huggingface.co/Zengwei/icefall-tts-ljspeech-vits-2023-11-29>`_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
#!/usr/bin/env python3 | ||
# Copyright 2021-2023 Xiaomi Corp. (authors: Fangjun Kuang, | ||
# Zengwei Yao) | ||
# | ||
# See ../../../../LICENSE for clarification regarding multiple authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
|
||
""" | ||
This file computes fbank features of the LJSpeech dataset. | ||
It looks for manifests in the directory data/manifests. | ||
The generated spectrogram features are saved in data/spectrogram. | ||
""" | ||
|
||
import logging | ||
import os | ||
from pathlib import Path | ||
|
||
import torch | ||
from lhotse import ( | ||
CutSet, | ||
LilcomChunkyWriter, | ||
Spectrogram, | ||
SpectrogramConfig, | ||
load_manifest, | ||
) | ||
from lhotse.audio import RecordingSet | ||
from lhotse.supervision import SupervisionSet | ||
|
||
from icefall.utils import get_executor | ||
|
||
# Torch's multithreaded behavior needs to be disabled or | ||
# it wastes a lot of CPU and slow things down. | ||
# Do this outside of main() in case it needs to take effect | ||
# even when we are not invoking the main (e.g. when spawning subprocesses). | ||
torch.set_num_threads(1) | ||
torch.set_num_interop_threads(1) | ||
|
||
|
||
def compute_spectrogram_ljspeech(): | ||
src_dir = Path("data/manifests") | ||
output_dir = Path("data/spectrogram") | ||
num_jobs = min(4, os.cpu_count()) | ||
|
||
sampling_rate = 22050 | ||
frame_length = 1024 / sampling_rate # (in second) | ||
frame_shift = 256 / sampling_rate # (in second) | ||
use_fft_mag = True | ||
|
||
prefix = "ljspeech" | ||
suffix = "jsonl.gz" | ||
partition = "all" | ||
|
||
recordings = load_manifest( | ||
src_dir / f"{prefix}_recordings_{partition}.{suffix}", RecordingSet | ||
) | ||
supervisions = load_manifest( | ||
src_dir / f"{prefix}_supervisions_{partition}.{suffix}", SupervisionSet | ||
) | ||
|
||
config = SpectrogramConfig( | ||
sampling_rate=sampling_rate, | ||
frame_length=frame_length, | ||
frame_shift=frame_shift, | ||
use_fft_mag=use_fft_mag, | ||
) | ||
extractor = Spectrogram(config) | ||
|
||
with get_executor() as ex: # Initialize the executor only once. | ||
cuts_filename = f"{prefix}_cuts_{partition}.{suffix}" | ||
if (output_dir / cuts_filename).is_file(): | ||
logging.info(f"{cuts_filename} already exists - skipping.") | ||
return | ||
logging.info(f"Processing {partition}") | ||
cut_set = CutSet.from_manifests( | ||
recordings=recordings, supervisions=supervisions | ||
) | ||
|
||
cut_set = cut_set.compute_and_store_features( | ||
extractor=extractor, | ||
storage_path=f"{output_dir}/{prefix}_feats_{partition}", | ||
# when an executor is specified, make more partitions | ||
num_jobs=num_jobs if ex is None else 80, | ||
executor=ex, | ||
storage_type=LilcomChunkyWriter, | ||
) | ||
cut_set.to_file(output_dir / cuts_filename) | ||
|
||
|
||
if __name__ == "__main__": | ||
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s" | ||
|
||
logging.basicConfig(format=formatter, level=logging.INFO) | ||
compute_spectrogram_ljspeech() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
#!/usr/bin/env python3 | ||
# Copyright 2023 Xiaomi Corp. (authors: Zengwei Yao) | ||
# | ||
# See ../../../../LICENSE for clarification regarding multiple authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
""" | ||
This file displays duration statistics of utterances in a manifest. | ||
You can use the displayed value to choose minimum/maximum duration | ||
to remove short and long utterances during the training. | ||
See the function `remove_short_and_long_utt()` in vits/train.py | ||
for usage. | ||
""" | ||
|
||
|
||
from lhotse import load_manifest_lazy | ||
|
||
|
||
def main(): | ||
path = "./data/spectrogram/ljspeech_cuts_all.jsonl.gz" | ||
cuts = load_manifest_lazy(path) | ||
cuts.describe() | ||
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
|
||
""" | ||
Cut statistics: | ||
╒═══════════════════════════╤══════════╕ | ||
│ Cuts count: │ 13100 │ | ||
├───────────────────────────┼──────────┤ | ||
│ Total duration (hh:mm:ss) │ 23:55:18 │ | ||
├───────────────────────────┼──────────┤ | ||
│ mean │ 6.6 │ | ||
├───────────────────────────┼──────────┤ | ||
│ std │ 2.2 │ | ||
├───────────────────────────┼──────────┤ | ||
│ min │ 1.1 │ | ||
├───────────────────────────┼──────────┤ | ||
│ 25% │ 5.0 │ | ||
├───────────────────────────┼──────────┤ | ||
│ 50% │ 6.8 │ | ||
├───────────────────────────┼──────────┤ | ||
│ 75% │ 8.4 │ | ||
├───────────────────────────┼──────────┤ | ||
│ 99% │ 10.0 │ | ||
├───────────────────────────┼──────────┤ | ||
│ 99.5% │ 10.1 │ | ||
├───────────────────────────┼──────────┤ | ||
│ 99.9% │ 10.1 │ | ||
├───────────────────────────┼──────────┤ | ||
│ max │ 10.1 │ | ||
├───────────────────────────┼──────────┤ | ||
│ Recordings available: │ 13100 │ | ||
├───────────────────────────┼──────────┤ | ||
│ Features available: │ 13100 │ | ||
├───────────────────────────┼──────────┤ | ||
│ Supervisions available: │ 13100 │ | ||
╘═══════════════════════════╧══════════╛ | ||
""" |
Oops, something went wrong.