Skip to content

Commit

Permalink
Add a TTS recipe VITS on LJSpeech dataset (#1372)
Browse files Browse the repository at this point in the history
* first commit

* replace phonimizer with g2p

* use Conformer as text encoder

* modify training script, clean codes

* rename directory

* convert text to tokens in data preparation stage

* fix tts_datamodule.py

* support onnx export and testing the exported onnx model

* add doc

* add README.md

* fix style
  • Loading branch information
yaozengwei authored Nov 29, 2023
1 parent ae67f75 commit 0622dea
Show file tree
Hide file tree
Showing 34 changed files with 7,517 additions and 2 deletions.
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ per-file-ignores =
egs/librispeech/ASR/zipformer_mmi/*.py: E501, E203
egs/librispeech/ASR/zipformer/*.py: E501, E203
egs/librispeech/ASR/RESULTS.md: E999,

egs/ljspeech/TTS/vits/*.py: E501, E203
# invalid escape sequence (cause by tex formular), W605
icefall/utils.py: E501, W605

Expand Down
7 changes: 7 additions & 0 deletions docs/source/recipes/TTS/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
TTS
======

.. toctree::
:maxdepth: 2

ljspeech/vits
113 changes: 113 additions & 0 deletions docs/source/recipes/TTS/ljspeech/vits.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
VITS
===============

This tutorial shows you how to train an VITS model
with the `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ dataset.

.. note::

The VITS paper: `Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech <https://arxiv.org/pdf/2106.06103.pdf>`_


Data preparation
----------------

.. code-block:: bash
$ cd egs/ljspeech/TTS
$ ./prepare.sh
To run stage 1 to stage 5, use

.. code-block:: bash
$ ./prepare.sh --stage 1 --stop_stage 5
Build Monotonic Alignment Search
--------------------------------

.. code-block:: bash
$ cd vits/monotonic_align
$ python setup.py build_ext --inplace
$ cd ../../
Training
--------

.. code-block:: bash
$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./vits/train.py \
--world-size 4 \
--num-epochs 1000 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--tokens data/tokens.txt
--max-duration 500
.. note::

You can adjust the hyper-parameters to control the size of the VITS model and
the training configurations. For more details, please run ``./vits/train.py --help``.

.. note::

The training can take a long time (usually a couple of days).

Training logs, checkpoints and tensorboard logs are saved in ``vits/exp``.


Inference
---------

The inference part uses checkpoints saved by the training part, so you have to run the
training part first. It will save the ground-truth and generated wavs to the directory
``vits/exp/infer/epoch-*/wav``, e.g., ``vits/exp/infer/epoch-1000/wav``.

.. code-block:: bash
$ export CUDA_VISIBLE_DEVICES="0"
$ ./vits/infer.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt
--max-duration 500
.. note::

For more details, please run ``./vits/infer.py --help``.


Export models
-------------

Currently we only support ONNX model exporting. It will generate two files in the given ``exp-dir``:
``vits-epoch-*.onnx`` and ``vits-epoch-*.int8.onnx``.

.. code-block:: bash
$ ./vits/export-onnx.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt
You can test the exported ONNX model with:

.. code-block:: bash
$ ./vits/test_onnx.py \
--model-filename vits/exp/vits-epoch-1000.onnx \
--tokens data/tokens.txt
Download pretrained models
--------------------------

If you don't want to train from scratch, you can download the pretrained models
by visiting the following link:

- `<https://huggingface.co/Zengwei/icefall-tts-ljspeech-vits-2023-11-29>`_
3 changes: 2 additions & 1 deletion docs/source/recipes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Recipes
=======

This page contains various recipes in ``icefall``.
Currently, only speech recognition recipes are provided.
Currently, we provide recipes for speech recognition, language model, and speech synthesis.

We may add recipes for other tasks as well in the future.

Expand All @@ -16,3 +16,4 @@ We may add recipes for other tasks as well in the future.
Non-streaming-ASR/index
Streaming-ASR/index
RNN-LM/index
TTS/index
106 changes: 106 additions & 0 deletions egs/ljspeech/TTS/local/compute_spectrogram_ljspeech.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#!/usr/bin/env python3
# Copyright 2021-2023 Xiaomi Corp. (authors: Fangjun Kuang,
# Zengwei Yao)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
This file computes fbank features of the LJSpeech dataset.
It looks for manifests in the directory data/manifests.
The generated spectrogram features are saved in data/spectrogram.
"""

import logging
import os
from pathlib import Path

import torch
from lhotse import (
CutSet,
LilcomChunkyWriter,
Spectrogram,
SpectrogramConfig,
load_manifest,
)
from lhotse.audio import RecordingSet
from lhotse.supervision import SupervisionSet

from icefall.utils import get_executor

# Torch's multithreaded behavior needs to be disabled or
# it wastes a lot of CPU and slow things down.
# Do this outside of main() in case it needs to take effect
# even when we are not invoking the main (e.g. when spawning subprocesses).
torch.set_num_threads(1)
torch.set_num_interop_threads(1)


def compute_spectrogram_ljspeech():
src_dir = Path("data/manifests")
output_dir = Path("data/spectrogram")
num_jobs = min(4, os.cpu_count())

sampling_rate = 22050
frame_length = 1024 / sampling_rate # (in second)
frame_shift = 256 / sampling_rate # (in second)
use_fft_mag = True

prefix = "ljspeech"
suffix = "jsonl.gz"
partition = "all"

recordings = load_manifest(
src_dir / f"{prefix}_recordings_{partition}.{suffix}", RecordingSet
)
supervisions = load_manifest(
src_dir / f"{prefix}_supervisions_{partition}.{suffix}", SupervisionSet
)

config = SpectrogramConfig(
sampling_rate=sampling_rate,
frame_length=frame_length,
frame_shift=frame_shift,
use_fft_mag=use_fft_mag,
)
extractor = Spectrogram(config)

with get_executor() as ex: # Initialize the executor only once.
cuts_filename = f"{prefix}_cuts_{partition}.{suffix}"
if (output_dir / cuts_filename).is_file():
logging.info(f"{cuts_filename} already exists - skipping.")
return
logging.info(f"Processing {partition}")
cut_set = CutSet.from_manifests(
recordings=recordings, supervisions=supervisions
)

cut_set = cut_set.compute_and_store_features(
extractor=extractor,
storage_path=f"{output_dir}/{prefix}_feats_{partition}",
# when an executor is specified, make more partitions
num_jobs=num_jobs if ex is None else 80,
executor=ex,
storage_type=LilcomChunkyWriter,
)
cut_set.to_file(output_dir / cuts_filename)


if __name__ == "__main__":
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"

logging.basicConfig(format=formatter, level=logging.INFO)
compute_spectrogram_ljspeech()
73 changes: 73 additions & 0 deletions egs/ljspeech/TTS/local/display_manifest_statistics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env python3
# Copyright 2023 Xiaomi Corp. (authors: Zengwei Yao)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This file displays duration statistics of utterances in a manifest.
You can use the displayed value to choose minimum/maximum duration
to remove short and long utterances during the training.
See the function `remove_short_and_long_utt()` in vits/train.py
for usage.
"""


from lhotse import load_manifest_lazy


def main():
path = "./data/spectrogram/ljspeech_cuts_all.jsonl.gz"
cuts = load_manifest_lazy(path)
cuts.describe()


if __name__ == "__main__":
main()

"""
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count: │ 13100 │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 23:55:18 │
├───────────────────────────┼──────────┤
│ mean │ 6.6 │
├───────────────────────────┼──────────┤
│ std │ 2.2 │
├───────────────────────────┼──────────┤
│ min │ 1.1 │
├───────────────────────────┼──────────┤
│ 25% │ 5.0 │
├───────────────────────────┼──────────┤
│ 50% │ 6.8 │
├───────────────────────────┼──────────┤
│ 75% │ 8.4 │
├───────────────────────────┼──────────┤
│ 99% │ 10.0 │
├───────────────────────────┼──────────┤
│ 99.5% │ 10.1 │
├───────────────────────────┼──────────┤
│ 99.9% │ 10.1 │
├───────────────────────────┼──────────┤
│ max │ 10.1 │
├───────────────────────────┼──────────┤
│ Recordings available: │ 13100 │
├───────────────────────────┼──────────┤
│ Features available: │ 13100 │
├───────────────────────────┼──────────┤
│ Supervisions available: │ 13100 │
╘═══════════════════════════╧══════════╛
"""
Loading

0 comments on commit 0622dea

Please sign in to comment.