F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

This forked version introduces chunking of the input text, allowing for the generation of audio files of any length without limitations. Additionally, the VRAM usage remains under 8 GB.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
E2 TTS: Flat-UNet Transformer, closest reproduction.
Sway Sampling: Inference-time flow step sampling strategy, greatly improves performance

Installation and running inference

Clone this repository.

git clone https://github.com/PasiKoodaa/F5-TTS
cd F5-TTS

Create a new conda environment:
```
conda create -n F5-TTS python=3.10
```
Activate the environment:
```
conda activate F5-TTS
```

Install the right torch for your system

https://pytorch.org/get-started/locally/

Tested with: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Install the required packages:
```
pip install -r requirements.txt
```
Run the Gradio-app
```
python app_local.py
```

Prepare Dataset

Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in model/dataset.py.

# prepare custom dataset up to your need
# download corresponding dataset first, and fill in the path in scripts

# Prepare the Emilia dataset
python scripts/prepare_emilia.py

# Prepare the Wenetspeech4TTS dataset
python scripts/prepare_wenetspeech4tts.py

Training

Once your datasets are prepared, you can start the training process.

# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml     
accelerate config
accelerate launch test_train.py

Inference

To inference with pretrained models, download the checkpoints from 🤗.

Single Inference

You can test single inference using the following command. Before running the command, modify the config up to your need.

# modify the config up to your need,
# e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
#      nfe_step     (larger takes more time to do more precise inference ode)
#      ode_method   (switch to 'midpoint' for better compatibility with small nfe_step, )
#                   ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
python test_infer_single.py

Speech Edit

To test speech editing capabilities, use the following command.

python test_infer_single_edit.py

Evaluation

Prepare Test Datasets

Seed-TTS test set: Download from seed-tts-eval.
LibriSpeech test-clean: Download from OpenSLR.
Unzip the downloaded datasets and place them in the data/ directory.
Update the path for the test-clean data in test_infer_batch.py
Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo

Batch Inference for Test Set

To run batch inference for evaluations, execute the following commands:

# batch inference for evaluations
accelerate config  # if not set before
bash test_infer_batch.sh

Download Evaluation Model Checkpoints

Chinese ASR Model: Paraformer-zh
English ASR Model: Faster-Whisper
WavLM Model: Download from Google Drive.

Objective Evaluation

Some Notes
For faster-whisper with CUDA 11:
pip install --force-reinstall ctranslate2==3.24.0
(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
pip install faster-whisper==0.10.1

Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:

# Evaluation for Seed-TTS test set
python scripts/eval_seedtts_testset.py

# Evaluation for LibriSpeech-PC test-clean (cross-sentence)
python scripts/eval_librispeech_test_clean.py

Acknowledgements

E2-TTS brilliant work, simple and effective
Emilia, WenetSpeech4TTS valuable datasets
lucidrains initial CFM structure with also bfs18 for discussion
SD3 & Huggingface diffusers DiT and MMDiT code structure
torchdiffeq as ODE solver, Vocos as vocoder
mrfakename huggingface space demo ~
FunASR, faster-whisper & UniSpeech for evaluation tools
ctc-forced-aligner for speech edit test

Citation

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

LICENSE

Our code is released under MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This forked version introduces chunking of the input text, allowing for the generation of audio files of any length without limitations. Additionally, the VRAM usage remains under 8 GB.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Installation and running inference

Prepare Dataset

Training

Inference

Single Inference

Speech Edit

Evaluation

Prepare Test Datasets

Batch Inference for Test Set

Download Evaluation Model Checkpoints

Objective Evaluation

Acknowledgements

Citation

LICENSE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
ckpts		ckpts
data		data
model		model
scripts		scripts
tests/ref_audio		tests/ref_audio
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app_local.py		app_local.py
requirements.txt		requirements.txt
test_infer_batch.py		test_infer_batch.py
test_infer_batch.sh		test_infer_batch.sh
test_infer_single.py		test_infer_single.py
test_infer_single_edit.py		test_infer_single_edit.py
test_train.py		test_train.py

License

strf0x1/F5-TTS-Docker

Folders and files

Latest commit

History

Repository files navigation

This forked version introduces chunking of the input text, allowing for the generation of audio files of any length without limitations. Additionally, the VRAM usage remains under 8 GB.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Installation and running inference

Prepare Dataset

Training

Inference

Single Inference

Speech Edit

Evaluation

Prepare Test Datasets

Batch Inference for Test Set

Download Evaluation Model Checkpoints

Objective Evaluation

Acknowledgements

Citation

LICENSE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages