Skip to content

Latest commit

 

History

History
43 lines (34 loc) · 1.56 KB

README.md

File metadata and controls

43 lines (34 loc) · 1.56 KB

srllctts

A simple utility for synthesizing English speech from the command line; uses NVIDIA's Tacotron2 and WaveGlow models to do the work, both of which were trained using the LJ Speech dataset. Just a quick little thing that we thought was neat.

Some of the code is taken directly from NVIDIA's TorchHub example (see links).

Please note that this is not a maintained project. Additionally, it also no longer represents state-of-the-art; while I haven't taken the time to investigate the following, it may be of interest:

MelGAN Vocoder code, and paper.

Links

Samples of the output

Decent: knuth.wav
Really bad: shakespeare.wav

Dependencies

You can pip install -r DEPENDENCIES to get these.

  • torch
  • matplotlib
  • numpy
  • inflect
  • librosa
  • scipy
  • unidecode
  • plac

Execution time

With a GTX 1080 Ti video card and an Intel Core i7-7700k (4.2GHz), it takes roughly a second per word or two.

Licenses

The LJ Speech dataset is public domain and NVIDIA's models are covered by a BSD 3-clause license. Imagine the court battles Hollywood is going to go through when we really get these things right.