Skip to content

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR implemented with Pytorch, the well known deep learning toolkit.

License

Notifications You must be signed in to change notification settings

s3prl/End-to-end-ASR-Pytorch

 
 

Repository files navigation

End-to-end Automatic Speech Recognition Systems - PyTorch Implementation

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR by Tzu-Wei Sung and me. Implementation was mostly done with Pytorch, the well known deep learning toolkit.

The end-to-end ASR was based on Listen, Attend and Spell1. Multiple techniques proposed recently were also implemented, serving as additional plug-ins for better performance. For the list of techniques implemented, please refer to the highlights, configuration and references.

Feel free to use/modify them, any bug report or improvement suggestion will be appreciated. If you find this project helpful for your research, please do consider to cite our paper, thanks!

Highlights

  • Feature Extraction

    • On-the-fly feature extraction using torchaudio as backend
    • Character/subword2/word encoding of text
  • Training End-to-end ASR

    • Seq2seq ASR with different types of encoder/attention3
    • CTC-based ASR4, which can also be hybrid5 with the former
    • yaml-styled model construction and hyper parameters setting
    • Training process visualization with TensorBoard, including attention alignment
  • Speech Recognition with End-to-end ASR (i.e. Decoding)

    • Beam search decoding
    • RNN language model training and joint decoding for ASR6
    • Joint CTC-attention based decoding6
    • Greedy decoding & CTC beam search contributed by Heng-Jui (Harry) Chang

You may checkout some example log files with TensorBoard by downloading them from coming soon

Dependencies

  • Python 3
  • Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
  • Required packages and their use are listed requirements.txt.

Instructions

Step 0. Preprocessing - Generate Text Encoder

You may use the text encoders provided at tests/sample_data/ and skip this step.

The subword model is trained with sentencepiece. As for character/word model, you have to generate the vocabulary file containing the vocabulary line by line. You may also use util/generate_vocab_file.py so that you only have to prepare a text file, which contains all texts you want to use for generating the vocabulary file or subword model. Please update data.text.* field in the config file if you want to change the mode or vocabulary file. For subword model, use the one ended with .model as vocab_file.

python3 util/generate_vocab_file.py --input_file TEXT_FILE \
                                    --output_file OUTPUT_FILE \
                                    --vocab_size VOCAB_SIZE \
                                    --mode MODE

For more details, please refer to python3 util/generate_vocab_file.py -h.

Step 1. Configuring - Model Design & Hyperparameter Setup

All the parameters related to training/decoding will be stored in a yaml file. Hyperparameter tuning and experiments can be managed easily this way. See documentation and examples for the exact format. Note that the example configs provided were not fine-tuned, you may want to write your own config for best performance.

Step 2. Training (End-to-end ASR or RNN-LM)

Once the config file is ready, run the following command to train end-to-end ASR (or language model)

  • Single GPU
python3 main.py --config <path of config file> 
  • Extract upstream representation and train on downstream asr (single linear layer) with single GPU
python3 main.py --config config/libri/upstream_example.yaml --upstream wav2vec2
  • Multiple GPUs
n_gpu=4
python3 -m torch.distributed.launch --nproc_per_node $n_gpu main.py --config <path of config file> 
  • Finetune pretrained models with multiple GPUs (the config is still under development)
n_gpu=4
python3 -m torch.distributed.launch --nproc_per_node $n_gpu main.py --config config/libri/upstream_example.yaml --upstream wav2vec2 --upstream_trainable

For example, train an ASR on LibriSpeech and watch the log with

# Checkout options available
python3 main.py -h
# Start training with specific config
python3 main.py --config config/libri/asr_example.yaml
# Open TensorBoard to see log
tensorboard --logdir log/
# Train an external language model
python3 main.py --config config/libri/lm_example.yaml --lm

All settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard. Please notice that the error rate reported on the TensorBoard is biased (see issue #10), you should run the testing phase in order to get the true performance of model. Options available in this phase include the followings

Options Description
config Path of config file.
seed Random seed, note this is an option that affects the result
name Experiments for logging and saving model.

By default it's <name of config file>_<random seed>

logdir Path to store training logs (log files for tensorboard), default log/.
ckpdir The directory to store model, default ckpt/.
njobs Number of workers used for data loader, consider increase this if you find data preprocessing takes most of your training time, default using 6.
no-ping Disable the pin-memory option of pytorch dataloader.
cpu CPU-only mode, not recommended, use it for debugging.
no-msg Hide all message from stdout.
lm Switch to rnnlm training mode.
test Switch to decoding mode (do not use during training phase)
cudnn-ctc Use CuDNN as the backend of PyTorch CTC. Unstable, see this issue, not sure if solved in latest Pytorch with cudnn version > 7.6

Step 3. Speech Recognition & Performance Evaluation

To test a model, run the following command

python3 main.py --config <path of config file> --test --njobs <int>

If the checkpoint is trained by distributed data parallel (DDP), since we do not wrap the model with DDP during testing, you need to explicitly specify --load_ddp_to_nonddp

python3 main.py --config <path of config file> --test --njobs <int> --load_ddp_to_nonddp

So if you are evaluating a finetuned checkpoint from a pretrained upstream model

config_file=config/libri/ctc_decode_finetune_example.yaml
python3 main.py --config $config_file --test --njobs <int> --load_ddp_to_nonddp

Please notice that the decoding is performed without batch processing, use more workers to speedup at the cost of using more RAM. By default, recognition result will be stored at result/<name>/ as two csv files with auto-naming according to the decoding config file. output.csv will store the best hypothesis provided by ASR and beam.csv will recored the top hypotheses during beam search. The result file may be evaluated with eval.py. For example, test the example ASR trained on LibriSpeech and check performance with

python3 main.py --config config/libri/decode_example.yaml --test --njobs 8
# Check WER/CER
python3 eval.py --file result/asr_example_sd0_dev_output.csv

Most of the options work similar to training phase except the followings:

Options Description
test Must be enabled
config Path to the decoding config file.
outdir Path to store decode result.
njobs Number of threads used for decoding, very important in terms of efficiency. Large value equals fast decoding yet RAM/GPU RAM expensive.

Troubleshooting

  • Loss becomes nan right after training begins

    For CTC, len(pred)>len(label) is necessary. Also consider set zero_infinity=True for torch.nn.CTCLoss

ToDo

  • Provide examples
  • Pure CTC training / CTC beam decode bug (out-of-candidate)
  • Greedy decoding
  • Customized dataset
  • Util. scripts
  • Finish CLM migration and reference
  • Store preprocessed dataset on RAM

Acknowledgements

  • Parts of the implementation refer to ESPnet, a great end-to-end speech processing toolkit by Watanabe et al.
  • Special thanks to William Chan, the first author of LAS, for answering my questions during implementation.
  • Thanks xiaoming, Odie Ko, b-etienne, Jinserk Baik and Zhong-Yi Li for identifying several issues in our implementation.

Reference

  1. Listen, Attend and Spell, W Chan et al.
  2. Neural Machine Translation of Rare Words with Subword Units, R Sennrich et al.
  3. Attention-Based Models for Speech Recognition, J Chorowski et al.
  4. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, A Graves et al.
  5. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, S Kim et al.
  6. Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM, T Hori et al.

Citation

@inproceedings{liu2019adversarial,
  title={Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model},
  author={Liu, Alexander and Lee, Hung-yi and Lee, Lin-shan},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  organization={IEEE}
}

@misc{alex2019sequencetosequence,
    title={Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding},
    author={Alexander H. Liu and Tzu-Wei Sung and Shun-Po Chuang and Hung-yi Lee and Lin-shan Lee},
    year={2019},
    eprint={1910.12740},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR implemented with Pytorch, the well known deep learning toolkit.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%