This repository contains implementations of three Automatic Speech Recognition (ASR) models based on Connectionist Temporal Classification (CTC):
-
CTC - A minimal implementation of the basic CTC model.
Reference: Connectionist Temporal Classification- Citation:
@inproceedings{graves2006connectionist, title={Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks}, author={Graves, Alex and Fern{\'a}ndez, Santiago and Gomez, Faustino and Schmidhuber, J{\"u}rgen}, booktitle={Proceedings of the 23rd international conference on Machine learning}, pages={369--376}, year={2006} }
- Citation:
-
SC-CTC - Self-Conditioned CTC, which introduces self-conditioning to the CTC framework.
Paper Reference: Relaxing the Conditional Independence Assumption of CTC-Based ASR- Citation:
@article{nozaki2021relaxing, title={Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions}, author={Nozaki, Jumon and Komatsu, Tatsuya}, journal={arXiv preprint arXiv:2104.02724}, year={2021} }
- Citation:
-
HC-CTC - Hierarchically Conditioned CTC, which applies a hierarchical conditioning mechanism for improved performance.
Paper Reference: Hierarchical Conditional End-to-End ASR with CTC- Citation:
@inproceedings{higuchi2022hierarchical, title={Hierarchical conditional end-to-end asr with ctc and multi-granular subword units}, author={Higuchi, Yosuke and Karube, Keita and Ogawa, Tetsuji and Kobayashi, Tetsunori}, booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={7797--7801}, year={2022}, organization={IEEE} }
- Citation:
The convolutional subsampling approach is taken from Fast Conformer Nvidia which uses an 8x downsampling of audio making the Conformer very fast.
However, a lower downsampling of 4x can be used by setting cfg.features.downsample: 4
requirements.txt
contains the libraries I had installed (some of them might not be needed).
To train any of these models, use the following script:
bash run_training.sh
The configs for all models can be found in config/
directory.
The run_decode.sh
will run decoding over a corpus and report WER
To decode a specific audio file use a yaml
config cfg
,
model = baseHCCTC.from_pretrained(cfg, cfg.paths.ckpt_path) # cfg.paths.ckpt_path path where checkpoint is saved
transcription = model.transcribe(/path/to/audio_file)
print(transcription)