ASR-CTC Models Repository

This repository contains implementations of three Automatic Speech Recognition (ASR) models based on Connectionist Temporal Classification (CTC):

CTC - A minimal implementation of the basic CTC model.
Reference: Connectionist Temporal Classification

Citation:

@inproceedings{graves2006connectionist,
  title={Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks},
  author={Graves, Alex and Fern{\'a}ndez, Santiago and Gomez, Faustino and Schmidhuber, J{\"u}rgen},
  booktitle={Proceedings of the 23rd international conference on Machine learning},
  pages={369--376},
  year={2006}
}

SC-CTC - Self-Conditioned CTC, which introduces self-conditioning to the CTC framework.
Paper Reference: Relaxing the Conditional Independence Assumption of CTC-Based ASR

Citation:

@article{nozaki2021relaxing,
  title={Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions},
  author={Nozaki, Jumon and Komatsu, Tatsuya},
  journal={arXiv preprint arXiv:2104.02724},
  year={2021}
}

HC-CTC - Hierarchically Conditioned CTC, which applies a hierarchical conditioning mechanism for improved performance.
Paper Reference: Hierarchical Conditional End-to-End ASR with CTC

Citation:

@inproceedings{higuchi2022hierarchical,
  title={Hierarchical conditional end-to-end asr with ctc and multi-granular subword units},
  author={Higuchi, Yosuke and Karube, Keita and Ogawa, Tetsuji and Kobayashi, Tetsunori},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7797--7801},
  year={2022},
  organization={IEEE}
}

The convolutional subsampling approach is taken from Fast Conformer Nvidia which uses an 8x downsampling of audio making the Conformer very fast.

However, a lower downsampling of 4x can be used by setting cfg.features.downsample: 4

requirements.txt contains the libraries I had installed (some of them might not be needed).

Training the Models

To train any of these models, use the following script:

bash run_training.sh

Configuration

The configs for all models can be found in config/ directory.

Decoding

The run_decode.sh will run decoding over a corpus and report WER

To decode a specific audio file use a yaml config cfg,

model = baseHCCTC.from_pretrained(cfg, cfg.paths.ckpt_path) # cfg.paths.ckpt_path path where checkpoint is saved
transcription = model.transcribe(/path/to/audio_file)
print(transcription)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
.gitignore		.gitignore
README.md		README.md
data.py		data.py
decode.py		decode.py
encoders.py		encoders.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
run_decode.sh		run_decode.sh
run_training.sh		run_training.sh
train.py		train.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASR-CTC Models Repository

Training the Models

Configuration

Decoding

About

Releases

Packages

Languages

OSU-slatelab/ctc-asr

Folders and files

Latest commit

History

Repository files navigation

ASR-CTC Models Repository

Training the Models

Configuration

Decoding

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages