Skip to content

This is the PyTorch-based model implementation of the paper "Listen, Attend and Spell", an attention-based encoder-decoder neural model for speech utterance transcription.

License

Notifications You must be signed in to change notification settings

YukunJ/Listen-Attend-Spell-Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Listen-Attend-Spell Implementation

This is the PyTorch-based model implementation of the paper "Listen, Attend and Spell" by Chan et al., 2015 from Google Brain and Carnegie Mellon University.

This is an attention-based encoder-decoder model transcribing speech utterance to text in a character-based manner. It utilizes the pyramidal RNN layer to reduce the length of input utterance and attention mechanism to decode information captured by the encoder.

The overall architecture could be visualized as belows: (pics taken from the paper)

drawing

drawing


Highlights of the model implementation

  1. Pyramidal LSTM Layer -> The utterance is typically quite long so as to exceed 1000. This make the attention mechanism difficult to focus on the right part of the speech during decoding and slower to converge. To tackle this problem, each pyramial LSTM layer reduce the length of input utterance by half, by concatenating the adjacent two. Essentially, after one layer of pyramial LSTM layer, a batch data of size (batch_size, seq_len, feat_size) becomes (batch_size, seq_len / 2, feat_size * 2). If the seq_len is odd, we just chop off the last one.

  2. Locked Dropout -> We self-implement and insert locked dropout layer in between pyramidal lstm layers. Locked dropout is the way apply the same dropout mask to every time step. This is an efficient way to enhance the generalizability of the encoder. The whole encoder's baseline architecture is therefore [lstm -> plstm -> locked-dropout -> plstm -> locked-dropout -> plstm]

  3. Attention Mechanism -> The model utilizes the attention mechanism to help the decoder to focus on the right part of the speech utterance during decoding. There are many ways of implementin the attention. In this implementation, we use linear transformation to produce attention_key and attention_value to be coupled with query during each timestamp's decoding.

  4. Teacher Forcing -> It's difficult at early stage for the model to learn because if the decoding at current tiemstamp t is wrong, then this wrong character's embedding would be feed into t+1 timestamp's decoding, making it even harder to get it right. To tackle this problem, we utilize teacher forcing techniques. Essentially, with a high probability (90% initially), the embedding of y_{t-1} to be fed into the decoding process for y_t would be the ground truth regardless of what the model predicts on last timestamp. As the training process goes, we could gradually decrease the teacher forcing rate and let the model rely wholly on itself.

  5. Beam Search -> To fully explore the possible decoding path, we implement beam search in the implementation. However, since it's pretty slow once the beam widthg get bigger, we only used it during validation and inference, and greedy search is applied in the training epochs.


File Structure

src/
    attention.py (attention module)
    model.py (locked dropout, pyramidal lstm layer, encoder, decoder)
    trainer.py (train, valid, inference and attention/graident plot helper)
    dataset.py (dataset, dataloader)
    search.py (greedy search, beam search)
    utils.py (letter list, index to letter trans dictionary)
    main.py (main driver, hyperparameter setting)
data/ (the train, valid, test dataset storage)
checkpoint/ (model checkpoint during training)
output/ (inference result)
pic/
Listen-Attend-Spell.pdf
requirements.txt
README.md

Where to get the data?

Since the data is too big to be put on the github, we package the data source and upload it to google drive for download (link). Please unzip the file and put the files in the data/ folder.


How to run this model?

  1. Check and install dependent packages.
    pip install -r requirements.txt
    
  2. cd to the src folder
  3. (Optionally) Change the hyperparameters as needed in the main.py script
  4. To train, run
    python3 main.py train
    
  5. To inference, run
    python3 main.py infer
    

Hyperparameter Setting

We adopt the baseline architecture setting as described in the paper.

Decoder:

  • 1 layer of normal LSTM followed by 3 layers of pyramial LSTM of hidden dim=256. This reduces the input utterance length by a factor of 8.

Attention:

  • linear transformation of key_value dim=128

Decoder:

  • 2 layers of LSTM cells of hidden dim=512
  • character embedding dim=128

Optimization:

  • batchsize=64
  • Adam optimzer of lr=0.001
  • ReduceLROnPlateau scheduler of reduce factor 0.75 with patience=2, start to step only after first 10 epochs
  • train for 40 epochs
  • Teacher Forcing rate remains 95% for the first 10 epochs and then gradually decrease to 70% by linear interpolation.

The model should be able to reach average Levenshtein distance below 30 on validation set after 40 epochs training.


Feel free to email me at [email protected] for questions or discussion for this implementation.

About

This is the PyTorch-based model implementation of the paper "Listen, Attend and Spell", an attention-based encoder-decoder neural model for speech utterance transcription.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages