PromptASR for contextualized ASR with controllable style (#1250)

* Add PromptASR with BERT as text encoder * Support using word-list based content prompts for context biasing * Upload the pretrained models to huggingface * Add usage example
k2-fsa · Oct 11, 2023 · 16a2748 · 16a2748
1 parent cb874e9
commit 16a2748
Show file tree

Hide file tree

Showing 29 changed files with 15,825 additions and 3 deletions.
diff --git a/egs/libriheavy/ASR/RESULTS.md b/egs/libriheavy/ASR/RESULTS.md
@@ -0,0 +1,205 @@
+## Results
+
+### Zipformer PromptASR (zipformer + PromptASR + BERT text encoder)
+
+#### [zipformer_prompt_asr](./zipformer_prompt_asr)
+
+See <https://github.com/k2-fsa/icefall/pull/1250> for commit history and
+our paper <https://arxiv.org/abs/2309.07414> for more details.
+
+
+
+##### Training on the medium subset, with content & style prompt, **no** context list
+
+You can find a pre-trained model, training logs, decoding logs, and decoding results at: <https://huggingface.co/marcoyang/icefall-promptasr-libriheavy-zipformer-BERT-2023-10-10>
+
+The training command is:
+
+```bash
+causal=0
+subset=medium
+memory_dropout_rate=0.05
+text_encoder_type=BERT
+
+python ./zipformer_prompt_asr/train_bert_encoder.py \
+    --world-size 4 \
+    --start-epoch 1 \
+    --num-epochs 60 \
+    --exp-dir ./zipformer_prompt_asr/exp \
+    --use-fp16 True \
+    --memory-dropout-rate $memory_dropout_rate \
+    --causal $causal \
+    --subset $subset \
+    --manifest-dir data/fbank \
+    --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
+    --max-duration 1000 \
+    --text-encoder-type $text_encoder_type \
+    --text-encoder-dim 768 \
+    --use-context-list 0 \
+    --top-k $top_k \
+    --use-style-prompt 1
+```
+
+The decoding results using utterance-level context (epoch-60-avg-10):
+
+| decoding method      | lh-test-clean | lh-test-other | comment             |
+|----------------------|---------------|---------------|---------------------|
+| modified_beam_search        |   3.13        |    6.78       |    --use-pre-text False --use-style-prompt False      |
+| modified_beam_search        |   2.86        |    5.93       |    --pre-text-transform upper-no-punc --style-text-transform upper-no-punc       |
+| modified_beam_search        |   2.6        |    5.5       |    --pre-text-transform mixed-punc --style-text-transform mixed-punc       |
+
+
+The decoding command is:
+
+```bash
+for style in mixed-punc upper-no-punc; do
+    python ./zipformer_prompt_asr/decode_bert.py \
+        --epoch 60 \
+        --avg 10 \
+        --use-averaged-model True \
+        --post-normalization True \
+        --causal False \
+        --exp-dir ./zipformer_prompt_asr/exp \
+        --manifest-dir data/fbank \
+        --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
+        --max-duration 1000 \
+        --decoding-method modified_beam_search \
+        --beam-size 4 \
+        --text-encoder-type BERT \
+        --text-encoder-dim 768 \
+        --memory-layer 0 \
+        --use-ls-test-set False \
+        --use-ls-context-list False \
+        --max-prompt-lens 1000 \
+        --use-pre-text True \
+        --use-style-prompt True \
+        --style-text-transform $style \
+        --pre-text-transform $style \
+        --compute-CER 0
+done
+```
+
+##### Training on the medium subset, with content & style prompt, **with** context list
+
+You can find a pre-trained model, training logs, decoding logs, and decoding results at: <https://huggingface.co/marcoyang/icefall-promptasr-with-context-libriheavy-zipformer-BERT-2023-10-10>
+
+This model is trained with an extra type of content prompt (context words), thus it does better
+on **word-level** context biasing. Note that to train this model, please first run `prepare_prompt_asr.sh`
+to prepare a manifest containing context words.
+
+The training command is:
+
+```bash
+
+causal=0
+subset=medium
+memory_dropout_rate=0.05
+text_encoder_type=BERT
+use_context_list=True
+
+# prepare the required data for context biasing
+./prepare_prompt_asr.sh --stage 0 --stop_stage 1
+
+python ./zipformer_prompt_asr/train_bert_encoder.py \
+    --world-size 4 \
+    --start-epoch 1 \
+    --num-epochs 50 \
+    --exp-dir ./zipformer_prompt_asr/exp \
+    --use-fp16 True \
+    --memory-dropout-rate $memory_dropout_rate \
+    --causal $causal \
+    --subset $subset \
+    --manifest-dir data/fbank \
+    --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
+    --max-duration 1000 \
+    --text-encoder-type $text_encoder_type \
+    --text-encoder-dim 768 \
+    --use-context-list $use_context_list \
+    --top-k 10000 \
+    --use-style-prompt 1
+```
+
+*Utterance-level biasing:*
+
+| decoding method      | lh-test-clean | lh-test-other | comment             |
+|----------------------|---------------|---------------|---------------------|
+| modified_beam_search        |   3.17        |    6.72       |    --use-pre-text 0 --use-style-prompt 0    |
+| modified_beam_search        |   2.91        |    6.24       |    --pre-text-transform upper-no-punc --style-text-transform upper-no-punc       |
+| modified_beam_search        |   2.72        |    5.72       |    --pre-text-transform mixed-punc --style-text-transform mixed-punc       |
+
+
+The decoding command for the table above is:
+
+```bash
+for style in mixed-punc upper-no-punc; do
+    python ./zipformer_prompt_asr/decode_bert.py \
+        --epoch 50 \
+        --avg 10 \
+        --use-averaged-model True \
+        --post-normalization True \
+        --causal False \
+        --exp-dir ./zipformer_prompt_asr/exp \
+        --manifest-dir data/fbank \
+        --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
+        --max-duration 1000 \
+        --decoding-method modified_beam_search \
+        --beam-size 4 \
+        --text-encoder-type BERT \
+        --text-encoder-dim 768 \
+        --memory-layer 0 \
+        --use-ls-test-set False \
+        --use-ls-context-list False \
+        --max-prompt-lens 1000 \
+        --use-pre-text True \
+        --use-style-prompt True \
+        --style-text-transform $style \
+        --pre-text-transform $style \
+        --compute-CER 0
+done
+```
+
+*Word-level biasing:*
+
+The results are reported on LibriSpeech test-sets using the biasing list provided from <https://arxiv.org/abs/2104.02194>.
+You need to set `--use-ls-test-set True` so that the LibriSpeech test sets are used.
+
+| decoding method      | ls-test-clean | ls-test-other | comment             |
+|----------------------|---------------|---------------|---------------------|
+| modified_beam_search        |   2.4        |    5.08       |    --use-pre-text 0 --use-style-prompt 0    |
+| modified_beam_search        |   2.14        |    4.62       |    --use-ls-context-list 1 --pre-text-transform mixed-punc --style-text-transform mixed-punc --ls-distractors 0       |
+| modified_beam_search        |   2.14        |    4.64       |    --use-ls-context-list 1 --pre-text-transform mixed-punc --style-text-transform mixed-punc --ls-distractors 100       |
+
+The decoding command is for the table above is:
+
+```bash
+use_ls_test_set=1
+use_ls_context_list=1
+
+for ls_distractors in 0 100; do
+    python ./zipformer_prompt_asr/decode_bert.py \
+        --epoch 50 \
+        --avg 10 \
+        --use-averaged-model True \
+        --post-normalization True \
+        --causal False \
+        --exp-dir ./zipformer_prompt_asr/exp \
+        --manifest-dir data/fbank \
+        --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
+        --max-duration 1000 \
+        --decoding-method modified_beam_search \
+        --beam-size 4 \
+        --text-encoder-type BERT \
+        --text-encoder-dim 768 \
+        --memory-layer 0 \
+        --use-ls-test-set $use_ls_test_setse \
+        --use-ls-context-list $use_ls_context_list \
+        --ls-distractors $ls_distractors \
+        --max-prompt-lens 1000 \
+        --use-pre-text True \
+        --use-style-prompt True \
+        --style-text-transform mixed-punc \
+        --pre-text-transform mixed-punc \
+        --compute-CER 0
+done
+
+```
diff --git a/egs/libriheavy/ASR/prepare_prompt_asr.sh b/egs/libriheavy/ASR/prepare_prompt_asr.sh
@@ -0,0 +1,36 @@
+#!/usr/bin/env bash
+
+set -eou pipefail
+
+# This is the preparation recipe for PromptASR: https://arxiv.org/pdf/2309.07414
+
+log() {
+  # This function is from espnet
+  local fname=${BASH_SOURCE[1]##*/}
+  echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
+}
+
+stage=-1
+stop_stage=100
+manifest_dir=data/fbank
+subset=medium
+topk=10000
+
+. shared/parse_options.sh || exit 1
+
+if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
+    log "Stage 0: Download the meta biasing list for LibriSpeech"
+    mkdir -p data/context_biasing
+    cd data/context_biasing
+    git clone https://github.com/facebookresearch/fbai-speech.git
+    cd ../..
+fi
+
+if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
+    log "Stage 1: Add rare-words for context biasing to the manifest"
+    python zipformer_prompt_asr/utils.py \
+        --manifest-dir $manifest_dir \
+        --subset $subset \
+        --top-k $topk
+
+fi
diff --git a/egs/libriheavy/ASR/shared b/egs/libriheavy/ASR/shared
@@ -0,0 +1 @@
+../../../icefall/shared
diff --git a/egs/libriheavy/ASR/zipformer_prompt_asr/__init__.py b/egs/libriheavy/ASR/zipformer_prompt_asr/__init__.py