Skip to content

Commit

Permalink
PromptASR for contextualized ASR with controllable style (#1250)
Browse files Browse the repository at this point in the history
* Add PromptASR with BERT as text encoder

* Support using word-list based content prompts for context biasing

* Upload the pretrained models to huggingface

* Add usage example
  • Loading branch information
marcoyang1998 authored Oct 11, 2023
1 parent cb874e9 commit 16a2748
Show file tree
Hide file tree
Showing 29 changed files with 15,825 additions and 3 deletions.
205 changes: 205 additions & 0 deletions egs/libriheavy/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
## Results

### Zipformer PromptASR (zipformer + PromptASR + BERT text encoder)

#### [zipformer_prompt_asr](./zipformer_prompt_asr)

See <https://github.com/k2-fsa/icefall/pull/1250> for commit history and
our paper <https://arxiv.org/abs/2309.07414> for more details.



##### Training on the medium subset, with content & style prompt, **no** context list

You can find a pre-trained model, training logs, decoding logs, and decoding results at: <https://huggingface.co/marcoyang/icefall-promptasr-libriheavy-zipformer-BERT-2023-10-10>

The training command is:

```bash
causal=0
subset=medium
memory_dropout_rate=0.05
text_encoder_type=BERT

python ./zipformer_prompt_asr/train_bert_encoder.py \
--world-size 4 \
--start-epoch 1 \
--num-epochs 60 \
--exp-dir ./zipformer_prompt_asr/exp \
--use-fp16 True \
--memory-dropout-rate $memory_dropout_rate \
--causal $causal \
--subset $subset \
--manifest-dir data/fbank \
--bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
--max-duration 1000 \
--text-encoder-type $text_encoder_type \
--text-encoder-dim 768 \
--use-context-list 0 \
--top-k $top_k \
--use-style-prompt 1
```

The decoding results using utterance-level context (epoch-60-avg-10):

| decoding method | lh-test-clean | lh-test-other | comment |
|----------------------|---------------|---------------|---------------------|
| modified_beam_search | 3.13 | 6.78 | --use-pre-text False --use-style-prompt False |
| modified_beam_search | 2.86 | 5.93 | --pre-text-transform upper-no-punc --style-text-transform upper-no-punc |
| modified_beam_search | 2.6 | 5.5 | --pre-text-transform mixed-punc --style-text-transform mixed-punc |


The decoding command is:

```bash
for style in mixed-punc upper-no-punc; do
python ./zipformer_prompt_asr/decode_bert.py \
--epoch 60 \
--avg 10 \
--use-averaged-model True \
--post-normalization True \
--causal False \
--exp-dir ./zipformer_prompt_asr/exp \
--manifest-dir data/fbank \
--bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
--max-duration 1000 \
--decoding-method modified_beam_search \
--beam-size 4 \
--text-encoder-type BERT \
--text-encoder-dim 768 \
--memory-layer 0 \
--use-ls-test-set False \
--use-ls-context-list False \
--max-prompt-lens 1000 \
--use-pre-text True \
--use-style-prompt True \
--style-text-transform $style \
--pre-text-transform $style \
--compute-CER 0
done
```

##### Training on the medium subset, with content & style prompt, **with** context list

You can find a pre-trained model, training logs, decoding logs, and decoding results at: <https://huggingface.co/marcoyang/icefall-promptasr-with-context-libriheavy-zipformer-BERT-2023-10-10>

This model is trained with an extra type of content prompt (context words), thus it does better
on **word-level** context biasing. Note that to train this model, please first run `prepare_prompt_asr.sh`
to prepare a manifest containing context words.

The training command is:

```bash

causal=0
subset=medium
memory_dropout_rate=0.05
text_encoder_type=BERT
use_context_list=True

# prepare the required data for context biasing
./prepare_prompt_asr.sh --stage 0 --stop_stage 1

python ./zipformer_prompt_asr/train_bert_encoder.py \
--world-size 4 \
--start-epoch 1 \
--num-epochs 50 \
--exp-dir ./zipformer_prompt_asr/exp \
--use-fp16 True \
--memory-dropout-rate $memory_dropout_rate \
--causal $causal \
--subset $subset \
--manifest-dir data/fbank \
--bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
--max-duration 1000 \
--text-encoder-type $text_encoder_type \
--text-encoder-dim 768 \
--use-context-list $use_context_list \
--top-k 10000 \
--use-style-prompt 1
```

*Utterance-level biasing:*

| decoding method | lh-test-clean | lh-test-other | comment |
|----------------------|---------------|---------------|---------------------|
| modified_beam_search | 3.17 | 6.72 | --use-pre-text 0 --use-style-prompt 0 |
| modified_beam_search | 2.91 | 6.24 | --pre-text-transform upper-no-punc --style-text-transform upper-no-punc |
| modified_beam_search | 2.72 | 5.72 | --pre-text-transform mixed-punc --style-text-transform mixed-punc |


The decoding command for the table above is:

```bash
for style in mixed-punc upper-no-punc; do
python ./zipformer_prompt_asr/decode_bert.py \
--epoch 50 \
--avg 10 \
--use-averaged-model True \
--post-normalization True \
--causal False \
--exp-dir ./zipformer_prompt_asr/exp \
--manifest-dir data/fbank \
--bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
--max-duration 1000 \
--decoding-method modified_beam_search \
--beam-size 4 \
--text-encoder-type BERT \
--text-encoder-dim 768 \
--memory-layer 0 \
--use-ls-test-set False \
--use-ls-context-list False \
--max-prompt-lens 1000 \
--use-pre-text True \
--use-style-prompt True \
--style-text-transform $style \
--pre-text-transform $style \
--compute-CER 0
done
```

*Word-level biasing:*

The results are reported on LibriSpeech test-sets using the biasing list provided from <https://arxiv.org/abs/2104.02194>.
You need to set `--use-ls-test-set True` so that the LibriSpeech test sets are used.

| decoding method | ls-test-clean | ls-test-other | comment |
|----------------------|---------------|---------------|---------------------|
| modified_beam_search | 2.4 | 5.08 | --use-pre-text 0 --use-style-prompt 0 |
| modified_beam_search | 2.14 | 4.62 | --use-ls-context-list 1 --pre-text-transform mixed-punc --style-text-transform mixed-punc --ls-distractors 0 |
| modified_beam_search | 2.14 | 4.64 | --use-ls-context-list 1 --pre-text-transform mixed-punc --style-text-transform mixed-punc --ls-distractors 100 |

The decoding command is for the table above is:

```bash
use_ls_test_set=1
use_ls_context_list=1

for ls_distractors in 0 100; do
python ./zipformer_prompt_asr/decode_bert.py \
--epoch 50 \
--avg 10 \
--use-averaged-model True \
--post-normalization True \
--causal False \
--exp-dir ./zipformer_prompt_asr/exp \
--manifest-dir data/fbank \
--bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
--max-duration 1000 \
--decoding-method modified_beam_search \
--beam-size 4 \
--text-encoder-type BERT \
--text-encoder-dim 768 \
--memory-layer 0 \
--use-ls-test-set $use_ls_test_setse \
--use-ls-context-list $use_ls_context_list \
--ls-distractors $ls_distractors \
--max-prompt-lens 1000 \
--use-pre-text True \
--use-style-prompt True \
--style-text-transform mixed-punc \
--pre-text-transform mixed-punc \
--compute-CER 0
done

```
36 changes: 36 additions & 0 deletions egs/libriheavy/ASR/prepare_prompt_asr.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env bash

set -eou pipefail

# This is the preparation recipe for PromptASR: https://arxiv.org/pdf/2309.07414

log() {
# This function is from espnet
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}

stage=-1
stop_stage=100
manifest_dir=data/fbank
subset=medium
topk=10000

. shared/parse_options.sh || exit 1

if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
log "Stage 0: Download the meta biasing list for LibriSpeech"
mkdir -p data/context_biasing
cd data/context_biasing
git clone https://github.com/facebookresearch/fbai-speech.git
cd ../..
fi

if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
log "Stage 1: Add rare-words for context biasing to the manifest"
python zipformer_prompt_asr/utils.py \
--manifest-dir $manifest_dir \
--subset $subset \
--top-k $topk

fi
1 change: 1 addition & 0 deletions egs/libriheavy/ASR/shared
Empty file.
Loading

0 comments on commit 16a2748

Please sign in to comment.