Skip to content

Commit

Permalink
Refactor prepare.sh in librispeech
Browse files Browse the repository at this point in the history
  • Loading branch information
pkufool committed Feb 6, 2024
1 parent a813186 commit b3f1a9f
Show file tree
Hide file tree
Showing 6 changed files with 376 additions and 274 deletions.
2 changes: 1 addition & 1 deletion egs/librispeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -1526,7 +1526,7 @@ done

You may also decode using LODR + LM shallow fusion. This decoding method is proposed in <https://arxiv.org/pdf/2203.16776.pdf>.
It subtracts the internal language model score during shallow fusion, which is approximated by a bi-gram model. The bi-gram can be
generated by `generate-lm.sh`, or you may download it from <https://huggingface.co/marcoyang/librispeech_bigram>.
generated by `prepare_lm.sh` at stage 4, or you may download it from <https://huggingface.co/marcoyang/librispeech_bigram>.

The decoding command is as follows:

Expand Down
20 changes: 0 additions & 20 deletions egs/librispeech/ASR/generate-lm.sh

This file was deleted.

14 changes: 14 additions & 0 deletions egs/librispeech/ASR/local/train_bpe_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,18 @@ def get_args():
return parser.parse_args()


def generate_tokens(lang_dir: Path):
"""
Generate the tokens.txt from a bpe model.
"""
sp = spm.SentencePieceProcessor()
sp.load(str(lang_dir / "bpe.model"))
token2id: Dict[str, int] = {sp.id_to_piece(i): i for i in range(sp.vocab_size())}
with open(lang_dir / "tokens.txt", "w", encoding="utf-8") as f:
for sym, i in token2id.items():
f.write(f"{sym} {i}\n")


def main():
args = get_args()
vocab_size = args.vocab_size
Expand Down Expand Up @@ -95,6 +107,8 @@ def main():

shutil.copyfile(model_file, f"{lang_dir}/bpe.model")

generate_tokens(lang_dir)


if __name__ == "__main__":
main()
Loading

0 comments on commit b3f1a9f

Please sign in to comment.