Zipformer recipe for ReazonSpeech (#1611)

* Add first cut at ReazonSpeech recipe This recipe is mostly based on egs/csj, but tweaked to the point that can be run with ReazonSpeech corpus. Signed-off-by: Fujimoto Seiji <[email protected]> --------- Signed-off-by: Fujimoto Seiji <[email protected]> Co-authored-by: Fujimoto Seiji <[email protected]> Co-authored-by: Chen <[email protected]> Co-authored-by: root <[email protected]>
k2-fsa · Jun 13, 2024 · 3b40d9b · 3b40d9b
1 parent d5be739
commit 3b40d9b
Show file tree

Hide file tree

Showing 37 changed files with 5,488 additions and 0 deletions.
diff --git a/egs/reazonspeech/ASR/README.md b/egs/reazonspeech/ASR/README.md
@@ -0,0 +1,29 @@
+# Introduction
+
+
+
+**ReazonSpeech** is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.
+
+
+
+The dataset is available on Hugging Face. For more details, please visit:
+
+- Dataset: https://huggingface.co/datasets/reazon-research/reazonspeech
+- Paper: https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
+
+
+
+[./RESULTS.md](./RESULTS.md) contains the latest results.
+
+# Transducers
+
+
+
+There are various folders containing the name `transducer` in this folder. The following table lists the differences among them.
+
+|                                          | Encoder              | Decoder            | Comment                                           |
+| ---------------------------------------- | -------------------- | ------------------ | ------------------------------------------------- |
+| `zipformer`                              | Upgraded Zipformer   | Embedding + Conv1d | The latest recipe                                 |
+
+The decoder in `transducer_stateless` is modified from the paper [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/). We place an additional Conv1d layer right after the input embedding layer.
+
diff --git a/egs/reazonspeech/ASR/RESULTS.md b/egs/reazonspeech/ASR/RESULTS.md
@@ -0,0 +1,49 @@
+## Results
+
+### Zipformer
+
+#### Non-streaming
+
+##### large-scaled model, number of model parameters: 159337842, i.e., 159.34 M
+
+|   decoding method    | In-Distribution CER | JSUT | CommonVoice | TEDx  |      comment       |
+| :------------------: | :-----------------: | :--: | :---------: | :---: | :----------------: |
+|    greedy search     |         4.2         | 6.7  |    7.84     | 17.9  | --epoch 39 --avg 7 |
+| modified beam search |        4.13         | 6.77 |    7.69     | 17.82 | --epoch 39 --avg 7 |
+
+The training command is:
+
+```shell
+./zipformer/train.py \
+  --world-size 8 \
+  --num-epochs 40 \
+  --start-epoch 1 \
+  --use-fp16 1 \
+  --exp-dir zipformer/exp-large \
+  --causal 0 \
+  --num-encoder-layers 2,2,4,5,4,2 \
+  --feedforward-dim 512,768,1536,2048,1536,768 \
+  --encoder-dim 192,256,512,768,512,256 \
+  --encoder-unmasked-dim 192,192,256,320,256,192 \
+  --lang data/lang_char \
+  --max-duration 1600 
+```
+
+The decoding command is:
+
+```shell
+./zipformer/decode.py \
+    --epoch 40 \
+    --avg 16 \
+    --exp-dir zipformer/exp-large \
+    --max-duration 600 \
+    --causal 0 \
+    --decoding-method greedy_search \
+    --num-encoder-layers 2,2,4,5,4,2 \
+    --feedforward-dim 512,768,1536,2048,1536,768 \
+    --encoder-dim 192,256,512,768,512,256 \
+    --encoder-unmasked-dim 192,192,256,320,256,192 \
+    --lang data/lang_char \
+    --blank-penalty 0
+```
+
diff --git a/egs/reazonspeech/ASR/local/compute_fbank_reazonspeech.py b/egs/reazonspeech/ASR/local/compute_fbank_reazonspeech.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+# Copyright    2023  The University of Electro-Communications  (Author: Teo Wen Shen)  # noqa
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import logging
+import os
+from pathlib import Path
+from typing import List, Tuple
+
+import torch
+
+# fmt: off
+from lhotse import (  # See the following for why LilcomChunkyWriter is preferred; https://github.com/k2-fsa/icefall/pull/404; https://github.com/lhotse-speech/lhotse/pull/527
+    CutSet,
+    Fbank,
+    FbankConfig,
+    LilcomChunkyWriter,
+    RecordingSet,
+    SupervisionSet,
+)
+
+# fmt: on
+
+# Torch's multithreaded behavior needs to be disabled or
+# it wastes a lot of CPU and slow things down.
+# Do this outside of main() in case it needs to take effect
+# even when we are not invoking the main (e.g. when spawning subprocesses).
+torch.set_num_threads(1)
+torch.set_num_interop_threads(1)
+
+RNG_SEED = 42
+concat_params = {"gap": 1.0, "maxlen": 10.0}
+
+
+def make_cutset_blueprints(
+    manifest_dir: Path,
+) -> List[Tuple[str, CutSet]]:
+    cut_sets = []
+
+    # Create test dataset
+    logging.info("Creating test cuts.")
+    cut_sets.append(
+        (
+            "test",
+            CutSet.from_manifests(
+                recordings=RecordingSet.from_file(
+                    manifest_dir / "reazonspeech_recordings_test.jsonl.gz"
+                ),
+                supervisions=SupervisionSet.from_file(
+                    manifest_dir / "reazonspeech_supervisions_test.jsonl.gz"
+                ),
+            ),
+        )
+    )
+
+    # Create dev dataset
+    logging.info("Creating dev cuts.")
+    cut_sets.append(
+        (
+            "dev",
+            CutSet.from_manifests(
+                recordings=RecordingSet.from_file(
+                    manifest_dir / "reazonspeech_recordings_dev.jsonl.gz"
+                ),
+                supervisions=SupervisionSet.from_file(
+                    manifest_dir / "reazonspeech_supervisions_dev.jsonl.gz"
+                ),
+            ),
+        )
+    )
+
+    # Create train dataset
+    logging.info("Creating train cuts.")
+    cut_sets.append(
+        (
+            "train",
+            CutSet.from_manifests(
+                recordings=RecordingSet.from_file(
+                    manifest_dir / "reazonspeech_recordings_train.jsonl.gz"
+                ),
+                supervisions=SupervisionSet.from_file(
+                    manifest_dir / "reazonspeech_supervisions_train.jsonl.gz"
+                ),
+            ),
+        )
+    )
+    return cut_sets
+
+
+def get_args():
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument("-m", "--manifest-dir", type=Path)
+    return parser.parse_args()
+
+
+def main():
+    args = get_args()
+
+    extractor = Fbank(FbankConfig(num_mel_bins=80))
+    num_jobs = min(16, os.cpu_count())
+
+    formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
+
+    logging.basicConfig(format=formatter, level=logging.INFO)
+
+    if (args.manifest_dir / ".reazonspeech-fbank.done").exists():
+        logging.info(
+            "Previous fbank computed for ReazonSpeech found. "
+            f"Delete {args.manifest_dir / '.reazonspeech-fbank.done'} to allow recomputing fbank."
+        )
+        return
+    else:
+        cut_sets = make_cutset_blueprints(args.manifest_dir)
+        for part, cut_set in cut_sets:
+            logging.info(f"Processing {part}")
+            cut_set = cut_set.compute_and_store_features(
+                extractor=extractor,
+                num_jobs=num_jobs,
+                storage_path=(args.manifest_dir / f"feats_{part}").as_posix(),
+                storage_type=LilcomChunkyWriter,
+            )
+            cut_set.to_file(args.manifest_dir / f"reazonspeech_cuts_{part}.jsonl.gz")
+
+        logging.info("All fbank computed for ReazonSpeech.")
+        (args.manifest_dir / ".reazonspeech-fbank.done").touch()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/egs/reazonspeech/ASR/local/display_manifest_statistics.py b/egs/reazonspeech/ASR/local/display_manifest_statistics.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+# Copyright    2021  Xiaomi Corp.        (authors: Fangjun Kuang)
+#              2022  The University of Electro-Communications (author: Teo Wen Shen)  # noqa
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pathlib import Path
+
+from lhotse import CutSet, load_manifest
+
+ARGPARSE_DESCRIPTION = """
+This file displays duration statistics of utterances in a manifest.
+You can use the displayed value to choose minimum/maximum duration
+to remove short and long utterances during the training.
+
+See the function `remove_short_and_long_utt()` in
+pruned_transducer_stateless5/train.py for usage.
+"""
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description=ARGPARSE_DESCRIPTION,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    parser.add_argument("--manifest-dir", type=Path, help="Path to cutset manifests")
+
+    return parser.parse_args()
+
+
+def main():
+    args = get_parser()
+
+    for part in ["train", "dev"]:
+        path = args.manifest_dir / f"reazonspeech_cuts_{part}.jsonl.gz"
+        cuts: CutSet = load_manifest(path)
+
+        print("\n---------------------------------\n")
+        print(path.name + ":")
+        cuts.describe()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/egs/reazonspeech/ASR/local/prepare_lang_char.py b/egs/reazonspeech/ASR/local/prepare_lang_char.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python3
+# Copyright    2022  The University of Electro-Communications  (Author: Teo Wen Shen)  # noqa
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import logging
+from pathlib import Path
+
+from lhotse import CutSet
+
+
+def get_args():
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    parser.add_argument(
+        "train_cut", metavar="train-cut", type=Path, help="Path to the train cut"
+    )
+
+    parser.add_argument(
+        "--lang-dir",
+        type=Path,
+        default=Path("data/lang_char"),
+        help=(
+            "Name of lang dir. "
+            "If not set, this will default to lang_char_{trans-mode}"
+        ),
+    )
+
+    return parser.parse_args()
+
+
+def main():
+    args = get_args()
+    logging.basicConfig(
+        format=("%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"),
+        level=logging.INFO,
+    )
+
+    sysdef_string = set(["<blk>", "<unk>", "<sos/eos>", " "])
+
+    token_set = set()
+    logging.info(f"Creating vocabulary from {args.train_cut}.")
+    train_cut: CutSet = CutSet.from_file(args.train_cut)
+    for cut in train_cut:
+        for sup in cut.supervisions:
+            token_set.update(sup.text)
+
+    token_set = ["<blk>"] + sorted(token_set - sysdef_string) + ["<unk>", "<sos/eos>"]
+    args.lang_dir.mkdir(parents=True, exist_ok=True)
+    (args.lang_dir / "tokens.txt").write_text(
+        "\n".join(f"{t}\t{i}" for i, t in enumerate(token_set))
+    )
+
+    (args.lang_dir / "lang_type").write_text("char")
+    logging.info("Done.")
+
+
+if __name__ == "__main__":
+    main()