Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipformer recipe for ReazonSpeech #1611

Merged
merged 51 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
a93aece
test icefall with yesno
Triplecq Oct 3, 2023
26ee4c3
Merge branch 'master' of github.com:Triplecq/icefall
Triplecq Oct 3, 2023
16c02cf
Merge latest commit 'b0f70c9' on k2-fsa/icefall
fujimotos Dec 18, 2023
c1ce7ca
Add first cut at ReazonSpeech recipe
fujimotos Dec 11, 2023
a82e001
Merge branch 'k2-fsa:master' into master
Triplecq Dec 20, 2023
abbee87
Merge tag 'rs-experiment' of kdm00:/mnt/syno128/volume1/fujimotos/git…
Dec 20, 2023
2436597
Zipformer recipe
Dec 27, 2023
af87726
init zipformer recipe
Triplecq Jan 14, 2024
8eae6ec
Add pruned_transducer_stateless2 from reazonspeech branch
Triplecq Jan 14, 2024
5e9a171
customize tranning script for rs
Triplecq Jan 14, 2024
1e6fe2e
restore
Triplecq Jan 14, 2024
b1de6f2
customized recipes for reazonspeech
Triplecq Jan 14, 2024
dc2d531
customized recipes for rs
Triplecq Jan 14, 2024
819db8f
Merge branch 'master' of github.com:Triplecq/icefall
Triplecq Jan 14, 2024
ced8a53
Merge branch 'master' into rs
Triplecq Jan 14, 2024
42c152f
decrease learning-rate to solve the error: RuntimeError: grad_scale i…
Triplecq Jan 14, 2024
04fa9e3
traning script completed
Triplecq Jan 14, 2024
7b6a897
customize decoding script
Triplecq Jan 14, 2024
77178c6
comment out params related to the chunk size
Triplecq Jan 14, 2024
a8e9dc2
all combinations of epochs and avgs
Triplecq Jan 23, 2024
f35fa8a
add blank penalty in decoding script
Triplecq Jan 23, 2024
d864da4
validation scripts
Triplecq Jan 24, 2024
5d94a19
prepare for 1000h dataset
Triplecq Jan 24, 2024
860a6b2
complete exp on zipformer-L
Mar 24, 2024
03e8cfa
validation test
Mar 24, 2024
456241b
update graph
Mar 24, 2024
5e7db1a
complete validation
Mar 26, 2024
baf6ebb
delete graph
Mar 26, 2024
1e25c96
update graph
Mar 26, 2024
3b36a67
update graph
Mar 26, 2024
9dc2a86
update graph
Mar 26, 2024
7e0817e
update graph
Mar 27, 2024
8229730
update graph
Mar 27, 2024
72faff6
update graph
Mar 27, 2024
92ab73e
update graph
Mar 27, 2024
b6216cd
calculate RTF
Triplecq Mar 31, 2024
e5b3b63
export onnx model
May 1, 2024
01325b5
remove unnecessary files
May 1, 2024
3505a8e
Merge remote-tracking branch 'upstream/master' into reazonspeech-recipe
May 1, 2024
ea1d9b2
update README & RESULTS
Triplecq May 1, 2024
1050455
remove unnecessary files
May 1, 2024
45a1225
remove outdated recipes
May 2, 2024
d61b739
Update README.md
Triplecq May 2, 2024
0925a0c
format files with isort to meet style guidelines
May 2, 2024
97c9311
format files with isort to meet style guidelines
May 2, 2024
193470c
remove unrelated changes
May 2, 2024
8edd9bd
add back necessary docs
May 2, 2024
f8707d7
remove unrelated changes
May 2, 2024
2507918
Add download method to prepare.sh
Triplecq May 19, 2024
e39f56e
Fix cuts file path
Triplecq May 20, 2024
777f7a4
Change valid to dev for consistency
Triplecq May 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions egs/reazonspeech/ASR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Introduction



**ReazonSpeech** is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.



The dataset is available on Hugging Face. For more details, please visit:

- Dataset: https://huggingface.co/datasets/reazon-research/reazonspeech
- Paper: https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf



[./RESULTS.md](./RESULTS.md) contains the latest results.

# Transducers



There are various folders containing the name `transducer` in this folder. The following table lists the differences among them.

| | Encoder | Decoder | Comment |
| ---------------------------------------- | -------------------- | ------------------ | ------------------------------------------------- |
| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe |

The decoder in `transducer_stateless` is modified from the paper [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/). We place an additional Conv1d layer right after the input embedding layer.

49 changes: 49 additions & 0 deletions egs/reazonspeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Results

### Zipformer

#### Non-streaming

##### large-scaled model, number of model parameters: 159337842, i.e., 159.34 M

| decoding method | In-Distribution CER | JSUT | CommonVoice | TEDx | comment |
| :------------------: | :-----------------: | :--: | :---------: | :---: | :----------------: |
| greedy search | 4.2 | 6.7 | 7.84 | 17.9 | --epoch 39 --avg 7 |
| modified beam search | 4.13 | 6.77 | 7.69 | 17.82 | --epoch 39 --avg 7 |

The training command is:

```shell
./zipformer/train.py \
--world-size 8 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-large \
--causal 0 \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192 \
--lang data/lang_char \
--max-duration 1600
```

The decoding command is:

```shell
./zipformer/decode.py \
--epoch 40 \
--avg 16 \
--exp-dir zipformer/exp-large \
--max-duration 600 \
--causal 0 \
--decoding-method greedy_search \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192 \
--lang data/lang_char \
--blank-penalty 0
```

146 changes: 146 additions & 0 deletions egs/reazonspeech/ASR/local/compute_fbank_reazonspeech.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
#!/usr/bin/env python3
# Copyright 2023 The University of Electro-Communications (Author: Teo Wen Shen) # noqa
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import argparse
import logging
import os
from pathlib import Path
from typing import List, Tuple

import torch

# fmt: off
from lhotse import ( # See the following for why LilcomChunkyWriter is preferred; https://github.com/k2-fsa/icefall/pull/404; https://github.com/lhotse-speech/lhotse/pull/527
CutSet,
Fbank,
FbankConfig,
LilcomChunkyWriter,
RecordingSet,
SupervisionSet,
)

# fmt: on

# Torch's multithreaded behavior needs to be disabled or
# it wastes a lot of CPU and slow things down.
# Do this outside of main() in case it needs to take effect
# even when we are not invoking the main (e.g. when spawning subprocesses).
torch.set_num_threads(1)
torch.set_num_interop_threads(1)

RNG_SEED = 42
concat_params = {"gap": 1.0, "maxlen": 10.0}


def make_cutset_blueprints(
manifest_dir: Path,
) -> List[Tuple[str, CutSet]]:
cut_sets = []

# Create test dataset
logging.info("Creating test cuts.")
cut_sets.append(
(
"test",
CutSet.from_manifests(
recordings=RecordingSet.from_file(
manifest_dir / "reazonspeech_recordings_test.jsonl.gz"
),
supervisions=SupervisionSet.from_file(
manifest_dir / "reazonspeech_supervisions_test.jsonl.gz"
),
),
)
)

# Create dev dataset
logging.info("Creating dev cuts.")
cut_sets.append(
(
"dev",
CutSet.from_manifests(
recordings=RecordingSet.from_file(
manifest_dir / "reazonspeech_recordings_dev.jsonl.gz"
),
supervisions=SupervisionSet.from_file(
manifest_dir / "reazonspeech_supervisions_dev.jsonl.gz"
),
),
)
)

# Create train dataset
logging.info("Creating train cuts.")
cut_sets.append(
(
"train",
CutSet.from_manifests(
recordings=RecordingSet.from_file(
manifest_dir / "reazonspeech_recordings_train.jsonl.gz"
),
supervisions=SupervisionSet.from_file(
manifest_dir / "reazonspeech_supervisions_train.jsonl.gz"
),
),
)
)
return cut_sets


def get_args():
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("-m", "--manifest-dir", type=Path)
return parser.parse_args()


def main():
args = get_args()

extractor = Fbank(FbankConfig(num_mel_bins=80))
num_jobs = min(16, os.cpu_count())

formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"

logging.basicConfig(format=formatter, level=logging.INFO)

if (args.manifest_dir / ".reazonspeech-fbank.done").exists():
logging.info(
"Previous fbank computed for ReazonSpeech found. "
f"Delete {args.manifest_dir / '.reazonspeech-fbank.done'} to allow recomputing fbank."
)
return
else:
cut_sets = make_cutset_blueprints(args.manifest_dir)
for part, cut_set in cut_sets:
logging.info(f"Processing {part}")
cut_set = cut_set.compute_and_store_features(
extractor=extractor,
num_jobs=num_jobs,
storage_path=(args.manifest_dir / f"feats_{part}").as_posix(),
storage_type=LilcomChunkyWriter,
)
cut_set.to_file(args.manifest_dir / f"reazonspeech_cuts_{part}.jsonl.gz")

logging.info("All fbank computed for ReazonSpeech.")
(args.manifest_dir / ".reazonspeech-fbank.done").touch()


if __name__ == "__main__":
main()
58 changes: 58 additions & 0 deletions egs/reazonspeech/ASR/local/display_manifest_statistics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env python3
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
# 2022 The University of Electro-Communications (author: Teo Wen Shen) # noqa
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
from pathlib import Path

from lhotse import CutSet, load_manifest

ARGPARSE_DESCRIPTION = """
This file displays duration statistics of utterances in a manifest.
You can use the displayed value to choose minimum/maximum duration
to remove short and long utterances during the training.

See the function `remove_short_and_long_utt()` in
pruned_transducer_stateless5/train.py for usage.
"""


def get_parser():
parser = argparse.ArgumentParser(
description=ARGPARSE_DESCRIPTION,
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)

parser.add_argument("--manifest-dir", type=Path, help="Path to cutset manifests")

return parser.parse_args()


def main():
args = get_parser()

for part in ["train", "dev"]:
path = args.manifest_dir / f"reazonspeech_cuts_{part}.jsonl.gz"
cuts: CutSet = load_manifest(path)

print("\n---------------------------------\n")
print(path.name + ":")
cuts.describe()


if __name__ == "__main__":
main()
75 changes: 75 additions & 0 deletions egs/reazonspeech/ASR/local/prepare_lang_char.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env python3
# Copyright 2022 The University of Electro-Communications (Author: Teo Wen Shen) # noqa
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import argparse
import logging
from pathlib import Path

from lhotse import CutSet


def get_args():
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)

parser.add_argument(
"train_cut", metavar="train-cut", type=Path, help="Path to the train cut"
)

parser.add_argument(
"--lang-dir",
type=Path,
default=Path("data/lang_char"),
help=(
"Name of lang dir. "
"If not set, this will default to lang_char_{trans-mode}"
),
)

return parser.parse_args()


def main():
args = get_args()
logging.basicConfig(
format=("%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"),
level=logging.INFO,
)

sysdef_string = set(["<blk>", "<unk>", "<sos/eos>", " "])

token_set = set()
logging.info(f"Creating vocabulary from {args.train_cut}.")
train_cut: CutSet = CutSet.from_file(args.train_cut)
for cut in train_cut:
for sup in cut.supervisions:
token_set.update(sup.text)

token_set = ["<blk>"] + sorted(token_set - sysdef_string) + ["<unk>", "<sos/eos>"]
args.lang_dir.mkdir(parents=True, exist_ok=True)
(args.lang_dir / "tokens.txt").write_text(
"\n".join(f"{t}\t{i}" for i, t in enumerate(token_set))
)

(args.lang_dir / "lang_type").write_text("char")
logging.info("Done.")


if __name__ == "__main__":
main()
Loading
Loading