Multi_zh-Hans Recipe #1238

JinZr · 2023-09-01T17:39:10Z

This PR includes scripts for training Zipformer model using multiple Chinese datasets.

Included Training Sets

THCHS-30
AiShell-{1,2,4}
ST-CMDS
Primewords
MagicData
Aidatatang_200zh
AliMeeting
WeNetSpeech
KeSpeech-ASR

Included Test Sets

Aishell-{1,2,4}
Aidatatang_200zh
AliMeeting
MagicData
KeSpeech-ASR
WeNetSpeech

…into dev_multi_zh-hans

script for fbank computation not done yet

…into dev_multi_zh-hans

JinZr · 2023-09-07T11:12:23Z

@csukuangfj all requested changes have been applied, thank you!

xiaoxi91 · 2024-08-13T09:03:42Z

Hi, I was exploring this recipe and noticed that the Chinese modeling unit has been adjusted from char to high-freq-char + byte-symbol. I'm curious about the rationale behind this change and would love to understand the thought process. Could you please shed some light on why this decision was made?

Moreover, I'm wondering if this modification is expected to improve performance in any way? If so, could you elaborate on how and in which specific scenarios this might be beneficial?

Thank you very much for your time and for sharing this valuable recipe! Looking forward to your insights.

JinZr · 2024-08-13T09:15:53Z

Dear User, thanks for raising the question! the motivation for this modeling unit partitioning protocol is that when we looked into the histogram of characters exported from all corpora involved, we found that the distribution of the characters was highly unbalanced, i.e. a few portions of the characters makes most of the appearances in the training data. of course you can use the vanilla char based modeling solution, but the main concern is that you will eventually have a output layer with an enormous amount of parameters (i couldn’t recall the exact number) and most of them might not be properly trained. Best Regards Jin

…

On Tue, 13 Aug 2024 at 17:04 xiaoxi91 ***@***.***> wrote: Hi, I was exploring this recipe and noticed that the Chinese modeling unit has been adjusted from char to high-freq-char + byte-symbol. I'm curious about the rationale behind this change and would love to understand the thought process. Could you please shed some light on why this decision was made? Moreover, I'm wondering if this modification is expected to improve performance in any way? If so, could you elaborate on how and in which specific scenarios this might be beneficial? Thank you very much for your time and for sharing this valuable recipe! Looking forward to your insights. — Reply to this email directly, view it on GitHub <#1238 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42HSM2NV7YD7VOL7AILZRHDYJAVCNFSM6AAAAABMN2EDQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBVG4ZTSNRQGQ> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

JinZr added 30 commits July 16, 2023 21:08

Init commit for recipes trained on multiple zh datasets.

39ba093

fbank extraction for thchs30

33d437c

added support for aishell1

cca7121

added support for aishell-2

5425e1d

fixes

fae9398

Merge branch 'dev_multi_zh-hans' of https://github.com/JinZr/icefall …

f51be9b

…into dev_multi_zh-hans

fixes

d9d7aec

Merge branch 'dev_multi_zh-hans' of https://github.com/JinZr/icefall …

2569ace

…into dev_multi_zh-hans

fixes

0098d5e

added support for stcmds and primewords

34bc78f

fixes

c4bf61d

added support for magicdata

523c1ae

script for fbank computation not done yet

added script for magicdata fbank computation

e04130d

file permission fixed

a01f88d

updated for the wenetspeech recipe

0350d66

updated

ff9d484

Update preprocess_kespeech.py

213a6ec

updated

441aa37

updated

c061307

updated

c1aa955

Merge branch 'dev_multi_zh-hans' of https://github.com/JinZr/icefall …

660e431

…into dev_multi_zh-hans

updated

4f6050c

Merge branch 'dev_multi_zh-hans' of https://github.com/JinZr/icefall …

c5dafbc

…into dev_multi_zh-hans

file permission fixed

b4110ba

updated paths

e58b2b5

fixes

729eaf7

added support for kespeech dev/test set fbank computation

d0d7dd3

fixes for file permission

a3e6e22

refined support for KeSpeech

dd6614f

Merge branch 'dev_multi_zh-hans' of https://github.com/JinZr/icefall …

48303ed

…into dev_multi_zh-hans

JinZr added the multi-zh_hans label Sep 7, 2023

minor fixes

bbe6724