Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoding error after successful aishell train #10

Open
MNCTTY opened this issue Sep 2, 2019 · 26 comments
Open

decoding error after successful aishell train #10

MNCTTY opened this issue Sep 2, 2019 · 26 comments

Comments

@MNCTTY
Copy link

MNCTTY commented Sep 2, 2019

Hi! I managed to train LAS on aishell data without errors. This is the end of the log:

Epoch 20 | Iter 441 | Average Loss 0.406 | Current Loss 0.505424 | 64.8 ms/batch
Epoch 20 | Iter 451 | Average Loss 0.409 | Current Loss 0.383116 | 64.1 ms/batch
-------------------------------------------------------------------------------------
Valid Summary | End of Epoch 20 | Time 956.81s | Valid Loss 0.410
-------------------------------------------------------------------------------------
Learning rate adjusted to: 0.000000
Find better validated model, saving to exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/final.pth.tar
# Accounting: time=21312 threads=1
# Ended (code 0) at Fri Aug 30 17:15:39 MSK 2019, elapsed time 21312 seconds

but decoding stage gave an error:

Stage 4: Decoding
run.pl: job failed, log is in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log
2019-08-30 17:15:39,608 (json2trn:24) INFO: reading exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json
Traceback (most recent call last):
 File “/home/karina/Listen-Attend-Spell/egs/aishell/../../src/utils/json2trn.py”, line 25, in <module>
   with open(args.json, ‘r’) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json’
write a CER (or TER) result in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/result.txt
|      SPKR        |         # Snt                   # Wrd         |      Corr              Sub              Del              Ins              Err            S.Err      |
|      Sum/Avg     |             0                       0         |       0.0              0.0              0.0              0.0              0.0              0.0      |

I don't understand why there are no some file in that directory. I thought everything that run.pl need are generated by themself there

@KnowBetterHelps
Copy link

maybe you should check file "exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json" ?

@MNCTTY
Copy link
Author

MNCTTY commented Sep 2, 2019

yes, there are no such file as I said
but as I understand, it should be generated at some stage as every other file in that directory
it's not
and I want to find why: in log all previous stages were with no errors

@KnowBetterHelps
Copy link

yeah, it should be generated when decoding started
I am running the training proccessing now, and it will be finished tomorrow, I'll see it whether got the same problem.

@MNCTTY
Copy link
Author

MNCTTY commented Sep 2, 2019

yep
thanks

@KnowBetterHelps
Copy link

I found the same problem, but the main reason is not "the file non exist".

in my case, I found a encoding error in "exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs128_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log"

I use "export PYTHONIOENCODING=UTF-8" to fixed it

@MNCTTY
Copy link
Author

MNCTTY commented Sep 3, 2019

yep, we found encoding error earlier, and solved similar way
so, after fixing encoding error I run all run.sh without any errors?

@KnowBetterHelps
Copy link

yes, I am waiting for decoding finished now. recognition result seems okay

@MNCTTY
Copy link
Author

MNCTTY commented Sep 3, 2019

how did you run recognition without decoding stage from run.sh?

@MNCTTY
Copy link
Author

MNCTTY commented Sep 3, 2019

ok. our news:

we are here SOMEHOW managed to run decoding stage
for this we copied data.json from dump/test to folder, where data.json is not found by run.sh
plus add encoding with utf-8 in several new places, plus change rec_token_id for token_id, because we thought that it was a typo.

and:
stage 4 finally ran successfully
and here are what id did say:


karina@karina:~/Listen-Attend-Spell/egs/aishell$ ./run.sh 
dictionary: data/lang_1char/train_chars.txt
Stage 4: Decoding
run.pl: job failed, log is in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log
2019-09-03 19:05:02,215 (json2trn:24) INFO: reading exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json
2019-09-03 19:05:02,218 (json2trn:28) INFO: reading data/lang_1char/train_chars.txt
2019-09-03 19:05:02,218 (json2trn:37) INFO: writing hyp trn to exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/hyp.trn
2019-09-03 19:05:02,218 (json2trn:38) INFO: writing ref trn to exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/ref.trn
write a CER (or TER) result in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/result.txt
|      SPKR                   |      # Snt            # Wrd       |      Corr              Sub              Del              Ins              Err            S.Err      |
|      Sum/Avg                |       419             26135       |     100.0              0.0              0.0              0.0              0.0              0.0      |

how it should be interpreted?

how to run recognition for a random new wav and where it write the recognised text?

do you have ANY idea why data.json doesn't being generated in decoding folder by itself?

I would be very grateful, if you could answer any of these questions

ps/ do you have any spaces in your language? :D

@KnowBetterHelps
Copy link

I guess:
1、it is not correct for you to copy test/data.json to exp/{...}/data.json. if you do so, the score script would compare test/data.json with exp/{...}/data.json, which are the same in your case. and the result would be 100% correct
2、how to run recognition for a random new wav and where it write the recognised text?
you should prepare dump/test/deltatrue/data.json, whcih could be generated from your data dir. look into data preparation script
3、do you have ANY idea why data.json doesn't being generated in decoding folder by itself?
maybe it is still the encoding problem
4、and BTW what do you mean "do you have any spaces in your language?“ : )

@MNCTTY
Copy link
Author

MNCTTY commented Sep 4, 2019

  1. hmm
    i noticed, that aishell annotations looks like text without lots of spaces, I opened Chinese wiki and see thats there are some spaces, but only after points or commas
    so I ask
    because in russian we have lots of spaces, it separates words from each other
    and when data prep script delete all spaces russian annotations looks strange

@MNCTTY
Copy link
Author

MNCTTY commented Sep 4, 2019

3, by the way, what do you mean saying 'encoding' - what stage in run.sh represents encoding? I thought there are only decoding - from wav to text. No?

@KnowBetterHelps
Copy link

the utterance for nnet training usually contains only one sentence, so there won't be any points or commas in it, if have, should be delete before training.

"encoding" means the language encoding type, like utf-8. when you use chinese, like opening some file which contains chinese, you should always be careful with the encoding.

@MNCTTY
Copy link
Author

MNCTTY commented Sep 9, 2019

I managed to run decoding from start to end, the problem was really in encoding (it was need to be added to some other files)
but! for some reason results are still 100% corr. I don't understand why is that

by the way, do you know, how to load a pretrained model to further training?

@KnowBetterHelps
Copy link

for some reason results are still 100% corr
can you show me some example of you train/...../data.json?

how to load a pretrained model to further training
I didn't find any train_stage like in kaldi process, so it might not support pre-training

@MNCTTY
Copy link
Author

MNCTTY commented Sep 10, 2019

can you show me some example of you train/...../data.json?

you mean, dump/train/deltatrue/data.json ?
here it is

and I attach at once dump/test/deltatrue/data.json
and data.json from decoding folder, that generated in the process of decoding
I renamed them train_data.json, test_data.json and decode_data.json for easy distinguishing in the attachement

Archive.zip

@KnowBetterHelps
Copy link

looks like you were using the script directly for your own data. рон не отрываясь смотрел на письмо которое уже начало с углов дымиться

in chinese, one syllbale can be one word, like "我"(one token), means "me"; but in your language, "рон" would be splited into "P O H" (three token), maybe you should modify the script to better understand your data, like one word one token (рон as one token).

@MNCTTY
Copy link
Author

MNCTTY commented Sep 11, 2019

so, did I understood you correctly? You say that it's better to construct a vocab with lots of tokens that would be meaningful pieces of the language?
not predict letters, but predict that pieces.

may be it makes sense, since I can take such vocab from bert for russian

can you say me, please, what files I should search for changes? I mean, if I just put new vocab in a place of old, situation didn't changed, yes?

@KnowBetterHelps
Copy link

I am doing some similar work for code-switch recognition, which in english I gona using subword 'BPE' unit, not a letter. for a example, catch --> ca tch , not 'c a t c h'

@KnowBetterHelps
Copy link

can you say me, please, what files I should search for changes
the script in data preparation, specically in generating data.json

@MNCTTY
Copy link
Author

MNCTTY commented Sep 11, 2019

I am doing some similar work for code-switch recognition, which in english I gona using subword 'BPE' unit, not a letter. for a example, catch --> ca tch , not 'c a t c h'

yeah, bert vocab using bpe exactly to construct vocab
plus, there are huge complete vocabs for english, maybe you can use them since google had much more data to construct them
for russian they are much smaller but still complete enough

@MNCTTY
Copy link
Author

MNCTTY commented Sep 13, 2019

ok, I find the real problem of 100% correct results in result.txt

the problem was in json2trn.py file:
there were creation of 2 absolute identical files - ref and hyp - in sourse code from decode data.json. But we know, that they must be different - hyp contains predictions of model, ref - things from test data.json
I fixed it in my computer code - and result.txt now is correct (has no 100% correctness)

May be it should be fixed and in source code.

@MNCTTY
Copy link
Author

MNCTTY commented Sep 23, 2019

okay
i've done something wrong: now hyp.trn are being created empty one. Can somebody tell me what files beside json2trn.py are responsible for it's creation? please
may be I will find out this tomorrow, but if someone already knows and answer in this time, it will be cool

@KnowBetterHelps
Copy link

it will create something like this, from exp/***/decode/data.json

hyn.trn
过 去 的 就 不 要 想 了 (T0055G2375-T0055G2375S0447)
天 气 下 降 注 意 身 体 (T0055G2286-T0055G2286S0457)
浦 中 市 剧 中 人 街 最 儿 我 独 醒 事 已 见 放 (T0055G0915-T0055G0915S0468)

ref.trn
过 去 的 就 不 要 想 了 (T0055G2375-T0055G2375S0447)
天 气 下 降 注 意 身 体 (T0055G2286-T0055G2286S0457
补 充 诗 句 众 人 皆 醉 而 我 独 醒 是 以 见 放 (T0055G0915-T0055G0915S0468)

@MNCTTY
Copy link
Author

MNCTTY commented Sep 25, 2019

it's strange that though I have exp/***/decode/data.json , not empty, looks pretty correct, I still got empty hyp.trn
but ref.trn is not an empty at all and looks correct one too

@ben-8878
Copy link

@MNCTTY do u have solve your problem? I'm hesitant whether to use the tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants