-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decoding error after successful aishell train #10
Comments
maybe you should check file "exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json" ? |
yes, there are no such file as I said |
yeah, it should be generated when decoding started |
yep |
I found the same problem, but the main reason is not "the file non exist". in my case, I found a encoding error in "exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs128_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log" I use "export PYTHONIOENCODING=UTF-8" to fixed it |
yep, we found encoding error earlier, and solved similar way |
yes, I am waiting for decoding finished now. recognition result seems okay |
how did you run recognition without decoding stage from run.sh? |
ok. our news: we are here SOMEHOW managed to run decoding stage and:
how it should be interpreted? how to run recognition for a random new wav and where it write the recognised text? do you have ANY idea why data.json doesn't being generated in decoding folder by itself? I would be very grateful, if you could answer any of these questions ps/ do you have any spaces in your language? :D |
I guess: |
|
3, by the way, what do you mean saying 'encoding' - what stage in run.sh represents encoding? I thought there are only decoding - from wav to text. No? |
the utterance for nnet training usually contains only one sentence, so there won't be any points or commas in it, if have, should be delete before training. "encoding" means the language encoding type, like utf-8. when you use chinese, like opening some file which contains chinese, you should always be careful with the encoding. |
I managed to run decoding from start to end, the problem was really in encoding (it was need to be added to some other files) by the way, do you know, how to load a pretrained model to further training? |
for some reason results are still 100% corr how to load a pretrained model to further training |
you mean, dump/train/deltatrue/data.json ? and I attach at once dump/test/deltatrue/data.json |
looks like you were using the script directly for your own data. рон не отрываясь смотрел на письмо которое уже начало с углов дымиться in chinese, one syllbale can be one word, like "我"(one token), means "me"; but in your language, "рон" would be splited into "P O H" (three token), maybe you should modify the script to better understand your data, like one word one token (рон as one token). |
so, did I understood you correctly? You say that it's better to construct a vocab with lots of tokens that would be meaningful pieces of the language? may be it makes sense, since I can take such vocab from bert for russian can you say me, please, what files I should search for changes? I mean, if I just put new vocab in a place of old, situation didn't changed, yes? |
I am doing some similar work for code-switch recognition, which in english I gona using subword 'BPE' unit, not a letter. for a example, catch --> ca tch , not 'c a t c h' |
can you say me, please, what files I should search for changes |
yeah, bert vocab using bpe exactly to construct vocab |
ok, I find the real problem of 100% correct results in result.txt the problem was in json2trn.py file: May be it should be fixed and in source code. |
okay |
it will create something like this, from exp/***/decode/data.json
|
it's strange that though I have exp/***/decode/data.json , not empty, looks pretty correct, I still got empty hyp.trn |
@MNCTTY do u have solve your problem? I'm hesitant whether to use the tool |
Hi! I managed to train LAS on aishell data without errors. This is the end of the log:
but decoding stage gave an error:
I don't understand why there are no some file in that directory. I thought everything that run.pl need are generated by themself there
The text was updated successfully, but these errors were encountered: