-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193
Comments
great yes thanks @efosler we really couldn't get a full recipe for tf + wfst. I have some spare scripts but nothing clean and oficial... This is correct yes:
I was thinking if is there any improved (or more simple way) to do the data preparation for characters and phonemes separately. Do you have any thought? I can try to help. Otherwise we can reuse the preparation of the master branch. Also, another random thing that I observed with swbd: I tried to prepare the char set up substituting numbers by written words and removing noises but at the end it did not work out... I am working in integrating CharRNN decoding recipe that we have (it doesn't perfrom better than wfst but we allow open vocabulary) https://arxiv.org/abs/1708.04469. Please let me know if I can help you somehow I am will be very happy to! Thanks again! |
Let me think about it as I play with the scripts. I just created a _tf version of swbd1_decode_graph.sh which gets rid of the -unk option, but that feels like there could be a better factorization. |
So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:
I'm sure that there is some other possible set of differences in parameters as well. Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in). Any thoughts on diffs between the tf and old versions that might be causing the discrepancy? |
Hi Eric,
Thank you very much for that. Can you let me know which token error rate
were you getting in the acoustic model? I have some experiments that
achieved way below this WER. Did you use the prior probabilities of each
character during wfst decoding? Can you share the complete log of the
acoustic model training?
Thanks!
Best,
Ramon
2018-06-26 12:24 GMT-04:00 Eric Fosler-Lussier <[email protected]>:
… So, an update: the good news is that I was able to get a decode to run all
the way through. There does seem to be a bit of underperformance w.r.t.
Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on
eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few
differences:
- 4 layer BiLSTM vs 5 layer BiLSTM
- I'm not sure that the default tf recipe currently checked in has
speaker adaptation
I'm sure that there is some other possible set of differences in
parameters as well.
Just to check: what I did was just work with the output of
./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB
this just runs test.py in ctc-am rather than nnet.py or anything else (not
sure if this is the right thing to do, but it's what's checked in).
Any thoughts on diffs between the tf and old versions that might be
causing the discrepancy?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPRMl704W1O4LrlphOdE-Xoarzhciks5uAmA4gaJpZM4U2hdg>
.
|
Let me re-run it - I just started up the non-tf version and I think it blew away the tf version (hmpf). I am pretty sure that it isn't using the prior probabilities of each phone (not character) but I'm not sure. (I don't see where that would have been injected into the system). |
@efosler I would also like to integrate the tf acoustic model into a WFST for decoding. As I understand this thread you have managed to do that. Is any of your code in the repo ? I pulled tf_clean and asr_egs/swbd/v1-tf/run_ctc_phn.sh only does acoustic decoding. Would be great if I could avoid starting from scratch :) |
Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria has had a better outcome than I have. Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding. |
@efosler, thank you !
…On 29 June 2018 at 18:28, Eric Fosler-Lussier ***@***.***> wrote:
Sorry for the delay - I had a few other things pop up. The non-tf run
didn't finish before we had a massive server shutdown because of a planned
power outage (sigh). So @ericbolo <https://github.com/ericbolo> let me
try to run the v1-tf branch again and I can check in against my copy of the
repository. I think that @ramonsanabria <https://github.com/ramonsanabria>
has had a better outcome than I have.
Basically, the things I had to do were slight modifications to the
building the TLG graph followed by calling latgen-faster and
score_sclite.sh. I'm sure that the decoding parameters aren't right and I
have to investigate whether I do have the priors involved or not before
decoding.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg>
.
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
www.batvoice.com
|
Hi all,
Yesterday I was able to re run one of our complete pipeline of EESEN + WFST
using BPE units ( https://arxiv.org/pdf/1712.06855.pdf) in SWBD. I hit the 16.5
without fine tunning. I feel I can maybe get some extra points by just
playing a bit with WFST parameters.
The BPE pipeline made me think that in the case you need a little bit more
speed during decoding you can also play use BPE (bigger units means less
steps in the decoding process) and the accuracy is not that bad.
PS: Also we have a recipe to train the acoustic model with the whole fisher
swbd corpus set in case you needed for the your real time implementation
that you have in mind.
Thanks!
Best,
Ramon
2018-07-02 4:52 GMT-04:00 ericbolo <[email protected]>:
… @efosler, thank you !
On 29 June 2018 at 18:28, Eric Fosler-Lussier ***@***.***>
wrote:
> Sorry for the delay - I had a few other things pop up. The non-tf run
> didn't finish before we had a massive server shutdown because of a
planned
> power outage (sigh). So @ericbolo <https://github.com/ericbolo> let me
> try to run the v1-tf branch again and I can check in against my copy of
the
> repository. I think that @ramonsanabria <https://github.com/
ramonsanabria>
> has had a better outcome than I have.
>
> Basically, the things I had to do were slight modifications to the
> building the TLG graph followed by calling latgen-faster and
> score_sclite.sh. I'm sure that the decoding parameters aren't right and I
> have to investigate whether I do have the priors involved or not before
> decoding.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#193 (comment)>, or
mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AEXQ_
KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg>
> .
>
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
<https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g>
www.batvoice.com
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg>
.
|
sorry wrong paper. Correction, the BPE paper is this one:
https://arxiv.org/pdf/1712.06855.pdf
2018-07-02 9:07 GMT-04:00 Ramon Sanabria <[email protected]>
:
… Hi all,
Yesterday I was able to re run one of our complete pipeline of EESEN +
WFST using BPE units (https://arxiv.org/abs/1708.04469) in SWBD. I hit
the 16.5 without fine tunning. I feel I can maybe get some extra points by
just playing a bit with WFST parameters.
The BPE pipeline made me think that in the case you need a little bit more
speed during decoding you can also play use BPE (bigger units means less
steps in the decoding process) and the accuracy is not that bad.
PS: Also we have a recipe to train the acoustic model with the whole
fisher swbd corpus set in case you needed for the your real time
implementation that you have in mind.
Thanks!
Best,
Ramon
2018-07-02 4:52 GMT-04:00 ericbolo ***@***.***>:
> @efosler, thank you !
>
> On 29 June 2018 at 18:28, Eric Fosler-Lussier ***@***.***>
> wrote:
>
> > Sorry for the delay - I had a few other things pop up. The non-tf run
> > didn't finish before we had a massive server shutdown because of a
> planned
> > power outage (sigh). So @ericbolo <https://github.com/ericbolo> let me
> > try to run the v1-tf branch again and I can check in against my copy of
> the
> > repository. I think that @ramonsanabria <https://github.com/ramonsanab
> ria>
> > has had a better outcome than I have.
> >
> > Basically, the things I had to do were slight modifications to the
> > building the TLG graph followed by calling latgen-faster and
> > score_sclite.sh. I'm sure that the decoding parameters aren't right and
> I
> > have to investigate whether I do have the priors involved or not before
> > decoding.
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#193 (comment)>, or
> mute
> > the thread
> > <https://github.com/notifications/unsubscribe-auth/AEXQ_KvFi
> DQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg>
> > .
> >
>
>
>
> --
> Eric Bolo
> CTO
> tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
> ______________________
> Batvoice Technologies
> 10, rue Coquillière - 75001 Paris
> <https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g>
> www.batvoice.com
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#193 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg>
> .
>
|
@ramonsanabria thanks! We've been having some NFS issues here so I haven't gotten my full pass to run. It would be great to have this recipe in the mix. Does this want to go into v1-tf or should there be a v2-tf? |
We finally got the NFS issues resolved so I should have the training done by tomorrowish. @ramonsanabria, two questions -
|
Hi Eric, Sorry for not responding the last message. We were also a little bit busy in JSALT. Regarding tf-v2. Yes we can do that, cool idea. @fmetze what do you think? I am still fine tunning everything (WFST and AM) and I should include some code (mostly for the BPE generation) but it would be a good idea. We are preparing a second publication for SLT, after the acceptance we can actually release the whole recipie. Ok let me send all the parameters the I am using. Can you share your TER results with your configuration? You might find some parameters that are currently not implemented in the master branch (dropout etc.). But with the intersected parameters you should be fine. With this configuration on swbd I remember that @fmetze achieved something close to 11% TER.
After having the log_probs then is when you can just apply normal eesen c++ recipie (i.e. apply wfst to log_probs). I am not sure why my character based WFST is not working. I could make it work with bpe300 and other units but not in characters. I will try to get back to you later on this. Thanks! |
No worries on lag - I think this is going to be an "over several weeks" thing as this isn't first priority for any of us (although high priority overall). The TER I'm seeing is more around 15% (still training, but I don't see it likely to get much under 15%) - I will see if there are any diffs. Meanwhile once I get the pipeline to finish, I'll check in a local copy for @ericbolo so that he can play around, since it is a working pipeline even if it isn't efficient or as high accuracy. Thanks! |
Just for the record, here are diffs on config: So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.
|
Perfect yes, so with this parameters you should see an improvement
according to my experience. For `input_feats_dim: 120 (vs 129)` just
extract fbank_pitch features (we should also change this in the recipie of
v1-tf). I assume that `'subsampling': 3` is correct right? (this is also
important).
Also those ones that I think are implemented: `'nproj': 60, 'final_nproj':
100, 'init_nproj': 80`. Otherwise I will push code to have them.
Thanks!
2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier <[email protected]>:
… Just for the record, here are diffs on config:
nlayer: 4 (vs 5)
input_feats_dim: 120 (vs 129)
batch_size: 32 (vs 16)
lr_rate: 0.005 (vs 0.05)
nproj: 60 (vs 340)
online_augment_conf.roll = False (vs True)
l2: 0.0 (vs 0.001)
batch_norm: False (vs True)
So it's pretty clear that there are some significant differences, and I'd
believe the sum total of them could result in a 4% difference in TER
(particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really
interesting question is what the extra 9 features are - it looks like one
additional base feature which has deltas/double-deltas and windowing
applied.
{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch':
False, 'random_seed': 15213, 'debug': False, 'store_model': True,
'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir':
'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size':
32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005,
'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0,
'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type':
'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0,
'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt':
'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune',
'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf':
{'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120,
'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir':
'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg>
.
|
Hello all,
@efosler: yes, all I need is a running pipeline, not the best performing
one, so that I can look at all the pieces of an online decoding system with
tensorflow + wfst.
…On 4 July 2018 at 00:22, ramonsanabria ***@***.***> wrote:
Perfect yes, so with this parameters you should see an improvement
according to my experience. For `input_feats_dim: 120 (vs 129)` just
extract fbank_pitch features (we should also change this in the recipie of
v1-tf). I assume that `'subsampling': 3` is correct right? (this is also
important).
Also those ones that I think are implemented: `'nproj': 60, 'final_nproj':
100, 'init_nproj': 80`. Otherwise I will push code to have them.
Thanks!
2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier ***@***.***>:
> Just for the record, here are diffs on config:
> nlayer: 4 (vs 5)
> input_feats_dim: 120 (vs 129)
> batch_size: 32 (vs 16)
> lr_rate: 0.005 (vs 0.05)
> nproj: 60 (vs 340)
> online_augment_conf.roll = False (vs True)
> l2: 0.0 (vs 0.001)
> batch_norm: False (vs True)
>
> So it's pretty clear that there are some significant differences, and I'd
> believe the sum total of them could result in a 4% difference in TER
> (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The
really
> interesting question is what the extra 9 features are - it looks like one
> additional base feature which has deltas/double-deltas and windowing
> applied.
>
> {'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch':
> False, 'random_seed': 15213, 'debug': False, 'store_model': True,
> 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir':
> 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80',
'batch_size':
> 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate':
0.0005,
> 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0,
> 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type':
> 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0,
> 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False,
'grad_opt':
> 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune',
> 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf':
> {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120,
> 'target_scheme': {'no_name_language': {'no_target_name': 43}},
'model_dir':
> 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#193 (comment)>, or
mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_JHFGkUzWdTdWwYVqrrxN6k8GQbAks5uC-6PgaJpZM4U2hdg>
.
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
www.batvoice.com
|
@ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:
My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria . Once I've got that done and confirmed I'll send a pull request. |
NB: the decode script is woefully non-parallel (needs to be fixed), but for the online stuff this won't matter. |
@efosler: wonderful, thanks !
I don't have the swbd db but I can adapt it for, say, tedlium.
…On Fri, Jul 6, 2018, 12:06 AM Eric Fosler-Lussier ***@***.***> wrote:
@ericbolo <https://github.com/ericbolo> I've uploaded my changes to
efosler/eesen so you can grab the newest copy. This *should* work - there
are a few diffs with the graph prep scripts. Here's a list of the files
that I changed so that you can just grab them if you want:
- asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh
- asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh
- asr_egs/swbd/v1-tf/run_ctc_phn.sh
- asr_egs/swbd/v1/local/swbd1_data_prep.sh
[cosmetic changes only to these]
- asr_egs/wsj/steps/decode_ctc_lat_tf.sh
- asr_egs/wsj/steps/train_ctc_tf.sh
[python 3 compatibility]
- asr_egs/wsj/utils/ctc_token_fst.py
- asr_egs/wsj/utils/model_topo.py
My next step will be to try to rework the recipe so that it matches the
parameters sent by @ramonsanabria <https://github.com/ramonsanabria> .
Once I've got that done and confirmed I'll send a pull request.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_EbYf7FANRUU4OBuQuUkC0MBrQRUks5uDo3QgaJpZM4U2hdg>
.
|
Hey @ramonsanabria , quick question: you said...
Looking through the code base, it seems like these are passed as parameters - will it not do the right thing if those parameters are set? |
About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet. |
@efosler, a quick update: I was able to run the full pipeline with
tensorflow + language model decoding on a dummy dataset. Thanks again !
Next steps re:online decoding (#141): implementing a forward-only LSTM, and
the loss function for student-teacher learning.
…On 7 July 2018 at 19:49, Eric Fosler-Lussier ***@***.***> wrote:
About to go offline for a bit, so I won't be able to report on the full
run, but training with the parameters above (same as @fmetze
<https://github.com/fmetze> 's run but with nproj=60, final_nproj=100,
init_nproj=80) does get down to 11.7% TER, so I will make those the default
with the script going forward. Decoding hasn't happened yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_DAAIxwLWmIlWqqyQVnLlV6rkaVKks5uEPTAgaJpZM4U2hdg>
.
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
www.batvoice.com
|
re: priors. As @efosler noted, it seems the priors are not used in the current decoding code. in tf_test.py:
model.priors doesn't seem to be generated anywhere, but we can use label.counts to generate it. |
Hi all,
Priors are generated by:
```
labels=$dir_am/label.counts
gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels
```
Then you can use nnet.py as:
```
$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
--acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
$graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;
```
This nnet.py is currently using tensorflow, I have a version that doesn't
rely on that. I will push it now. Will keep you posted.
Thanks!
|
Here the commit of the new Here are the parts of the code that I posted in the previous message (email responds does not support code in the markdown language): labels=$dir_am/label.counts
gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels $decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
--acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
$graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;
|
So, an update on my progress with SWB (now that I'm getting back to this). I haven't tried out @ramonsanabria 's code above yet. I'm able to train a SWB system getting 11.8% TER on the CV set (much better than before). However, decoding with this (again not with priors) gives me a 40+% WER - much worse than the previous setup. I'm trying to debug this to understand where things are going wrong. One thing I tried to do was turn on TER calculation during the forward pass. Had to do some modifications to steps/decode_ctc_am_tf.sh to make it pass the right flags to the test module. However, that was a non-starter it seems - the forward pass just hangs with no errors. Seems like the next best step would be to just try to switch to @ramonsanabria 's decode strategy and abandon steps/decode_ctc_am_tf.sh? |
@ramonsanabria what's a good (rough) value for blank_scale? |
@ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp. |
Hi all,
How is it going. A good value for blank scale should be 0.9 and 1.1. But is
something that we should play with. Exactly, the nnet.py script will only
take the posteriors from the eesen, modify them slightly (add blank
scaling, put blank in index-zero so WFST can read it, add temperature to
the whole distribution, add priors which will certainly boost WER scores)
and finally pipe it to the next script, which I believe it is the WFST
decoding.
Will you guys be in India for Interspeech? Would be great to meet :)
2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier <[email protected]>:
… @ramonsanabria <https://github.com/ramonsanabria> Now looking through
nnet.py (and non-tf version) - this actually takes the output of the net
and does the smoothing and priors as a filter, right? The code snippet you
have above doesn't actually run the net forward, it seems to me, but would
do something funky on the features in feats.scp.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg>
.
|
India, I wish ! But no...
…On Fri, Aug 10, 2018, 8:43 PM ramonsanabria ***@***.***> wrote:
Hi all,
How is it going. A good value for blank scale should be 0.9 and 1.1. But is
something that we should play with. Exactly, the nnet.py script will only
take the posteriors from the eesen, modify them slightly (add blank
scaling, put blank in index-zero so WFST can read it, add temperature to
the whole distribution, add priors which will certainly boost WER scores)
and finally pipe it to the next script, which I believe it is the WFST
decoding.
Will you guys be in India for Interspeech? Would be great to meet :)
2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier ***@***.***>:
> @ramonsanabria <https://github.com/ramonsanabria> Now looking through
> nnet.py (and non-tf version) - this actually takes the output of the net
> and does the smoothing and priors as a filter, right? The code snippet
you
> have above doesn't actually run the net forward, it seems to me, but
would
> do something funky on the features in feats.scp.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#193 (comment)>, or
mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_K1PrDJkJz6OEpop29iaR5KFcWJoks5uPdRVgaJpZM4U2hdg>
.
|
Alas, I won't be in India either. (Maybe I might be able to stop by CMU sometime this semester.) Update on progress: I wrote a small script to do greedy decoding on a logit/posterior stream and calculate the TER. (Will post this to my repo soonish and then send a pull request.) Found that on the SWB eval2000 test set I was getting 30% TER (this after priors; without priors it is worse). I was slightly puzzled by that, so I decided to calculate the TER on the train_dev set for SWB - I'm getting roughly 21-22% TER. This was a system that was reporting 11.8% TER on the same set during training. So something is rather hinky. Still digging, but if anyone has ideas, let me know. |
I think I've enabled tf_train to dump out the forward pass on cv to see what's going on - if there is a difference in the output. Took me a good chunk of the evening. One thing I did run across is that the fp on subsampled data gets averaged in tf_test - it's not clear to me if the TER reported in tf_train is over averaged or (as I suspect) all variants. I don't think this could account for a factor of two in TER though. FWIW, I think the code would be cleaner if tf_train and tf_test were factorized some - I had to copy a lot of code over and I worry about inconsistencies between them (although they are hooked together through the model class). |
Update from yesterday (now that the swbd system has some time to train): the dumped cv ark files do not show the same CTC error rate as the system claims. I am suspecting that the averaging might be doing something weird. Writing down assumptions here and someone can pick this apart:
(Now this is making me wonder if the test set was augmented... hmmm...) Anyway, just to give a sample of the difference in TER: Reported by tf during training:
Decoding on the averaged stream:
|
@ramonsanabria and @fmetze can you confirm what the online feature augmentation is doing? I think I misunderstood it in my comments above. (I had visions of other types of augmentation going on but reading the code I think it's simpler than I thought.) Looking through the code it seems like when you have the subsample and window set to 3, what it's doing is stacking three frames on the input, and making the input 3 times as small. Is it also creating three variants with different shifts? I'm trying to figure out where the averaging would come in later. |
OK, I have figured out the discrepancy in output between forward passes and what is recorded by the training regime. tl;dr - the augmentation and averaging code in tf_test.py is at fault and should not be currently trusted. I'm working on a fix. When training is done with augmentation (in this example, with window 3) 3 different shifted copies are created for training with stacked features. The TER is calculated for each copy (stream) by taking a forward pass and greedy decoding over the logits, then getting edit distance to the labels. The reported TER is over all copies. At test time, it is not really clear what to do with 3 copies of the same logit stream. The test code (which I've replicated in the forward pass during training) assumes that correct thing to do is to average the logit streams. This would be appropriate for a traditional frame-based NN system. However, in a CTC-based system there is no guarantee of synchronization of outputs, so averaging the streams means that sometimes the blank label will dominate where it should not (for example: if one stream labels greedily "A blank blank", the second "blank A blank" and the third "blank blank A" then the average stream might label "blank blank blank" - causing a deletion). I verified this by only dumping out the first stream in the averaging rather than the average, and found that the CV TER was identicial to that reported by the trainer. (That's not to say that the decoding was identical, but that the end number was the same.) Upshot: it's probably best to arbitrarily take one of the streams and use it at test time - although is there a more appropriate combination scheme? |
Created new issue for this particular bug. #194 |
Latest update: |
Successful full train and decode; I also tested out a run with a slightly larger net (with a bit of improvement). Adding these baselines to the README file.
|
Awesome thank you so much Eric! the numbers looks great. Can you share the full training configuration? Thank you again! |
Just submitted the pull request (#196). |
Once we decide that #196 is all good, I think we can close this particular thread!!! |
OK, closing this particular thread. Whew! |
Decided a new thread would be good for this issue.
Right now the SWB tf code as checked in seems to have a discrepancy, and I'm writing down some of my assumptions as I work through cleaning up the WFST decode.
It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:
The fix is pretty simple (synchronizing the lexicon) but I'm trying to figure out how much to modify the utils/ctc_compile_dict_token.sh script vs. correcting the prep script to do the appropriate correction. I'm thinking that I'll correct the prep script, but if anyone has any thoughts on that let me know.
The text was updated successfully, but these errors were encountered: