Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-time decoding #141

Open
ericbolo opened this issue Jul 29, 2017 · 27 comments
Open

Real-time decoding #141

ericbolo opened this issue Jul 29, 2017 · 27 comments

Comments

@ericbolo
Copy link

What would be the main steps for building a real-time decoder on top of EESEN?

I read in the EESEN paper that composing the tokens, lexicon and grammar speeds up decoding a great deal, and I'd like to leverage that in a real-time context: capture an audio stream and output the transcripts progressively.

Is that by any chance in the works?

If not, I could try and give it a shot.

Thank you for this great project

@fmetze
Copy link
Contributor

fmetze commented Jul 31, 2017

Eric,

thanks for the flowers. The main problem is the use of the bi-directional LSTM as an acoustic model, which in theory requires you to have the while segment available before evaluating the acoustic model, i.e. no run-on or streaming capability like you want.

There are several papers on how to get around this, because all of the really important information for ASR seems to be contained in a 200ms window, so not all the right context is needed. Within our Tensorflow implementation, we want to try a CNN for that, rather than the BLSTM, so we could get around that limitation easily, but we have not gotten round to implement that. The second challenge is of course to implement the code for streaming the audio and actually processing it. There is probably some Kaldi code that we could leverage.

If you're interested, this would be a great addition to Eesen. We're using the fast decoding and small memory footprint for processing large amounts of data, but you're right, it would go well with real-time recognition, too. Want to try?

Let me know what you think!

@ericbolo
Copy link
Author

Thank you for your response!

With the current BiLSTM setup, what would be required is then a feature window that slides over the input and includes for each input all the left context and just enough of the right context. Is that correct?

Kaldi does have boilerplates and tools for online decoding: http://kaldi-asr.org/doc/online_decoding.html

In that web page they also mention per-speaker cepstral mean and variance normalization (CMVN), which I see is being used in the example EESEN scripts. What I have in mind is a system that takes any user's voice and outputs a transcript, without necessarily knowing who the speaker is or having access to past utterances. Any challenges to foresee?

@ericbolo
Copy link
Author

ericbolo commented Jul 31, 2017

For online decoding with neural nets, Kaldi recommends constructing an i-vector that summarizes speaker properties, and training the neural net with audio features + i-vector. In the absence of past utterances, the i-vector is built from the audio from time 0 to some time t. So there would be an additional delay from building the i-vector, if I understand correctly.

From http://kaldi-asr.org/doc/online_decoding.html
"
Our best online-decoding setup, which we recommend should be used, is the neural net based setup. The adaptation philosphy is to give the neural net un-adapted and non-mean-normalized features (MFCCs, in our example recipes), and also to give it an iVector. ... Our idea is that the iVector gives the neural net as much as it needs to know about the speaker properties. This has proved quite useful. The iVector is estimated in a left-to-right way, meaning that at a certain time t, it sees input from time zero to t. It also sees information from previous utterances of the current speaker, if available.
"

@fmetze
Copy link
Contributor

fmetze commented Aug 2, 2017

Yes, that is by and large correct.

The big challenge is speaker diarization, unless you only have one speaker in your audio channel. Imagine you have two speakers, a loud male and a soft female, in the same channel of the recording that you want to recognize. Even if they don't overlap, you need to separate them somehow, either by estimating and updating the corresponding cepstral means, or the corresponding i-vectors.

I could imagine that a neural network learns the normalization properties from the i-vectors, but I am wondering if that doesn't introduce all other types of failure modes like means not being 0, etc.

In sum, the BLSTM problem is just one problem that you need to solve for on-line recognition. Can you see a solution that would work for you for the above issues, so we can think about the BLSTM issue specifically?

@ericbolo
Copy link
Author

ericbolo commented Aug 4, 2017

Ok, I now have a pretty good understanding of the diarization/speaker normalization issues, none of them insurmountable in my application.

For now, I can focus on the decoding of the BLSTM outputs. You mentioned ASR probably only needs a 200ms window, any papers you could point me to?

@fmetze
Copy link
Contributor

fmetze commented Aug 4, 2017 via email

@ericbolo
Copy link
Author

ericbolo commented Aug 5, 2017

Thank you for the link.

I gave it a quick read, my only worry is that in the paper the data is pre-segmented with GMM/HMM, and the training does not use CTC loss.

Could the 0.5 s spectral window be too small for CTC loss training?

@ericbolo
Copy link
Author

ericbolo commented Aug 5, 2017

Elaborating: from the paper I understand they sub-sample the data to avoid overfitting, so we don't have access to all the outputs of the utterance, possibly hampering CTC loss.

@fmetze
Copy link
Contributor

fmetze commented Aug 5, 2017

Right, the paper does not use CTC loss, but I don't think this would matter much, certainly not for the LSTMs, which is where we have the recurrent connections. CTC affects the way the error signal is computed during training, but during inference, the computation is straightfoward and independent of the window. Except of course that we do need a segmentation - which is the other problem that you mention. The exact value of the window, 2*0.2 or 0.5 seconds would have to be determined through experiments, of course.

@ericbolo
Copy link
Author

I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/document/7953176/

Thought it might be of interest

@fmetze
Copy link
Contributor

fmetze commented Feb 7, 2018 via email

@ericbolo
Copy link
Author

ericbolo commented Feb 7, 2018 via email

@fmetze
Copy link
Contributor

fmetze commented Feb 7, 2018 via email

@efosler
Copy link
Contributor

efosler commented Jun 20, 2018

What's the current status on this? I'm starting the (crazy) sabbatical project and it's pretty clear that some sort of online decoding mechanism is going to be necessary. I can probably pitch in to help but I'll be shaking the rust off of my coding skills. @ericbolo, @fmetze any interest in this?

@ericbolo
Copy link
Author

ericbolo commented Jun 20, 2018 via email

@ericbolo
Copy link
Author

ericbolo commented Jun 20, 2018 via email

@efosler
Copy link
Contributor

efosler commented Jun 21, 2018

@fmetze 's comment was that CTC seems to be relatively dependent on the LSTM recurrent connections (which makes sense when I think about it). CNN with a wide enough window would probably do ok, though - although @fmetze might have some thoughts on that.

The idea of having a forward LSTM + DNN-based representation of the future (ala the paper @ericbolo pointed out) would probably not be difficult to implement.

The real question is what branch to target. I think, in talking with @fmetze, that the tf branch is the future for Eesen. That does make it easier to integrate different acoustic models. However, for the eventual application I'm looking at I'm not sure how I feel about "python in the loop".

@ericbolo
Copy link
Author

ericbolo commented Jun 21, 2018 via email

@ramonsanabria
Copy link

ramonsanabria commented Jun 21, 2018 via email

@ericbolo
Copy link
Author

ericbolo commented Jun 21, 2018 via email

@efosler
Copy link
Contributor

efosler commented Jun 22, 2018

This seems like a reasonable step (student-teacher) and simple to implement.

Re: python in the loop - I'm working on an eesen-in-the-browser project in order to enable some other stuff I want to work on (read: want my students to work on...). It's a bit of a crazy lark - compiling c++ into javascript (asm.js) via emscripten, which python would be problematic for (although @ramonsanabria is right in that I could reimplement in c++). Could be a big fail, but could be interesting if it works (and for which I need the online decoding). I think I'm going to retract what I said about tf, though - in thinking it through, the google folks have already made a javascript version available at js.tensorflow.org which could be used to run the net models probably more efficiently than any attempt I make.

@ericbolo
Copy link
Author

ericbolo commented Jun 27, 2018 via email

@ericbolo
Copy link
Author

#193 #155 : before delving into online decoding, it might be a good idea to have a full working example of the tensorflow acoustic model with WFST.

Do you agree ? @efosler I believe you have started working on that, how may I contribute ?

@fmetze
Copy link
Contributor

fmetze commented Jun 28, 2018 via email

@ericbolo
Copy link
Author

update: currently training a forward LSTM with tensorflow, with different loss functions (CTC only, student-teacher loss, etc.).

Looking ahead, I'm studying examples of online decoding. Kaldi has online feature extractors for MFCC and PLP, but not for filterbanks. In the EESEN paper as well as in the tedlium example, filterbanks rather than MFCC are used.

Any reason for choosing fbanks over MFCCs ? Is it simply extraction speed since MFCCs are just filterbanks with postprocessing ? If I train with MFCC features, do you expect I 'll get similar results ?

If using MFCCs is ok, I'll train with that to avoid the work on implementing my own online extractor.

@efosler
Copy link
Contributor

efosler commented Aug 10, 2018

Our experience is that log filterbanks do work somewhat better than MFCCs, but if you're mostly working on pipeline at the moment the hit you'll take on MFCCs will not be large (and it should be easy to sub in online log mel filter banks later).

FWIW, rolling your own log mel filterbank can also be a bit treacherous (although it shouldn't be). We found that using scikit to build features rather than Kaldi was giving us suboptimal performance. We ended up tracing it down to windowing differences, IIRC (we needed to have windowing be a multiple of frame shift for our application, which isn't the default in Kaldi).

@ericbolo
Copy link
Author

great, MFCCs it is then (for now)

thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants