-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real-time decoding #141
Comments
Eric, thanks for the flowers. The main problem is the use of the bi-directional LSTM as an acoustic model, which in theory requires you to have the while segment available before evaluating the acoustic model, i.e. no run-on or streaming capability like you want. There are several papers on how to get around this, because all of the really important information for ASR seems to be contained in a 200ms window, so not all the right context is needed. Within our Tensorflow implementation, we want to try a CNN for that, rather than the BLSTM, so we could get around that limitation easily, but we have not gotten round to implement that. The second challenge is of course to implement the code for streaming the audio and actually processing it. There is probably some Kaldi code that we could leverage. If you're interested, this would be a great addition to Eesen. We're using the fast decoding and small memory footprint for processing large amounts of data, but you're right, it would go well with real-time recognition, too. Want to try? Let me know what you think! |
Thank you for your response! With the current BiLSTM setup, what would be required is then a feature window that slides over the input and includes for each input all the left context and just enough of the right context. Is that correct? Kaldi does have boilerplates and tools for online decoding: http://kaldi-asr.org/doc/online_decoding.html In that web page they also mention per-speaker cepstral mean and variance normalization (CMVN), which I see is being used in the example EESEN scripts. What I have in mind is a system that takes any user's voice and outputs a transcript, without necessarily knowing who the speaker is or having access to past utterances. Any challenges to foresee? |
For online decoding with neural nets, Kaldi recommends constructing an i-vector that summarizes speaker properties, and training the neural net with audio features + i-vector. In the absence of past utterances, the i-vector is built from the audio from time 0 to some time t. So there would be an additional delay from building the i-vector, if I understand correctly. From http://kaldi-asr.org/doc/online_decoding.html |
Yes, that is by and large correct. The big challenge is speaker diarization, unless you only have one speaker in your audio channel. Imagine you have two speakers, a loud male and a soft female, in the same channel of the recording that you want to recognize. Even if they don't overlap, you need to separate them somehow, either by estimating and updating the corresponding cepstral means, or the corresponding i-vectors. I could imagine that a neural network learns the normalization properties from the i-vectors, but I am wondering if that doesn't introduce all other types of failure modes like means not being 0, etc. In sum, the BLSTM problem is just one problem that you need to solve for on-line recognition. Can you see a solution that would work for you for the above issues, so we can think about the BLSTM issue specifically? |
Ok, I now have a pretty good understanding of the diarization/speaker normalization issues, none of them insurmountable in my application. For now, I can focus on the decoding of the BLSTM outputs. You mentioned ASR probably only needs a 200ms window, any papers you could point me to? |
http://www.asru2015.org/Papers/ViewPapers.asp?PaperNum=1103 <http://www.asru2015.org/Papers/ViewPapers.asp?PaperNum=1103>
On Aug 4, 2017, at 5:37 PM, ericbolo ***@***.***> wrote:
Ok, I now have a pretty good understanding of the diarization/speaker normalization issues, none of them insurmountable in my application.
For now, I can focus on the decoding of the BLSTM outputs. You mentioned ASR probably only needs a 200ms window, any papers you could point me to?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#141 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8WwtkAT11TPQIYGznLe62D0uj3esks5sUzq1gaJpZM4OnXsl>.
Florian Metze <http://www.cs.cmu.edu/directory/florian-metze>
Associate Research Professor
Carnegie Mellon University
|
Thank you for the link. I gave it a quick read, my only worry is that in the paper the data is pre-segmented with GMM/HMM, and the training does not use CTC loss. Could the 0.5 s spectral window be too small for CTC loss training? |
Elaborating: from the paper I understand they sub-sample the data to avoid overfitting, so we don't have access to all the outputs of the utterance, possibly hampering CTC loss. |
Right, the paper does not use CTC loss, but I don't think this would matter much, certainly not for the LSTMs, which is where we have the recurrent connections. CTC affects the way the error signal is computed during training, but during inference, the computation is straightfoward and independent of the window. Except of course that we do need a segmentation - which is the other problem that you mention. The exact value of the window, 2*0.2 or 0.5 seconds would have to be determined through experiments, of course. |
I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/document/7953176/ Thought it might be of interest |
Yes, it is of interest - are you still trying to look into this and maybe implement this or something similar? I was hoping to do something here during the semester,er but it seems we don’t have enough hands as is ...
… On Jan 17, 2018, at 8:10 AM, ericbolo ***@***.***> wrote:
I've stumbled on this paper, which proposes a kind of BLSTM that is compatible with online decoding: http://ieeexplore.ieee.org/document/7953176/ <http://ieeexplore.ieee.org/document/7953176/>
Thought it might be of interest
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#141 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8V0GkW_0ykppcY4wCW2XXjmGPcHVks5tLfEogaJpZM4OnXsl>.
|
I'm still very much interested in online decoding, yes, but unfortunately
my hands are full this month and the next.
If anyone is interested to work on this with me starting early April, do
let me know !
…On 7 February 2018 at 04:48, Florian Metze ***@***.***> wrote:
Yes, it is of interest - are you still trying to look into this and maybe
implement this or something similar? I was hoping to do something here
during the semester,er but it seems we don’t have enough hands as is ...
> On Jan 17, 2018, at 8:10 AM, ericbolo ***@***.***> wrote:
>
> I've stumbled on this paper, which proposes a kind of BLSTM that is
compatible with online decoding: http://ieeexplore.ieee.org/
document/7953176/ <http://ieeexplore.ieee.org/document/7953176/>
> Thought it might be of interest
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub <
#141 (comment)>, or mute
the thread <https://github.com/notifications/unsubscribe-auth/AEnA8V0GkW_
0ykppcY4wCW2XXjmGPcHVks5tLfEogaJpZM4OnXsl>.
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#141 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_GeDkfDIdTRJY9gZT2-kkb68Rd71ks5tSRz7gaJpZM4OnXsl>
.
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
www.batvoice.com
|
yep, help wanted! probably even in April!
… On Feb 7, 2018, at 3:06 AM, ericbolo ***@***.***> wrote:
I'm still very much interested in online decoding, yes, but unfortunately
my hands are full this month and the next.
If anyone is interested to work on this with me starting early April, do
let me know !
On 7 February 2018 at 04:48, Florian Metze ***@***.***> wrote:
> Yes, it is of interest - are you still trying to look into this and maybe
> implement this or something similar? I was hoping to do something here
> during the semester,er but it seems we don’t have enough hands as is ...
>
> > On Jan 17, 2018, at 8:10 AM, ericbolo ***@***.***> wrote:
> >
> > I've stumbled on this paper, which proposes a kind of BLSTM that is
> compatible with online decoding: http://ieeexplore.ieee.org/
> document/7953176/ <http://ieeexplore.ieee.org/document/7953176/>
> > Thought it might be of interest
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub <
> #141 (comment)>, or mute
> the thread <https://github.com/notifications/unsubscribe-auth/AEnA8V0GkW_
> 0ykppcY4wCW2XXjmGPcHVks5tLfEogaJpZM4OnXsl>.
>
> >
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#141 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AEXQ_GeDkfDIdTRJY9gZT2-kkb68Rd71ks5tSRz7gaJpZM4OnXsl>
> .
>
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
www.batvoice.com
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#141 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8RoDIc7MNRQYvvs2rN68-g9yPWWbks5tSVmWgaJpZM4OnXsl>.
|
Hi Eric,
I'm still interested in online decoding but company priorities have caught
up to me and I can't do it single-handedly .
This said, if we can team up and brainstorm beforehand I'd be more than
happy to contribute !
…On Wed, Jun 20, 2018, 5:32 PM Eric Fosler-Lussier ***@***.***> wrote:
What's the current status on this? I'm starting the (crazy) sabbatical
project and it's pretty clear that some sort of online decoding mechanism
is going to be necessary. I can probably pitch in to help but I'll be
shaking the rust off of my coding skills. @ericbolo
<https://github.com/ericbolo>, @fmetze <https://github.com/fmetze> any
interest in this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#141 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_L2hwKpf8-pGDESc9qZqKME3pfaEks5t-msEgaJpZM4OnXsl>
.
|
From @fmeze 's answer above, implementing online decoding requires (1)
audio streaming capability: work required but not exploratory, probably
some open source tools available, (2) overhauling the model to be
compatible with true real-time processing ( CNN ? Wavenet ?), or keep the
current BLSTM architecture and adding a 200ms second window to capture the
right-context. Generally, CNN is much faster than LSTM so we should also
take that into account.
Does anyone know of a good CNN architecture for acoustic modeling ?
…On 20 June 2018 at 18:15, Eric Bolo ***@***.***> wrote:
Hi Eric,
I'm still interested in online decoding but company priorities have caught
up to me and I can't do it single-handedly .
This said, if we can team up and brainstorm beforehand I'd be more than
happy to contribute !
On Wed, Jun 20, 2018, 5:32 PM Eric Fosler-Lussier <
***@***.***> wrote:
> What's the current status on this? I'm starting the (crazy) sabbatical
> project and it's pretty clear that some sort of online decoding mechanism
> is going to be necessary. I can probably pitch in to help but I'll be
> shaking the rust off of my coding skills. @ericbolo
> <https://github.com/ericbolo>, @fmetze <https://github.com/fmetze> any
> interest in this?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#141 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AEXQ_L2hwKpf8-pGDESc9qZqKME3pfaEks5t-msEgaJpZM4OnXsl>
> .
>
--
Eric Bolo
CTO
tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>
______________________
Batvoice Technologies
10, rue Coquillière - 75001 Paris
www.batvoice.com
|
@fmetze 's comment was that CTC seems to be relatively dependent on the LSTM recurrent connections (which makes sense when I think about it). CNN with a wide enough window would probably do ok, though - although @fmetze might have some thoughts on that. The idea of having a forward LSTM + DNN-based representation of the future (ala the paper @ericbolo pointed out) would probably not be difficult to implement. The real question is what branch to target. I think, in talking with @fmetze, that the tf branch is the future for Eesen. That does make it easier to integrate different acoustic models. However, for the eventual application I'm looking at I'm not sure how I feel about "python in the loop". |
Regarding the branch, I for one am a lot more comfortable with tf.
And sorry but what do you mean by "python in the loop". And what would be
the broad specs of your eventual application ?
…On Thu, Jun 21, 2018, 7:44 PM Eric Fosler-Lussier ***@***.***> wrote:
@fmetze <https://github.com/fmetze> 's comment was that CTC seems to be
relatively dependent on the LSTM recurrent connections (which makes sense
when I think about it). CNN with a wide enough window would probably do ok,
though - although @fmetze <https://github.com/fmetze> might have some
thoughts on that.
The idea of having a forward LSTM + DNN-based representation of the future
(ala the paper @ericbolo <https://github.com/ericbolo> pointed out) would
probably not be difficult to implement.
The real question is what branch to target. I think, in talking with
@fmetze <https://github.com/fmetze>, that the tf branch is the future for
Eesen. That does make it easier to integrate different acoustic models.
However, for the eventual application I'm looking at I'm not sure how I
feel about "python in the loop".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#141 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_EcKSK6ZmpzmKN9o7p2kOQjbttSTks5t-9tlgaJpZM4OnXsl>
.
|
Hi,
In the speech course last year some students did a good analysis on CNNs
architectures for ASR. I can try to look for that.
I have seen good results using architectures like VGG although this could
maybe com with a expenses in the computation side.
One possible interesting solution for using unidirectional LSTM is to train
a biLSTM and then perform KL divergence between a unidirectional (non
trained model) and a fully trained bidirectional LSTM (
https://arxiv.org/pdf/1711.02212.pdf).
Regarding having python (if this is a concern) there is tf API in C++ that
you could maybe use(?).
Thanks!
2018-06-21 14:51 GMT-04:00 ericbolo <[email protected]>:
… Regarding the branch, I for one am a lot more comfortable with tf.
And sorry but what do you mean by "python in the loop". And what would be
the broad specs of your eventual application ?
On Thu, Jun 21, 2018, 7:44 PM Eric Fosler-Lussier <
***@***.***>
wrote:
> @fmetze <https://github.com/fmetze> 's comment was that CTC seems to be
> relatively dependent on the LSTM recurrent connections (which makes sense
> when I think about it). CNN with a wide enough window would probably do
ok,
> though - although @fmetze <https://github.com/fmetze> might have some
> thoughts on that.
>
> The idea of having a forward LSTM + DNN-based representation of the
future
> (ala the paper @ericbolo <https://github.com/ericbolo> pointed out)
would
> probably not be difficult to implement.
>
> The real question is what branch to target. I think, in talking with
> @fmetze <https://github.com/fmetze>, that the tf branch is the future
for
> Eesen. That does make it easier to integrate different acoustic models.
> However, for the eventual application I'm looking at I'm not sure how I
> feel about "python in the loop".
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#141 (comment)>, or
mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AEXQ_
EcKSK6ZmpzmKN9o7p2kOQjbttSTks5t-9tlgaJpZM4OnXsl>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#141 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPVFlUv13ECPB5II7fjpEKL-USoNNks5t--sigaJpZM4OnXsl>
.
|
I find the method which you mention from
https://arxiv.org/pdf/1711.02212.pdf appealing. For the benefit of others
in this thread I will outline the system:
1/ train a blstm with ctc loss - which Essen already does. Call this model
the teacher model.
2/ next train the student model: a backward-only lstm compatible with
online decoding, with the loss being the KL divergence between the teacher
and student networks' output distributions. In the paper they report that
this method reduces the WER by a large margin relative to randomly
initialized unidirectional LSTMs
The system still benefits from CTC and we will be able to write the model s
code easily
On Jun 21, 2018 10:05 PM, "ramonsanabria" <[email protected]> wrote:
Hi,
In the speech course last year some students did a good analysis on CNNs
architectures for ASR. I can try to look for that.
I have seen good results using architectures like VGG although this could
maybe com with a expenses in the computation side.
One possible interesting solution for using unidirectional LSTM is to train
a biLSTM and then perform KL divergence between a unidirectional (non
trained model) and a fully trained bidirectional LSTM (
https://arxiv.org/pdf/1711.02212.pdf).
Regarding having python (if this is a concern) there is tf API in C++ that
you could maybe use(?).
Thanks!
2018-06-21 14:51 GMT-04:00 ericbolo <[email protected]>:
Regarding the branch, I for one am a lot more comfortable with tf.
And sorry but what do you mean by "python in the loop". And what would be
the broad specs of your eventual application ?
On Thu, Jun 21, 2018, 7:44 PM Eric Fosler-Lussier <
***@***.***>
wrote:
> @fmetze <https://github.com/fmetze> 's comment was that CTC seems to be
> relatively dependent on the LSTM recurrent connections (which makes
sense
> when I think about it). CNN with a wide enough window would probably do
ok,
> though - although @fmetze <https://github.com/fmetze> might have some
> thoughts on that.
>
> The idea of having a forward LSTM + DNN-based representation of the
future
> (ala the paper @ericbolo <https://github.com/ericbolo> pointed out)
would
> probably not be difficult to implement.
>
> The real question is what branch to target. I think, in talking with
> @fmetze <https://github.com/fmetze>, that the tf branch is the future
for
> Eesen. That does make it easier to integrate different acoustic models.
> However, for the eventual application I'm looking at I'm not sure how I
> feel about "python in the loop".
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#141 (comment)>, or
mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AEXQ_
EcKSK6ZmpzmKN9o7p2kOQjbttSTks5t-9tlgaJpZM4OnXsl>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#141 (comment)>, or mute
the thread
<
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#141 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXQ_FySpF6TeVMJnzCL6C5onL5S8pHhks5t-_yQgaJpZM4OnXsl>
.
|
This seems like a reasonable step (student-teacher) and simple to implement. Re: python in the loop - I'm working on an eesen-in-the-browser project in order to enable some other stuff I want to work on (read: want my students to work on...). It's a bit of a crazy lark - compiling c++ into javascript (asm.js) via emscripten, which python would be problematic for (although @ramonsanabria is right in that I could reimplement in c++). Could be a big fail, but could be interesting if it works (and for which I need the online decoding). I think I'm going to retract what I said about tf, though - in thinking it through, the google folks have already made a javascript version available at js.tensorflow.org which could be used to run the net models probably more efficiently than any attempt I make. |
As I was looking into the tf_clean branch, these other design questions
came to mind:
- do we start with the character-based or the acoustic model ? I have
experience with the acoustic model only. But using acoustic models excludes
RNN-LM, and forces us to use a WFST as the final decoding graph. which
motivates my next question:
- is it possible to generate lattices and build the final graph fast enough
for an online implementation?
|
@efosler and @ramonsanabria have “full working examples” of WFST decoding for the Tensorflow code base that Ramon created, i think they are checked in, but maybe not in the same branch?
… On Jun 28, 2018, at 6:52 AM, ericbolo ***@***.***> wrote:
#193 <#193> #155 <#155> : before delving into online decoding, it might be a good idea to have a full working example of the tensorflow acoustic model with WFST.
Do you agree ? @efosler <https://github.com/efosler> I believe you have started working on that, how may I contribute ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#141 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8d7Zzukquz7HEeHXqF9bTCi02-yNks5uBLV_gaJpZM4OnXsl>.
|
update: currently training a forward LSTM with tensorflow, with different loss functions (CTC only, student-teacher loss, etc.). Looking ahead, I'm studying examples of online decoding. Kaldi has online feature extractors for MFCC and PLP, but not for filterbanks. In the EESEN paper as well as in the tedlium example, filterbanks rather than MFCC are used. Any reason for choosing fbanks over MFCCs ? Is it simply extraction speed since MFCCs are just filterbanks with postprocessing ? If I train with MFCC features, do you expect I 'll get similar results ? If using MFCCs is ok, I'll train with that to avoid the work on implementing my own online extractor. |
Our experience is that log filterbanks do work somewhat better than MFCCs, but if you're mostly working on pipeline at the moment the hit you'll take on MFCCs will not be large (and it should be easy to sub in online log mel filter banks later). FWIW, rolling your own log mel filterbank can also be a bit treacherous (although it shouldn't be). We found that using scikit to build features rather than Kaldi was giving us suboptimal performance. We ended up tracing it down to windowing differences, IIRC (we needed to have windowing be a multiple of frame shift for our application, which isn't the default in Kaldi). |
great, MFCCs it is then (for now) thank you |
What would be the main steps for building a real-time decoder on top of EESEN?
I read in the EESEN paper that composing the tokens, lexicon and grammar speeds up decoding a great deal, and I'd like to leverage that in a real-time context: capture an audio stream and output the transcripts progressively.
Is that by any chance in the works?
If not, I could try and give it a shot.
Thank you for this great project
The text was updated successfully, but these errors were encountered: