A possible bug with the start and end times of words #105

rilhia · 2023-03-22T20:47:11Z

rilhia
Mar 22, 2023

Hi, I have been working on a project to extract words from audio. I am getting some varying results. It is written in Java and I have been through several variations of code and keep getting the same problems....although not for every word. Many words are extracted first time, based on the start time and end time specified in the response from the listen endpoint. But there are several which seem to be off. At first I thought that this could be down to an incorrect transcription, so I tested a few.

I wrote some code to play the original wav file from and to the start and end point, then played around with the start time and end time to manually check to see if the word was correctly transcribed. A good example is the word "branches". The word that I extracted originally (using the start and end time supplied in the results) can be heard here....

branches-incorrect.wav.zip

This was specified as being between 1584.293 seconds and 1584.6128 seconds in the original wav file.

After I had played around, I extracted a much cleaner version which can be heard here.....

branches-corrected.wav.zip

This was taken from between 1584.393 seconds and 1584.7000 seconds.

As you can see, there is not much in it. But the original version (using the numbers from the response) includes the end of the previous word and misses out the end of "branches"....the "es". The corrected version includes the whole word and is actually a shorter length.

I know that this might sound ultra picky and I am not trying to be I promise. But when I saw that every word in the response was given a start and end time, I thought it might be fun to create something that would allow me to construct sentences from the words that are extracted at those precise points.

Is this a known issue? Or maybe I am doing something wrong. As I said, I have played around with a few different ways of doing this and I always seem to find the same words occurring like this and the words that work, always seem to work.

I am happy to answer any questions you may have.

Answered by jjmaldonis

May 17, 2023

Hey @rilhia , thanks for all the info, including the example audio files. I can hear exactly what you're explaining. You're right that this is a known issue, and as far as I'm aware this is an issue with all automatic speech recognition models - at Deepgram and elsewhere. There are some words in the audio file that get identified in a way where the timestamps are shortened or lengthened. The models will continue to get better and better, so hopefully this issue will fix itself with time.

Until then, one thing you may have already thought of is to try to analyze the audio track to differentiate between words. It might be possible to differentiate words/sounds on a sentence-by-sentence basi…

View full answer

jjmaldonis · 2023-05-17T19:55:48Z

jjmaldonis
May 17, 2023
Maintainer

Hey @rilhia , thanks for all the info, including the example audio files. I can hear exactly what you're explaining. You're right that this is a known issue, and as far as I'm aware this is an issue with all automatic speech recognition models - at Deepgram and elsewhere. There are some words in the audio file that get identified in a way where the timestamps are shortened or lengthened. The models will continue to get better and better, so hopefully this issue will fix itself with time.

Until then, one thing you may have already thought of is to try to analyze the audio track to differentiate between words. It might be possible to differentiate words/sounds on a sentence-by-sentence basis by taking into account the pauses (or lack of audio signal) in the audio file. I tried to do this myself a few months back, and wasn't able to differentiate on a word-by-word basis. However based on what you said your use case is, maybe it could work for you. The precision that you are looking for is probably best identified by a combination of Deepgram's API responses coupled with another tool/algorithm that operates on the audio signal. If I run across something that could help, I'll update this thread.

1 reply

rilhia May 18, 2023
Author

Hi @jjmaldonis, thanks for getting back to me. This was just a fun project to play around with Deepgram, so it's not a massive issue. I ended up extracting the words that I needed using the timestamps in the transcription, then tweaking the timings where they were off. No big deal, I just thought it should be raised.

I think you guys have done a great job with this. From my perspective, it opens up so many opportunities for working with audio data. I put together this LinkedIn article (https://www.linkedin.com/pulse/audio-transcription-made-super-easy-richard-hall/) to get you guys some attention from developers in my domain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

A possible bug with the start and end times of words #105

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Deepgram

A possible bug with the start and end times of words #105

rilhia Mar 22, 2023

Replies: 1 comment · 1 reply

jjmaldonis May 17, 2023 Maintainer

rilhia May 18, 2023 Author

rilhia
Mar 22, 2023

Replies: 1 comment 1 reply

jjmaldonis
May 17, 2023
Maintainer

rilhia May 18, 2023
Author