A possible bug with the start and end times of words #105
-
Hi, I have been working on a project to extract words from audio. I am getting some varying results. It is written in Java and I have been through several variations of code and keep getting the same problems....although not for every word. Many words are extracted first time, based on the start time and end time specified in the response from the listen endpoint. But there are several which seem to be off. At first I thought that this could be down to an incorrect transcription, so I tested a few. I wrote some code to play the original wav file from and to the start and end point, then played around with the start time and end time to manually check to see if the word was correctly transcribed. A good example is the word "branches". The word that I extracted originally (using the start and end time supplied in the results) can be heard here.... This was specified as being between 1584.293 seconds and 1584.6128 seconds in the original wav file. After I had played around, I extracted a much cleaner version which can be heard here..... This was taken from between 1584.393 seconds and 1584.7000 seconds. As you can see, there is not much in it. But the original version (using the numbers from the response) includes the end of the previous word and misses out the end of "branches"....the "es". The corrected version includes the whole word and is actually a shorter length. I know that this might sound ultra picky and I am not trying to be I promise. But when I saw that every word in the response was given a start and end time, I thought it might be fun to create something that would allow me to construct sentences from the words that are extracted at those precise points. Is this a known issue? Or maybe I am doing something wrong. As I said, I have played around with a few different ways of doing this and I always seem to find the same words occurring like this and the words that work, always seem to work. I am happy to answer any questions you may have. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hey @rilhia , thanks for all the info, including the example audio files. I can hear exactly what you're explaining. You're right that this is a known issue, and as far as I'm aware this is an issue with all automatic speech recognition models - at Deepgram and elsewhere. There are some words in the audio file that get identified in a way where the timestamps are shortened or lengthened. The models will continue to get better and better, so hopefully this issue will fix itself with time. Until then, one thing you may have already thought of is to try to analyze the audio track to differentiate between words. It might be possible to differentiate words/sounds on a sentence-by-sentence basis by taking into account the pauses (or lack of audio signal) in the audio file. I tried to do this myself a few months back, and wasn't able to differentiate on a word-by-word basis. However based on what you said your use case is, maybe it could work for you. The precision that you are looking for is probably best identified by a combination of Deepgram's API responses coupled with another tool/algorithm that operates on the audio signal. If I run across something that could help, I'll update this thread. |
Beta Was this translation helpful? Give feedback.
Hey @rilhia , thanks for all the info, including the example audio files. I can hear exactly what you're explaining. You're right that this is a known issue, and as far as I'm aware this is an issue with all automatic speech recognition models - at Deepgram and elsewhere. There are some words in the audio file that get identified in a way where the timestamps are shortened or lengthened. The models will continue to get better and better, so hopefully this issue will fix itself with time.
Until then, one thing you may have already thought of is to try to analyze the audio track to differentiate between words. It might be possible to differentiate words/sounds on a sentence-by-sentence basi…