Skip to content

Background

pluteski edited this page Apr 23, 2017 · 22 revisions

Introduction

I wrote this code to utilize the speech-to-text cloud APIs provided by IBM and Google for a personal project.

Background

I somehow accumulated numerous audio files containing spoken word recording, including dictation as well as continuous speech. This audio contains valuable data for me (if no one else outside my immediate family), including family journals, recordings of talks I've given, dictation of stories I've written, as well as health and fitness data. It turns out to be a very convenient way of collecting such data, especially if one is susceptible to repetitive motion injury from all that typing, much less the writers cramp one gets before very long from actually applying pen to paper.

The story behind the data and how I came to accumulate it is too long to address here; however suffice it to say that there is enough of it that it became too unwieldy to utilize in its audio form. For this data to be accessible and more useful it needs to be transcribed.

I tried various speech-to-text transcription software. I was able to successfully transcribe a small portion of my dataset using Dragon Naturally Speaking, and another small portion using the speech-to-text capability built into Mac OS. However, neither of these was suited to my task. They either required too much manual intervention, or consumed too many resources, rendering my laptop unusable for anything else while it was processing a file. Neither one provided a means of processing batches of audio files.

Then, in 2016 IBM and Google made their cloud APIs available at compelling price points and generous trial versions. Furthermore, each one provided a generous monthly allowance of free processing (although IBM is much more generous in this regard). This prompted me to give them a try.

It turned out that using each one wasn't exactly trivial. While the cloud offerings provide the key functionality lacking by commercial off-the-shelf (COTS) software, utilizing for my use case required a bit of coding effort. However, the price point and accuracy had reached the point where this coding effort became worthwhile.

What does it do?

This code locates the audio files that are contained in a folder, submits them to a cloud api, collects and collates the output. The output consists of the transcribed text, as well as associated log data.

Who might find this useful?

This is suitable for DIY users who have a moderate number of audio files that they want to process themselves. I have tested it on thousands of my own files.

The cloud APIs I used are not yet suitable for transcribing continuous speech collected using a mobile audio recorder especially where there is substantial background noise. Even audio that is obtained using a good microphone in a quiet environment poses a difficult challenge. Transcribing meeting minutes would still require a substantial amount of additional processing. Furthermore, neither of these offerings can as yet handle dictation as well as COTS software. This means that to obtain an accurate transcript would require not only correcting the transcription errors, but correcting punctuation errors even where it is explicitly provided by dictation. Furthermore, the ability to handle dates and entity names is not yet as good as COTS software.

However, while the current accuracy limits the utility, I find it to be adequate for indexing my audio files and rudimentary text analysis.

Why contribute this code?

These cloud APIs are evolving rapidly. I hope that they will improve to be able to handle low bitrate recordings and background noise. I expect these cloud providers to adapt to meet the needs of their best users. The more users that apply these cloud apis on use cases similar to mine the more likely that the apis will evolve to serve these use cases.

Besides, it just feels good to give back to a software community that has shared so much and so freely with me and so many others over the past numerous years.

Can you imagine what the world would be like if all crafts and trades were so freely sharing of their tools as are software developers?

References

https://www.wired.com/2016/04/long-form-voice-transcription/

  • claims current error rate is 4% to 12%. The error rate is substantially higher on my audio, ranging from 15% to 100%.
Clone this wiki locally