Skip to content

Background

pluteski edited this page Apr 23, 2017 · 22 revisions

Introduction

I wrote this code to utilize the speech-to-text cloud APIs provided by IBM and Google for a personal project.

Background

I somehow accumulated numerous audio files containing spoken word recording, including dictation as well as continuous speech. This audio contains valuable data for me and my family, including family journals, recordings of talks I've given, dictation of stories I've written, health and fitness data.

The story behind the data and how I came to accumulate it is too long to address here; however suffice it to say that there is enough of it that it became too unwieldy to utilize in its audio form. To make this data more accessible I need to transcribe it.

I tried various speech-to-text transcription software. I was able to successfully transcribe a small portion of my dataset using Dragon Naturally Speaking, and another small portion using the speech-to-text capability built into Mac OS. However, neither of these was suited to my task. They either required too much manual intervention, or consumed too many resources, rendering my laptop unusable for anything else while it was processing a file. Neither one provided a means of processing batches of audio files.

Then, in 2016 IBM and Google made their cloud APIs available at compelling price points and generous trial versions. Furthermore, each one provided a generous monthly allowance of free processing (although IBM is much more generous in this regard). This prompted me to give them a try.

It turned out that using each one wasn't exactly trivial. While the cloud offerings provide the key functionality lacking by commercial off-the-shelf (COTS) software, utilizing for my use case required a bit of coding effort. However, the price point and accuracy had reached the point where this coding effort became worthwhile.

What does it do?

This code locates the audio files that are contained in a folder, submits them to a cloud api, collects and collates the output. The output consists of the transcribed text, as well as associated log data.

Who might find this useful?

This is suitable for DIY users who have up to thousands of audio files. I have tested it on thousands of my own files.

The cloud APIs I used are not yet suitable for transcribing continuous speech collected using a mobile audio recorder especially where there is substantial background noise. Even audio that is obtained using a good microphone in a quiet environment poses a difficult challenge. Transcribing meeting minutes would still require a substantial amount of additional processing. Furthermore, neither of these offerings can as yet handle dictation as well as COTS software. This means that to obtain an accurate transcript would require not only correcting the transcription errors, but correcting punctuation errors even where it is explicitly provided by dictation. Furthermore, the ability to handle dates and entity names is not yet as good as COTS software.

However, while the current accuracy limits the utility, I find it to be adequate for indexing my audio files and rudimentary text analysis.

Why contribute this code?

These cloud APIs are evolving rapidly. I hope that they will improve to be able to handle low bitrate recordings and background noise. I expect these cloud providers to adapt to meet the needs of their best users. The more users that apply these cloud apis on use cases similar to mine the more likely that the apis will evolve to serve these use cases. And last but not least, it just feels good to give back to the software community that has so freely shared so much with me over the many years.

References

https://www.wired.com/2016/04/long-form-voice-transcription/

  • claims current error rate is 4% to 12%. The error rate is substantially higher on my audio, ranging from 15% to 100%.
Clone this wiki locally