Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Credit to Sanchit Gandhi for starter code, README, and loading/fine-tuning with commonvoice.

POSSIBLE TODOS for team, after grading deadline:

add AccentDB classifier notebook (Mohamed) and integrate into a fluid pipeline (same notebook as finetune model)
Speech Accent Archive loading and inference code (Laura)
tune hyperparameters
support work with more collaborator credits and references

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

CMU 11-785 Intro to Deep Learning Course Project

Democratizing Speech: Automatic Speech Recognition for Second Language Teaching

Introduction

This notebook presents a detailed guide on how to fine-tune OpenAI's Whisper model for any multilingual Automatic Speech Recognition (ASR) dataset using Hugging Face's Transformers. This content serves as a hands-on companion to the accompanying blog post, offering a practical approach to the fine-tuning process.

Whisper Model Overview

Whisper, released by OpenAI in September 2022, is a groundbreaking pre-trained model for ASR. Developed by Alec Radford et al., it is unique in its training approach, utilizing a massive quantity of labelled audio-transcription data (680,000 hours). This is significantly more than the unlabelled data used in previous models like Wav2Vec 2.0. Notably, 117,000 hours of this pre-training data is multilingual, enabling the model to handle over 96 languages, including many low-resource ones.

The extensive training allows Whisper to generalize effectively across various datasets and domains. It achieves competitive results in benchmarks like LibriSpeech ASR and sets new standards in others like TED-LIUM. The pre-trained checkpoints can be further fine-tuned for specific datasets and languages, enhancing their performance for low-resource languages.

How to run the Code

Dependencies and Installation

We'll employ several popular Python packages to fine-tune the Whisper model. We'll use datasets to download and prepare our training data and transformers to load and train our Whisper model. We'll also require the soundfile package to pre-process audio files, evaluate and jiwer to assess the performance of our model. Finally, we'll use gradio to build a flashy demo of our fine-tuned model.

!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install kaleido
!pip install cohere
!pip install openai
!pip install tiktoken
!pip install soundfile
!pip install tiktoken
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

Hugging Face Hub

You can upload model checkpoints directly the Hugging Face Hub whilst training.

Find your Hub authentication token here:

Load dataset and Preprocessing

Ensure you have accepted the terms of use on the Hugging Face Hub: mozilla-foundation/common_voice_11_0. Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally. Just load the dataset with load_dataset function.

Data is loaded from AccentDB.

Follow the code cells in the notebook to preprocess data and load pre-trained models.

Training and Evaluation

We used the 🤗 Hugging Face Trainer to train our data

Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metric. We need to define a compute_metrics function that handles this computation.
Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.
Define the training configuration: this will be used by the 🤗 Trainer to define the training schedule.

To launch training, simply execute:

trainer.train()

Training will take approximately 5-10 hours depending on your GPU or the one allocated to this Google Colab. If using this Google Colab directly to fine-tune a Whisper model, you should make sure that training isn't interrupted due to inactivity. A simple workaround to prevent this is to paste the following code into the console of this tab (right mouse click -> inspect -> Console tab -> insert code).

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
screenshots		screenshots
IDL_fine_tune_whisper_rev2.ipynb		IDL_fine_tune_whisper_rev2.ipynb
ProjectReport.pdf		ProjectReport.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Introduction

Whisper Model Overview

How to run the Code

Dependencies and Installation

Hugging Face Hub

Load dataset and Preprocessing

Training and Evaluation

Contributing

About

Releases

Packages

Contributors 2

Languages

aliu390/idl-accented-asr

Folders and files

Latest commit

History

Repository files navigation

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Introduction

Whisper Model Overview

How to run the Code

Dependencies and Installation

Hugging Face Hub

Load dataset and Preprocessing

Training and Evaluation

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages