Migrated from whisper_and_distil_whisper and asr evaluation on 2024-05-30.
Simple speaking, Whisper is too large to deploy into a lot of production environments, we can deal this with model distillation technique.
The paper "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling" proposed a robust and straight forward approach to make full-size Whisper become smaller, but unfortunately, the original implementation has following problems which make it painful to use in real working:
- Unnecessarily coupled together with a lot of HuggingFace libs, most of them are not really easy to use, like datasets, huggingface_hub.
- A lot of rarely using logics, like uploading your data, uploading your trained model.... seriously, WHY WOULD I DO THAT...??
- Tied together with data on HuggingFace Datasets platform, I believe most MLE/researchers who need distil or fine-tune Whiper have their internal datasets.
So the target of this working is to solve the problem brought by original implementation and make MLEs/researchers life easier when they need distil Whisper with their customized datasets.
Basically do distillation of Whisper contain three steps:
- Generate psuedo labeled dataset based on original training dataset.
- Initialize a distilled/pruned Whisper based on full-sized model.
- Model distillation.
Each of above steps will corresponding with a single program to run.
Compare with original implementation, we make parameters be much more clear by put all parameter into a JSON configs, so:
- Do not have a lot of commond-line parameters, one JSON configs can have everything you need.
- Even you don't know or care some not commonly used default parameters, you can still get their existance, in case when you need change it in future.
- After each task, the JSON configs will be copied into an output folder with which you can always reproduce your task in future.
The demo configs can be found at root directory of this project.
All audio datasets are just a JSON lines file.
- Original Training Dataset: Contains at least 2 fields, one represents text/transcript, the other represents audio file path.
- Psuedo Labelled Training Dataset: Contains at least 3 fields, the first represents text/transcript generated by original full-sized Whiser(psuedo label), the second for audio file path, and the third for CER/WER between psuedo label and original text.
- A directory in which contains a pre-trained or fine-tuned HuggingFace Whisper model.
- JSON lines audio training & dev & test dataset.
- Build Python environment:
conda create -p ./_venv python=3.10
conda activate ./_venv
conda deactivate
# or
python3 -m venv ./_venv --copies
source ./bin/activate
# deactivate
pip install -r ./requirements.txt
python ./run_pseudo_labelling.py ./run_pseudo_labelling.json
python run_student_model_init.py run_student_model_init.json
python ./run_distillation.py ./run_distillation.json
python ./offline_inference.py ./offline_inference.json
python eval.py eval.json
I did this on the internal datasets which only contains Mandarin and Hokkien. So far the performance can be understood as reproduced the original paper:
- Full-sized Whiser (Small) CER on Mixed Full Test Data: 0.3033
- Distil-Whisper (Small) CER on Mixed Full Test Data: 0.3039