-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suppress tokens during training #1
Comments
During training with this method, the model is not trained to predict the given labels exactly, but rather it's trained to have the same hidden states as the original model. Meaning, if the original model thinks the word "comma" in "hello comma how are you" is unlikely, the finetuned model will also think it's unlikely. This is in contrast to regular finetuning, where the finetuned model gets nudged to believe the training data is probable. It's useful to think of this finetuning process more as a distillation process, where the training data just so happens to be a reference for how the models behave, rather than a source of truth that the model is trained to match. If you want your model to have specific behavior/domain-knowledge and be able to use audio_ctx, you'll have to finetune them normally first, and then use this method to make it more robust with audio_ctx. I'm not entirely sure about the suppress_tokens issue you mention though |
Oh, this changes everything, thanks. I believe I misunderstood, then. I suppose that if I use my already fine-tuned model and "distil" it using your code, then things will work. I'll do this and post the results here for history. |
Hm. I've not been so successful. I ran whispercpp's
WER doubles essentially, even if I use the default |
Just to confirm, you made sure to replace both |
Hello. First of all, thank you for your work.
I came upon this repo when trying to improve the transcription speed in whispercpp by using a lower
audio_ctx
.However, while fine-tuning with this code, it does not seem to be suppressing tokens. I adapted the code a bit to set
suppress_tokens
and to use my dataset in Spanish, but everything else of importance remained unchanged.When fine-tuning normally, at least for
WhisperForConditionalGeneration
, settingmodel.config.suppress_tokens = suppress
works. But I'm not sure it's working here. Furthermore, my training dataset (about 6100 audio files) does not use any punctuation marks, that is, "," is "comma" and not the actual character. So I at least expect the model to learn not to use punctuation from the data itself, even if I do not suppress them explicitly during inference, but it's not what is happening here -- even if I train for 15 epochs -- the same number of epochs I use in my "normal" pipeline that does not use dynamic audio context.Also, in the code below, suppressing tokens only works during inference if I set these in the
model.generate
function:This made me suspect that it's not suppressing tokens during training. But I don't know how to verify this.
Full code (except loading the dataset, but nothing unusual there):
Finally, when running inference on a test set using whispercpp and the fine-tuned model, I get a WER of 39% instead of the usual 7% that I get when trained "normally".
The text was updated successfully, but these errors were encountered: