You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.
I'm wondering if it would be easy to add support for knowledge distillation for the Transformers on GLUE tasks (i.e. the TransformerSequenceClassifier module).
I see that the distillation loss has been implemented, and it's an option for the NeuralTagger which uses the TransformerTokenClassifier. Would it be easy to add distillation support for the GLUE models?
Here's how I was envisioning implementing it, modeling off of the distillation implementation for the tagger models:
TransformerSequenceClassifier’s train calls the base transformer model’s _train(). I would need to add a distiller argument to this function. This function would then handle distillation by loading in the teacher and the relevant arguments, just like the NeuralTagger
in procedures/transformers/glue.py, i would need to add a do_kd_training function that adds distilation args. This function would need to create a teacher model from these args (loading in weights from the passed-in path), create a TeacherStudentDistill instance, and pass in this object as the distiller argument into the new _train() function.
Does this seem about right? Are there any roadblocks you'd envision / why wasn't distillation implemented for the sequence classifier models to begin with?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi,
I'm wondering if it would be easy to add support for knowledge distillation for the Transformers on GLUE tasks (i.e. the TransformerSequenceClassifier module).
I see that the distillation loss has been implemented, and it's an option for the NeuralTagger which uses the TransformerTokenClassifier. Would it be easy to add distillation support for the GLUE models?
Here's how I was envisioning implementing it, modeling off of the distillation implementation for the tagger models:
Does this seem about right? Are there any roadblocks you'd envision / why wasn't distillation implemented for the sequence classifier models to begin with?
The text was updated successfully, but these errors were encountered: