Here we provide an example of knowledge distillation for Chinese-CLIP fine-tuning training, based on ModelScope model library. By using knowledge distillation, smaller Chinese-CLIP models (with better inference speed) can learn from larger models (including larger Chinese-CLIP or other image embedding models on ModelScope) to further improve the image-to-image retrieval ability. The Teacher models used are all from ModelScope. Currently, all the Chinese-CLIP have been supported on ModelScope.
- Nvidia GPUs with Turning, Ampere, Ada or Hopper architecture (such as H100, A100, RTX 3090, T4, and RTX 2080). Please refer to this document for the corresponding GPUs of each Nvidia architecture.
- CUDA 11.4 and above.
- PyTorch 1.12 and above.
- ModelScope:Install ModelScope by executing
pip install modelscope
. - Other dependencies as required in requirements.txt.
It is not complicated to apply knowledge distillation to the image side in Chinese-CLIP finetune. Just add the --distllation
configuration item to the sh script of finetune.
Then fill in the name of the Teacher model to be used in the configuration item --teacher-model-name
. The currently supported Teacher models include the following four ModelScope-supported models.
Teacher model | Model Info |
damo/multi-modal_clip-vit-huge-patch14_zh | CLIP model-Chinese-general field-huge |
damo/multi-modal_clip-vit-large-patch14_zh | CLIP model-Chinese-general field-large |
damo/multi-modal_team-vit-large-patch14_multi-modal-similarity | TEAM image-text retrieval model-Chinese-large |
damo/multi-modal_rleg-vit-large-patch14 | RLEG Generative Multimodal Representation Model-English-large |
Finally, fill in the weight of the distillation loss in the configuration item --kd_loss_weight
, the default value is 0.5.
The configuration items are defined as follows:
distllation
: Whether to enable knowledge distillation to fine-tune the image side of the model.teacher-model-name
: Specify the Teacher model to use. Currently supports the above four Teacher models, such as filling indamo/multi-modal_team-vit-large-patch14_multi-modal-similarity
.kd_loss_weight
(optional): Distillation loss weight, default value is 0.5.
We provide a sample script run_scripts/muge_finetune_vit-b-16_rbt-base_distllation.sh
, we take the TEAM image-text retrieval model-Chinese-large
as Teacher model.
Image retrieval Top10 results of our model (finetune+distillation) v.s. pre-trained model v.s. finetune model. The image in the upper left corner is used as a query, and the search results are in order from Top1 to Top10 on the right. The support data set in this experiment has 100,000 e-commerce data (including shoes, clothes, pants, etc.).
Advantages of our approach:
- Meet the basic requirements of the retrieval task: under the premise of ensuring the category similarity, the image similarity is well realized.
- Good performance and fast speed: Through the distillation method, the base model has a retrieval effect similar to that of the large model. And deployed to the CPU, the retrieval reasoning time is controlled within 100ms.
A solution of distillation have been launched on Alibaba Cloud PAI-DSW Gallery. The corresponding Jupyter Notebook is provided in PAI-DSW Gallery to support users to build exclusive search models using their own data.