Skip to content

An easy way to apply LoRA to CLIP. Implementation of the paper "Low-Rank Few-Shot Adaptation of Vision-Language Models" (CLIP-LoRA) [CVPRW 2024].

Notifications You must be signed in to change notification settings

Tavish-777/CLIP-LoRA

 
 

Repository files navigation

Low-Rank Few-Shot Adaptation of Vision-Language Models [CVPRW 2024]

The official implementation of Low-Rank Few-Shot Adaptation of Vision-Language Models.

Authors: Maxime Zanella, Ismail Ben Ayed.

We present CLIP-LoRA, an easy-to-use few-shot method for Vision-Language Models with fixed hyperparameters for every task and every number of shots. This repository also aims at facilitating the usage of Low-Rank adapters (LoRA) in Vision-Language Models like CLIP.

PEFT
Figure 1: Low-Rank Adaptation (LoRA) is easy to use and does not create any additional inference latency.

Here is how to run the experiments:

  1. Installation
  2. Usage

A quick guide on how LoRA is implemented in this repository:

  1. LoRA in MultiheadAttention

Please consider supporting our work:

  1. Citation

If you have any inquiries:

  1. Contact

Installation

Environment configuration

Our code requires an environment with PyTorch installed. If you don't have one, consider creating a Python environment with:

conda create -y --name CLIP-LoRA python=3.10.0
conda activate CLIP-LoRA

And install Pytorch for instance with:

pip3 install torch==2.0.1 torchaudio==2.0.2 torchvision==0.15.2

Datasets installation

Please follow DATASETS.md to install the datasets.

How to execute CLIP-LoRA

Execute CLIP-LoRA on the ImageNet dataset with a random seed of 1 by entering the following command:

python main.py --root_path /path/to/your/data --dataset imagenet --seed 1

You can also exectute CLIP-LoRA on the 10 other datasets:

python main.py --root_path /path/to/your/data --dataset dataset_name --seed 1

You can optionally provide a save_path to save the LoRA modules, which can be reload easily with the --eval_only argument. The code will automatically check if your trained LoRA with the corresponding rank, alpha, encoder, params and position to ensure compatibility. The folder will be structured like that:

/your/save/path
└── backbone
    └── dataset
        └── Xshots
            ├── seedY

Here is the command line:

python main.py --root_path /path/to/your/data --dataset dataset_name --seed 1 --save_path /your/save/path --eval_only 

LoRA in MultiheadAttention

The PlainMultiheadAttentionLoRA class in loralib/layers.py extends the standard PyTorch multi-head attention mechanism by incorporating Low-Rank Adaptation (LoRA). This class constructs explicit linear modules for each component of the attention mechanism—query (q), key (k), value (v), and output (o)—providing a structured and adaptable foundation for your experiments.

Class Overview

PlainMultiheadAttentionLoRA takes an existing nn.MultiheadAttention module, replicates its configuration, and integrates LoRA linear modules.

Key Features

  • Parameter Initialization: The initialization process involves copying weights and biases from a pre-existing multi-head attention model. Each LoRA module (q, k, v, o) is adapted based on the specified requirements in the enable_lora list.
  • LoRA Integration: The replacement of standard linear layers with LinearLoRA layers introduces low-rank matrices, which are parameterized by the rank of adaptation (r) and the scaling factor (lora_alpha).
  • Forward Pass: The forward_module method manages the attention computation, incorporating optional dropout settings on the LoRA modules.

Example Usage

The following snippet demonstrates how to initialize the PlainMultiheadAttentionLoRA with an existing multi-head attention module.

from loralib.layers import PlainMultiheadAttentionLoRA

# Initialize with an existing MultiheadAttention module
existing_mha = nn.MultiheadAttention(embed_dim=512, num_heads=8)
lora_mha = PlainMultiheadAttentionLoRA(existing_mha, enable_lora=['q', 'k', 'v', 'o'], r=4, lora_alpha=2)

Few-shot performance

few_shot
Figure 2: Detailed few-shot learning results on the 10 fine-grained datasets and ImageNet with the ViT-B/16 visual backbone. Average performance for the ViT-B/16, ViT-B/32 and ViT-L/14 on the same 11 datasets is reported in the last three plots.

Citation

If you find this project useful, please cite it as follows:

@inproceedings{zanella2024low,
  title={Low-Rank Few-Shot Adaptation of Vision-Language Models},
  author={Zanella, Maxime and Ben Ayed, Ismail},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  pages={1593--1603},
  year={2024}
}

Contact

For any inquiries, feel free to create an issue or contact us at [email protected].

Acknowledgement

We express our gratitude to the CoOp and Tip-Adapter authors for their open-source contribution.

About

An easy way to apply LoRA to CLIP. Implementation of the paper "Low-Rank Few-Shot Adaptation of Vision-Language Models" (CLIP-LoRA) [CVPRW 2024].

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%