Skip to content

Latest commit

 

History

History
executable file
·
312 lines (262 loc) · 17.1 KB

README.md

File metadata and controls

executable file
·
312 lines (262 loc) · 17.1 KB

CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages

Arxiv Hugging Face

In this project, we propose CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. Our framework is built upon pre-trained OpenCLIP, and it implements the conventional fine-tuning and also an optimized fine-tuning (CAPIVARA + Opt.) that uses LoRA and gradient checkpointing in order to reduce the computation cost.

CAPIVARA holds the state of the art in many zero-shot tasks involving images and Portuguese texts. Also, our method has the potential to significantly improve the model performance in other low-resource languages using a single RTX Quadro 8000 GPU for just 2 hours.

Models

Pipeline

In our pipeline, we employed the following models:

  • Translator: Google Translate
  • Image captioning: BLIP2

Results

Performance improvement with CAPIVARA + Opt. in Low-Resource Languages: Xhosa, Hindi, and Portuguese. The percentage point increase over the baseline (OpenCLIP ViT-B/32 XLM-Roberta Base) in terms of mean recall for text-to-image (txt2img) and image-to-text (img2txt) retrieval is highlighted above the respective bars.

Zero-shot Cross-Modal Retrieval

We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.

Models Flickr30k MS COCO PraCegoVer
text-to-image image-to-text text-to-image image-to-text text-to-image image-to-text
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) 76.23 87.93 52.62 66.55 65.36 69.43
CAPIVARA 79.56 ± 0.01 89.95 ± 0.04 56.27 ± 0.01 71.24 ± 0.01 66.40 ± 0.01 64.75 ± 0.01
CAPIVARA + Opt. 79.39 ± 0.05 89.13 ± 0.08 55.49 ± 0.06 69.26 ± 0.05 66.89 ± 0.04 67.93 ± 0.01

Zero-shot image classification

Models Caltech-101 CIFAR-10 CIFAR-100 Country-211 DTD EuroSAT FER-2013 FGVC-Aircraft Food-101 GTSRB Hateful-Memes KITTI-Distance MNIST Oxford Flowers-102 Oxford-IIIT Pets PatchCamelyon Rendered-SST2 RESISC-45 Stanford-Cars PASCAL VOC-2007 Average ImageNet-1k
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) 84.53 ± 0.00 93.99 ± 0.00 68.44 ± 0.00 17.82 ± 0.00 41.17 ± 0.00 47.16 ± 0.00 48.65 ± 0.00 26.30 ± 0.00 65.06 ± 0.00 43.27 ± 0.00 56.50 ± 0.00 28.41 ± 0.00 54.99 ± 0.00 50.88 ± 0.00 81.56 ± 0.00 50.96 ± 0.00 54.20 ± 0.00 58.51 ± 0.00 84.93 ± 0.00 82.09 ± 0.00 56.97 ± 0.00 45.84 ± 0.00
CAPIVARA 82.97 ± 0.03 93.85 ± 0.00 69.37 ± 0.01 17.61 ± 0.00 42.34 ± 0.04 47.77 ± 0.02 46.68 ± 0.05 25.49 ± 0.01 64.58 ± 0.01 46.34 ± 0.01 56.17 ± 0.00 33.94 ± 0.13 60.14 ± 0.04 49.93 ± 0.02 79.37 ± 0.00 51.71 ± 0.01 54.82 ± 0.03 59.71 ± 0.01 85.10 ± 0.02 82.29 ± 0.00 57.51 ± 0.02 46.06 ± 0.01
CAPIVARA + Opt. 83.68 ± 0.02 93.93 ± 0.03 68.87 ± 0.01 17.32 ± 0.02 41.79 ± 0.07 48.85 ± 0.12 46.85 ± 0.13 25.54 ± 0.09 64.46 ± 0.00 44.66 ± 0.06 56.81 ± 0.03 28.27 ± 0.11 55.00 ± 0.10 51.99 ± 0.12 80.90 ± 0.09 52.39 ± 0.07 52.94 ± 0.04 56.93 ± 0.01 84.90 ± 0.06 81.99 ± 0.02 56.90 ± 0.06 45.65 ± 0.02

Reproducibility

Installation

Run the following command to install required packages.

pip install -r requirements.txt

Code organization

├─ README.md
├─ assets
│  ├─ capivara.png
│  ├─ low-resource-lang.png
│  └─ pipeline.png
├─ clip_pt
│  ├─ experiment_setup					<--- training setup files in format .yaml			
│  │  └─ capivara.yaml
│  ├─ requirements.txt
│  └─ src
│     ├─ evaluate
│     │  ├─ utils
│     │  │  ├─ metric.py				<--- metrics used in ELEVATOR
│     │  │  ├─ voc2007.py				<--- metric used in PASCAL VOC-2007
│     │  │  └─ resources				<--- setup files used for inference
│     │  ├─ zero_shot_elevater.py			<--- script used for zero-shot image classification in ELEVATOR
│     │  ├─ zero_shot_image_classification.py		<--- script used for zero-shot image classification in ImageNet, ObjectNet, and GroceryStore
│     │  ├─ zero_shot_imagenet_babel.py			<--- script used for zero-shot image classification in ImageNet Babel
│     │  └─ zero_shot_retrieval.py			<--- script used for zero-shot cross-modal retrieval
│     ├─ generating
│     │  ├─ blip2.py					<--- script used for generating synthetic captions
│     │  └─ generated_caps_sim_score.py			<--- script used for computing similarity score between captions and images
│     ├─ main_open_clip.py				<--- main script used for training
│     ├─ models
│     │  ├─ open_CLIP.py				<--- base CLIP class
│     │  ├─ open_CLIP_adapter.py			<--- CLIP + LoRA class
│     │  └─ open_clip_wrapper.py			<--- Wrapper that implements the training methods using PyTorch-lightning
│     ├─ recipes					<--- auxiliary executable files
│     └─ utils
│        ├─ carbon_tracker.py				<--- methods used to estimate the carbon footprint
│        ├─ loss.py					<--- loss function
│        ├─ open_clip_utils.py				<--- implements auxiliary methods
│        ├─ scheduler.py				<--- implements learning rate schedulers
│        └─ dataset
│           ├─ evaluation_dataset.py			<--- base evaluation class
│           ├─ grocery_store_dataset.py			<--- implements grocery store evaluation class 
│           ├─ imagenet_dataset.py			<--- implements ImageNet evaluation class 
│           ├─ object_net.py				<--- implements ObjectNet evaluation class 
│           └─ load_datasets_open_clip.py		<--- methods to load train/val datasets
└─ preprocessing					<--- auxiliary dataset preprocessing methods

Data preprocessing

Dataset translation

Since the texts used are translated from English into the target languages, if it is necessary to introduce new data in addition to the data provided by us, a new translation is required. We used Google Translate to do this. First, we extracted all the captions for each of the sets used. Then we translated the captions using the translator. Finally, we added all the translated captions to their original bases with the tag of the language used. All sets are kept in the original format of the bases to make it easier for users who already use them.

Generating synthetic captions

To generate the synthetic captions, you can run the following command:

python3 generating/blip2.py --dataset-path "your_webdataset/{00000..00999}.tar" --gpu 0 --batch 100

It uses BLIP2 to generate captions starting with the prefixes:

prefixes = ["the foreground features", "a photo of", "a picture of",
            "this is a scene depicting", "an image of", "portrait of a",
            "this image captures a moment of", "a painting of", "an art of",
            "the picture shows"]

Then, a new dataset is saved whose name is "dataset_name_{postfix-path}", where --postfix-path is an optional argument.

Train

CAPIVARA is built using pytorch-lightning. The file example.yaml lists all the parameters that can be used by CAPIVARA.

For simple and straightforward training of the model, the following command can be used:

python3 main_open_clip.py --config_path=path/to/config_file

To use the adapter training settings, you must also pass on the directory of the checkpoint used:

python3 main_open_clip.py \
		--config_path=path/to/config_file \
		--checkpoint-dir=path/to/checkpoint 

Other settings (all present in the file example.yaml are available to configure the training and import according to your needs.

Inference

In order to make it easier to replicate our experiments, we share the scripts we used for inference.

Zero-shot Text-to-Image Retrieval

The following method can be used to retrieve images:

def text_to_image_retrieval(text_required, model, image_features, text_features, all_images, all_texts):
    all_texts = sum(all_texts, [])
    caption = []
    for text in text_required:
        if type(text) != int:
            caption.append(text)
            text_features = text_tokenizer(text)
            text_features = model.encode_text(text_features.to(device))
            text_features = text_features
        else:
            caption.append([text])
        similarities = []
        for i in tqdm.tqdm(range(len(image_features)), desc="t2i retrieval"):
            if type(text) == int:
                scores = text_features[text] @ image_features[i].t()  # shape: [batch_size, batch_size]
            else:
                scores = text_features @ image_features[i].t()  # shape: [batch_size, batch_size]
            item = {
                'score': scores.cpu(),
                'id': i,
                'image': all_images[i].cpu()
            }
            similarities.append(item)
        similarities_df = pd.DataFrame(similarities)
        sorted_similarities_df = similarities_df.sort_values(by='score', ascending=False)
    return sorted_similarities_df, caption

In this way, a list containing the similarity scores between the input text and the set of images is returned, as well as their ids and images.

Zero-shot Image-to-Text Retrieval

As a complement, the method below retrieves text from a target image.

def image_to_text_retrieval(image_required, image_features, text_features, all_images, all_texts):
    all_texts = sum(all_texts, [])
    images_selected = []
    for image in image_required:
        images_selected.append(all_images[image])
        similarities = []
        for i in tqdm.tqdm(range(len(text_features)), desc="i2t retrieval"):
            scores = text_features[i] @ image_features[image].t()  # shape: [batch_size, batch_size]
            item = {
                'score': scores.cpu(),
                'id': i,
                'text': all_texts[i]
            }
            similarities.append(item)
        similarities_df = pd.DataFrame(similarities)
        sorted_similarities_df = similarities_df.sort_values(by='score', ascending=False)
    return sorted_similarities_df, images_selected

This method returns a list containing the similarity scores between the input image and the set of texts, as well as their ids and images. The use of these methods and other auxiliary methods can also be seen in the retrieval example notebook, where it is possible to iteratively retrieve images and texts.

Retrieval Evaluation

To carry out the evaluation of image and text retrieval automatically, generating the metrics used in the article, the python script zero_shot_retrieval.py can be used. The following parameters can be used:

--model-path # directs to the path of the model checkpoint
--dataset-path # path to validation/test dataset
--translation # select which translation framework will be used "english", "marian", "google" (default)
--language # language used for captions: "en" (default), "xh", "hi"
--batch # batch size
--open_clip # indicates whether model is fine-tuned (True) or is the original OpenCLIP (False)
--gpu # select GPU
--adapter # load the adapter weights

Zero-shot image classification

To use the model as a classifier, the following code can be used:

img_features, txt_features = model.model(batch)
logits, _ = model.model.compute_logits(
                img_features,
                txt_features,
                fixed_logit=False
            )  # shape: [n_imgs, n_classes]
predictions = torch.argsort(logits, descending=True)
predicted_labels = predictions[:, :k]

# Check if the target label is in the top-k predictions for each sample
correct_predictions = (predicted_labels == targets.view(-1, 1)).any(dim=1)

The predictions return the correct predictions relating the classified image and text. We then check the first k correctly classified values. An classification example notebook for classifying images and text is also available.

Acknowledgements

This project was supported by the Ministry of Science, Technology, and Innovation of Brazil, with resources granted by the Federal Law 8.248 of October 23, 1991, under the PPI-Softex. The project was coordinated by Softex and published as Intelligent agents for mobile platforms based on Cognitive Architecture technology [01245.013778/2020-21]. D.A.B.M. is partially funded by FAPESP 2023/05939-5. A.I.F., T.S., N.S. are partially funded by Centro de Excelência em Inteligência Artificial (CEIA), da Universidade Federal de Goiás (UFG). E.L.C. is partially funded by CNPq 315468/2021-1. H.P. is partially funded by CNPq 304836/2022-2. S.A. is partially funded by CNPq 315231/2020-3, FAPESP 2013/08293-7, 2020/09838-0, Google Award for Inclusion Research 2022.

Citation

@inproceedings{santos2023capivara,
  title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
  author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
  booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  year = "2023"
}