LeGrad

An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Walid Bousselham¹, Angie Boggust², Sofian Chaybouti¹, Hendrik Strobelt^3,4 and Hilde Kuehne^1,3

¹ University of Bonn & Goethe University Frankfurt, ² MIT CSAIL, ³ MIT-IBM Watson AI Lab, ⁴ IBM Research.

Vision-Language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. we propose LeGrad, an explainability method specifically designed for ViTs. We LeGrad we explore how the decision-making process of such models by leveraging their feature formation process. A by-product of understanding VL models decision-making is the ability to produce localised heatmap for any text prompt.

The following is the code for a wrapper around the OpenCLIP library to equip VL models with LeGrad.

🔨 Installation

legrad library can be simply installed via pip:

$ pip install legrad_torch

Demo

Try out our web demo on HuggingFace Spaces
Run the demo on Google Colab:
Run playground.py for a usage example.

To run the gradio app locally, first install gradio and then run app.py:

$ pip install gradio
$ python app.py

Usage

To see which pretrained models is available use the following code snippet:

import legrad
legrad.list_pretrained()

Single Image

To process an image and a text prompt use the following code snippet:

Note: the wrapper does not affect the original model, hence all the functionalities of OpenCLIP models can be used seamlessly.

import requests
from PIL import Image
import open_clip
import torch

from legrad import LeWrapper, LePreprocess
from legrad.utils import visualize

# ------- model's paramters -------
model_name = 'ViT-B-16'
pretrained = 'laion2b_s34b_b88k'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# ------- init model -------
model, _, preprocess = open_clip.create_model_and_transforms(
    model_name=model_name, pretrained=pretrained, device=device)
tokenizer = open_clip.get_tokenizer(model_name=model_name)
model.eval()
# ------- Equip the model with LeGrad -------
model = LeWrapper(model)
# ___ (Optional): Wrapper for Higher-Res input image ___
preprocess = LePreprocess(preprocess=preprocess, image_size=448)

# ------- init inputs: image + text -------
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0).to(device)
text = tokenizer(['a photo of a cat']).to(device)

# -------
text_embedding = model.encode_text(text, normalize=True)
print(image.shape)
explainability_map = model.compute_legrad_clip(image=image, text_embedding=text_embedding)

# ___ (Optional): Visualize overlay of the image + heatmap ___
visualize(heatmaps=explainability_map, image=image)

⭐ Acknowledgement

This code is build as wrapper around OpenCLIP library from LAION, visit their repo for more vision-language models. This project also takes inspiration from Transformer-MM-Explainability and the timm library, please visit their repository.

📚 Citation

If you find this repository useful, please consider citing our work 📝 and giving a star 🌟 :

@article{bousselham2024legrad,
  author    = {Bousselham, Walid and Boggust, Angie and Chaybouti, Sofian and Strobelt, Hendrik and Kuehne, Hilde}
  title     = {LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity},
  journal   = {arXiv preprint arXiv:2404.03214},
  year      = {2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
CogVLM		CogVLM
assets		assets
docs		docs
legrad		legrad
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Test_cogvlm.py		Test_cogvlm.py
app.py		app.py
calc_iou.py		calc_iou.py
heatmap_score.py		heatmap_score.py
original_image.png		original_image.png
playground.py		playground.py
playground_aggregate.py		playground_aggregate.py
playground_cogvlm.py		playground_cogvlm.py
playground_cogvlm_multiple_tokens.py		playground_cogvlm_multiple_tokens.py
playground_cogvlm_multiple_tokens_aggregations.py		playground_cogvlm_multiple_tokens_aggregations.py
playground_cogvlm_multiple_tokens_aggregations_loop.py		playground_cogvlm_multiple_tokens_aggregations_loop.py
playground_cogvlm_multiple_tokens_heatmap_aggregation.py		playground_cogvlm_multiple_tokens_heatmap_aggregation.py
playground_vicuna.py		playground_vicuna.py
requirements.txt		requirements.txt
setup.py		setup.py
test.ipynb		test.ipynb
test3.py		test3.py
test_one.py		test_one.py
testcog.py		testcog.py
token_alignment.py		token_alignment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LeGrad

An Explainability Method for Vision Transformers via Feature Formation Sensitivity

🔨 Installation

Demo

Usage

Single Image

⭐ Acknowledgement

📚 Citation

About

Releases

Packages

Languages

License

raoulritter/LeGrad

Folders and files

Latest commit

History

Repository files navigation

LeGrad

An Explainability Method for Vision Transformers via Feature Formation Sensitivity

🔨 Installation

Demo

Usage

Single Image

⭐ Acknowledgement

📚 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages