diff --git a/README.md b/README.md new file mode 100644 index 0000000..6a357cd --- /dev/null +++ b/README.md @@ -0,0 +1,38 @@ +## Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (Salesforce Research) + +This is the official PyTorch implementation of the ALBEF paper [Blog]. +This repository supports finetuning ALBEF on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, +and visual grounding on RefCOCO+. Pre-trained and Fine-tuned checkpoints are released. + + + +### Requirements: +* pytorch 1.8.0 +* transformers 4.8.1 + +### Download: + +* Pre-trained checkpoint +* Dataset json files +* Finetuned checkpoint for retrieval on MSCOCO +* Finetuned checkpoint for VQA +* Finetuned checkpoint for visual grounding on RefCOCO+ + +### Visualization: +We provide code in visualize.ipynb to visualize the important areas in an image for each word in a text. +Here is an example visualization using the visual grounding checkpoint. + + + + + +### Image-Text Retrieval: + +1. Download MSCOCO or Flickr30k datasets from original websites. +2. Download and extract the provided dataset json files. +3. In configs/Retrieval_coco.yaml or configs/Retrieval_flickr.yaml, set the paths for the json files and the image path. +4. Finetune the pre-trained checkpoint using 8 A100 GPUs: +
python -m torch.distributed.launch --nproc_per_node=8 --use_env Retrieval.py \ +--config ./configs/Retrieval_flickr.yaml \ +--output_dir output/Retrieval_flickr \ +--checkpoint [Pretrained checkpoint]