This is the official PyTorch implementation of the EMNLP 2022 paper "PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models".
PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. PEVL shows impressive results of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs such as visual commomsense reasoning, visual relation detection and visual question answering(GQA). For more details, please see the paper PEVL
Please refer to INSTALL.
Before pretraining, we initialize PEVL's weights with the parameters of ALBEF[14M]
Our raw pretraining corpus is from Visual Commonsense Reasoning(VCR) and MDETR that collects images from Flickr30k entities, COCO, Visual Genome datasets.
- MDETR Data
- Download VCR data from the original websites VCR.
You can download our first-stage pre-training model from pre-trained pevl. We conduct second stage pre-training and fine-tuning for all downstream tasks.
- Second-stage pre-trained checkpoint for position output tasks.
- Dataset json files for position output downstream tasks.(the 'file_name' in each json file need to be changed to your own directory)
- In configs/visual_grounding.yaml, set the paths for the json files.
- Fine-tuning the model using 4 V100 GPUs:
##RefCOCO:
###train
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset refcoco --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcoco --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0 --pretrain 0 --test_dataset refcoco --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcoco_test --checkpoint [Finetuned checkpoint]
##RefCOCOg
###train
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset refcocog --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocog --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0 --pretrain 0 --test_dataset refcocog --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocog_test --checkpoint [Finetuned checkpoint]
##RefCOCO+
###train
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset refcocop --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocop --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0 --pretrain 0 --test_dataset refcocop --config ./configs/visual_grounding.yaml --output_dir ./output/visual_grounding/refcocop_test --checkpoint [Finetuned checkpoint]
- Second stage pre-trained checkpoint for position output tasks.
- Dataset json files for position output downstream tasks.
- In configs/visual_grounding.yaml, set the paths for the json files.
- Fine-tuning the model using 8 V100 GPUs:
##Flickr30k
###train
python -m torch.distributed.launch --nproc_per_node=8 --master_port=12451 --use_env run_grounding_train.py --train 1 --pretrain 0 --test_dataset flickr --config ./configs/visual_grounding.yaml --output_dir ./output/phrase_grounding --checkpoint grounding.pth --eval_step 500
###evaluate
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_grounding_train.py --train 0 --pretrain 0 --test_dataset flickr --config ./configs/visual_grounding.yaml --output_dir ./output/phrase_grounding --checkpoint [Finetuned checkpoint]
- Second stage pre-trained checkpoint for visual relation detection.
- Download PEVL's VRD dataset json files for visual relation detection from pevl_vrd and images for VRD from Visual Genome .
- In configs/vrd.yaml, set the paths for the json files.
- Fine-tuning the model using 8 V100 GPUs:
##for finetuning on visual genome:
python -m torch.distributed.launch --nproc_per_node=8 --master_port=12451 --use_env run_vrd_train.py --train 1 --pretrain 0 --mode finetune --config ./configs/vrd.yaml --output_dir ./output/vrd --checkpoint vrd.pth
##for evaluation on visual genome:
python -m torch.distributed.launch --nproc_per_node=1 --master_port=12451 --use_env run_vrd_train.py --train 0 --pretrain 0 --config ./configs/vrd.yaml --checkpoint [Finetuned checkpoint]
- Second-stage pre-trained checkpoint for visual commonsense reasoning.
- Fine-tuned checkpoint for visual commonsense reasoning.
- Download PEVL's VCR dataset json files from vcr data and images for visual commonsense reasoning from original websites VCR .
- In configs/vcr.yaml, set the paths for the json files and vcr images.
- Download PEVL's GQA dataset json files from pevl_gqa and images for GQA from original websites GQA .
- In configs/gqa.yaml, set the paths for the json files and gqa images.
If you find this project helps your research, please kindly consider citing our paper in your publications.
@inproceedings{yao2022pevl,
title={PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models},
author={Yao, Yuan and Chen, Qianyu and Zhang, Ao and Ji, Wei and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong},
booktitle={Proceedings of EMNLP},
year={2022}
}
The implementation of PEVL relies on resources from ALBEF especially, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing and excellent work.