SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization [Paper]
As language models continue to scale, Large Language Models (LLMs) have exhibited remarkable capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs). Inspired by these advancements, researchers have extended these techniques to Large Multimodal Models (LMMs) with ICL capabilities.
However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This highlights a lack of alignment between multimodal demonstrations and model outputs.
To address this issue, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). SymDPO breaks the traditional paradigm of constructing multimodal demonstrations by replacing text answers in examples with random symbols, forcing the model to carefully understand the demonstration images and establish relationships between images and symbols to answer questions correctly.
We validate the effectiveness of SymDPO on multiple benchmarks, demonstrating its ability to enhance the multimodal context understanding of LMMs and improve their ability to answer questions accurately.
-
Clone the repository:
git clone https://github.com/APiaoG/SymDPO.git cd SymDPO
-
Install dependencies:
conda env create -f environment.yml
Here’s how to use the model to inference:
- Prepare your multimodal examples (including images and text answers).
- Download the model weights. (The model weights trained with SymDPO will be released soon.)
- Modify the file paths in
open_flamingo/scripts/inference.py
accordingly. - Run the
inference.py
file to perform inference.conda activate SymDPO python open_flamingo/scripts/inference.py
To use SymDPO to train the model built upon OpenFlamingo, follow these steps:
Modify the file paths in the file open_flamingo/scripts/train_dpo.sh
, then run the following command:
sh open_flamingo/scripts/train_dpo.sh
We also provide code for SFT. Please follow these steps:
Modify the file paths in the file open_flamingo/scripts/train_sft.sh
, then run the following command:
sh open_flamingo/scripts/train_sft.sh
-
Please download the files required for the following dataset:
- COCO
- Flickr-30K
- VQAv2
- OKVQA
- TextVQA
-
Modify the file paths in the file
open_flamingo/scripts/run_eval.sh
, then run the following command:sh open_flamingo/scripts/run_eval.sh
Here is a comparison of results between OpenFlamingo-3B-Instruct trained with SymDPO and OpenFlamingo-3B-Instruct not trained with SymDPO:
Performance comparison across multiple benchmarks:
Detailed results and methodology can be found in the paper.
If you have any questions or suggestions, feel free to reach out:
- GitHub Issues: Create an issue
- Email: [email protected]
If you found this repository useful, please consider citing:
@misc{jia2024symdpoboostingincontextlearning,
title={SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization},
author={Hongrui Jia and Chaoya Jiang and Haiyang Xu and Wei Ye and Mengfan Dong and Ming Yan and Ji Zhang and Fei Huang and Shikun Zhang},
year={2024},
eprint={2411.11909},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.11909},
}
@article{awadalla2023openflamingo,
title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},
author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},
journal={arXiv preprint arXiv:2308.01390},
year={2023}
}