Skip to content

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

License

Notifications You must be signed in to change notification settings

Lillianwei-h/MMIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard]

License: MIT


🌟 Overview

We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed for Large Vision-Language Models (LVLMs). MMIE provides a robust framework to assess the interleaved comprehension and generation capabilities of LVLMs across diverse domains, supported by reliable automated metrics.

📚 Setup

We have host MMIE dataset on HuggingFace, where you should request access on this page first and shall be automatically approved. Please download all the files in this repository and unzip images.tar.gz to get all images. We also provide overview.json, which is an example of the format of our dataset.

📦 Model Evaluation

Setup

Dataset Preparation

Your to-eval data format should be:

[
    {
        "id": "",
        "question": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ],
        "answer": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ],
        "model": "gt",
        "gt_answer": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ]
    },
    ...
]

Currently gt_answer is only used for Multi-step Reasoning tasks. But it is required in the data format. You can set "gt_answer": [{"text": None,"image":None}] for other tasks.

Make sure the file structure be:

INPUT_DIR
    |INPUT_FILE(data.json)
    |images
        |0.png
        |1.png
        |...

Installation

  • Clone code from this repo
git clone https://github.com/Lillianwei-h/MMIE
cd MMIE
  • Build environment
conda create -n MMIE python=3.11
conda activate MMIE
pip install -r requirements.txt
pip install flash_attn

Model Preparation

You can request access to our MMIE-Score model on HuggingFace and refer to the document of InternVL 2.0 to find more details.

Run

python main.py --input_dir INPUT_DIR --input_file INPUT_FILE

The output file should be at ./eval_outputs/eval_result.json by default. You can also use arguments --output_dir and --output_file to specify your intended output position.

📝 Citation

If you find our benchmark useful in your research, please kindly consider citing us:

@article{xia2024mmie,
  title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
  author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2410.10139},
  year={2024}
}

About

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •  

Languages