MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard]

🌟 Overview

We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed for Large Vision-Language Models (LVLMs). MMIE provides a robust framework to assess the interleaved comprehension and generation capabilities of LVLMs across diverse domains, supported by reliable automated metrics.

📚 Setup

We have host MMIE dataset on HuggingFace, where you should request access on this page first and shall be automatically approved. Please download all the files in this repository and unzip images.tar.gz to get all images. We also provide overview.json, which is an example of the format of our dataset.

📦 Model Evaluation

Setup

Dataset Preparation

Your to-eval data format should be:

[
    {
        "id": "",
        "question": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ],
        "answer": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ],
        "model": "gt",
        "gt_answer": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ]
    },
    ...
]

Currently gt_answer is only used for Multi-step Reasoning tasks. But it is required in the data format. You can set "gt_answer": [{"text": None,"image":None}] for other tasks.

Make sure the file structure be:

INPUT_DIR
    |INPUT_FILE(data.json)
    |images
        |0.png
        |1.png
        |...

Installation

Clone code from this repo

git clone https://github.com/Lillianwei-h/MMIE
cd MMIE

Build environment

conda create -n MMIE python=3.11
conda activate MMIE
pip install -r requirements.txt
pip install flash_attn

Model Preparation

You can request access to our MMIE-Score model on HuggingFace and refer to the document of InternVL 2.0 to find more details.

Run

python main.py --input_dir INPUT_DIR --input_file INPUT_FILE

The output file should be at ./eval_outputs/eval_result.json by default. You can also use arguments --output_dir and --output_file to specify your intended output position.

📝 Citation

If you find our benchmark useful in your research, please kindly consider citing us:

@article{xia2024mmie,
  title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
  author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2410.10139},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
load_model.py		load_model.py
main.py		main.py
prompts.py		prompts.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

🌟 Overview

📚 Setup

📦 Model Evaluation

Setup

Dataset Preparation

Installation

Model Preparation

Run

📝 Citation

About

Contributors 3

Languages

License

Lillianwei-h/MMIE

Folders and files

Latest commit

History

Repository files navigation

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

🌟 Overview

📚 Setup

📦 Model Evaluation

Setup

Dataset Preparation

Installation

Model Preparation

Run

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages