2025.1.8
The official Arena is now open. You’re welcome to submit your model results!2024.12.18
The code for automated evaluation is available! Please refer to MLLM Evaluation.2024.12.17
Paper is available on Arxiv. Dataset is available on Hugging Face.
IDEA-Bench (Intelligent Design Evaluation and Assessment Benchmark) is a comprehensive and pioneering benchmark designed to advance the capabilities of image generation models toward professional-grade applications. It addresses the gap between current generative models and the demanding requirements of professional image design through robust evaluation across diverse tasks.
IDEA-Bench encompasses 100 professional image generation tasks and 275 specific cases, systematically categorized into five major types:
- Text to Image (T2I): Generate single images from text prompts.
- Image to Image (I2I): Transform or edit input images based on textual guidance.
- Multi-image to Image (Is2I): Create a single output image from multiple input images.
- Text to Multi-image (T2Is): Generate multiple images from a single text prompt.
- (Multi-)image to Multi-image (I(s)2Is): Create multiple output images from one or more input images.
- Binary Scoring Items: Incorporates 1,650 binary scoring items to ensure precise, objective evaluations of generated images.
- MLLM-Assisted Assessment: Includes a representative subset of 18 tasks with enhanced criteria for automated assessments, where MLLM are leveraged to transform evaluations into image understanding tasks, surpassing traditional metrics like FID and CLIPScore in capturing aesthetic quality and contextual relevance.
License: The images and datasets included in this repository are subject to the terms outlined in the LICENSE file. Please refer to the file for details on usage restrictions.
Set up the environment for running the evaluation scripts.
conda create -n ideabench python=3.10
conda activate ideabench
cd IDEA-Bench
pip install -r requirements.txt
Download the dataset and place it in the dataset/
folder under the root of your project.
Run your model to generate the results for all tasks. Save the output in the outputs/
folder, which should mirror the structure of the dataset.
Your directory structure should look like this:
IDEA-Bench/
├── assets/
├── dataset/
│ └── IDEA-Bench/
│ ├── 3d_effect_generation_single_reference_0001/
│ ├── 3d_effect_generation_single_reference_0002/
│ ├── 3d_effect_generation_three-view_reference_0001/
│ └── ...
├── eval_results/
├── outputs/
│ └── model_results/
│ ├── 3d_effect_generation_single_reference_0001/
│ │ └── 0001.jpg
│ ├── 3d_effect_generation_single_reference_0002/
│ │ └── 0001.jpg
│ ├── 3d_effect_generation_three-view_reference_0001/
│ └── ...
├── scripts/
├── requirements.txt
└── ...
Each case folder in the outputs/
folder should have images named 0001.jpg
, 0002.jpg
, etc., corresponding to the model's output for each task.
Use the script scripts/stitch_image.py
to stitch the generated images.
python scripts/stitch_image.py {path_to_dataset} {path_to_generation_result}
This will create a new folder under the model’s output directory with the suffix _stitched
, which will contain stitched images ready for evaluation and a summary.csv
file containing paths to the images and relevant evaluation questions.
important: Before running the script, make sure to configure your API key for the Gemini API. At the beginning of the scripts/gemini_eval.py
file, you will find the following line:
genai.configure(api_key="YOUR_API_KEY")
Replace "YOUR_API_KEY" with your actual Gemini API key.
Use the script scripts/gemini_eval.py
to run the evaluation using Gemini 1.5 pro. The first argument should be the path to the summary.csv
generated in the previous step. You can also use the optional argument --resume
to continue the evaluation from the last checkpoint.
python scripts/gemini_eval.py {path_to_summary}
The evaluation results will be saved in the eval_results/
folder as a CSV file.
Use the script scripts/cal_scores.py
to calculate the final evaluation scores. The input for this step will be the CSV file generated from the Gemini evaluation.
python scripts/cal_scores.py {path_to_evaluation_result}
Learderboard based on Arena is coming soon.
If you find our work helpful for your research, please consider citing our work.
@misc{liang2024ideabenchfargenerativemodels,
title={IDEA-Bench: How Far are Generative Models from Professional Designing?},
author={Chen Liang and Lianghua Huang and Jingwu Fang and Huanzhang Dou and Wei Wang and Zhi-Fan Wu and Yupeng Shi and Junge Zhang and Xin Zhao and Yu Liu},
year={2024},
eprint={2412.11767},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.11767},
}