We provide some common pure text and multimodal benchmark evaluation codes for both Chinese and English. You can also simply modify the code and add more data evaluation benchmarks as needed.
We now support the use of the following benchmarks.
-
Pure Text
- English
unzip mmlu.zip python evaluate/run.py --dataset_name mmlu --data_path ./evaluate/mmlu/mmlu/
unzip BBH.zip python evaluate/run.py --dataset_name bbh --data_path ./evaluate/bbh/BBH/
- Chinese
unzip cmmlu.zip python evaluate/run.py --dataset_name cmmlu --data_path ./evaluate/cmmlu/cmmlu/
Your need to submit your c-eval results (i.e., result.json) to the online evaluation website.unzip ceval.zip python evaluate/run.py --dataset_name ceval --data_path ./evaluate/ceval/ceval/formal_ceval/
- English
-
Multimodal
You can download the dataset from here.
python evaluate/run.py --dataset_name seed_bench --data_path ./evaluate/seed_bench2/seed_bench2/
You can download the dataset from here after sending email to data owner.
python evaluate/run.py --dataset_name mme --data_path ./evaluate/mme/mme/
You can retrieve the final score from calucation.py in ./evaluate/mme/mme/eval_tool.
You can download the dataset from here.
python evaluate/run.py --dataset_name mme --data_path ./evaluate/mme/mmvet/
Your need to submit your mm-vet results (i.e., result.json) to the online evaluation website.
- MMB
- MMMU
- CMMMU