Evaluation

We provide some common pure text and multimodal benchmark evaluation codes for both Chinese and English. You can also simply modify the code and add more data evaluation benchmarks as needed.

Data Preparation

We now support the use of the following benchmarks.

Pure Text

English

MMLU

unzip mmlu.zip
python evaluate/run.py --dataset_name mmlu --data_path ./evaluate/mmlu/mmlu/

BBH

unzip BBH.zip
python evaluate/run.py --dataset_name bbh --data_path ./evaluate/bbh/BBH/

Chinese

CMMLU

unzip cmmlu.zip
python evaluate/run.py --dataset_name cmmlu --data_path ./evaluate/cmmlu/cmmlu/

C-Eval

unzip ceval.zip
python evaluate/run.py --dataset_name ceval --data_path ./evaluate/ceval/ceval/formal_ceval/

Your need to submit your c-eval results (i.e., result.json) to the online evaluation website.

Multimodal
- SEED-Bench2
You can download the dataset from here.
```
  python evaluate/run.py --dataset_name seed_bench --data_path ./evaluate/seed_bench2/seed_bench2/
```
- MME
You can download the dataset from here after sending email to data owner.
```
  python evaluate/run.py --dataset_name mme --data_path ./evaluate/mme/mme/
```
You can retrieve the final score from calucation.py in ./evaluate/mme/mme/eval_tool.
- MMVet
You can download the dataset from here.
```
  python evaluate/run.py --dataset_name mme --data_path ./evaluate/mme/mmvet/
```
Your need to submit your mm-vet results (i.e., result.json) to the online evaluation website.
- MMB
- MMMU
- CMMMU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate.md

evaluate.md

Evaluation

Data Preparation

Files

evaluate.md

Latest commit

History

evaluate.md

File metadata and controls

Evaluation

Data Preparation