Skip to content

Latest commit

 

History

History
61 lines (50 loc) · 2.33 KB

evaluate.md

File metadata and controls

61 lines (50 loc) · 2.33 KB

Evaluation

We provide some common pure text and multimodal benchmark evaluation codes for both Chinese and English. You can also simply modify the code and add more data evaluation benchmarks as needed.

Data Preparation

We now support the use of the following benchmarks.

  • Pure Text

    • English
      unzip mmlu.zip
      python evaluate/run.py --dataset_name mmlu --data_path ./evaluate/mmlu/mmlu/
      unzip BBH.zip
      python evaluate/run.py --dataset_name bbh --data_path ./evaluate/bbh/BBH/
    • Chinese
      unzip cmmlu.zip
      python evaluate/run.py --dataset_name cmmlu --data_path ./evaluate/cmmlu/cmmlu/
      unzip ceval.zip
      python evaluate/run.py --dataset_name ceval --data_path ./evaluate/ceval/ceval/formal_ceval/
      Your need to submit your c-eval results (i.e., result.json) to the online evaluation website.
  • Multimodal

    You can download the dataset from here.

      python evaluate/run.py --dataset_name seed_bench --data_path ./evaluate/seed_bench2/seed_bench2/

    You can download the dataset from here after sending email to data owner.

      python evaluate/run.py --dataset_name mme --data_path ./evaluate/mme/mme/

    You can retrieve the final score from calucation.py in ./evaluate/mme/mme/eval_tool.

    You can download the dataset from here.

      python evaluate/run.py --dataset_name mme --data_path ./evaluate/mme/mmvet/

    Your need to submit your mm-vet results (i.e., result.json) to the online evaluation website.

    • MMB
    • MMMU
    • CMMMU