Cetvel is an extended version of the lm-eval-harness tool, specifically includes tasks/datasets for benchmarking Turkish Large Language Models (LLMs). This tool encompasses a variety of tasks curated to assess different aspects of model performance in the Turkish language. Our primary goal is to objectively evaluate the capabilities of large language models in understanding and processing Turkish.
- Extractive Question Answering
- Multiple Choice Question Answering
- Natural Language Inference
- Text Classification
- Machine Translation
- Summarization
- Grammatical Error Correction
Clone the repository using the following command to fetch the submodules:
git clone [email protected]:KUIS-AI/cetvel.git --recursive
Create a virtual environment with any tool of your choice (e.g. conda
, virtualenv
) and install core PyTorch dependencies.
conda create -n cetvel python=3.9
conda activate cetvel
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
Note that we only tested Cetvel using the specified PyTorch (==2.3.1
) and CUDA versions (==11.8
).
Install the evaluation harness and other dependencies:
pip install toml
pip install -e ./lm-evaluation-harness
pip install -r requirements.txt
Cetvel utilizes the identical command line interface as lm-eval-harness
. Here is an example command,
python -m lm_eval --model hf --include_path ./tasks/ \
--model_args pretrained=openai-community/gpt2 \
--tasks exams_tr,xquad_tr,tquad,turkish_plu \
--device cuda:0 --batch_size 4 --write_out --log_samples --output_path outs
For more details on the usage, and explore other evaluation settings, refer to the lm-eval-harness repository.
Checkout the examples folder for more examples to run the all tasks with different models.
Task | Datasets | Metrics |
---|---|---|
Extractive Question Answering | xquad tquad MKQA-tr |
Exact Match F1 |
Multiple Choice Question Answering | EXAMS Belebele Turkish PLU XCOPA |
Accuracy |
Text Classification | IronyTR TRClaim-19 news_cat OffensEval-TR STSb-TR X-FACT |
Accuracy |
Natural Language Inference | XNLI SNLI-tr MNLI-tr |
Accuracy |
Machine Translation | wmt2016 | WER BLEU |
Summarization | TurkishPLU MLSum XLSum WikiLingua |
ROUGE |
Grammatical Error Correction | gecturk | Exact Match |
If you find Cetvel beneficial for your research, please cite it,
@misc{kuisai2024cetvel,
title={Cetvel: A Unified Benchmark for Evaluating Turkish LLMs},
author={Ilker Kesen and Mustafa Cemil Guney and Aykut Erdem and Gozde Gul Sahin},
year={2024},
url={https://github.com/KUIS-AI/cetvel}
}