Skip to content

Commit

Permalink
Added TurkishMMLU to LM Evaluation Harness (#2283)
Browse files Browse the repository at this point in the history
* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

---------

Co-authored-by: Hailey Schoelkopf <[email protected]>
Co-authored-by: haileyschoelkopf <[email protected]>
  • Loading branch information
3 people authored Sep 26, 2024
1 parent 558d0d7 commit deb4328
Show file tree
Hide file tree
Showing 22 changed files with 776 additions and 0 deletions.
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@
| [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese |
| [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English |
| [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
Expand Down
94 changes: 94 additions & 0 deletions lm_eval/tasks/turkishmmlu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# TurkishMMLU

This repository contains configuration files for LM Evaluation Harness for Few-Shot and Chain-of-Thought experiments for TurkishMMLU. Using these configurations with LM Evaluation Harness, the results of this study are obtained.

TurkishMMLU is a multiple-choice Question-Answering dataset created for the Turkish Natural Language Processing (NLP) community based on Turkish Highschool Curricula across nine subjects. This comprehensive study is conducted to provide Question-Answering benchmark for Turkish language. The questions of the dataset are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic.

To access this dataset please send an email to:
[email protected] or [email protected].

## Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation.

## Dataset

Dataset is divided into four categories Natural Sciences, Mathematics, Language, and Social Sciences and Humanities with a total of nine subjects in Turkish highschool education. It is available in multiple choice for LLM evaluation. The questions also contain difficulty indicator referred as Correctness ratio.

## Evaluation

5-Shot evaluation results from the paper includes open and closed source SOTA LLM with different architectures. For this study, multilingual and Turkish adapted models are tested.

The evaluation results of this study are obtained using the provided configurations with LM Evaluation Harness.

| Model | Source | Average | Natural Sciences | Math | Turkish L & L | Social Sciences and Humanities |
| ------------------- | ------ | ------- | ---------------- | ---- | ------------- | ------------------------------ |
| GPT 4o | Closed | 83.1 | 75.3 | 59.0 | 82.0 | 95.3 |
| Claude-3 Opus | Closed | 79.1 | 71.7 | 59.0 | 77.0 | 90.3 |
| GPT 4-turbo | Closed | 75.7 | 70.3 | 57.0 | 67.0 | 86.5 |
| Llama-3 70B-IT | Closed | 67.3 | 56.7 | 42.0 | 57.0 | 84.3 |
| Claude-3 Sonnet | Closed | 67.3 | 67.3 | 44.0 | 58.0 | 75.5 |
| Llama-3 70B | Open | 66.1 | 56.0 | 37.0 | 57.0 | 83.3 |
| Claude-3 Haiku | Closed | 65.4 | 57.0 | 40.0 | 61.0 | 79.3 |
| Gemini 1.0-pro | Closed | 63.2 | 52.7 | 29.0 | 63.0 | 79.8 |
| C4AI Command-r+ | Open | 60.6 | 50.0 | 26.0 | 57.0 | 78.0 |
| Aya-23 35B | Open | 55.6 | 43.3 | 31.0 | 49.0 | 72.5 |
| C4AI Command-r | Open | 54.9 | 44.7 | 29.0 | 49.0 | 70.5 |
| Mixtral 8x22B | Open | 54.8 | 45.3 | 27.0 | 49.0 | 70.3 |
| GPT 3.5-turbo | Closed | 51.0 | 42.7 | 39.0 | 35.0 | 61.8 |
| Llama-3 8B-IT | Open | 46.4 | 36.7 | 29.0 | 39.0 | 60.0 |
| Llama-3 8B | Open | 46.2 | 37.3 | 30.0 | 33.0 | 60.3 |
| Mixtral 8x7B-IT | Open | 45.2 | 41.3 | 28.0 | 39.0 | 54.0 |
| Aya-23 8B | Open | 45.0 | 39.0 | 23.0 | 31.0 | 58.5 |
| Gemma 7B | Open | 43.6 | 34.3 | 22.0 | 47.0 | 55.0 |
| Aya-101 | Open | 40.7 | 31.3 | 24.0 | 38.0 | 55.0 |
| Trendyol-LLM 7B-C-D | Open | 34.1 | 30.3 | 22.0 | 28.0 | 41.5 |
| mT0-xxl | Open | 33.9 | 29.3 | 28.0 | 21.0 | 42.0 |
| Mistral 7B-IT | Open | 32.0 | 34.3 | 26.0 | 38.0 | 30.3 |
| Llama-2 7B | Open | 22.3 | 25.3 | 20.0 | 20.0 | 19.8 |
| mT5-xxl | Open | 18.1 | 19.3 | 24.0 | 14.0 | 16.8 |

## Citation

```
@misc{yüksel2024turkishmmlumeasuringmassivemultitask,
title={TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish},
author={Arda Yüksel and Abdullatif Köksal and Lütfi Kerem Şenel and Anna Korhonen and Hinrich Schütze},
year={2024},
eprint={2407.12402},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.12402},
}
```

### Groups and Tasks

#### Groups

- `turkishmmlu`: 'All 9 Subjects of Turkish MMLU namely:
Biology, Chemistry, Physics, Geography, Philosophy, History, Religion and Ethics, Turkish Language and Literature, and Mathematics

#### Tasks

The following tasks evaluate subjects in the TurkishMMLU dataset

- `turkishmmlu_{subject}`

The following task evaluate subjects in the TurkishMMLU dataset in Chain-of-Thought (COT)

- `turkishmmlu_cot_{subject}`

### Checklist

For adding novel benchmarks/datasets to the library:

- [x] Is the task an existing benchmark in the literature?
- [x] Have you referenced the original paper that introduced the task?
- [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

If other tasks on this dataset are already supported:

- [ ] Is the "Main" variant of this task clearly denoted?
- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Biology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_biology
dataset_name: Biology
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Chemistry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_chemistry
dataset_name: Chemistry
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Geography.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_geography
dataset_name: Geography
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/History.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_history
dataset_name: History
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Mathematics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_mathematics
dataset_name: Mathematics
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Philosophy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_philosophy
dataset_name: Philosophy
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_physics
dataset_name: Physics
3 changes: 3 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/Religion_and_Ethics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_religion_and_ethics
dataset_name: Religion_and_Ethics
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: _turkishmmlu_default_yaml
task: turkishmmlu_turkish_language_and_literature
dataset_name: Turkish_Language_and_Literature
21 changes: 21 additions & 0 deletions lm_eval/tasks/turkishmmlu/config/_turkishmmlu_default_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
tag: turkishmmlu
task: null
dataset_path: AYueksel/TurkishMMLU
dataset_name: TurkishMMLU
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: multiple_choice
doc_to_text: "Soru: {{ question.strip() }}\nA. {{ choices[0] }}\nB. {{ choices[1] }}\nC. {{ choices[2] }}\nD. {{ choices[3] }}\nE. {{ choices[4] }}\nCevap:"
doc_to_choice: ["A", "B", "C", "D", "E"]
doc_to_target: "{{['A', 'B', 'C', 'D', 'E'].index(answer)}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
54 changes: 54 additions & 0 deletions lm_eval/tasks/turkishmmlu/config_cot/Biology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
include: _turkishmmlu_cot_default_yaml
task: turkishmmlu_cot_biology
description:
"Soru: AaKkDdEeX$^{R}$X$^{r}$ genotipli bireyde AKD alelleri bağlı olup diğerleri bağımsızdır.\
\ Bu birey bu karakterler açısından kaç çeşit gamet oluşturabilir? (Krossing over gerçekleşmemiştir.)\nA) 2 \nB)\
\ 4 \nC) 8 \nD) 16 \nE) 32 \nÇözüm: Adım adım düşünelim.\
\ Bir bireyde A, K ve D genleri bağımlı olduğunda, bu üç gen birlikte hareket eder ve aynı gamet içinde bulunurlar.\
\ Diğer lokuslardaki alleller bağımsızdır.\
\ AKD lokusundaki allel kombinasyonları: AKD veya akd (2 seçenek)\
\ Diğer lokuslardaki allel kombinasyonları: Ee (2 seçenek), X$^{R}$X$^{r}$ (2 seçenek).\
\ Şimdi, bağımlı olan AKD lokusundaki kombinasyonu diğer bağımsız lokuslardaki kombinasyonlarla çarpacağız:\
\ 2 x 2 x 2 = 8\
\ Bu birey 8 farklı gamet oluşturabilir.\
\ Doğru cevap C şıkkıdır.\n\nSoru: Aşağıda verilen hormon çiftlerinden hangisi antagonist (zıt) çalışır?\nA) Oksitosin\
\ - Prolaktin\nB) Kalsitonin - Parathormon\nC) Adrenalin\
\ - Noradrenalin\nD) Östrojen - Progesteron\nE) FSH - LH\nÇözüm: Adım adım düşünelim.\
\ Bu soruyu cevaplayabilmek için hormonların görevlerini ve birbirleri ile olan ilişkilerini bilmek gerekir.\
\ A) Oksitosin ve Prolaktin: Oksitosin doğum sırasında uterus kasılmalarını uyarır ve süt salgılanmasını sağlar. Prolaktin ise süt üretimini uyarır. Bu iki hormon birbirini destekleyici görev yapar, zıt değildir.\
\ B) Kalsitonin ve Parathormon: Kalsitonin kanda kalsiyum seviyesini düşürür, parathormon ise kalsiyum seviyesini yükseltir. Bu iki hormon birbirine zıt etki gösterir, antagonisttir.\
\ C) Adrenalin ve Noradrenalin: Her ikisi de stres hormonudur ve benzer görevleri vardır. Zıt etki göstermezler.\
\ D) Östrojen ve Progesteron: Östrojen kadınlık hormonudur, progesteron ise gebelik sırasında üretilir. Birlikte çalışırlar, zıt etki göstermezler.\
\ E) FSH ve LH: FSH folikül gelişimini, LH ovulasyonu uyarır. Birlikte çalışırlar, zıt etki göstermezler.\
\ Dolayısıyla verilen seçenekler arasında antagonist (zıt) çalışan hormon çifti Kalsitonin ve Parathormon'dur.\
\ Doğru cevap B şıkkıdır.\n\nSoru: I. Besin azalması II. Avcıların artması III. Zehirli madde birikimin artması\
\ Yukarıdaki faktörlerden hangileri çevre direncini artırır?\nA) Yalnız I\nB) Yalnız II\nC)\
\ Yalnız III\nD) II ve III\nE) I, II ve III\nÇözüm: Adım adım düşünelim.\
\ Çevre direnci, bir ekosistemin dışarıdan gelen olumsuz etkilere karşı direncini ifade eder. Yüksek çevre direnci, ekosistemin bu olumsuz etkilere daha iyi direnebileceği anlamına gelir.\
\ I. Besin azalması, popülasyonların büyümesini ve sağlığını olumsuz etkiler, dolayısıyla çevre direncini artırır.\
\ II. Avcıların artması, popülasyonların dengesini bozar ve türlerin sayısını azaltır, bu da çevre direncini artırır.\
\ III. Zehirli madde birikiminin artması, canlıların sağlığını ve üremesini olumsuz etkiler, ekosistemin dengesini bozar, bu şekilde çevre direncini artırır.\
\ Sorudaki faktörlerin hepsi olumsuz faktörlerdir ve ekosistemin direncini zorlar. Doğru cevap E şıkkıdır.\n\nSoru:\
\ Gen klonlama çalışmaları sırasında; a. Vektör DNA ve istenen geni taşıyan DNA'nın kesilmesi, b. İstenen geni taşıyan DNA'nın,\
\ vektör DNA ile birleştirilmesi, c. Bakterinin çoğalmasıyla birlikte istenen genin kopyalanması, uygulamaları yapılmaktadır.\
\ Bu uygulamalarda; I. DNA polimeraz II. DNA ligaz III. Restriksiyon enzimi yapılarının kullanıldığı çalışma basamakları\
\ hangi seçenekte doğru olarak verilmiştir?\
\ I II III \nA) a, b b\
\ a, c\nB) b a, b c\nC)\
\ a c a, c\nD) c b, c a\nE)\
\ b, c a a, b\nÇözüm: Adım Adım düşünelim.\
\ I. DNA polimeraz: c (Bakterinin çoğalması ile birlikte istenen genin kopyalanması)\
\ II. DNA ligaz: b, c (İstenen geni taşıyan DNA'nın, vektör DNA ile birleştirilmesi ve sonrasında bakterinin çoğalması ile birlikte kopyalanması)\
\ III. Restriksiyon enzimi: a (Vektör DNA ve istenen geni taşıyan DNA'nın kesilmesi)\
\ Doğru cevap D şıkkıdır.\n\nSoru: İnsanlardaki lizozomlar, fagositoz yapmayan hücrelerde de aktif olabilir. Hücreler metabolik faaliyetlerinin sorunsuz geçebilmesi için bazı hücresel yapılarını yıkıp yeniden yapar.\
\ Hücresel yapıların kontrollü ve programlı şekilde yıkılması lizozomlar tarafından yapılır ve otofaji olarak bilinir.\
\ Otofaji ile ilgili ifadelerden; I. Otofaji sonucu hücresel yapılar yıkılamadığında lizozomal depolama hastalıkları ortaya çıkar\
\ II. Otofaji sırasında hidrolitik enzimler hücre dışında etkinlik gösterir\
\ III. Otofaji olayında hidrolitik enzimler lizozomlarda üretilip sitoplazmaya gönderilir hangileri doğrudur?\nA) Yalnız\
\ I\nB) I ve II\nC) I ve III\nD) II ve III\nE) I, II ve III\nÇözüm: Adım adım düşünelim.\
\ I. Otofaji sonucu hücresel yapılar yıkılamadığında lizozomal depolama hastalıkları ortaya çıkar: Doğru\
\ II. Otofaji sırasında hidrolitik enzimler hücre dışında etkinlik gösterir: Yanlış, hidrolitik enzimler lizozomlarda etkinlik gösterir.\
\ III. Otofaji olayında hidrolitik enzimler lizozomlarda üretilip sitoplazmaya gönderilir: Yanlış, hidrolitik enzimler lizozomlarda üretilir ve lizozom içinde etkinlik gösterir.\
\ Doğru cevap A şıkkıdır."
num_fewshot: 0
dataset_name: Biology
Loading

0 comments on commit deb4328

Please sign in to comment.