Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new benchmark: Catalan bench #2154

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,4 @@
| [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
| [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
| [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan |
121 changes: 121 additions & 0 deletions lm_eval/tasks/catalan_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# CatalanBench

### Paper

CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon.

The new evaluation datasets included in CatalanBench are:
| Task | Category | Homepage |
|:-------------:|:-----:|:-----:|
| ARC_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/arc_ca |
| MGSM_ca | Math | https://huggingface.co/datasets/projecte-aina/mgsm_ca |
| OpenBookQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/openbookqa_ca |
| Parafraseja | Paraphrasing | https://huggingface.co/datasets/projecte-aina/Parafraseja |
| PIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/piqa_ca |
| SIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/siqa_ca |
| XStoryCloze_ca | Commonsense Reasoning | https://huggingface.co/datasets/projecte-aina/xstorycloze_ca |

The datasets included in CatalanBench that have been made public in previous pubications are:

| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_ca | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| caBREU | Summarization | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/caBreu |
| CatalanQA | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/catalanqa |
| CatCoLA | Linguistic Acceptability | CatCoLA: Catalan Corpus of Linguistic Acceptability | https://huggingface.co/datasets/nbel/CatCoLA |
| COPA-ca | Commonsense Reasoning | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/COPA-ca |
| CoQCat | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/CoQCat |
| FLORES_ca | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
| PAWS-ca | Paraphrasing | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/PAWS-ca |
| TE-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/teca |
| VeritasQA_ca | Truthfulness | VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability | TBA |
| WNLI-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/wnli-ca |
| XNLI-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/xnli-ca |
| XQuAD-ca | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/xquad-ca |


### Citation
Paper for CatalanBench coming soon.

<!--```bibtex
@inproceedings{baucells-2024-iberobench,
title = "IberoBench: A Benchmark for LLM Evaluation in Iberian Languages",
author = "Baucells, Irene and
AUTHORS, ADD",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
year = "2024",
publisher = "Association for Computational Linguistics",
}
```
-->

### Groups and Tasks

#### Groups

- `catalan_bench`: All tasks included in CatalanBench.
- `flores_ca`: All FLORES translation tasks from or to Catalan.

#### Tags
- `cabreu`: Three CaBREU tasks for each type of summary (extractive, abstractive and extreme).
- `phrases_va`: Two Phrases_va tasks for language adaptation between Catalan and Valencian.

#### Tasks

The following tasks evaluate tasks on CatalanBench dataset using various scoring methods.
- `arc_ca_challenge`
- `arc_ca_easy`
- `belebele_cat_Latn`
- `cabreu`
- `catalanqa`
- `catcola`
- `copa_ca`
- `coqcat`
- `flores_ca`
- `flores_ca-de`
- `flores_ca-en`
- `flores_ca-es`
- `flores_ca-eu`
- `flores_ca-fr`
- `flores_ca-gl`
- `flores_ca-it`
- `flores_ca-pt`
- `flores_de-ca`
- `flores_en-ca`
- `flores_es-ca`
- `flores_eu-ca`
- `flores_fr-ca`
- `flores_gl-ca`
- `flores_it-ca`
- `flores_pt-ca`
- `mgsm_direct_ca`
- `openbookqa_ca`
- `parafraseja`
- `paws_ca`
- `phrases_ca`
- `piqa_ca`
- `siqa_ca`
- `teca`
- `veritasqa_gen_ca`
- `veritasqa_mc1_ca`
- `veritasqa_mc2_ca`
- `wnli_ca`
- `xnli_ca`
- `xquad_ca`
- `xstorycloze_ca`

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_cat_Latn`: Belebele Catalan


### Checklist

* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
20 changes: 20 additions & 0 deletions lm_eval/tasks/catalan_bench/_arc_ca_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
tag: arc_ca
dataset_path: projecte-aina/arc_ca
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
doc_to_text: "Pregunta: {{question}}\nResposta:"
doc_to_target: "{{choices.label.index(answerKey)}}"
doc_to_choice: "{{choices.text}}"
should_decontaminate: true
doc_to_decontamination_query: "Pregunta: {{question}}\nResposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
17 changes: 17 additions & 0 deletions lm_eval/tasks/catalan_bench/_cabreu_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
tag: cabreu
dataset_path: projecte-aina/caBreu
dataset_name: null
output_type: generate_until
test_split: test
training_split: train
validation_split: validation
process_docs: !function utils.process_doc_cabreu
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: !function utils.rouge1
aggregation: !function utils.rouge1_agg
higher_is_better: true
metadata:
version: 1.0
3 changes: 3 additions & 0 deletions lm_eval/tasks/catalan_bench/arc_ca_challenge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task: arc_ca_challenge
dataset_name: ARC-Challenge
include: _arc_ca_common_yaml
3 changes: 3 additions & 0 deletions lm_eval/tasks/catalan_bench/arc_ca_easy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task: arc_ca_easy
dataset_name: ARC-Easy
include: _arc_ca_common_yaml
8 changes: 8 additions & 0 deletions lm_eval/tasks/catalan_bench/cabreu_abstractive.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
include: _cabreu_common_yaml
task: cabreu_abstractive
description: "Examina el text següent i genera'n un resum abstractiu, expressant el significat del text original d'una manera més natural i concisa.\n"
doc_to_text: >-
Text: {{content}}

Resum:
doc_to_target: '{{summaries["abstractive"]["a1"]}}'
8 changes: 8 additions & 0 deletions lm_eval/tasks/catalan_bench/cabreu_extractive.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
include: _cabreu_common_yaml
task: cabreu_extractive
description: "Examina el text següent i genera'n un resum extractiu, utilitzant les frases o oracions més rellevants del text original.\n"
doc_to_text: >-
Text: {{content}}

Resum:
doc_to_target: '{{summaries["extractive"]["a1"]}}'
8 changes: 8 additions & 0 deletions lm_eval/tasks/catalan_bench/cabreu_extreme.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
include: _cabreu_common_yaml
task: cabreu_extreme
description: "Examina el text següent i genera'n un resum que sigui el més concís possible i que preservi el significat del text original.\n"
doc_to_text: >-
Text: {{content}}
Resum:
doc_to_target: '{{summaries["extreme"]["a1"]}}'
25 changes: 25 additions & 0 deletions lm_eval/tasks/catalan_bench/catalan_bench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
group: catalan_bench
task:
- belebele_cat_Latn
- xnli_ca
- catcola
- copa_ca
- openbookqa_ca
- parafraseja
- paws_ca
- piqa_ca
- siqa_ca
- teca
- wnli_ca
- arc_ca_easy
- arc_ca_challenge
- xstorycloze_ca
- xquad_ca
- catalanqa
- coqcat
- flores_ca
- cabreu
- mgsm_direct_ca
- phrases_va
metadata:
version: 1.0
25 changes: 25 additions & 0 deletions lm_eval/tasks/catalan_bench/catalanqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
task: catalanqa
dataset_path: projecte-aina/catalanqa
dataset_name: null
output_type: generate_until
training_split: train
validation_split: validation
test_split: test
doc_to_text: "Context: {{context}}\n\nPregunta: {{question}}\n\nResposta:"
doc_to_target: '{{answers[0]["text"]}}'
target_delimiter: ' '
process_results: !function utils.process_results_qa
generation_kwargs:
until:
- "\n"
do_sample: false
temperature: 0.0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
- metric: f1
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
14 changes: 14 additions & 0 deletions lm_eval/tasks/catalan_bench/catcola.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
task: catcola
dataset_path: nbel/CatCoLA
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
doc_to_text: "{{Sentence}}\nPregunta: Té sentit aquesta frase?\nResposta:"
doc_to_target: label
doc_to_choice: ["no", "sí"]
metric_list:
- metric: mcc
- metric: acc
metadata:
version: 1.0
17 changes: 17 additions & 0 deletions lm_eval/tasks/catalan_bench/copa_ca.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
task: copa_ca
dataset_path: projecte-aina/COPA-ca
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
process_docs: !function utils.process_docs_copa_ca
doc_to_text: '{{premise[:-1].strip() + " " + {"cause": "perquè", "effect": "i per tant"}[question]}}'
doc_to_target: '{{choice1 if label == 0 else choice2}}'
doc_to_choice: '{{[choice1, choice2]}}'
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
23 changes: 23 additions & 0 deletions lm_eval/tasks/catalan_bench/coqcat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: coqcat
dataset_path: projecte-aina/CoQCat
output_type: generate_until
training_split: train
validation_split: validation
test_split: test
doc_to_text: '{{story+"\n\n"}}{% for i in range(questions|length-1) %}{{"Q: "+questions[i]+"\n\n"+"A: "+answers["input_text"][i]+"\n\n"}}{% endfor %}{{"Q: "+questions[-1]+"\n\n"+"A:"}}'
doc_to_target: '{{ answers["input_text"][questions|length - 1] }}'
process_results: !function utils.process_results_coqcat
should_decontaminate: true
doc_to_decontamination_query: "{{story}} {{question.input_text|join('\n')}}"
generation_kwargs:
until:
- "\nQ:"
metric_list:
- metric: "em"
aggregation: mean
higher_is_better: true
- metric: "f1"
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
25 changes: 25 additions & 0 deletions lm_eval/tasks/catalan_bench/flores_ca/_flores_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
training_split: dev
validation_split: dev
test_split: devtest
fewshot_split: dev
target_delimiter: ''
generation_kwargs:
until:
- "\n"
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: chrf
aggregation: chrf
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
Loading