EleutherAI · zxcvuser · Jul 19, 2024 · Jul 29, 2024 · Jul 29, 2024 · Jul 30, 2024
@@ -121,3 +121,4 @@
 | [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
 | [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
 | [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
+| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan |
@@ -0,0 +1,121 @@
+# CatalanBench
+
+### Paper
+
+CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon.
+
+The new evaluation datasets included in CatalanBench are:
+| Task | Category | Homepage |
+|:-------------:|:-----:|:-----:|
+| ARC_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/arc_ca |
+| MGSM_ca | Math | https://huggingface.co/datasets/projecte-aina/mgsm_ca |
+| OpenBookQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/openbookqa_ca |
+| Parafraseja | Paraphrasing | https://huggingface.co/datasets/projecte-aina/Parafraseja |
+| PIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/piqa_ca |
+| SIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/siqa_ca |
+| XStoryCloze_ca | Commonsense Reasoning | https://huggingface.co/datasets/projecte-aina/xstorycloze_ca |
+
+The datasets included in CatalanBench that have been made public in previous pubications are:
+
+| Task | Category | Paper title | Homepage |
+|:-------------:|:-----:|:-------------:|:-----:|
+| Belebele_ca | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
+| caBREU | Summarization | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/caBreu |
+| CatalanQA | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/catalanqa |
+| CatCoLA | Linguistic Acceptability | CatCoLA: Catalan Corpus of Linguistic Acceptability | https://huggingface.co/datasets/nbel/CatCoLA |
+| COPA-ca | Commonsense Reasoning | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/COPA-ca |
+| CoQCat | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/CoQCat |
+| FLORES_ca | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
+| PAWS-ca | Paraphrasing | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/PAWS-ca |
+| TE-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/teca |
+| VeritasQA_ca | Truthfulness | VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability | TBA |
+| WNLI-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/wnli-ca |
+| XNLI-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/xnli-ca |
+| XQuAD-ca | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/xquad-ca |
+
+
+### Citation
+Paper for CatalanBench coming soon.
+
+<!--```bibtex
+@inproceedings{baucells-2024-iberobench,
+ title = "IberoBench: A Benchmark for LLM Evaluation in Iberian Languages",
+ author = "Baucells, Irene and
+ AUTHORS, ADD",
+ booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
+ year = "2024",
+ publisher = "Association for Computational Linguistics",
+}
+```
+-->
+
+### Groups and Tasks
+
+#### Groups
+
+- `catalan_bench`: All tasks included in CatalanBench.
+- `flores_ca`: All FLORES translation tasks from or to Catalan.
+
+#### Tags
+- `cabreu`: Three CaBREU tasks for each type of summary (extractive, abstractive and extreme).
+- `phrases_va`: Two Phrases_va tasks for language adaptation between Catalan and Valencian.
+
+#### Tasks
+
+The following tasks evaluate tasks on CatalanBench dataset using various scoring methods.
+ - `arc_ca_challenge`
+ - `arc_ca_easy`
+ - `belebele_cat_Latn`
+ - `cabreu`
+ - `catalanqa`
+ - `catcola`
+ - `copa_ca`
+ - `coqcat`
+ - `flores_ca`
+ - `flores_ca-de`
+ - `flores_ca-en`
+ - `flores_ca-es`
+ - `flores_ca-eu`
+ - `flores_ca-fr`
+ - `flores_ca-gl`
+ - `flores_ca-it`
+ - `flores_ca-pt`
+ - `flores_de-ca`
+ - `flores_en-ca`
+ - `flores_es-ca`
+ - `flores_eu-ca`
+ - `flores_fr-ca`
+ - `flores_gl-ca`
+ - `flores_it-ca`
+ - `flores_pt-ca`
+ - `mgsm_direct_ca`
+ - `openbookqa_ca`
+ - `parafraseja`
+ - `paws_ca`
+ - `phrases_ca`
+ - `piqa_ca`
+ - `siqa_ca`
+ - `teca`
+ - `veritasqa_gen_ca`
+ - `veritasqa_mc1_ca`
+ - `veritasqa_mc2_ca`
+ - `wnli_ca`
+ - `xnli_ca`
+ - `xquad_ca`
+ - `xstorycloze_ca`
+
+Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
+- `belebele_cat_Latn`: Belebele Catalan
+
+
+### Checklist
+
+* [x] Is the task an existing benchmark in the literature?
+ * [ ] Have you referenced the original paper that introduced the task?
+ * [ ] If yes, does the original paper provide a reference implementation?
+ * [ ] Yes, original implementation contributed by author of the benchmark
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
@@ -0,0 +1,20 @@
+tag: arc_ca
+dataset_path: projecte-aina/arc_ca
+output_type: multiple_choice
+training_split: null
+validation_split: validation
+test_split: test
+doc_to_text: "Pregunta: {{question}}\nResposta:"
+doc_to_target: "{{choices.label.index(answerKey)}}"
+doc_to_choice: "{{choices.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "Pregunta: {{question}}\nResposta:"
+metric_list:
+ - metric: acc
+ aggregation: mean
+ higher_is_better: true
+ - metric: acc_norm
+ aggregation: mean
+ higher_is_better: true
+metadata:
+ version: 1.0
@@ -0,0 +1,17 @@
+tag: cabreu
+dataset_path: projecte-aina/caBreu
+dataset_name: null
+output_type: generate_until
+test_split: test
+training_split: train
+validation_split: validation
+process_docs: !function utils.process_doc_cabreu
+metric_list:
+ - metric: bleu
+ aggregation: bleu
+ higher_is_better: true
+ - metric: !function utils.rouge1
+ aggregation: !function utils.rouge1_agg
+ higher_is_better: true
+metadata:
+ version: 1.0
@@ -0,0 +1,3 @@
+task: arc_ca_challenge
+dataset_name: ARC-Challenge
+include: _arc_ca_common_yaml
@@ -0,0 +1,3 @@
+task: arc_ca_easy
+dataset_name: ARC-Easy
+include: _arc_ca_common_yaml
@@ -0,0 +1,8 @@
+include: _cabreu_common_yaml
+task: cabreu_abstractive
+description: "Examina el text següent i genera'n un resum abstractiu, expressant el significat del text original d'una manera més natural i concisa.\n"
+doc_to_text: >-
+ Text: {{content}}
+
+ Resum:
+doc_to_target: '{{summaries["abstractive"]["a1"]}}'
@@ -0,0 +1,8 @@
+include: _cabreu_common_yaml
+task: cabreu_extractive
+description: "Examina el text següent i genera'n un resum extractiu, utilitzant les frases o oracions més rellevants del text original.\n"
+doc_to_text: >-
+ Text: {{content}}
+
+ Resum:
+doc_to_target: '{{summaries["extractive"]["a1"]}}'
@@ -0,0 +1,8 @@
+include: _cabreu_common_yaml
+task: cabreu_extreme
+description: "Examina el text següent i genera'n un resum que sigui el més concís possible i que preservi el significat del text original.\n"
+doc_to_text: >-
+ Text: {{content}}
+
+ Resum:
+doc_to_target: '{{summaries["extreme"]["a1"]}}'
@@ -0,0 +1,25 @@
+group: catalan_bench
+task:
+ - belebele_cat_Latn
+ - xnli_ca
+ - catcola
+ - copa_ca
+ - openbookqa_ca
+ - parafraseja
+ - paws_ca
+ - piqa_ca
+ - siqa_ca
+ - teca
+ - wnli_ca
+ - arc_ca_easy
+ - arc_ca_challenge
+ - xstorycloze_ca
+ - xquad_ca
+ - catalanqa
+ - coqcat
+ - flores_ca
+ - cabreu
+ - mgsm_direct_ca
+ - phrases_va
+metadata:
+ version: 1.0
@@ -0,0 +1,25 @@
+task: catalanqa
+dataset_path: projecte-aina/catalanqa
+dataset_name: null
+output_type: generate_until
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "Context: {{context}}\n\nPregunta: {{question}}\n\nResposta:"
+doc_to_target: '{{answers[0]["text"]}}'
+target_delimiter: ' '
+process_results: !function utils.process_results_qa
+generation_kwargs:
+ until:
+ - "\n"
+ do_sample: false
+ temperature: 0.0
+metric_list:
+ - metric: exact_match
+ aggregation: mean
+ higher_is_better: true
+ - metric: f1
+ aggregation: mean
+ higher_is_better: true
+metadata:
+ version: 1.0
@@ -0,0 +1,14 @@
+task: catcola
+dataset_path: nbel/CatCoLA
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: null
+doc_to_text: "{{Sentence}}\nPregunta: Té sentit aquesta frase?\nResposta:"
+doc_to_target: label
+doc_to_choice: ["no", "sí"]
+metric_list:
+ - metric: mcc
+ - metric: acc
+metadata:
+ version: 1.0
@@ -0,0 +1,17 @@
+task: copa_ca
+dataset_path: projecte-aina/COPA-ca
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+process_docs: !function utils.process_docs_copa_ca
+doc_to_text: '{{premise[:-1].strip() + " " + {"cause": "perquè", "effect": "i per tant"}[question]}}'
+doc_to_target: '{{choice1 if label == 0 else choice2}}'
+doc_to_choice: '{{[choice1, choice2]}}'
+metric_list:
+ - metric: acc
+ aggregation: mean
+ higher_is_better: true
+metadata:
+ version: 1.0
@@ -0,0 +1,23 @@
+task: coqcat
+dataset_path: projecte-aina/CoQCat
+output_type: generate_until
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: '{{story+"\n\n"}}{% for i in range(questions|length-1) %}{{"Q: "+questions[i]+"\n\n"+"A: "+answers["input_text"][i]+"\n\n"}}{% endfor %}{{"Q: "+questions[-1]+"\n\n"+"A:"}}'
+doc_to_target: '{{ answers["input_text"][questions|length - 1] }}'
+process_results: !function utils.process_results_coqcat
+should_decontaminate: true
+doc_to_decontamination_query: "{{story}} {{question.input_text|join('\n')}}"
+generation_kwargs:
+ until:
+ - "\nQ:"
+metric_list:
+ - metric: "em"
+ aggregation: mean
+ higher_is_better: true
+ - metric: "f1"
+ aggregation: mean
+ higher_is_better: true
+metadata:
+ version: 1.0
@@ -0,0 +1,25 @@
+dataset_path: facebook/flores
+dataset_name: all
+output_type: generate_until
+training_split: dev
+validation_split: dev
+test_split: devtest
+fewshot_split: dev
+target_delimiter: ''
+generation_kwargs:
+ until:
+ - "\n"
+metric_list:
+ - metric: bleu
+ aggregation: bleu
+ higher_is_better: true
+ - metric: ter
+ aggregation: ter
+ higher_is_better: false
+ - metric: chrf
+ aggregation: chrf
+ higher_is_better: true
+metadata:
+ version: 1.0
+dataset_kwargs:
+ trust_remote_code: true