diff --git a/lm_eval/tasks/README.md b/lm_eval/tasks/README.md index 771f1adcab..e44bf3f95d 100644 --- a/lm_eval/tasks/README.md +++ b/lm_eval/tasks/README.md @@ -25,6 +25,7 @@ | [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) | | [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple | | [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English | +| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan | | [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese | | [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese | | code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby | @@ -86,6 +87,7 @@ | [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English | | [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English | | [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish | +| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese | | [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English | | [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English | | [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English | @@ -121,4 +123,3 @@ | [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque | | [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese | | [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese | -| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese | diff --git a/lm_eval/tasks/catalan_bench/README.md b/lm_eval/tasks/catalan_bench/README.md new file mode 100644 index 0000000000..73dec948fe --- /dev/null +++ b/lm_eval/tasks/catalan_bench/README.md @@ -0,0 +1,121 @@ +# CatalanBench + +### Paper + +CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon. + +The new evaluation datasets included in CatalanBench are: +| Task | Category | Homepage | +|:-------------:|:-----:|:-----:| +| ARC_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/arc_ca | +| MGSM_ca | Math | https://huggingface.co/datasets/projecte-aina/mgsm_ca | +| OpenBookQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/openbookqa_ca | +| Parafraseja | Paraphrasing | https://huggingface.co/datasets/projecte-aina/Parafraseja | +| PIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/piqa_ca | +| SIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/siqa_ca | +| XStoryCloze_ca | Commonsense Reasoning | https://huggingface.co/datasets/projecte-aina/xstorycloze_ca | + +The datasets included in CatalanBench that have been made public in previous pubications are: + +| Task | Category | Paper title | Homepage | +|:-------------:|:-----:|:-------------:|:-----:| +| Belebele_ca | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele | +| caBREU | Summarization | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/caBreu | +| CatalanQA | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/catalanqa | +| CatCoLA | Linguistic Acceptability | CatCoLA: Catalan Corpus of Linguistic Acceptability | https://huggingface.co/datasets/nbel/CatCoLA | +| COPA-ca | Commonsense Reasoning | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/COPA-ca | +| CoQCat | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/CoQCat | +| FLORES_ca | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores | +| PAWS-ca | Paraphrasing | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/PAWS-ca | +| TE-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/teca | +| VeritasQA_ca | Truthfulness | VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability | TBA | +| WNLI-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/wnli-ca | +| XNLI-ca | Natural Language Inference | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/xnli-ca | +| XQuAD-ca | Question Answering | [Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan](https://aclanthology.org/2024.lrec-main.231/) | https://huggingface.co/datasets/projecte-aina/xquad-ca | + + +### Citation +Paper for CatalanBench coming soon. + + + +### Groups and Tasks + +#### Groups + +- `catalan_bench`: All tasks included in CatalanBench. +- `flores_ca`: All FLORES translation tasks from or to Catalan. + +#### Tags +- `cabreu`: Three CaBREU tasks for each type of summary (extractive, abstractive and extreme). +- `phrases_va`: Two Phrases_va tasks for language adaptation between Catalan and Valencian. + +#### Tasks + +The following tasks evaluate tasks on CatalanBench dataset using various scoring methods. + - `arc_ca_challenge` + - `arc_ca_easy` + - `belebele_cat_Latn` + - `cabreu` + - `catalanqa` + - `catcola` + - `copa_ca` + - `coqcat` + - `flores_ca` + - `flores_ca-de` + - `flores_ca-en` + - `flores_ca-es` + - `flores_ca-eu` + - `flores_ca-fr` + - `flores_ca-gl` + - `flores_ca-it` + - `flores_ca-pt` + - `flores_de-ca` + - `flores_en-ca` + - `flores_es-ca` + - `flores_eu-ca` + - `flores_fr-ca` + - `flores_gl-ca` + - `flores_it-ca` + - `flores_pt-ca` + - `mgsm_direct_ca` + - `openbookqa_ca` + - `parafraseja` + - `paws_ca` + - `phrases_ca` + - `piqa_ca` + - `siqa_ca` + - `teca` + - `veritasqa_gen_ca` + - `veritasqa_mc1_ca` + - `veritasqa_mc2_ca` + - `wnli_ca` + - `xnli_ca` + - `xquad_ca` + - `xstorycloze_ca` + +Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are: +- `belebele_cat_Latn`: Belebele Catalan + + +### Checklist + +* [x] Is the task an existing benchmark in the literature? + * [ ] Have you referenced the original paper that introduced the task? + * [ ] If yes, does the original paper provide a reference implementation? + * [ ] Yes, original implementation contributed by author of the benchmark + +If other tasks on this dataset are already supported: +* [ ] Is the "Main" variant of this task clearly denoted? +* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? +* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? diff --git a/lm_eval/tasks/catalan_bench/_arc_ca_common_yaml b/lm_eval/tasks/catalan_bench/_arc_ca_common_yaml new file mode 100644 index 0000000000..b89290ebaf --- /dev/null +++ b/lm_eval/tasks/catalan_bench/_arc_ca_common_yaml @@ -0,0 +1,20 @@ +tag: arc_ca +dataset_path: projecte-aina/arc_ca +output_type: multiple_choice +training_split: null +validation_split: validation +test_split: test +doc_to_text: "Pregunta: {{question}}\nResposta:" +doc_to_target: "{{choices.label.index(answerKey)}}" +doc_to_choice: "{{choices.text}}" +should_decontaminate: true +doc_to_decontamination_query: "Pregunta: {{question}}\nResposta:" +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true + - metric: acc_norm + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/_cabreu_common_yaml b/lm_eval/tasks/catalan_bench/_cabreu_common_yaml new file mode 100644 index 0000000000..c66e8bc486 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/_cabreu_common_yaml @@ -0,0 +1,17 @@ +tag: cabreu +dataset_path: projecte-aina/caBreu +dataset_name: null +output_type: generate_until +test_split: test +training_split: train +validation_split: validation +process_docs: !function utils.process_doc_cabreu +metric_list: + - metric: bleu + aggregation: bleu + higher_is_better: true + - metric: !function utils.rouge1 + aggregation: !function utils.rouge1_agg + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/arc_ca_challenge.yaml b/lm_eval/tasks/catalan_bench/arc_ca_challenge.yaml new file mode 100644 index 0000000000..9d7a9c8423 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/arc_ca_challenge.yaml @@ -0,0 +1,3 @@ +task: arc_ca_challenge +dataset_name: ARC-Challenge +include: _arc_ca_common_yaml diff --git a/lm_eval/tasks/catalan_bench/arc_ca_easy.yaml b/lm_eval/tasks/catalan_bench/arc_ca_easy.yaml new file mode 100644 index 0000000000..67b28fd626 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/arc_ca_easy.yaml @@ -0,0 +1,3 @@ +task: arc_ca_easy +dataset_name: ARC-Easy +include: _arc_ca_common_yaml diff --git a/lm_eval/tasks/catalan_bench/cabreu_abstractive.yaml b/lm_eval/tasks/catalan_bench/cabreu_abstractive.yaml new file mode 100644 index 0000000000..930ba28a52 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/cabreu_abstractive.yaml @@ -0,0 +1,8 @@ +include: _cabreu_common_yaml +task: cabreu_abstractive +description: "Examina el text següent i genera'n un resum abstractiu, expressant el significat del text original d'una manera més natural i concisa.\n" +doc_to_text: >- + Text: {{content}} + + Resum: +doc_to_target: '{{summaries["abstractive"]["a1"]}}' diff --git a/lm_eval/tasks/catalan_bench/cabreu_extractive.yaml b/lm_eval/tasks/catalan_bench/cabreu_extractive.yaml new file mode 100644 index 0000000000..e5f3dd4dd0 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/cabreu_extractive.yaml @@ -0,0 +1,8 @@ +include: _cabreu_common_yaml +task: cabreu_extractive +description: "Examina el text següent i genera'n un resum extractiu, utilitzant les frases o oracions més rellevants del text original.\n" +doc_to_text: >- + Text: {{content}} + + Resum: +doc_to_target: '{{summaries["extractive"]["a1"]}}' diff --git a/lm_eval/tasks/catalan_bench/cabreu_extreme.yaml b/lm_eval/tasks/catalan_bench/cabreu_extreme.yaml new file mode 100644 index 0000000000..98efbe9cd4 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/cabreu_extreme.yaml @@ -0,0 +1,8 @@ +include: _cabreu_common_yaml +task: cabreu_extreme +description: "Examina el text següent i genera'n un resum que sigui el més concís possible i que preservi el significat del text original.\n" +doc_to_text: >- + Text: {{content}} + + Resum: +doc_to_target: '{{summaries["extreme"]["a1"]}}' diff --git a/lm_eval/tasks/catalan_bench/catalan_bench.yaml b/lm_eval/tasks/catalan_bench/catalan_bench.yaml new file mode 100644 index 0000000000..1f1f09ece2 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/catalan_bench.yaml @@ -0,0 +1,25 @@ +group: catalan_bench +task: + - belebele_cat_Latn + - xnli_ca + - catcola + - copa_ca + - openbookqa_ca + - parafraseja + - paws_ca + - piqa_ca + - siqa_ca + - teca + - wnli_ca + - arc_ca_easy + - arc_ca_challenge + - xstorycloze_ca + - xquad_ca + - catalanqa + - coqcat + - flores_ca + - cabreu + - mgsm_direct_ca + - phrases_va +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/catalanqa.yaml b/lm_eval/tasks/catalan_bench/catalanqa.yaml new file mode 100644 index 0000000000..926cdfa1be --- /dev/null +++ b/lm_eval/tasks/catalan_bench/catalanqa.yaml @@ -0,0 +1,25 @@ +task: catalanqa +dataset_path: projecte-aina/catalanqa +dataset_name: null +output_type: generate_until +training_split: train +validation_split: validation +test_split: test +doc_to_text: "Context: {{context}}\n\nPregunta: {{question}}\n\nResposta:" +doc_to_target: '{{answers[0]["text"]}}' +target_delimiter: ' ' +process_results: !function utils.process_results_qa +generation_kwargs: + until: + - "\n" + do_sample: false + temperature: 0.0 +metric_list: + - metric: exact_match + aggregation: mean + higher_is_better: true + - metric: f1 + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/catcola.yaml b/lm_eval/tasks/catalan_bench/catcola.yaml new file mode 100644 index 0000000000..121b5e7f48 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/catcola.yaml @@ -0,0 +1,14 @@ +task: catcola +dataset_path: nbel/CatCoLA +output_type: multiple_choice +training_split: train +validation_split: validation +test_split: null +doc_to_text: "{{Sentence}}\nPregunta: Té sentit aquesta frase?\nResposta:" +doc_to_target: label +doc_to_choice: ["no", "sí"] +metric_list: + - metric: mcc + - metric: acc +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/copa_ca.yaml b/lm_eval/tasks/catalan_bench/copa_ca.yaml new file mode 100644 index 0000000000..d376ad3aea --- /dev/null +++ b/lm_eval/tasks/catalan_bench/copa_ca.yaml @@ -0,0 +1,17 @@ +task: copa_ca +dataset_path: projecte-aina/COPA-ca +dataset_name: null +output_type: multiple_choice +training_split: train +validation_split: validation +test_split: test +process_docs: !function utils.process_docs_copa_ca +doc_to_text: '{{premise[:-1].strip() + " " + {"cause": "perquè", "effect": "i per tant"}[question]}}' +doc_to_target: '{{choice1 if label == 0 else choice2}}' +doc_to_choice: '{{[choice1, choice2]}}' +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/coqcat.yaml b/lm_eval/tasks/catalan_bench/coqcat.yaml new file mode 100644 index 0000000000..95145a7492 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/coqcat.yaml @@ -0,0 +1,23 @@ +task: coqcat +dataset_path: projecte-aina/CoQCat +output_type: generate_until +training_split: train +validation_split: validation +test_split: test +doc_to_text: '{{story+"\n\n"}}{% for i in range(questions|length-1) %}{{"Q: "+questions[i]+"\n\n"+"A: "+answers["input_text"][i]+"\n\n"}}{% endfor %}{{"Q: "+questions[-1]+"\n\n"+"A:"}}' +doc_to_target: '{{ answers["input_text"][questions|length - 1] }}' +process_results: !function utils.process_results_coqcat +should_decontaminate: true +doc_to_decontamination_query: "{{story}} {{question.input_text|join('\n')}}" +generation_kwargs: + until: + - "\nQ:" +metric_list: + - metric: "em" + aggregation: mean + higher_is_better: true + - metric: "f1" + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/flores_ca/_flores_common_yaml b/lm_eval/tasks/catalan_bench/flores_ca/_flores_common_yaml new file mode 100644 index 0000000000..59a9b14aaf --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/_flores_common_yaml @@ -0,0 +1,25 @@ +dataset_path: facebook/flores +dataset_name: all +output_type: generate_until +training_split: dev +validation_split: dev +test_split: devtest +fewshot_split: dev +target_delimiter: '' +generation_kwargs: + until: + - "\n" +metric_list: + - metric: bleu + aggregation: bleu + higher_is_better: true + - metric: ter + aggregation: ter + higher_is_better: false + - metric: chrf + aggregation: chrf + higher_is_better: true +metadata: + version: 1.0 +dataset_kwargs: + trust_remote_code: true diff --git a/lm_eval/tasks/catalan_bench/flores_ca/create_yamls_flores_ca.py b/lm_eval/tasks/catalan_bench/flores_ca/create_yamls_flores_ca.py new file mode 100644 index 0000000000..6125b97266 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/create_yamls_flores_ca.py @@ -0,0 +1,334 @@ +""" +Script to generate task YAMLs for the FLORES-200 dataset. +Based on `tasks/translation/utils.py`. +""" + +import argparse + +import yaml +from langcodes import Language + + +# constants +_LANGUAGES = [ + "ace_Arab", + "bam_Latn", + "dzo_Tibt", + "hin_Deva", + "khm_Khmr", + "mag_Deva", + "pap_Latn", + "sot_Latn", + "tur_Latn", + "ace_Latn", + "ban_Latn", + "ell_Grek", + "hne_Deva", + "kik_Latn", + "mai_Deva", + "pbt_Arab", + "spa_Latn", + "twi_Latn", + "acm_Arab", + "bel_Cyrl", + "eng_Latn", + "hrv_Latn", + "kin_Latn", + "mal_Mlym", + "pes_Arab", + "srd_Latn", + "tzm_Tfng", + "acq_Arab", + "bem_Latn", + "epo_Latn", + "hun_Latn", + "kir_Cyrl", + "mar_Deva", + "plt_Latn", + "srp_Cyrl", + "uig_Arab", + "aeb_Arab", + "ben_Beng", + "est_Latn", + "hye_Armn", + "kmb_Latn", + "min_Arab", + "pol_Latn", + "ssw_Latn", + "ukr_Cyrl", + "afr_Latn", + "bho_Deva", + "eus_Latn", + "ibo_Latn", + "kmr_Latn", + "min_Latn", + "por_Latn", + "sun_Latn", + "umb_Latn", + "ajp_Arab", + "bjn_Arab", + "ewe_Latn", + "ilo_Latn", + "knc_Arab", + "mkd_Cyrl", + "prs_Arab", + "swe_Latn", + "urd_Arab", + "aka_Latn", + "bjn_Latn", + "fao_Latn", + "ind_Latn", + "knc_Latn", + "mlt_Latn", + "quy_Latn", + "swh_Latn", + "uzn_Latn", + "als_Latn", + "bod_Tibt", + "fij_Latn", + "isl_Latn", + "kon_Latn", + "mni_Beng", + "ron_Latn", + "szl_Latn", + "vec_Latn", + "amh_Ethi", + "bos_Latn", + "fin_Latn", + "ita_Latn", + "kor_Hang", + "mos_Latn", + "run_Latn", + "tam_Taml", + "vie_Latn", + "apc_Arab", + "bug_Latn", + "fon_Latn", + "jav_Latn", + "lao_Laoo", + "mri_Latn", + "rus_Cyrl", + "taq_Latn", + "war_Latn", + "arb_Arab", + "bul_Cyrl", + "fra_Latn", + "jpn_Jpan", + "lij_Latn", + "mya_Mymr", + "sag_Latn", + "taq_Tfng", + "wol_Latn", + "arb_Latn", + "cat_Latn", + "fur_Latn", + "kab_Latn", + "lim_Latn", + "nld_Latn", + "san_Deva", + "tat_Cyrl", + "xho_Latn", + "ars_Arab", + "ceb_Latn", + "fuv_Latn", + "kac_Latn", + "lin_Latn", + "nno_Latn", + "sat_Olck", + "tel_Telu", + "ydd_Hebr", + "ary_Arab", + "ces_Latn", + "gaz_Latn", + "kam_Latn", + "lit_Latn", + "nob_Latn", + "scn_Latn", + "tgk_Cyrl", + "yor_Latn", + "arz_Arab", + "cjk_Latn", + "gla_Latn", + "kan_Knda", + "lmo_Latn", + "npi_Deva", + "shn_Mymr", + "tgl_Latn", + "yue_Hant", + "asm_Beng", + "ckb_Arab", + "gle_Latn", + "kas_Arab", + "ltg_Latn", + "nso_Latn", + "sin_Sinh", + "tha_Thai", + "zho_Hans", + "ast_Latn", + "crh_Latn", + "glg_Latn", + "kas_Deva", + "ltz_Latn", + "nus_Latn", + "slk_Latn", + "tir_Ethi", + "zho_Hant", + "awa_Deva", + "cym_Latn", + "grn_Latn", + "kat_Geor", + "lua_Latn", + "nya_Latn", + "slv_Latn", + "tpi_Latn", + "zsm_Latn", + "ayr_Latn", + "dan_Latn", + "guj_Gujr", + "kaz_Cyrl", + "lug_Latn", + "oci_Latn", + "smo_Latn", + "tsn_Latn", + "zul_Latn", + "azb_Arab", + "deu_Latn", + "hat_Latn", + "kbp_Latn", + "luo_Latn", + "ory_Orya", + "sna_Latn", + "tso_Latn", + "azj_Latn", + "dik_Latn", + "hau_Latn", + "kea_Latn", + "lus_Latn", + "pag_Latn", + "snd_Arab", + "tuk_Latn", + "bak_Cyrl", + "dyu_Latn", + "heb_Hebr", + "khk_Cyrl", + "lvs_Latn", + "pan_Guru", + "som_Latn", + "tum_Latn", +] +LANGUAGE_PAIRS = [ + (a, b) for idx, a in enumerate(_LANGUAGES) for b in _LANGUAGES[idx + 1 :] +] + +LANGUAGES_OF_INTEREST = [ + "cat_Latn", + "spa_Latn", + "eng_Latn", + "glg_Latn", + "eus_Latn", + "ita_Latn", + "deu_Latn", + "por_Latn", + "fra_Latn", +] +MAIN_LANG = "cat_Latn" +LANGUAGE_PAIRS = [ + (a, b) + for (a, b) in LANGUAGE_PAIRS + if a in LANGUAGES_OF_INTEREST + and b in LANGUAGES_OF_INTEREST + and "cat_Latn" in (a, b) +] + +# auxiliary functions + + +def code_to_language_name(code): + return Language.make(language=Language.get(code)["language"]).display_name() + + +def code_to_short_name(code): + return Language.get(code)["language"] + + +def jinja_var(s): + return "{{" + s + "}}" + + +def doc_to_text(src: str, tgt: str) -> str: + src_name, tgt_name = map(code_to_language_name, [src, tgt]) + + return f"""\ +{src_name} sentence: {jinja_var('sentence_' + src)} +{tgt_name} sentence:""" + + +def doc_to_target(tgt: str) -> str: + return f"{jinja_var('sentence_' + tgt)}" + + +# main function + + +def gen_lang_yamls(output_dir: str, overwrite: bool) -> None: + """ + Generate a YAML file for each translation direction. + """ + + err = [] + for src, tgt in LANGUAGE_PAIRS: + # do both translation directions for each lang pair + for src, tgt in [(src, tgt), (tgt, src)]: + lang_pair_name = f"{code_to_short_name(src)}-{code_to_short_name(tgt)}" + yaml_file_name = f"flores_{lang_pair_name}.yaml" + + try: + with open( + f"{output_dir}/{yaml_file_name}", + "w" if overwrite else "x", + encoding="utf-8", + ) as outfile: + print(f"Creating {yaml_file_name}...") + outfile.write("# File generated by `create-yamls.py`\n") + yaml.dump( + { + # "group": [f"{BENCH_NAME}_bench", f"{BENCH_NAME}_bench_flores"], + # "group": "flores_ca", + "include": "_flores_common_yaml", + "task": f"flores_{lang_pair_name}", + "doc_to_text": doc_to_text(src, tgt), + "doc_to_target": doc_to_target(tgt), + }, + outfile, + sort_keys=False, + ) + + except FileExistsError: + err.append(yaml_file_name) + + if len(err) > 0: + raise FileExistsError( + "Files were not created because they already exist:" + f" {', '.join(err)}" + "\nUse flag --overwrite to overwrite them." + ) + + +def main() -> None: + parser = argparse.ArgumentParser() + parser.add_argument( + "--overwrite", + default=False, + action="store_true", + help="Overwrite files if they already exist", + ) + parser.add_argument( + "--output-dir", default=".", help="Directory to write yaml files to" + ) + args = parser.parse_args() + + gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite) + + +if __name__ == "__main__": + main() diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-de.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-de.yaml new file mode 100644 index 0000000000..15eb02afb6 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-de.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-de +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + German sentence:' +doc_to_target: '{{sentence_deu_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-en.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-en.yaml new file mode 100644 index 0000000000..9a8f5ffeb8 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-en.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-en +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + English sentence:' +doc_to_target: '{{sentence_eng_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-es.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-es.yaml new file mode 100644 index 0000000000..9a6aa44240 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-es.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-es +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + Spanish sentence:' +doc_to_target: '{{sentence_spa_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-eu.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-eu.yaml new file mode 100644 index 0000000000..48ffe7bf5c --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-eu.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-eu +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + Basque sentence:' +doc_to_target: '{{sentence_eus_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-fr.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-fr.yaml new file mode 100644 index 0000000000..99b40c1462 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-fr.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-fr +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + French sentence:' +doc_to_target: '{{sentence_fra_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-gl.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-gl.yaml new file mode 100644 index 0000000000..5da7ad5fe4 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-gl.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-gl +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + Galician sentence:' +doc_to_target: '{{sentence_glg_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-it.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-it.yaml new file mode 100644 index 0000000000..20f8d99f9f --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-it.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-it +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + Italian sentence:' +doc_to_target: '{{sentence_ita_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-pt.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-pt.yaml new file mode 100644 index 0000000000..565f6267c5 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca-pt.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_ca-pt +doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}} + + Portuguese sentence:' +doc_to_target: '{{sentence_por_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca.yaml new file mode 100644 index 0000000000..4726daa83e --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_ca.yaml @@ -0,0 +1,24 @@ +group: flores_ca +task: + - flores_es-ca + - flores_ca-es + - flores_en-ca + - flores_ca-en + - flores_eu-ca + - flores_ca-eu + - flores_pt-ca + - flores_ca-pt + - flores_it-ca + - flores_ca-it + - flores_fr-ca + - flores_ca-fr + - flores_ca-gl + - flores_gl-ca + - flores_ca-de + - flores_de-ca +aggregate_metric_list: + - metric: bleu + aggregation: mean + weight_by_size: false +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_de-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_de-ca.yaml new file mode 100644 index 0000000000..af3d0eb493 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_de-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_de-ca +doc_to_text: 'German sentence: {{sentence_deu_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_en-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_en-ca.yaml new file mode 100644 index 0000000000..16132ff497 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_en-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_en-ca +doc_to_text: 'English sentence: {{sentence_eng_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_es-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_es-ca.yaml new file mode 100644 index 0000000000..e35b715213 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_es-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_es-ca +doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_eu-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_eu-ca.yaml new file mode 100644 index 0000000000..c8be6ee93b --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_eu-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_eu-ca +doc_to_text: 'Basque sentence: {{sentence_eus_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_fr-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_fr-ca.yaml new file mode 100644 index 0000000000..0d2de77edf --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_fr-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_fr-ca +doc_to_text: 'French sentence: {{sentence_fra_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_gl-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_gl-ca.yaml new file mode 100644 index 0000000000..6ce3eaae5c --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_gl-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_gl-ca +doc_to_text: 'Galician sentence: {{sentence_glg_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_it-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_it-ca.yaml new file mode 100644 index 0000000000..db811154e5 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_it-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_it-ca +doc_to_text: 'Italian sentence: {{sentence_ita_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/flores_ca/flores_pt-ca.yaml b/lm_eval/tasks/catalan_bench/flores_ca/flores_pt-ca.yaml new file mode 100644 index 0000000000..196295c9e3 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/flores_ca/flores_pt-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _flores_common_yaml +task: flores_pt-ca +doc_to_text: 'Portuguese sentence: {{sentence_por_Latn}} + + Catalan sentence:' +doc_to_target: '{{sentence_cat_Latn}}' diff --git a/lm_eval/tasks/catalan_bench/mgsm_direct_ca.yaml b/lm_eval/tasks/catalan_bench/mgsm_direct_ca.yaml new file mode 100644 index 0000000000..066336a67f --- /dev/null +++ b/lm_eval/tasks/catalan_bench/mgsm_direct_ca.yaml @@ -0,0 +1,25 @@ +task: mgsm_direct_ca +dataset_path: projecte-aina/mgsm_ca +doc_to_target: '{{answer_number|string}}' +doc_to_text: '{% if answer != None %}{{question + "\nResposta: "}}{% else %}{{"Pregunta: " + question + "\nResposta: "}}{% endif %}' +output_type: generate_until +training_split: train +test_split: test +target_delimiter: "" +generation_kwargs: + until: + - "\n\n" + - "\n" +filter_list: + - name: remove_whitespace + filter: + - function: remove_whitespace + - function: take_first +metric_list: + - metric: exact_match + aggregation: mean + higher_is_better: true + ignore_case: true + ignore_punctuation: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/openbookqa_ca.yaml b/lm_eval/tasks/catalan_bench/openbookqa_ca.yaml new file mode 100644 index 0000000000..868be75612 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/openbookqa_ca.yaml @@ -0,0 +1,20 @@ +task: openbookqa_ca +dataset_path: projecte-aina/openbookqa_ca +output_type: multiple_choice +training_split: null +validation_split: validation +test_split: test +doc_to_text: question_stem +doc_to_target: "{{choices.label.index(answerKey.lstrip())}}" +doc_to_choice: "{{choices.text}}" +should_decontaminate: true +doc_to_decontamination_query: question_stem +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true + - metric: acc_norm + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/parafraseja.yaml b/lm_eval/tasks/catalan_bench/parafraseja.yaml new file mode 100644 index 0000000000..060d488d18 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/parafraseja.yaml @@ -0,0 +1,17 @@ +task: parafraseja +dataset_path: projecte-aina/Parafraseja +output_type: multiple_choice +dataset_name: null +test_split: test +training_split: train +validation_split: validation +doc_to_choice: '{{[sentence1+", veritat? No, "+sentence2, sentence1+", veritat? Sí, "+sentence2]}}' +process_docs: !function utils.process_docs_paraphrases +doc_to_text: '' +doc_to_target: label +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/paws_ca.yaml b/lm_eval/tasks/catalan_bench/paws_ca.yaml new file mode 100644 index 0000000000..e736f5c746 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/paws_ca.yaml @@ -0,0 +1,18 @@ +task: paws_ca +dataset_path: projecte-aina/PAWS-ca +dataset_name: null +output_type: multiple_choice +training_split: train +validation_split: validation +test_split: test +process_docs: !function utils.process_docs_paraphrases +doc_to_text: '' +doc_to_target: label +doc_to_choice: '{{[sentence1+", veritat? No, "+sentence2, sentence1+", veritat? Sí, "+sentence2]}}' +target_delimiter: '' +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/phrases_va/_phrases_va_common.yaml b/lm_eval/tasks/catalan_bench/phrases_va/_phrases_va_common.yaml new file mode 100644 index 0000000000..f59a2098ca --- /dev/null +++ b/lm_eval/tasks/catalan_bench/phrases_va/_phrases_va_common.yaml @@ -0,0 +1,24 @@ +tag: phrases_va +dataset_path: gplsi/CA-VA_alignment_test +output_type: generate_until +training_split: null +validation_split: null +test_split: test +fewshot_split: test +num_fewshot: 5 +target_delimiter: ' ' +generation_kwargs: + until: + - "\n" +metric_list: + - metric: bleu + aggregation: bleu + higher_is_better: true + - metric: ter + aggregation: ter + higher_is_better: false + - metric: chrf + aggregation: chrf + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/phrases_va/phrases_ca-va.yaml b/lm_eval/tasks/catalan_bench/phrases_va/phrases_ca-va.yaml new file mode 100644 index 0000000000..fc0e08d5a2 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/phrases_va/phrases_ca-va.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _phrases_va_common.yaml +task: phrases_ca-va +doc_to_text: 'Oració en català: {{ca}} + + Oració en valencià:' +doc_to_target: '{{va}}' diff --git a/lm_eval/tasks/catalan_bench/phrases_va/phrases_va-ca.yaml b/lm_eval/tasks/catalan_bench/phrases_va/phrases_va-ca.yaml new file mode 100644 index 0000000000..5b1a76780a --- /dev/null +++ b/lm_eval/tasks/catalan_bench/phrases_va/phrases_va-ca.yaml @@ -0,0 +1,7 @@ +# File generated by `create-yamls.py` +include: _phrases_va_common.yaml +task: phrases_va-ca +doc_to_text: 'Oració en valencià: {{va}} + + Oració en català:' +doc_to_target: '{{ca}}' diff --git a/lm_eval/tasks/catalan_bench/piqa_ca.yaml b/lm_eval/tasks/catalan_bench/piqa_ca.yaml new file mode 100644 index 0000000000..11e600a7f1 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/piqa_ca.yaml @@ -0,0 +1,21 @@ +task: piqa_ca +dataset_path: projecte-aina/piqa_ca +dataset_name: null +output_type: multiple_choice +training_split: null +validation_split: validation +test_split: null +doc_to_text: "Pregunta: {{goal}}\nResposta:" +doc_to_target: label +doc_to_choice: "{{[sol1, sol2]}}" +should_decontaminate: true +doc_to_decontamination_query: goal +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true + - metric: acc_norm + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/siqa_ca.yaml b/lm_eval/tasks/catalan_bench/siqa_ca.yaml new file mode 100644 index 0000000000..8a39a37f5c --- /dev/null +++ b/lm_eval/tasks/catalan_bench/siqa_ca.yaml @@ -0,0 +1,16 @@ +task: siqa_ca +dataset_path: projecte-aina/siqa_ca +output_type: multiple_choice +training_split: null +validation_split: validation +test_split: null +doc_to_text: "Pregunta: {{context}} {{question}}\nResposta:" +target_delimiter: " " +doc_to_choice: "{{[answerA, answerB, answerC]}}" +doc_to_target: "{{ (label|int) - 1 }}" +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/teca.yaml b/lm_eval/tasks/catalan_bench/teca.yaml new file mode 100644 index 0000000000..8978c2c969 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/teca.yaml @@ -0,0 +1,18 @@ +task: teca +dataset_path: projecte-aina/teca +dataset_name: null +training_split: train +validation_split: validation +test_split: test +output_type: multiple_choice +process_docs: !function utils.process_doc_nli +doc_to_text: "" +doc_to_target: label +target_delimiter: "" +doc_to_choice: '{{[premise + ", correcte? Sí, " + hypothesis, premise + ", correcte? A més, " + hypothesis, premise + ", correcte? No, " + hypothesis]}}' +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/utils.py b/lm_eval/tasks/catalan_bench/utils.py new file mode 100644 index 0000000000..ced91772ca --- /dev/null +++ b/lm_eval/tasks/catalan_bench/utils.py @@ -0,0 +1,142 @@ +import re +from itertools import product + +import evaluate +import transformers.data.metrics.squad_metrics as squad_metrics + +from lm_eval.utils import general_detokenize + + +def lowercase_first_letter(text): + return text[0].lower() + text[1:] + + +def process_doc_nli(dataset): + def process_fn(doc): + # Detokenize(remove extra whitespaces) + doc["premise"] = general_detokenize(doc["premise"]).strip() + doc["hypothesis"] = general_detokenize(doc["hypothesis"]).strip() + # Remove last punctuation mark in the premise + doc["premise"] = ( + doc["premise"][:-1] + if doc["premise"].endswith((".", ",", "!", "?")) + else doc["premise"] + ) + # Lowercase the first letter in the hypothesis + doc["hypothesis"] = lowercase_first_letter(doc["hypothesis"]) + # Ensure that the hypothesis ends with a dot + doc["hypothesis"] = ( + (doc["hypothesis"] + ".") + if not doc["hypothesis"].endswith(".") + else doc["hypothesis"] + ) + return doc + + return dataset.map(process_fn) + + +def process_results_coqcat(doc, results): + # Get all possible answers and compute the scores + turn_id = len(doc["questions"]) + answers = [doc["answers"]["input_text"][turn_id - 1]] + additional_answers_list = doc.get("additional_answers") + if additional_answers_list: + for key, additional_answers in additional_answers_list.items(): + if additional_answers["input_text"][turn_id - 1].lower() not in map( + str.lower, answers + ): + answers.append(additional_answers["input_text"][turn_id - 1]) + + gold_list = answers + pred = results[0].strip().split("\n")[0] + # import code; code.interact(local=dict(globals(), **locals())) + + f1_sum = 0.0 + em_sum = 0.0 + if len(gold_list) > 1: + for i in range(len(gold_list)): + gold_answers = gold_list[0:i] + gold_list[i + 1 :] + # predictions compared against (n) golds and take maximum + em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_answers) + f1_sum += max(squad_metrics.compute_f1(a, pred) for a in gold_answers) + else: + em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_list) + f1_sum += max(squad_metrics.compute_f1(a, pred) for a in gold_list) + # import code; code.interact(local=dict(globals(), **locals())) + return { + "em": em_sum / max(1, len(gold_list)), + "f1": f1_sum / max(1, len(gold_list)), + } + + +def process_results_qa(doc, results): + preds = results[0] + reference = doc["answers"][0]["text"] + # import code; code.interact(local=dict(globals(), **locals())) + f1_sum = squad_metrics.compute_f1(reference, preds) + exact_match = squad_metrics.compute_exact(reference, preds) + return {"f1": f1_sum, "exact_match": exact_match} + + +def process_doc_cabreu(dataset): + def process_fn(doc): + # Remove duplicate spaces + doc["content"] = re.sub(r" +", " ", doc["content"]) + for summary_type, index in product( + ["abstractive", "extractive", "extreme"], ["a1", "a2", "a3"] + ): + doc["summaries"][summary_type][index] = re.sub( + r" +", " ", doc["summaries"][summary_type][index] + ) + return doc + + return dataset.map(process_fn) + + +def process_docs_paraphrases(dataset): + empty_docs = [] + + def _process_doc(doc): + if doc["sentence1"] not in [None, ""] and doc["sentence2"] not in [None, ""]: + doc["sentence1"] = general_detokenize(doc["sentence1"]).strip() + doc["sentence2"] = general_detokenize(doc["sentence2"]).strip() + # Remove final punctuation mark in the first sentence + if doc["sentence1"].endswith((".", ",", ";")): + doc["sentence1"] = doc["sentence1"][:-1] + # Start the second sentence in lowercase (to be used after "Yes, ...") + doc["sentence2"] = lowercase_first_letter(doc["sentence2"]) + return doc + else: + empty_docs.append(doc) + return doc + + return dataset.filter( + lambda doc: doc["sentence1"] not in [None, ""] + and doc["sentence2"] not in [None, ""] + ).map(_process_doc) + + +def process_docs_copa_ca(dataset): + def _process_doc(doc): + doc["choice1"] = lowercase_first_letter(doc["choice1"]) + doc["choice2"] = lowercase_first_letter(doc["choice2"]) + return doc + + return dataset.map(_process_doc) + + +def rouge1(items): + """ + # passthrough for efficiency + """ + return items + + +def rouge1_agg(items): + """ + Higher is better + """ + refs = list(zip(*items))[0] + preds = list(zip(*items))[1] + rouge_scorer = evaluate.load("rouge") + return rouge_scorer.compute(predictions=preds, references=refs)["rouge1"] diff --git a/lm_eval/tasks/catalan_bench/wnli_ca.yaml b/lm_eval/tasks/catalan_bench/wnli_ca.yaml new file mode 100644 index 0000000000..d4deec5c04 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/wnli_ca.yaml @@ -0,0 +1,14 @@ +task: wnli_ca +dataset_path: projecte-aina/wnli-ca +dataset_name: null +output_type: multiple_choice +training_split: train +validation_split: validation +test_split: null +doc_to_text: "{{sentence1}}\nPregunta: {{sentence2}} Cert o Fals?\nResposta:" +doc_to_target: label +doc_to_choice: ["Fals", "Cert"] +metric_list: + - metric: acc +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/xnli_ca.yaml b/lm_eval/tasks/catalan_bench/xnli_ca.yaml new file mode 100644 index 0000000000..44f0f44302 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/xnli_ca.yaml @@ -0,0 +1,19 @@ +task: xnli_ca +dataset_path: projecte-aina/xnli-ca +dataset_name: null +include: ../xnli/xnli_common_yaml +output_type: multiple_choice +doc_to_choice: '{{[premise+", correcte? Sí, "+hypothesis,premise+", correcte? A més, + "+hypothesis,premise+", correcte? No, "+hypothesis]}}' +doc_to_text: '' +target_delimiter: '' +process_docs: !function utils.process_doc_nli +training_split: null +validation_split: validation +doc_to_target: label +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/xquad_ca.yaml b/lm_eval/tasks/catalan_bench/xquad_ca.yaml new file mode 100644 index 0000000000..9b72c7da74 --- /dev/null +++ b/lm_eval/tasks/catalan_bench/xquad_ca.yaml @@ -0,0 +1,24 @@ +task: xquad_ca +dataset_path: projecte-aina/xquad-ca +dataset_name: null +output_type: generate_until +doc_to_text: "Context: {{context}}\n\nPregunta: {{question}}\n\nResposta:" +doc_to_target: '{{answers[0]["text"]}}' +validation_split: null +test_split: test +target_delimiter: ' ' +process_results: !function utils.process_results_qa +generation_kwargs: + until: + - "\n" + do_sample: false + temperature: 0.0 +metric_list: + - metric: exact_match + aggregation: mean + higher_is_better: true + - metric: f1 + aggregation: mean + higher_is_better: true +metadata: + version: 1.0 diff --git a/lm_eval/tasks/catalan_bench/xstorycloze_ca.yaml b/lm_eval/tasks/catalan_bench/xstorycloze_ca.yaml new file mode 100644 index 0000000000..61a7c2991f --- /dev/null +++ b/lm_eval/tasks/catalan_bench/xstorycloze_ca.yaml @@ -0,0 +1,17 @@ +task: xstorycloze_ca +dataset_path: projecte-aina/xstorycloze_ca +dataset_name: ca +output_type: multiple_choice +training_split: train +validation_split: eval +doc_to_text: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}" +doc_to_target: "{{answer_right_ending-1}}" +doc_to_choice: "{{[sentence_quiz1, sentence_quiz2]}}" +should_decontaminate: true +doc_to_decontamination_query: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}" +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true +metadata: + version: 1.0