From fc7947eeec6c32644f2eaa4e57ae9a0ed592fa59 Mon Sep 17 00:00:00 2001 From: Barbara Plank Date: Sun, 1 Sep 2024 15:23:06 +0200 Subject: [PATCH] Update projects.md --- _pages/projects.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_pages/projects.md b/_pages/projects.md index 9a1a7db..c2b267e 100644 --- a/_pages/projects.md +++ b/_pages/projects.md @@ -146,10 +146,11 @@ MSc/BSc thesis research vectors: * *Creating a dialectal dependency treebank or POS-tagged corpus.* Create a small Universal Dependencies (UD; [de Marneffe et al. 2021](https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies)) treebank in a dialect, regional language or other low-resource language that you are familiar with. For a less time-intensive project, it is also possible to only annotate part-of-speech (POS) tags (otherwise: POS tags + dependencies) and complement the project with something else. This project requires a strong interest in linguistics and syntax. You will need to read up on UD's annotation guidelines and independently seek out relevant linguistic literature on your chosen language. You also evaluate parsers/POS taggers on your new dataset in a cross-lingual transfer set-up and, time permitting, you might also train your own parsers. Ideally, the project leads to contributing a new treebank to the UD project. Examples of similar corpora: [Siewert et al. 2021](https://aclanthology.org/2021.konvens-1.25/), [Hollenstein & Aepli 2014](https://aclanthology.org/W14-5310/), [Cassidy et al. 2022](https://aclanthology.org/2022.acl-long.473/), [Lusito & Maillard 2021](https://aclanthology.org/2021.udw-1.10/). [Tutorial for UD newcomers](https://unidive.lisn.upsaclay.fr/doku.php?id=other-events:webinar-1). Level: BSc or MSc. - - *Transfer or translate: how to better work with dialectal data.* Demands for generalizing NLP pipelines to dialectal data are on the rise. Given current LLMs trained in hundreds of languages, there are two common approaches. The first approach is to translate (or normalize) dialectal data to its mainstream counterpart and apply pipelines to the translated mainstream counterpart. Such an approach can benefit from the bigger amount of unannotated and annotated data in the mainstream variant but suffers from error propagation in the pipeline. The second transfer approach is to annotate a small amount of dialectal data and few-shot transfer (finetune) models on the dialect. This involves more dialectal annotation as well as collected unannotated dialectal data. Reference: [Zampieri et al. 2020](https://helda.helsinki.fi/server/api/core/bitstreams/dd1636da-66ef-4e2d-bdb7-19c0b27080f3/content). For a BSc thesis, you would choose an NLP task (e.g., syntactic or semantic parsing, sentiment or stance detection, QA or summarization, etc.) and a specific dialect, compare performances of fewshot versus translation approaches quantitatively, and conduct a qualitative error analysis on the difficult cases. For MSc, the research needs to scale up either to multiple dialects (in the same or across different language families) or to multiple NLP tasks. Level: BSc or MSc. + * *Lexical Resources for Dialects.* NLP for dialect languages is an intriguing area of research due to the lack of resources (low-resources languages) and lack of standardization (high variance). In this project, the goal is build dialect dictionaries by annotating words and phrases with respect to different linguistic properties including parts-of-speech, cases and grammatical genders. For this, we have collected raw data in several German dialects. It is therefore important that the student is familiar with one or more of the following dialects: Alemannic (Alemannisch), Palatinate (Pfälzisch), Frisian (Friesisch), Saterland Frisian (Saterfriesisch), Bavarian (Bairisch), Low German (Niederdeutsch/Plattdeutsch), Colognian (Kölsch). The second part consists of using these dictionaries to either 1) systematically investigate the linguistic dialect competence of pre-trained language models or 2) develop IR-specific resources (dialect stopword lists, dialect lemmatizers, dialect stemmers). This project does not need access to GPUs and is suitable for BSc and MSc students. - - *Lexical Resources for Dialects.* NLP for dialect languages is an intriguing area of research due to the lack of resources (low-resources languages) and lack of standardization (high variance). In this project, the goal is build dialect dictionaries by annotating words and phrases with respect to different linguistic properties including parts-of-speech, cases and grammatical genders. For this, we have collected raw data in several German dialects. It is therefore important that the student is familiar with one or more of the following dialects: Alemannic (Alemannisch), Palatinate (Pfälzisch), Frisian (Friesisch), Saterland Frisian (Saterfriesisch), Bavarian (Bairisch), Low German (Niederdeutsch/Plattdeutsch), Colognian (Kölsch). The second part consists of using these dictionaries to either 1) systematically investigate the linguistic dialect competence of pre-trained language models or 2) develop IR-specific resources (dialect stopword lists, dialect lemmatizers, dialect stemmers). This project does not need access to GPUs and is suitable for BSc and MSc students. + * *Transfer or translate: how to better work with dialectal data.* Demands for generalizing NLP pipelines to dialectal data are on the rise. Given current LLMs trained in hundreds of languages, there are two common approaches. The first approach is to translate (or normalize) dialectal data to its mainstream counterpart and apply pipelines to the translated mainstream counterpart. Such an approach can benefit from the bigger amount of unannotated and annotated data in the mainstream variant but suffers from error propagation in the pipeline. The second transfer approach is to annotate a small amount of dialectal data and few-shot transfer (finetune) models on the dialect. This involves more dialectal annotation as well as collected unannotated dialectal data. Reference: [Zampieri et al. 2020](https://helda.helsinki.fi/server/api/core/bitstreams/dd1636da-66ef-4e2d-bdb7-19c0b27080f3/content). For a BSc thesis, you would choose an NLP task (e.g., syntactic or semantic parsing, sentiment or stance detection, QA or summarization, etc.) and a specific dialect, compare performances of fewshot versus translation approaches quantitatively, and conduct a qualitative error analysis on the difficult cases. For MSc, the research needs to scale up either to multiple dialects (in the same or across different language families) or to multiple NLP tasks. Level: BSc or MSc. + ### V2: High-Quality Information Extraction and Retrieval, Data-centric NLP