From 7cc94210c90b5e73818459458d4aa91e53b64234 Mon Sep 17 00:00:00 2001 From: Barbara Plank Date: Sun, 1 Sep 2024 15:37:09 +0200 Subject: [PATCH] Update projects.md --- _pages/projects.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_pages/projects.md b/_pages/projects.md index 3262597..6f2a450 100644 --- a/_pages/projects.md +++ b/_pages/projects.md @@ -150,6 +150,8 @@ MSc/BSc thesis research vectors: * *Transfer or translate: how to better work with dialectal data.* Demands for generalizing NLP pipelines to dialectal data are on the rise. Given current LLMs trained in hundreds of languages, there are two common approaches. The first approach is to translate (or normalize) dialectal data to its mainstream counterpart and apply pipelines to the translated mainstream counterpart. Such an approach can benefit from the bigger amount of unannotated and annotated data in the mainstream variant but suffers from error propagation in the pipeline. The second transfer approach is to annotate a small amount of dialectal data and few-shot transfer (finetune) models on the dialect. This involves more dialectal annotation as well as collected unannotated dialectal data. Reference: [Zampieri et al. 2020](https://helda.helsinki.fi/server/api/core/bitstreams/dd1636da-66ef-4e2d-bdb7-19c0b27080f3/content). For a BSc thesis, you would choose an NLP task (e.g., syntactic or semantic parsing, sentiment or stance detection, QA or summarization, etc.) and a specific dialect, compare performances of fewshot versus translation approaches quantitatively, and conduct a qualitative error analysis on the difficult cases. For MSc, the research needs to scale up either to multiple dialects (in the same or across different language families) or to multiple NLP tasks. Level: BSc or MSc. + * *To What Degree do LLMs Understand Bavarian Dialect Variants?* In this project, we aim to comprehensively evaluate existing multilingual and german LLMs on Bavarian dialect variants. Your task is to come up with a set of evaluation criteria to test existing LLMs in zero-shot and few-shot manner, starting from existing benchmarks but with the aim to go beyond them, systematically comparing LLMs with traditional fine-tuning approaches, and finding out when and why certain methods struggle. Techniques: In-depth Evaluation, LLMs, Fine-tuning, Behavioral Testing. Level: MSc. + * *Language Modeling of Historical Non-Standard Language Documents.* Digitalisation can provide access to valuable historical information, especially for non-standard languages and dialects. In this project, you test and build a prototype for digitalisation of historical documents using recent visual representation-based methods. The project includes: data gathering and annotation, model evaluation and improvement (e.g. by augmentation methods). References: [Salesky et al., 2021](https://aclanthology.org/2021.emnlp-main.576/), [Borenstein et al., 2023](https://aclanthology.org/2023.emnlp-main.7.pdf). Level: MSc.