diff --git a/_pages/projects.md b/_pages/projects.md index 5960ea7..9c8f961 100644 --- a/_pages/projects.md +++ b/_pages/projects.md @@ -122,7 +122,7 @@ MSc/BSc thesis research vectors: * :hourglass_flowing_sand: *Slot and Intent Detection for Low-Resource Dialects.* Digital assistants are becoming wide-spread, yet current technology covers only a limited set of languages. How can we best do zero-shot transfer to low-resource language variants without standard orthography? [Reference: van der Goot et al., 2021](https://aclanthology.org/2021.naacl-main.197.pdf) and [VarDial 2023 SID4LR](https://sites.google.com/view/vardial-2023/shared-tasks). Create a new evaluation dataset of a low-resource language variant you speak, and investigate how to best transfer to it. Topics: Transfer Learning, Cross-linguality, Dataset annotation (Particularly suited for students interested in covering their own language or dialect not yet covered by existing systems including local dialects, e.g. Austrian, Low Saxon, Sardinian dialects or others). Level: MSc or BSc. - * *Creating a dialectal dependency treebank or POS-tagged corpus.* Create a small Universal Dependencies (UD; [de Marneffe et al. 2021](https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies)) treebank in a dialect, regional language or other low-resource language that you are familiar with. For a less time-intensive project, it is also possible to only annotate part-of-speech (POS) tags (otherwise: POS tags + dependencies) and complement the project with something else. This project requires a strong interest in linguistics and syntax. You will need to read up on UD's annotation guidelines and independently seek out relevant linguistic literature on your chosen language. You also evaluate parsers/POS taggers on your new dataset in a cross-lingual transfer set-up and, time permitting, you might also train your own parsers. Ideally, the project leads to contributing a new treebank to the UD project. Examples of similar corpora: [Siewert et al. 2021](https://aclanthology.org/2021.konvens-1.25/), [Hollenstein & Aepli 2014](https://aclanthology.org/W14-5310/), [Cassidy et al. 2022](https://aclanthology.org/2022.acl-long.473/), [Lusito & Maillard 2021](https://aclanthology.org/2021.udw-1.10/). [Tutorial for UD newcomers](https://unidive.lisn.upsaclay.fr/doku.php?id=other-events:webinar-1). Level: BSc or MSc. + * :hourglass_flowing_sand: *Creating a dialectal dependency treebank or POS-tagged corpus.* Create a small Universal Dependencies (UD; [de Marneffe et al. 2021](https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies)) treebank in a dialect, regional language or other low-resource language that you are familiar with. For a less time-intensive project, it is also possible to only annotate part-of-speech (POS) tags (otherwise: POS tags + dependencies) and complement the project with something else. This project requires a strong interest in linguistics and syntax. You will need to read up on UD's annotation guidelines and independently seek out relevant linguistic literature on your chosen language. You also evaluate parsers/POS taggers on your new dataset in a cross-lingual transfer set-up and, time permitting, you might also train your own parsers. Ideally, the project leads to contributing a new treebank to the UD project. Examples of similar corpora: [Siewert et al. 2021](https://aclanthology.org/2021.konvens-1.25/), [Hollenstein & Aepli 2014](https://aclanthology.org/W14-5310/), [Cassidy et al. 2022](https://aclanthology.org/2022.acl-long.473/), [Lusito & Maillard 2021](https://aclanthology.org/2021.udw-1.10/). [Tutorial for UD newcomers](https://unidive.lisn.upsaclay.fr/doku.php?id=other-events:webinar-1). Level: BSc or MSc. - :hourglass_flowing_sand: *Transfer or translate: how to better work with dialectal data.* Demands for generalizing NLP pipelines to dialectal data are on the rise. Given current LLMs trained in hundreds of languages, there are two common approaches. The first approach is to translate (or normalize) dialectal data to its mainstream counterpart and apply pipelines to the translated mainstream counterpart. Such an approach can benefit from the bigger amount of unannotated and annotated data in the mainstream variant but suffers from error propagation in the pipeline. The second transfer approach is to annotate a small amount of dialectal data and few-shot transfer (finetune) models on the dialect. This involves more dialectal annotation as well as collected unannotated dialectal data. Reference: [Zampieri et al. 2020](https://helda.helsinki.fi/server/api/core/bitstreams/dd1636da-66ef-4e2d-bdb7-19c0b27080f3/content). For a BSc thesis, you would choose an NLP task (e.g., syntactic or semantic parsing, sentiment or stance detection, QA or summarization, etc.) and a specific dialect, compare performances of fewshot versus translation approaches quantitatively, and conduct a qualitative error analysis on the difficult cases. For MSc, the research needs to scale up either to multiple dialects (in the same or across different language families) or to multiple NLP tasks. Level: BSc or MSc. @@ -137,9 +137,9 @@ MSc/BSc thesis research vectors: - :hourglass_flowing_sand: *Computational Job Market Analysis.* Job postings are a rich resource to understand the dynamics of the labor market including which skills are demanded, which is also important for an educational viewpoint. Recently, the emerging line of work on computational job market analysis or NLP for human resources has started to provide data resources and models for automatic job posting analysis, such as the identification and extraction of skills. See references of MultiSkill project. For students interested in real-world applications, this theme provides multiple thesis projects including but not limited to: an in-depth analysis of existing data set and models, researching implicit skill extraction or cross-domain transfer learning to adapt skill and knowledge extraction to data sources other than job postings like patents or scientific articles. [See references of MultiSkill project](#multiskill). See also [Bhola et al., 2020](https://aclanthology.org/2020.coling-main.513.pdf) and [Gnehm et al. 2021](https://www.zora.uzh.ch/id/eprint/230653/1/2022.nlpcss_1.2.pdf) and our [own ESCOXLM-R model](https://aclanthology.org/2023.acl-long.662.pdf). Level: BSc or MSc. -- :hourglass_flowing_sand: *Climate Change Insights through NLP*. Climate change is a pressing issue internationally that is receiving more and more attention everyday. It is influencing regulations and decision-making in various parts of society such as politics, agriculture, business, and it is discussed extensively on social media. For students interested in real-world societal applications, this project aims to contribute insights on the discussion surrounding climate change on social media by examining discourse from a social media platform. The data will have to be collected (potentially from existing sources), cleaned, and analyzed using NLP techniques to examine various aspects or features of interest such as stance, sentiment, the extraction of key players, etc. References: [Luo et al., 2020](https://aclanthology.org/2020.findings-emnlp.296v2.pdf), [Stede & Patz, 2021](https://aclanthology.org/2021.nlp4posimpact-1.2.pdf), [Vaid et al., 2022](https://aclanthology.org/2022.acl-srw.35.pdf). Level: BSc or MSc. +- *Climate Change Insights through NLP*. Climate change is a pressing issue internationally that is receiving more and more attention everyday. It is influencing regulations and decision-making in various parts of society such as politics, agriculture, business, and it is discussed extensively on social media. For students interested in real-world societal applications, this project aims to contribute insights on the discussion surrounding climate change on social media by examining discourse from a social media platform. The data will have to be collected (potentially from existing sources), cleaned, and analyzed using NLP techniques to examine various aspects or features of interest such as stance, sentiment, the extraction of key players, etc. References: [Luo et al., 2020](https://aclanthology.org/2020.findings-emnlp.296v2.pdf), [Stede & Patz, 2021](https://aclanthology.org/2021.nlp4posimpact-1.2.pdf), [Vaid et al., 2022](https://aclanthology.org/2022.acl-srw.35.pdf). Level: BSc or MSc. -- *Better Benchmarks / Mining for Errors in Annotated Datasets.* Benchmark datasets are essential in empirical research, but even widely-used annotated datasets contain mistakes, as annotators inevitably make mistakes (e.g. annotation inconsistencies). There are several lines of work in this direction. On the one site, annotation error detection methods provide a suite of methods to detect errors in existing datasets (cf. [Klie et al. 2023](https://direct.mit.edu/coli/article/49/1/157/113280/Annotation-Error-Detection-Analyzing-the-Past-and), [Weber & Plank 2023](https://aclanthology.org/2023.findings-acl.562/)), including tools such as data maps ([Swayamdipta et al. 2020](https://aclanthology.org/people/s/swabha-swayamdipta/)). On the other side, there is work on inspecting existing datasets in revision efforts that exist for English NER in the past year (cf. [Reiss et al. 2020](https://aclanthology.org/2020.conll-1.16/), [Rücker & Akbik 2023](https://aclanthology.org/2023.emnlp-main.533.pdf)). The goal of projects on this theme can be:
+- :hourglass_flowing_sand: *Better Benchmarks / Mining for Errors in Annotated Datasets.* Benchmark datasets are essential in empirical research, but even widely-used annotated datasets contain mistakes, as annotators inevitably make mistakes (e.g. annotation inconsistencies). There are several lines of work in this direction. On the one site, annotation error detection methods provide a suite of methods to detect errors in existing datasets (cf. [Klie et al. 2023](https://direct.mit.edu/coli/article/49/1/157/113280/Annotation-Error-Detection-Analyzing-the-Past-and), [Weber & Plank 2023](https://aclanthology.org/2023.findings-acl.562/)), including tools such as data maps ([Swayamdipta et al. 2020](https://aclanthology.org/people/s/swabha-swayamdipta/)). On the other side, there is work on inspecting existing datasets in revision efforts that exist for English NER in the past year (cf. [Reiss et al. 2020](https://aclanthology.org/2020.conll-1.16/), [Rücker & Akbik 2023](https://aclanthology.org/2023.emnlp-main.533.pdf)). The goal of projects on this theme can be:
a) (MSc level) to investigate error detection methods in novel scenarios (new benchmarks, new applications, and/or create a new error detection dataset),
b) (MSc or BSc level) extend revision efforts on NER to other languages. For the latter, for a BSc thesis, your task includes improving a benchmark dataset with iterations of sanity checks and revisions and comparing NLP models on the original versus revised versions. For MSc, you could extend either by incorporating Annotation Error Detection methods (see previous part) or conducting additional evaluations on multiple downstream NLP tasks.
c) (MSc level) Checking the annotation consistency of non-standardized language data. Automatic methods for finding potential inconsistencies in annotations typically rely on consistent orthographies (e.g., detecting sequences that occur multiple times in a corpus but have received different annotations; [Dickinson & Meurers 2003](https://aclanthology.org/E03-1068/)). When text is written in a language variety without a standardized orthography, such methods may no longer work well because of spelling differences between the repeated sequences. Your task is to extend such approaches to detect errors in existing datasets to be more robust to orthographic variation and/or to investigate how well annotation error detection methods that do not directly depend on orthographic matches work (cf. [Klie et al. 2023](https://direct.mit.edu/coli/article/49/1/157/113280/Annotation-Error-Detection-Analyzing-the-Past-and)). The target dataset would ideally be a dialectal dataset currently under development at the lab (this requires familiarity with German dialects and an interest in syntax). @@ -175,7 +175,7 @@ Start with the CNN/DM dataset. Project can be extended to other genres and langu - :hourglass_flowing_sand: *Understanding Political Party Manifestos*. Recent advancements in natural language processing have already changed adjacent fields like political science and communication research. A question that has always been relevant in these and other social sciences has been how to turn textual data into ecologically valid numerical representations. In the field of party communication, the question of how one can turn party manifestos into numerical vector representations has been studied by the Manifesto Project for decades. The project codes - with huge human work input - some dozens of political issue categories for thousands of party manifestos. This project aims to use recent advances in natural language inference and zero-shot classification to reproduce the human codings produced by the Manifesto Project. Level: MSc (could be adapted to BSc). References: Intro to the political science political theory behind the Manifesto Project (Chapters 1-3): [Lemmer 2023](https://doi.org/10.25593/978-3-96147-671-8); Paper on Natural Language Inference: [Laurer et al. 2024](https://www.cambridge.org/core/journals/political-analysis/article/less-annotating-more-classifying-addressing-the-data-scarcity-issue-of-supervised-machine-learning-with-deep-transfer-learning-and-bertnli/05BB05555241762889825B080E097C27); [Manifesto Project website](https://manifesto-project.wzb.eu/). -- :hourglass_flowing_sand: *Characteristics of language between amateur and expert poetists.* Writing is an art - a beautiful and moving poem has various characteristics which readers relate to and draws their mind into an imaginative tale. This project aims to better understand and characterize writing styles of amaetur and expert poets. The first step would be constructing a corpus of poems or prose of experienced and amateur writings from online sources, checking carefully for copyright. The data would have to be clean and preprocessed. Afterwards, various NLP techniques such as sentiment analysis or analysis of metaphors will be used to better understand and characterize various writing styles. If time allows, the corpus could be expanded to across genres and time periods for a more comprehensive analysis of writing style. References: [Kao & Jurafsky 2015](https://aclanthology.org/2015.lilt-12.3.pdf), [Kao & Jurafsky 2012](https://aclanthology.org/W12-2502.pdf), [Gopidi & Alam 2019](https://aclanthology.org/W19-4702.pdf). Level: BSc or MSc +- *Characteristics of language between amateur and expert poetists.* Writing is an art - a beautiful and moving poem has various characteristics which readers relate to and draws their mind into an imaginative tale. This project aims to better understand and characterize writing styles of amaetur and expert poets. The first step would be constructing a corpus of poems or prose of experienced and amateur writings from online sources, checking carefully for copyright. The data would have to be clean and preprocessed. Afterwards, various NLP techniques such as sentiment analysis or analysis of metaphors will be used to better understand and characterize various writing styles. If time allows, the corpus could be expanded to across genres and time periods for a more comprehensive analysis of writing style. References: [Kao & Jurafsky 2015](https://aclanthology.org/2015.lilt-12.3.pdf), [Kao & Jurafsky 2012](https://aclanthology.org/W12-2502.pdf), [Gopidi & Alam 2019](https://aclanthology.org/W19-4702.pdf). Level: BSc or MSc