Publications
Cite
Cite
Related
Publications
Related
Cite
Projects
Data4Transparency
Text2Tech
BIFOLD
Cora4NLP
BBDC2
DEEPLEE
PLASS
SIM3S
Cite
Cite
Publications
Cite
Publications
1
InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations
Factuality Detection using Machine Translation - a Use Case for German Clinical Text
Inseq: An Interpretability Toolkit for Sequence Generation Models
Neural Machine Translation Methods for Translating Text to Sign Language Glosses
Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset
VendorLink: An NLP approach for Identifying & Linking Vendor Migrants & Potential Aliases on Darknet Markets
Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports
Multilingual Relation Classification via Efficient and Effective Prompting
Cite
1
Full-Text Argumentation Mining on Scientific Publications
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
An Annotated Corpus of Textual Explanations for Clinical Decision Support
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective
Generating Extended and Multilingual Summaries with Pre-trained Transformers
MobASA: Corpus for Aspect-based Sentiment Analysis and Social Inclusion in the Mobility Domain
Subjective Text Complexity Assessment for German
Claim Extraction and Law Matching for COVID-19-related Legislation
Specialized Document Embeddings for Aspect-based Similarity of Research Papers
1
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information
Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
A Comparative Study of Pre-trained Encoders for Low-Resource Named Entity Recognition
Why only Micro-$F_1$? Class Weighting of Measures for Relation Classification
Detecting Covariate Drift with Explanations
MobIE: A German Dataset for Named Entity Recognition, Entity Linking and Relation Extraction in the Mobility Domain
Evaluating Document Representations for Content-based Legal Literature Recommendations
Aspect-based Document Similarity for Research Papers
Defx at SemEval-2020 Task 6: Joint Extraction of Concepts and Relations for Definition Extraction
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles
1
Pattern-Guided Integrated Gradients
Bootstrapping Named Entity Recognition in E-Commerce with Positive Unlabeled Learning
Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling
Probing Linguistic Features of Sentence-Level Representations in Neural Relation Extraction
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task
Evaluating German Transformer Language Models with Syntactic Agreement Tests
Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling
Cross-lingual Neural Vector Conceptualization
Layerwise Relevance Visualization in Convolutional Text Graph Classifiers
Fine-Tuning Pre-Trained Transformer Language Models to Distantly Supervised Relation Extraction
1
A Crowdsourcing Approach to Evaluate the Quality of Query-based Extractive Text Summaries
Enriching BERT with Knowledge Graph Embedding for Document Classification
Improving Relation Extraction by Pre-Trained Language Representations
Neural Vector Conceptualization for Word Vector Space Interpretation
Train, Sort, Explain: Learning to Diagnose Translation Models
Learning Explanations From Language Data
2
SapBERT-Based Medical Concept Normalization Using SNOMED CT
When performance is not enough—A multidisciplinary view on clinical decision support
Automatic Extraction of Medication Mentions from Tweets—Overview of the BioCreative VII Shared Task 3 Competition
''Nothing works without the doctor:'' Physicians' perception of clinical decision-making and artificial intelligence
Evaluation of a clinical decision support system for detection of patients at risk after kidney transplantation
Cite
5
Klinische Entscheidungsfindung mit Künstlicher Intelligenz: Ein interdisziplinärer Governance-Ansatz
Cite
6
European Language Equality, Report on Europe's Sign Languages
Cite
Abstract
In this work, we introduce a bootstrapped, iterative NER model that integrates a PU learning algorithm for recognizing named entities in a low-resource setting. Our approach combines dictionary-based labeling with syntactically-informed label expansion to efficiently enrich the seed dictionaries. Experimental results on a dataset of manually annotated e-commerce product descriptions demonstrate the effectiveness of the proposed framework.
Cite
Abstract
Recently, state-of-the-art NLP models gained an increasing syntactic and semantic understanding of language, and explanation methods are crucial to understand their decisions. Occlusion is a well established method that provides explanations on discrete language data, e.g. by removing a language unit from an input and measuring the impact on a model’s decision. We argue that current occlusion-based methods often produce invalid or syntactically incorrect language data, neglecting the improved abilities of recent NLP models. Furthermore, gradient-based explanation methods disregard the discrete distribution of data in NLP. Thus, we propose OLM: a novel explanation method that combines occlusion and language models to sample valid and syntactically correct replacements with high likelihood, given the context of the original input. We lay out a theoretical foundation that alleviates these weaknesses of other explanation methods in NLP and provide results that underline the importance of considering data likelihood in occlusion-based explanation.
Cite
Abstract
Despite the recent progress, little is known about the features captured by state-of-the-art neural relation extraction (RE) models. Common methods encode the source sentence, conditioned on the entity mentions, before classifying the relation. However, the complexity of the task makes it difficult to understand how encoder architecture and supporting linguistic knowledge affect the features learned by the encoder. We introduce 14 probing tasks targeting linguistic properties relevant to RE, and we use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets, TACRED and SemEval 2010 Task 8. We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance. For example, adding contextualized word representations greatly increases performance on probing tasks with a focus on named entity and part-of-speech information, and yields better results in RE. In contrast, entity masking improves RE, but considerably lowers performance on entity type related probing tasks.
Cite
Abstract
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE). But, even with recent advances in unsupervised pre-training and knowledge enhanced neural RE, models still show a high error rate. In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? And how do crowd annotations, dataset, and models contribute to this error rate? To answer these questions, we first validate the most challenging 5K examples in the development and test sets using trained annotators. We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled. On the relabeled test set the average F1 score of a large baseline model set improves from 62.1 to 70.1. After validation, we analyze misclassifications on the challenging instances, categorize them into linguistically motivated error groups, and verify the resulting error hypotheses on three state-of-the-art RE models. We show that two groups of ambiguous relations are responsible for most of the remaining errors and that models may adopt shallow heuristics on the dataset when entities are not masked.
Cite
Cite
Abstract
Relation classification models are conventionally evaluated using only a single measure, e.g., micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro and macro, for imbalanced datasets. We introduce a framework for weighting schemes, where existing schemes are extremes, and two new intermediate schemes. We show that reporting results of different weighting schemes better highlights strengths and weaknesses of a model.
Cite
Abstract
Pre-trained language models (PLM) are effective components of few-shot named entity recognition (NER) approaches when augmented with continued pre-training on task-specific out-of-domain data or fine-tuning on in-domain data. However, their performance in low-resource scenarios, where such data is not available, remains an open question. We introduce an encoder evaluation framework, and use it to systematically compare the performance of state-of-the-art pre-trained representations on the task of low-resource NER. We analyze a wide range of encoders pre-trained with different strategies, model architectures, intermediate-task fine-tuning, and contrastive learning. Our experimental results across ten benchmark NER datasets in English and German show that encoder performance varies significantly, suggesting that the choice of encoder for a specific low-resource scenario needs to be carefully evaluated.
Cite
Abstract
Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model based on a pre-trained, encoder-only Transformer language model (HiStruct+ model), which improves SOTA ROUGEs for extractive summarization on PubMed and arXiv substantially. Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively, which differs from our model only in that the hierarchical structure information is not injected. It is also observed that the more conspicuous hierarchical structure the dataset has, the larger improvements our method gains. The ablation study demonstrates that the hierarchical position information is the main contributor to our model’s SOTA performance.
Cite
Abstract
Saliency maps can explain a neural model’s predictions by identifying important input features. They are difficult to interpret for laypeople, especially for instances with many features. In order to make them more accessible, we formalize the underexplored task of translating saliency maps into natural language and compare methods that address two key challenges of this approach – what and how to verbalize. In both automatic and human evaluation setups, using token-level attributions from text classification tasks, we compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations (heatmap visualizations and extractive rationales), measuring simulatability, faithfulness, helpfulness and ease of understanding. Instructing GPT-3.5 to generate saliency map verbalizations yields plausible explanations which include associations, abstractive summarization and commonsense reasoning, achieving by far the highest human ratings, but they are not faithfully capturing numeric information and are inconsistent in their interpretation of the task. In comparison, our search-based, model-free verbalization approach efficiently completes templated verbalizations, is faithful by design, but falls short in helpfulness and simulatability. Our results suggest that saliency map verbalization makes feature attribution explanations more comprehensible and less cognitively challenging to humans than conventional representations.
Cite
Abstract
Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 83% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.
Cite
Abstract
The anonymity on the Darknet allows vendors to stay undetected by using multiple vendor aliases or frequently migrating between markets. Consequently, illegal markets and their connections are challenging to uncover on the Darknet. To identify relationships between illegal markets and their vendors, we propose VendorLink, an NLP-based approach that examines writing patterns to verify, identify, and link unique vendor accounts across text advertisements (ads) on seven public Darknet markets. In contrast to existing literature, VendorLink utilizes the strength of supervised pretraining to perform closed-set vendor verification, open-set vendor identification, and low-resource market adaption tasks. Through VendorLink, we uncover (i) 15 migrants and 71 potential aliases in the Alphabay-Dreams-Silk dataset, (ii) 17 migrants and 3 potential aliases in the Valhalla-Berlusconi dataset, and (iii) 75 migrants and 10 potential aliases in the Traderoute-Agora dataset. Altogether, our approach can help Law Enforcement Agencies (LEA) make more informed decisions by verifying and identifying migrating vendors and their potential aliases on existing and Low-Resource (LR) emerging Darknet markets.
Cite
Abstract
Past work in natural language processing interpretability focused mainly on popular classification tasks while largely overlooking generation settings, partly due to a lack of dedicated tools. In this work, we introduce Inseq, a Python library to democratize access to interpretability analyses of sequence generation models. Inseq enables intuitive and optimized extraction of models' internal information and feature importance scores for popular decoder-only and encoder-decoder Transformers architectures. We showcase its potential by adopting it to highlight gender biases in machine translation models and locate factual knowledge inside GPT-2. Thanks to its extensible interface supporting cutting-edge techniques such as contrastive feature attribution, Inseq can drive future advances in explainable natural language generation, centralizing good practices and enabling fair and reproducible model evaluations.
Neural Machine Translation Methods for Translating Text to Sign Language Glosses
Abstract
State-of-the-art techniques common to low resource Machine Translation (MT) are applied to improve MT of spoken language text to Sign Language (SL) glosses. In our experiments, we improve the performance of the transformer-based models via (1) data augmentation, (2) semi-supervised Neural Machine Translation (NMT), (3) transfer learning and (4) multilingual NMT. The proposed methods are implemented progressively on two German SL corpora containing gloss annotations. Multilingual NMT combined with data augmentation appear to be the most successful setting, yielding statistically significant improvements as measured by three automatic metrics (up to over 6 points BLEU), and confirmed via human evaluation. Our best setting outperforms all previous work that report on the same test-set and is also confirmed on a corpus of the American Sign Language (ASL).
Cite
Abstract
Recently, Neural Vector Conceptualization (NVC) was proposed as a means to interpret samples from a word vector space. For NVC, a neural model activates higher order concepts it recognizes in a word vector instance. To this end, the model first needs to be trained with a sufficiently large instance-to-concept ground truth, which only exists for a few languages. In this work, we tackle this lack of resources with word vector space alignment techniques: We train the NVC model on a high resource language and test it with vectors from an aligned word vector space of another language, without retraining or fine-tuning. A quantitative and qualitative analysis shows that the NVC model indeed activates meaningful concepts for unseen vectors from the aligned vector space. NVC thus becomes available for low resource languages for which no appropriate concept ground truth exists.
Cite
Abstract
This study presents the outcomes of the shared task competition BioCreative VII (Task 3) focusing on the extraction of medication names from a Twitter user’s publicly available tweets (the user’s ‘timeline’). In general, detecting health-related tweets is notoriously challenging for natural language processing tools. The main challenge, aside from the informality of the language used, is that people tweet about any and all topics, and most of their tweets are not related to health. Thus, finding those tweets in a user’s timeline that mention specific health-related concepts such as medications requires addressing extreme imbalance. Task 3 called for detecting tweets in a user’s timeline that mentions a medication name and, for each detected mention, extracting its span. The organizers made available a corpus consisting of 182 049 tweets publicly posted by 212 Twitter users with all medication mentions manually annotated. The corpus exhibits the natural distribution of positive tweets, with only 442 tweets (0.2%) mentioning a medication. This task was an opportunity for participants to evaluate methods that are robust to class imbalance beyond the simple lexical match. A total of 65 teams registered, and 16 teams submitted a system run. This study summarizes the corpus created by the organizers and the approaches taken by the participating teams for this challenge. The corpus is freely available at https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/. The methods and the results of the competing systems are analyzed with a focus on the approaches taken for learning from class-imbalanced data.
Cite
Abstract
Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.
Cite
Abstract
Prompting pre-trained language models has achieved impressive performance on various NLP tasks, especially in low data regimes. Despite the success of prompting in monolingual settings, applying prompt-based methods in multilingual scenarios has been limited to a narrow set of tasks, due to the high cost of handcrafting multilingual prompts. In this paper, we present the first work on prompt-based multilingual relation classification (RC), by introducing an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels. We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages, prompt variants, and English-task training in cross-lingual settings. We find that in both fully supervised and few-shot scenarios, our prompt method beats competitive baselines: fine-tuning XLM-R-EM and null prompts. It also outperforms the random baseline by a large margin in zero-shot experiments. Our method requires little in-language knowledge and can be used as a strong baseline for similar multilingual classification tasks.
Cite
Abstract
Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.
Cite
Abstract
While recently developed NLP explainability methods let us open the black box in various ways (Madsen et al., 2022), a missing ingredient in this endeavor is an interactive tool offering a conversational interface. Such a dialogue system can help users explore datasets and models with explanations in a contextualized manner, e.g. via clarification or follow-up questions, and through a natural language interface. We adapt the conversational explanation framework TalkToModel (Slack et al., 2022) to the NLP domain, add new NLP-specific operations such as free-text rationalization, and illustrate its generalizability on three NLP tasks (dialogue act classification, question answering, hate speech detection). To recognize user queries for explanations, we evaluate fine-tuned and few-shot prompting models and implement a novel adapter-based approach. We then conduct two user studies on (1) the perceived correctness and helpfulness of the dialogues, and (2) the simulatability, i.e. how objectively helpful dialogical explanations are for humans in figuring out the model’s predicted label when it’s not shown. We found rationalization and feature attribution were helpful in explaining the model behavior. Moreover, users could more reliably predict the model outcome based on an explanation dialogue rather than one-off explanations.
Cite
Abstract
Patient care after kidney transplantation requires integration of complex information to make informed decisions on risk constellations. Many machine learning models have been developed for detecting patient outcomes in the past years. However, performance metrics alone do not determine practical utility. We present a newly developed clinical decision support system (CDSS) for detection of patients at risk for rejection and death-censored graft failure. The CDSS is based on clinical routine data including 1,516 kidney transplant recipients and more than 100,000 data points. In a reader study we compare the performance of physicians at a nephrology department with and without the CDSS. Internal validation shows AUC-ROC scores of 0.83 for rejection, and 0.95 for graft failure. The reader study shows that predictions by physicians converge toward the CDSS. However, performance does not improve (AUC–ROC; 0.6413 vs. 0.6314 for rejection; 0.8072 vs. 0.7778 for graft failure). Finally, the study shows that the CDSS detects partially different patients at risk compared to physicians. This indicates that the combination of both, medical professionals and a CDSS might help detect more patients at risk for graft failure. However, the question of how to integrate such a system efficiently into clinical practice remains open.
Cite
Abstract
Introduction: Artificial intelligence–driven decision support systems (AI–DSS) have the potential to help physicians analyze data and facilitate the search for a correct diagnosis or suitable intervention. The potential of such systems is often emphasized. However, implementation in clinical practice deserves continuous attention. This article aims to shed light on the needs and challenges arising from the use of AI-DSS from physicians' perspectives. Methods: The basis for this study is a qualitative content analysis of expert interviews with experienced nephrologists after testing an AI-DSS in a straightforward usage scenario. Results: The results provide insights on the basics of clinical decision-making, expected challenges when using AI-DSS as well as a reflection on the test run. Discussion: While we can confirm the somewhat expectable demand for better explainability and control, other insights highlight the need to uphold classical strengths of the medical profession when using AI-DSS as well as the importance of broadening the view of AI-related challenges to the clinical environment, especially during treatment. Our results stress the necessity for adjusting AI-DSS to shared decision-making. We conclude that explainability must be context-specific while fostering meaningful interaction with the systems available.
Cite
Abstract
Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t. the corpus size. In an empirical study, we use the Papers with Code corpus containing 157, 606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. This can, for example, be used for more diverse and explainable recommendations.
Cite
Abstract
We present MobIE, a German-language dataset, which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework. To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE, and thus can be used for joint and multi-task learning of these fundamental information extraction tasks. We make MobIE public at https://github.com/dfki-nlp/mobie.
Cite
Factuality Detection using Machine Translation - a Use Case for German Clinical Text
Abstract
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.
Cite
Abstract
Representations in the hidden layers of Deep Neural Networks (DNN) are often hard to interpret since it is difficult to project them into an interpretable domain. Graph Convolutional Networks (GCN) allow this projection, but existing explainability methods do not exploit this fact, i.e. do not focus their explanations on intermediate states. In this work, we present a novel method that traces and visualizes features that contribute to a classification decision in the visible and hidden layers of a GCN. Our method exposes hidden cross-layer dynamics in the input graph structure. We experimentally demonstrate that it yields meaningful layerwise explanations for a GCN sentence classifier.
Cite
Abstract
PatternAttribution is a recent method, introduced in the vision domain, that explains classifications of deep neural networks. We demonstrate that it also generates meaningful interpretations in the language domain.
Cite
Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling
Abstract
We explore to what extent knowledge about the pre-trained language model that is used is beneficial for the task of abstractive summarization. To this end, we experiment with conditioning the encoder and decoder of a Transformer-based neural model on the BERT language model. In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size. We also explore how locality modeling, i.e., the explicit restriction of calculations to the local context, can affect the summarization ability of the Transformer. This is done by introducing 2-dimensional convolutional self-attention into the first layers of the encoder. The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset. We additionally train our model on the SwissText dataset to demonstrate usability on German. Both models outperform the baseline in ROUGE scores on two datasets and show its superiority in a manual qualitative analysis.
Cite
Abstract
Almost all summarisation methods and datasets focus on a single language and short summaries. We introduce a new dataset called WikinewsSum for English, German, French, Spanish, Portuguese, Polish, and Italian summarisation tailored for extended summaries of approx. 11 sentences. The dataset comprises 39,626 summaries which are news articles from Wikinews and their sources. We compare three multilingual transformer models on the extractive summarisation task and three training scenarios on which we fine-tune mT5 to perform abstractive summarisation. This results in strong baselines for both extractive and abstractive summarisation on WikinewsSum. We also show how the combination of an extractive model with an abstractive one can be used to create extended abstractive summaries from long input documents. Finally, our results show that fine-tuning mT5 on all the languages combined significantly improves the summarisation performance on low-resource languages.
Cite
Abstract
To cope with the COVID-19 pandemic, many jurisdictions have introduced new or altered existing legislation. Even though these new rules are often communicated to the public in news articles, it remains challenging for laypersons to learn about what is currently allowed or forbidden since news articles typically do not reference underlying laws. We investigate an automated approach to extract legal claims from news articles and to match the claims with their corresponding applicable laws. We examine the feasibility of the two tasks concerning claims about COVID-19-related laws from Berlin, Germany. For both tasks, we create and make publicly available the data sets and report the results of initial experiments. We obtain promising results with Transformer-based models that achieve 46.7 F1 for claim extraction and 91.4 F1 for law matching, albeit with some conceptual limitations. Furthermore, we discuss challenges of current machine learning approaches for legal language processing and their ability for complex legal reasoning tasks.
Cite
Abstract
In this paper we show how aspect-based sentiment analysis might help public transport companies to improve their social responsibility for accessible travel. We present MobASA: a novel German-language corpus of tweets annotated with their relevance for public transportation, and with sentiment towards aspects related to barrier-free travel. We identified and labeled topics important for passengers limited in their mobility due to disability, age, or when travelling with young children. The data can be used to identify hurdles and improve travel planning for vulnerable passengers, as well as to monitor a perception of transportation businesses regarding the social inclusion of all passengers. The data is publicly available under: https://github.com/DFKI-NLP/sim3s-corpus
Related
Cite
Abstract
This paper presents a fine-grained test suite for the language pair German–English. The test suite is based on a number of linguistically motivated categories and phenomena and the semi-automatic evaluation is carried out with regular expressions. We describe the creation and implementation of the test suite in detail, providing a full list of all categories and phenomena. Furthermore, we present various exemplary applications of our test suite that have been implemented in the past years, like contributions to the Conference of Machine Translation, the usage of the test suite and MT outputs for quality estimation, and the expansion of the test suite to the language pair Portuguese–English. We describe how we tracked the development of the performance of various systems MT systems over the years with the help of the test suite and which categories and phenomena are prone to resulting in MT errors. For the first time, we also make a large part of our test suite publicly available to the research community.
Cite
An Annotated Corpus of Textual Explanations for Clinical Decision Support
Abstract
In recent years, machine learning for clinical decision support has gained more and more attention. In order to introduce such applications into clinical practice, a good performance might be essential, however, the aspect of trust should not be underestimated. For the treating physician using such a system and being (legally) responsible for the decision made, it is particularly important to understand the system’s recommendation. To provide insights into a model’s decision, various techniques from the field of explainability (XAI) have been proposed whose output is often enough not targeted to the domain experts that want to use the model. To close this gap, in this work, we explore how explanations could possibly look like in future. To this end, this work presents a dataset of textual explanations in context of decision support. Within a reader study, human physicians estimated the likelihood of possible negative patient outcomes in the near future and justified each decision with a few sentences. Using those sentences, we created a novel corpus, annotated with different semantic layers. Moreover, we provide an analysis of how those explanations are constructed, and how they change depending on physician, on the estimated risk and also in comparison to an automatic clinical decision support system with feature importance.
Subjective Text Complexity Assessment for German
Abstract
For different reasons, text can be difficult to read and understand for many people, especially if the text’s language is too complex. In order to provide suitable text for the target audience, it is necessary to measure its complexity. In this paper we describe subjective experiments to assess the readability of German text. We compile a new corpus of sentences provided by a German IT service provider. The sentences are annotated with the subjective complexity ratings by two groups of participants, namely experts and non-experts for that text domain. We then extract an extensive set of linguistically motivated features that are supposedly interacting with complexity perception. We show that a linear regression model with a subset of these features can be a very good predictor of text complexity.
Cite
Cite
Abstract
Detecting when there is a domain drift between training and inference data is important for any model evaluated on data collected in real time. Many current data drift detection methods only utilize input features to detect domain drift. While effective, these methods disregard the model’s evaluation of the data, which may be a significant source of information about the data domain. We propose to use information from the model in the form of explanations, specifically gradient times input, in order to utilize this information. Following the framework of Rabanser et al. [11], we combine these explanations with two-sample tests in order to detect a shift in distribution between training and evaluation data. Promising initial experiments show that explanations provide useful information for detecting shift, which potentially improves upon the current state-of-the-art.
Cite
Abstract
Distributed word vector spaces are considered hard to interpret which hinders the understanding of natural language processing (NLP) models. In this work, we introduce a new method to interpret arbitrary samples from a word vector space. To this end, we train a neural model to conceptualize word vectors, which means that it activates higher order concepts it recognizes in a given vector. Contrary to prior approaches, our model operates in the original vector space and is capable of learning non-linear relations between word vectors and concepts. Furthermore, we show that it produces considerably less entropic concept activation profiles than the popular cosine similarity.
Enriching BERT with Knowledge Graph Embedding for Document Classification
Abstract
In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available
Cite
Abstract
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Cite
Abstract
Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available.
Evaluating Document Representations for Content-based Legal Literature Recommendations
Abstract
Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datasets. Thus, these studies have limited reproducibility. To address the gap between research and practice, we explore a set of state-of-the-art document representation methods for the task of retrieving semantically related US case law. We evaluate text-based (e.g., fastText, Transformers), citation-based (e.g., DeepWalk, Poincaré), and hybrid methods. We compare in total 27 methods using two silver standards with annotations for 2,964 documents. The silver standards are newly created from Open Case Book and Wikisource and can be reused under an open license facilitating reproducibility. Our experiments show that document representations from averaged fastText word vectors (trained on legal corpora) yield the best results, closely followed by Poincaré citation embeddings. Combining fastText and Poincaré in a hybrid manner further improves the overall result. Besides the overall performance, we analyze the methods depending on document length, citation count, and the coverage of their recommendations. We make our source code, models, and datasets publicly available at this https URL.
Cite
Abstract
Scientific publications about the application of machine learning models in healthcare often focus on improving performance metrics. However, beyond often short-lived improvements, many additional aspects need to be taken into consideration to make sustainable progress. What does it take to implement a clinical decision support system, what makes it usable for the domain experts, and what brings it eventually into practical usage? So far, there has been little research to answer these questions. This work presents a multidisciplinary view of machine learning in medical decision support systems and covers information technology, medical, as well as ethical aspects. The target audience is computer scientists, who plan to do research in a clinical context. The paper starts from a relatively straightforward risk prediction system in the subspecialty nephrology that was evaluated on historic patient data both intrinsically and based on a reader study with medical doctors. Although the results were quite promising, the focus of this article is not on the model itself or potential performance improvements. Instead, we want to let other researchers participate in the lessons we have learned and the insights we have gained when implementing and evaluating our system in a clinical setting within a highly interdisciplinary pilot project in the cooperation of computer scientists, medical doctors, ethicists, and legal experts.
Cite
Abstract
Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlapping vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization.
Cite
Abstract
Dieses Open-Access-essential schafft Orientierung, wenn Künstliche Intelligenz im klinischen Alltag eingesetzt wird. Die Herausforderungen werden anhand zweier Beispiele aus dem Bereich der Nephrologie erläutert, die ethisch und rechtlich reflektiert werden. Ein umfangreicher Empfehlungsteil schließt diesen durchweg interdisziplinär erarbeiteten Band ab.
Cite
Definition Extraction systems are a valuable knowledge source for both humans and algorithms. In this paper we describe our submissions to the DeftEval shared task (SemEval-2020 Task 6), which is evaluated on an English textbook corpus. We provide a detailed explanation of our system for the joint extraction of definition concepts and the relations among them. Furthermore we provide an ablation study of our model variations and describe the results of an error analysis.
Cite
Abstract
Word vector representations, known as embeddings, are commonly used for natural language processing. Particularly, contextualized representations have been very successful recently. In this work, we analyze the impact of contextualized and non-contextualized embeddings for medical concept normalization, mapping clinical terms via a k-NN approach to SNOMED CT. The non-contextualized concept mapping resulted in a much better performance (F1-score = 0.853) than the contextualized representation (F1-score = 0.322).
Cite
Abstract
Pre-trained transformer language models (TLMs) have recently refashioned natural language processing (NLP): Most stateof-the-art NLP models now operate on top of TLMs to benefit from contextualization and knowledge induction. To explain their success, the scientific community conducted numerous analyses. Besides other methods, syntactic agreement tests were utilized to analyse TLMs. Most of the studies were conducted for the English language, however. In this work, we analyse German TLMs. To this end, we design numerous agreement tasks, some of which consider peculiarities of the German language. Our experimental results show that state-of-the-art German TLMs generally perform well on agreement tasks, but we also identify and discuss syntactic structures that push them to their limits.
Cite
Abstract
This report on Europe’s Sign Languages is part of a series of language deliverables developed within the framework of the European Language Equality (ELE) project. The series seeks to not only delineate the current state of affairs for each European language, but to additionally identify the gaps and factors that hinder further development in research and technology. The survey presented here focuses on the condition of Language Technology (LT) with regard to Europe’s Sign Languages, a set of languages often forgotten in the context of European Language Equality. With the rise of the deep learning paradigm in artificial intelligence, sign language technologies become technologically feasible, provided that enough data is available to feed this data-hungry paradigm. It is exactly the quality and quantity of data that is the main bottleneck in development of well performing and useful technologies. In the past, there have been several projects aimed at developing sign language technologies and methodologies that have been deemed of little value by the deaf communities. Co-creation and involvement of deaf communities throughout projects and development of technologies ensures that this does not happen again.
Cite
Abstract
Scholarly Argumentation Mining (SAM) has recently gained attention due to its potential to help scholars with the rapid growth of published scientific literature. It comprises two subtasks: argumentative discourse unit recognition (ADUR) and argumentative relation extraction (ARE), both of which are challenging since they require e.g. the integration of domain knowledge, the detection of implicit statements, and the disambiguation of argument structure. While previous work focused on dataset construction and baseline methods for specific document sections, such as abstract or results, full-text scholarly argumentation mining has seen little progress. In this work, we introduce a sequential pipeline model combining ADUR and ARE for full-text SAM, and provide a first analysis of the performance of pretrained language models (PLMs) on both subtasks. We establish a new SotA for ADUR on the Sci-Arg corpus, outperforming the previous best reported result by a large margin (+7% F1). We also present the first results for ARE, and thus for the full AM pipeline, on this benchmark dataset. Our detailed error analysis reveals that non-contiguous ADUs as well as the interpretation of discourse connectors pose major challenges and that data annotation needs to be more consistent.
Cite
Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports
Abstract
In the seventh edition of the WMT Biomedical Task, we addressed a total of seven languagepairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian. This year{'}s test sets covered three types of biomedical text genre. In addition to scientific abstracts and terminology items used in previous editions, we released test sets of clinical cases. The evaluation of clinical cases translations were given special attention by involving clinicians in the preparation of reference translations and manual evaluation. For the main MEDLINE test sets, we received a total of 609 submissions from 37 teams. For the ClinSpEn sub-task, we had the participation of five teams.
Cite
adverse drug reactions
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective
Cite
barrier-free travel
MobASA: Corpus for Aspect-based Sentiment Analysis and Social Inclusion in the Mobility Domain
Cite
Clinical Decision Support
Ex4CDS - Textual Explanations for Clinical Decision Support
Cite
Explainability
InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations
Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling
Cite
Explainable AI
Ex4CDS - Textual Explanations for Clinical Decision Support
Cite
Information Extraction
Data4Transparency
Text2Tech
BIFOLD
Cora4NLP
BBDC2
DEEPLEE
PLASS
SIM3S
German Adverse Drug Reaction (ADR) detection in patient-generated content
MobASA Corpus
Cite
Information Extraction
MobIE Corpus
Product Corpus
SmartData Corpus
Interpretability
Inseq: An Interpretability Toolkit for Sequence Generation Models
Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods
Cite
Language Understanding
Cora4NLP
DEEPLEE
Ex4CDS - Textual Explanations for Clinical Decision Support
German Adverse Drug Reaction (ADR) detection in patient-generated content
MobASA Corpus
MobIE Corpus
Product Corpus
SmartData Corpus
Cite
linguistic test suite
A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
Cite
Low-Resource Learning
Data4Transparency
Text2Tech
PLASS
Cite
machine-generated text
Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Cite
Machine Translation
European Language Equality, Report on Europe's Sign Languages
Neural Machine Translation Methods for Translating Text to Sign Language Glosses
A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
Cite
Mobility
MobASA Corpus
MobIE Corpus
Cite
German Adverse Drug Reaction (ADR) detection in patient-generated content
Cite
pharmacovigilance
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective
Cite
Cite
semantic differential
Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Cite
Sentiment Analysis
MobASA: Corpus for Aspect-based Sentiment Analysis and Social Inclusion in the Mobility Domain
MobASA Corpus
Cite
summarization
Generating Extended and Multilingual Summaries with Pre-trained Transformers
Cite
text classification
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective
Cite
text quality
Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Cite
Cite
wikinews
Generating Extended and Multilingual Summaries with Pre-trained Transformers
Cite
Tags
Cite
Tags
Legal Information
Responsible service provider
Responsible for the content of the domain dfki-nlp.github.io from the point of view of § 5 TMG:
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI)
Management:
Prof. Dr. Antonio Krüger
Helmut Ditzer
Trippstadter Str. 122
67663 Kaiserslautern
Germany
Phone: +49 631 20575 0
Fax: +49 631 20575 5030
Email: info@dfki.de
Register Court: Amtsgericht Kaiserslautern
Register Number: HRB 2313
ID-Number: DE 148 646 973
The person responsible for the editorial content of the domain cora4nlp.github.io of the German Research Center for Artificial Intelligence GmbH within the meaning of § 18 para. 2 MStV is:
Dr. Leonhard Hennig, Senior Researcher
DFKI Lab Berlin
Alt-Moabit 91c
D-10559 Berlin
Tel: +49 (0)30 / 238 95-0
Email: leonhard.hennig@dfki.de
Website URL: www.dfki.de
Legal notice concerning liability for proprietary content
As a content provider in accordance with Section 7 (1) of the German Telemedia Act (Telemediengesetz), the Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI) is responsible for its own content that is used pursuant to the general laws. The DFKI endeavors to ensure that the information provided on this website is accurate and current. Nevertheless, errors and uncertainties cannot be entirely ruled out. For this reason, the DFKI undertakes no liability for ensuring that the provided information is current, accurate or complete, and is not responsible for its quality. The DFKI is not liable for material or immaterial damages caused directly or indirectly by the use or non-use of the offered information, or by the use of erroneous and incomplete information, unless willful or grossly negligent fault can be demonstrated. This also applies with respect to software or data provided for download. The DFKI reserves the right to modify, expand or delete parts of the website or the entire website without separate announcement, or to cease publication temporarily or definitively.
Legal notices for third party content and references to external websites
As a service provider, we are responsible for our content on these pages in accordance with general law (§ 7 (1) TMG). According to § 8 – 10 TMG, as a service provider, we are not obliged to monior third-party information that is transmitted or stored, or to investigate circumstances that indicate illegal activity. Obligations to remove or block the use of information according to general laws remain unaffected. However, liability in this regard is only possible form the point in time at which we become aware of a specific legal violation. If we become aware of legal violations, we will remove this content immediately. Cross- references (“links”) to the content providers are to be distinguished from own content. Our offer contains links to external third party website. The respective provider is always responsible for the content of the linked external pages. We cannot accept any liability for the content for the content of the linked pages. After this external content was checked by the DFKI when the link was recenty set, it was checked whether there were any preliminary legal violations. At the time oft he review, no legal violations were apparent. However, it cannot be ruled out that the content may be changed afterwards by teh resprective providers. A permanent control oft he content of the linked pages is not reasonable without ecidence of a violation of the law. The DFKI does not constantly check the content to which it refers in ist offer for changes that could justify a new responsibility. If you are oft he option that the linked external pages violate applicable law or have otherwie inappropriate content, please inform us directly: info@dfki.de. Should the DFKI discover or receive a message that an external offer to whicht is has linked triggers civil or criminal liability, the DFKI will remove the link to this offer.
Legal notice concerning copyright
The layout of the homepage, the graphics used and other content on the DFKI website are protected by copyright. The reproduction, processing, distribution and any type of use outside the boundaries of copyright law require the written approval of the DFKI (in writing). Insofar as any content on this page was not prepared by the DFKI, the copyrights of third parties will be observed. If you become aware of a copyright breach nevertheless, please inform us accordingly. Upon becoming aware of relevant legal breaches, the DFKI will remove such content immediately.