data/anlp-sciner-test-withdocstart.txt

-DOCSTART- MedNLI Is Not Immune : Natural Language Inference Artifacts in the Clinical Domain
Crowdworker - constructed natural language inference ( NLI ) datasets have been found to contain statistical artifacts associated with the annotation process that allow hypothesis - only classifiers to achieve better - than - random performance ( Poliak et al . , 2018;Gururangan et al . , 2018;Tsuchiya , 2018 ) . We investigate whether MedNLI , a physician - annotated dataset with premises extracted from clinical notes , contains such artifacts ( Romanov and Shivade , 2018 ) . We find that entailed hypotheses contain generic versions of specific concepts in the premise , as well as modifiers related to responsiveness , duration , and probability . Neutral hypotheses feature conditions and behaviors that co - occur with , or cause , the condition(s ) in the premise . Contradiction hypotheses feature explicit negation of the premise and implicit negation via assertion of good health . Adversarial filtering demonstrates that performance degrades when evaluated on the difficult subset . We provide partition information and recommendations for alternative dataset construction strategies for knowledge - intensive domains .
Introduction
In the clinical domain , the ability to conduct natural language inference ( NLI ) on unstructured , domainspecific texts such as patient notes , pathology reports , and scientific papers , plays a critical role in the development of predictive models and clinical decision support ( CDS ) systems .
Considerable progress in domain - agnostic NLI has been facilitated by the development of largescale , crowdworker - constructed datasets , including the Stanford Natural Language Inference corpus ( SNLI ) , and the Multi - Genre Natural Language Inference ( MultiNLI ) corpus ( Bowman et al . , 2015;Williams et al . , 2017 ) . MedNLI is a similarlymotivated , healthcare - specific dataset created by a small team of physician - annotators in lieu of crowdworkers , due to the extensive domain expertise required ( Romanov and Shivade , 2018 ) . Poliak et al . ( 2018 ) , Gururangan et al . ( 2018 ) , Tsuchiya ( 2018 ) , andMcCoy et al . ( 2019 ) empirically demonstrate that SNLI and MultiNLI contain lexical and syntactic annotation artifacts that are disproportionately associated with specific classes , allowing a hypothesis - only classifier to significantly outperform a majority - class baseline model . The presence of such artifacts is hypothesized to be partially attributable to the priming effect of the example hypotheses provided to crowdworkers at annotation - time . Romanov and Shivade ( 2018 ) note that a hypothesis - only baseline is able to outperform a majority class baseline in MedNLI , but they do not identify specific artifacts .
We confirm the presence of annotation artifacts in MedNLI and proceed to identify their lexical and semantic characteristics . We then conduct adversarial filtering to partition MedNLI into easy and difficult subsets ( Sakaguchi et al . , 2020 ) . We find that performance of off - the - shelf fastText - based hypothesis - only and hypothesis - plus - premise classifiers is lower on the difficult subset than on the full and easy subsets ( Joulin et al . , 2016 ) . We provide partition information for downstream use , and conclude by advocating alternative dataset construction strategies for knowledge - intensive domains . 1
The MedNLI Dataset
MedNLI is domain - specific evaluation dataset inspired by general - purpose NLI datasets , including SNLI and MultiNLI ( Romanov and Shivade , 2018;Bowman et al . , 2015;Williams et al . , 2017 ) . Much like its predecessors , MedNLI consists of premisehypothesis pairs , in which the premises are drawn 1021 from the Past Medical History sections of a randomly selected subset of de - identified clinical notes contained in MIMIC - III ( Johnson et al . , 2016;Goldberger et al . , 2000 ( June 13 ) . MIMIC - III was created from the records of adult and neonatal intensive care unit ( ICU ) patients . As such , complex and clinically severe cases are disproportionately represented , relative to their frequency of occurrence in the general population .
Physician - annotators were asked to write a definitely true , maybe true , and definitely false set of hypotheses for each premise , corresponding to entailment , neutral and contradiction labels , respectively . The resulting dataset has cardinality : n train = 11232 ; n dev = 1395 ; n test = 1422 .
MedNLI Contains Artifacts
To determine whether MedNLI contains annotation artifacts that may artificially inflate the performance of models trained on this dataset , we train a simple , premise - unaware , fastText classifier to predict the label of each premise - hypothesis pair , and compare the performance of this classifier to a majority - class baseline , in which all training examples are mapped to the most commonly occurring class label ( Joulin et al . , 2016;Poliak et al . , 2018;Gururangan et al . , 2018 ) . Note that since annotators were asked to create an entailed , contradictory , and neutral hypothesis for each premise , MedNLI is class - balanced . Thus , in this setting , a majority class baseline is equivalent to choosing a label uniformly at random for each training example .
The micro F1 - score achieved by the fastText classifier significantly exceeds that of the majority class baseline , confirming the findings of Romanov and Shivade ( 2018 ) , who report a micro - F1 score of 61.9 but do not identify or analyze artifacts :
Characteristics of Clinical Artifacts
In this section , we conduct class - specific lexical analysis to identify the clinical and domainagnostic characteristics of annotation artifacts associated with each set of hypotheses in MedNLI .
Preprocessing
We cast each hypothesis string in the MedNLI training dataset to lowercase . We then use a scispaCy model pre - trained on the en_core_sci_lg corpus for tokenization and clinical named entity recognition ( CNER ) ( Neumann et al . , 2019a ) . One challenge associated with clinical text , and scientific text more generally , is that semantically meaningful entities often consist of spans rather than single tokens . To mitigate this issue during lexical analysis , we map each multi - token entity to a single - token representation , where sub - tokens are separated by underscores .
Lexical Artifacts
Following Gururangan et al . ( 2018 ) , to identify tokens that occur disproportionately in hypotheses associated with a specific class , we compute tokenclass pointwise mutual information ( PMI ) with add-50 smoothing applied to raw counts , and a filter to exclude tokens appearing less than five times in the overall training dataset .
Physician - Annotator Heuristics
In this section , we re - introduce premises to our analysis to evaluate a set of hypotheses regarding latent , class - specific annotator heuristics . If annotators do employ class - specific heuristics , we should expect the semantic contents , ϕ , of a given hypothesis , h ∈ H , to be influenced not only by the semantic contents of its associated premise , p ∈ P , but also by the target class , c ∈ C.
To investigate , we identify a set of heuristics parameterized by ϕ(p ) and c , and characterized by the presence of a set of heuristic - specific Medical Subject Headings ( MeSH ) linked entities in the premise and hypothesis of each heuristic - satisfying example . These heuristics are described below ; specific MeSH features are detailed in the Appendix .
Hypernym Heuristic This heuristic applies when the premise contains clinical condition(s ) , medication(s ) , finding(s ) , procedure(s ) or event(s ) , the target class is entailment , and the generated hypothesis contains term(s ) that can be interpreted as super - types for a subset of elements in the premise ( e.g. , clindamycin < : antibiotic ) .
Probable Cause Heuristic This heuristic applies when the premise contains clinical condition(s ) , the target class is neutral , and the generated hypothesis provides a plausible , often subjective or behavioral , causal explanation for the condition , finding , or event described in the premise ( e.g. , associating altered mental status with drug overdose ) .
Everything Is Fine Heuristic This heuristic applies when the premise contains condition(s ) or finding(s ) , the target class is contradiction , and the generated hypothesis negates the premise or asserts unremarkable finding(s ) . This can take two forms : repetition of premise content plus negation , or inclusion of modifiers that convey good health .
Analysis We conduct a χ 2 test for each heuristic to determine whether we are able to reject the null hypothesis that pattern - satisfying premisehypothesis pairs are uniformly distributed over classes . The results support our hypotheses regarding each of the three heuristics . Notably , the percentage of heuristic - satisfying pairs accounted for by the top class is lowest for the HYPERNYM hypothesis , which we attribute to the high degree of semantic overlap between entailed and neutral hypotheses .
Adversarial Filtering
To mitigate the effect of clinical annotation artifacts , we employ AFLite , an adversarial filtering algorithm introduced by Sakaguchi et AFLite requires distributed representations of the full dataset as input , and proceeds in an iterative fashion . At each iteration , an ensemble of n linear classifiers are trained and evaluated on different random subsets of the data . A score is then computed for each premise - hypothesis instance , reflecting the number of times the instance is correctly labeled by a classifier , divided by the number of times the instance appears in any classifier 's evaluation set . The top - k instances with scores above a threshold , τ , are filtered out and added to the easy partition ; the remaining instances are retained . This process continues until the size of the filtered subset is < k , or the number of retained instances is < m ; retained instances constitute the difficult partition .
To represent the full dataset , we use fastText MIMIC - III embeddings , which have been pretrained on deidentified patient notes from MIMIC - III ( Romanov and Shivade , 2018;Johnson et al . , 2016 ) . We represent each example as the average of its component token vectors . We proportionally adjust a subset of the hyperparameters used by Sakaguchi et al . ( 2020 ) to account for the fact that MedNLI contains far fewer examples than WINOGRANDE 2 : specifically , we set the training size for each ensemble , m , to 5620 , which represents ≈ 2 5 of the MedNLI combined dataset . The remaining hyperparameters are unchanged : the ensemble consists of n = 64 logistic regression models , the filtering cutoff , k = 500 , and the filtering threshold τ = 0.75 .
We apply AFLite to two different versions of MedNLI : ( 1 ) X h , m : hypothesis - only , multi - token entities merged , and ( 2 ) X ph , m : premise and hypothesis concatenated , multi - token entities merged . AFLIte maps each version to an easy and difficult partition , which can in turn be split into training , dev , and test subsets . We report results for the fastText classifier trained on the original , hypothesis - only ( hypothesis + premise ) MedNLI training set , and evaluated on the full , easy and difficult dev and test subsets of X h , m ( X ph , m ) , and observe that performance decreases on the difficult partition :
Discussion
MedNLI is Not Immune from Artifacts
In this paper , we demonstrate that MedNLI suffers from the same challenge associated with annotation artifacts that its domain - agnostic predecessors have encountered : namely , NLI models trained on { Med , S , Multi}NLI can perform well even without access to the training examples ' premises , indicating that they often exploit shallow heuristics , with negative implications for out - of - sample generalization . Interestingly , many of the high - level lexical characteristics identified in MedNLI can be considered domain - specific variants of the more generic , classspecific patterns identified in SNLI . This observation suggests that a set of abstract design patterns for inference example generation exists across domains , and may be reinforced by the prompts provided to annotators . Creative or randomized priming , such as Sakaguchi et al . ( 2020 ) 's use of anchor words from WikiHow articles , may help to decrease reliance on such design patterns , but it appears unlikely that they can be systematically sidestepped without introducing new , " corrective " artifacts .
A Prescription for Dataset Construction
To mitigate the risk of performance overestimation associated with annotation artifacts , Zellers et al . ( 2019 ) advocate adversarial dataset construction , such that benchmarks will co - evolve with language models . This may be difficult to scale in knowledge - intensive domains , as expert validation of adversarially generated benchmarks is typically required . Additionally , in high - stakes domains such as medicine , information - rich inferences should be preferred over correct but trivial inferences that time - constrained expert annotators may be rationally incentivized to produce , because entropy - reducing inferences are more useful for downstream tasks .
We advocate the adoption of a mechanism design perspective , so as to develop modified annotation tasks that reduce the cognitive load placed on expert annotators while incentivizing the production of domain - specific NLI datasets with high downstream utility ( Ho et al . , 2015;Liu and Chen , 2017 ) . An additional option is to narrow the generative scope by defining a set of inferences deemed to be useful for a specific task . Annotators can then map ( premise , relation ) tuples to relation - satisfying , potentially fuzzy subsets of this pool of useful inferences , or return partial functions when more information is needed .
Ethical Considerations
When working with clinical data , two key ethical objectives include : ( 1 ) the preservation of pa - tient privacy , and ( 2 ) the development of language and predictive models that benefit patients and providers to the extent possible , without causing undue harm . With respect to the former , MedNLI 's premises are sampled from de - identified clinical notes contained in MIMIC - III ( Goldberger et al . , 2000 ( June 13;Johnson et al . , 2016 ) , and the hypotheses generated by annotators do not refer to specific patients , providers , or locations by name . MedNLI requires users to complete Health Insurance Portability and Accountability Act ( HIPAA ) training and sign a data use agreement prior to being granted access , which we have complied with .
Per MedNLI 's data use agreement requirements , we do not attempt to identify any patient , provider , or institution mentioned in the de - identified corpus . Additionally , while we provide AFLite easy and difficult partition information for community use in the form of split - example ids and a checksum , we do not share the premise or hypothesis text associated with any example . Interested readers are encouraged to complete the necessary training and obtain credentials so that they can access the complete dataset ( Romanov and Shivade , 2018;Goldberger et al . , 2000 ( June 13 ) .
With respect to benefiting patients , the discussion of natural language artifacts we have presented is intended to encourage clinical researchers who rely on ( or construct ) expert - annotated clinical corpora to train domain - specific language models , or consume such models to perform downstream tasks , to be aware of the presence of annotation artifacts , and adjust their assessments of model performance accordingly . It is our hope that these findings can be used to inform error analysis and improve predictive models that inform patient care .
A Appendix
A.1 Hypothesis - only Baseline Analysis
To conduct the analysis presented in Section 3 , we take the MedNLI training dataset as input , and exclude the premise text for each training example . We cast the text of each training hypothesis to lowercase , but do not perform any additional preprocessing . We use an off - the - shelf fastText classifier , with all model hyperparameters set to their default values with the exception of wordNgrams , which we set equal to 2 to allow the model to use bigrams in addition to unigrams ( Joulin et al . , 2016 ) . We evaluate the trained classifier on the hypotheses contained in the MedNLI dev and test datasets , and report results for each split .
A.2 Lexical Artifact Analysis
To perform the analysis presented in Section 4 , we cast each hypothesis string in the MedNLI training dataset to lowercase . We then use a scispaCy model pre - trained on the en_core_sci_lg corpus for tokenization and clinical named entity recognition ( CNER ) ( Neumann et al . , 2019a ) . Next , we merge multi - token entities , using underscores as delimiters - e.g. , " brain injury " → " brain_injury " .
When computing token - class pointwise mutual information ( PMI ) , we exclude tokens that appear less than five times in the overall training dataset 's hypotheses . Then , following Gururangan et al . ( 2018 ) , who apply add-100 smoothing to raw counts to highlight particularly discriminative token - class co - occurrence patterns , we apply add-50 smoothing to raw counts . Our approach is similarly motivated ; our choice of 50 reflects the smaller state space associated with a focus on the clinical domain .
A.3 Semantic Analysis of Heuristics
To perform the statistical analysis presented in Section 5 , we take the premise - hypothesis pairs from the MedNLI training , dev , and test splits , and combine them to produce a single corpus . We use a scispaCy model pre - trained on the en_core_sci_lg corpus for tokenization and entity linking ( Neumann et al . , 2019b ) , and link against the Medical Subject Headings ( MeSH ) knowledge base . We take the top - ranked knowledge base entry for each linked entity . Linking against MeSH provides a unique concept i d , canonical name , alias(es ) , a definition , and one or more MeSH tree numbers for each recovered entity . Tree numbers convey semantic type information by embedding each concept into the broader MeSH hierarchy 3 . We operationalize each of our heuristics with a set of MeSH - informed semantic properties , which are defined as follows :
1 . Hypernym Heuristic : a premise - hypothesis pair satisfies this heuristic if specific clinical concept(s ) appearing in the premise appear in a more general form in the hypothesis . Formally : { ( p , h)|ϕ(p ) ϕ(h ) } . MeSH tree numbers are organized hierarchically , and increase in length with specificity . Thus , when a premise entity and hypothesis entity are leftaligned , the hypothesis entity is a hypernym for the premise entity if the hypothesis entity is a substring of the premise entity . To provide a concrete example : diabetes mellitus is an endocrine system disease ; the associated MeSH tree numbers are C19.246 and C19 , respectively .
2 . Probable Cause Heuristic : a premisehypothesis pair satisfies this heuristic if : ( 1 ) the premise contains one or more MeSH entities belonging to high - level categories C ( diseases ) , D ( chemicals and drugs ) , E ( analytical , diagnostic and therapeutic techniques , and equipment ) or F ( psychiatry and psychology ) ; and ( 2 ) the hypothesis contains one or more MeSH entities that can be interpreted as providing a plausible causal or behavioral explanation for the condition , finding , or event described in the premise ( e.g. , smoking , substance - related disorders , mental disorders , alcoholism , homelessness , obesity ) .
3 . Everything Is Fine Heuristic : a premisehypothesis pair satisfies this heuristic if the hypothesis contains one or more of the same MeSH entities as the premise ( excluding the patient entity , which appears in almost all notes ) and also contains : ( 1 ) a negation word or phrase ( e.g. , does not have , no finding , no , denies ) ; or ( 2 ) a word or phrase that affirms the patient 's health ( e.g. , normal , healthy , discharged ) .
For each heuristic , we subset the complete dataset to find pattern - satisfying premise - heuristic pairs . We use this subset when performing the χ 2 tests .
Acknowledgments
We thank the four anonymous reviewers whose feedback and suggestions helped improve this manuscript . The first author was supported by the National Institute of Standards and Technology 's ( NIST ) Professional Research Experience Program ( PREP ) . This research was also supported by the DARPA KAIROS program . The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of NIST , DARPA , or the U.S. Government .
A.4 Adversarial Filtering
When implementing AFLite , we follow Sakaguchi et al . ( 2020 ) . We use a smaller training set size of m = 5620 , but keep the remaining hyperparameters unchanged , such that the ensemble consists of n = 64 logistic regression models , the filtering cutoff , k = 500 , and the filtering threshold τ = 0.75 .
-DOCSTART- Question Generation for Adaptive Education
Intelligent and adaptive online education systems aim to make high - quality education available for a diverse range of students . However , existing systems usually depend on a pool of hand - made questions , limiting how finegrained and open - ended they can be in adapting to individual students . We explore targeted question generation as a controllable sequence generation task . We first show how to fine - tune pre - trained language models for deep knowledge tracing ( LM - KT ) . This model accurately predicts the probability of a student answering a question correctly , and generalizes to questions not seen in training . We then use LM - KT to specify the objective and data for training a model to generate questions conditioned on the student and target difficulty . Our results show we succeed at generating novel , wellcalibrated language translation questions for second language learners from a real online education platform .
Introduction
Online education platforms can increase the accessibility of educational resources around the world . However , achieving equitable outcomes across diverse learning needs benefits from systems that are adaptive and individualized to each student ( Doroudi and Brunskill , 2019 ) . Traditionally , adaptive education methods involve planning over a pool of pre - made questions ( Atkinson , 1972;Hunziker et al . , 2018 ) . These are naturally limited by the diversity and coverage of the pool , as well as the scaling capacity of curriculum planning algorithms . Recent approaches , such as procedural generation for personalized programming games ( Valls - Vargas et al . , 2017 ) , are limited to well - specified small domains . We address these limitations by leveraging recent success in deep generative models , in particular language models ( LMs ) .
Many educational activities involve sequential data , such as language translation , reading compre-
< Y >
Figure 1 : Example input and outputs for our LM - based knowledge tracing model ( middle ) and question generation model ( bottom ) for an online reverse language translation task ( top ) . A question in this task consists of a target phrase for the student , in this case a Spanish learner , to translate ( e.g. " the woman " ) .
hension , algebra , and deductive logic . Meanwhile , pre - trained LMs can effectively handle sequences from a wide range of modalities ( Madani et al . , 2020;Polu and Sutskever , 2020 ) . In this work , we focus on natural language sequences , where recent progress in language modeling has shown great success at capturing abstract properties of language ( Hewitt and Manning , 2019;Liu et al . , 2019 ) . Specifically , we show how pre - trained LMs can be easily leveraged to adaptively generate questions for a given student and target difficulty in a reverse translation task , using difficulty at answering questions as a proxy for more complex future learning objectives .
We introduce an LM - based knowledge tracing model ( LM - KT ) to predict students ' difficulty on novel questions ( e.g. target phrases to translate ) . We show that LM - KT is well - calibrated , allowing us to pose the learning problem for the question generator : given a student state , generate a question that will achieve a target difficulty , according to LM - KT . We evaluate both LM - KT and question generation models on real users and responses from Duolingo 1 , a popular online second - language learning platform .
Background & Related Works
There exists a rich body of work on precisely modeling student " ability " and learning . For example , Item Response Theory ( IRT ) seeks to model individual student ability based on their responses to different questions , creating a strong factorization between students and test items ( Lord , 1980;Hambelton and Jodoin , 2003 ) . Meanwhile , Computer Adaptive Testing ( CAT ) techniques are used to determine a fixed student ability as quickly as possible by selecting test items based on information utility ( Weiss and Kingsbury , 1984;Thissen and Mislevy , 2000;Settles et al . , 2020 ) . However , these methods , which have been used to develop efficient standardized tests , do not necessarily optimize a student 's learning experience ( Mu et al . , 2018 ) . We instead focus on tracking each student 's evolving knowledge , choosing questions to target difficulty .
Knowledge Tracing ( KT ) seeks to model a student 's knowledge state from their answer history in order to help individualize exercise sequences ( Corbett and Anderson , 1995 ) . This draws inspiration from traditional education curriculum practices , such as distributed spacing of vocabulary ( Bloom and Shuell , 1981 ) and mixed review in mathematics ( Rohrer , 2009 ) . To address simplifying assumptions in earlier KT approaches , such as discrete knowledge representations , Piech et al . ( 2015 ) introduced Deep Knowledge Tracing ( DKT ) , which uses RNNs to enable more complex knowledge representations for students . Recently , SAINT+ ( Shin et al . , 2020 ) showed state - of - the - art performance on the popular EdNet KT task using a Transformer model to capture temporal information across activities , motivating our use of Transformer LMs .
Controllable Text Generation aims to steer
LMs towards desired attributes . Examples include using reinforcement learning to control quality metrics ( Ranzato et al . , 2016 ) , adjusting sampling weights to control for poetry style ( Ghazvininejad et al . , 2017 ) , and learning to condition on valence or domain - specific codes ( Keskar et al . , 2019;Peng et al . , 2018 ) . To the best of our knowledge , we are
Method
Given any autoregressive language model ( e.g. GPT-2 ( Radford et al . , 2019 ) , we can fine - tune a LM - KT model ( p θ KT ) to predict whether an individual student will correctly answer the next question . If this model has well - calibrated uncertainty , we can use its predicted probability of a correct answer as a proxy for the difficulty of a question to a student . We then train a question generation model ( p θ QG ) to generate a new question conditioned on a student and desired target difficulty .
Question Representation Unlike standard DKT , which treats questions as IDs or simple handcrafted features , we represent questions fully in text ( e.g. " she eats " in Figure 1 ) . This is a key contribution of our work , required by our eventual goal of generating questions in text , and allows the model to leverage similarity across linguistic features . We thus represent a question q as a sequence of words , with prefix and suffix tokens :
q i = < Q > w i 1 w i 2 w i 3 ... w i n < A > Student State
We represent a student as a temporally - evolving sequence of questions and their responses . As in much previous KT work , we represent the student response as simply correct / incorrect , with special tokens < Y > and < N > . A student 's current state is thus represented as a sequence of all past question and response pairs :
s j = q j 1 a j 1 q j 2 a j 2 .
.. q j m a j m , a i ∈ { < Y>,<N > } LM - KT Given the sequential nature of student learning over time , we can easily frame knowledge tracing as an autoregressive language modeling task . Given a dataset D of students s 1 , s 2 , ... , s |D| , we employ the standard training objective of finding the parameters θ KT that minimizes
L KT = − |D| i=1 |x ( i ) | t=1 logp θ KT ( x ( i ) t |x ( i ) < t ) ( 1 )
where
x ( j ) = ( x ( j ) 1 , .... , x ( j )
|x| ) is the entire sequence tokens corresponding to student s j , consisting of all their past questions and answers . Using the softmax output of the LM - KT model ( p θ KT ) , we estimate a student 's ( inverse ) difficulty in answering a specific question as d qs = p θ KT ( < Y>|s , q ) . We find that p θ KT is well - calibrated ( Section 4.2 ) , yielding a good proxy for the true question difficulty .
Question Generation We frame question generation as finetuning a new autoregressive LM . Given random samples of students and questions from a held - out set not used to train LM - KT , we can construct a new dataset D consisting of s i d i < G > q i sequences , where < G > is a special generation token and d i = p θ KT ( < Y>|s i , q i ) is the continuous difficulty value assigned by LM - KT . We learn a linear layer to map the continuous input difficulty into a difficulty control vector c d of dimension matching the LM word - embeddings , which we append to the token embeddings . Unlike LM - KT , we train our question generation model p θ QG to minimize the loss only on the question text , which only appears after the < G > token . If t g is the token index of < G > , then our modified loss is :
L QG = − |D | i=1 |x ( i ) | t = tg+1 logp θ QG ( x ( i ) t |x ( i ) < t ) ( 2 )
where sequence x ( j ) contains the full s j d j < G > q j sequence . At test time , we generate tokens w 1 ... w n conditioned on the s j d j < G > prefix .
Experiments
Our method generalizes to any education activity that can be represented with text sequences . Due to the availability of real student learning data , we focus on a reverse language translation task , where a student translates phrases from their native language ( e.g. English , " she eats " ) to the second language they are learning ( e.g. Spanish , " ella come " ) .
Experimental Details
We use the 2018 Duolingo Shared Task on Second Language Acquisition Modeling ( Settles et al . , 2018 ) dataset , which contains questions and responses for Duolingo users over the first 30 days of learning a second language . While the original task 's goal was to identify token - level mistakes , we collapse these errors into binary ( correct / incorrect ) per - question labels . We use the provided train / dev / test splits for users learning Spanish and French . We create separate held - out sets from the test set to evaluate the LM - KT and question generation models . For both models , we finetune separate GPT-2 ( Radford et al . , 2019 )
Results : Student Modeling
We evaluate LM - KT two ways : first , its ability to predict if an individual student will answer a novel question correctly on a held - out test set of real Duolingo student responses . Second , how wellcalibrated these predictions are , which is crucial to our later use of LM - KT for question generation . Table 1 compares AUC - ROC on a held - out test set for our LM - KT model with standard DKT , which uses question IDs instead of text , and a baseline that ignores the student state , only using the question text representation . This question only baseline would perform well if the Duolingo dataset largely consisted of universally " easy " and " difficult " questions , independent of individual student . Our results show that incorporating the student state is crucial for accurately predicting Duolingo user responses , and including question text also leads to a significant improvement . LM - KT outperforms Standard DKT especially on novel questions - a necessary generalization ability for generation .
Finally , we measure the calibration of our LM - KT models for both Spanish and French ( from En - glish ) learners , which is the crucial property for our downstream generation task . We bin our test data by predicted question difficulty , and plot the fraction of true correct answers in each bin . Figure 2 shows that LM - KT is well - calibrated , for both Spanish and French , meaning the predicted difficulty matches the empirically observed proportion of correct answers .
Results : Question Generation
We evaluate four different aspects of our question generation model : ( i ) successful control for difficulty , ( ii ) novelty , ( iii ) fluency , and ( iv ) latency .
Difficulty Control To explore whether our question generation model indeed depends on target difficulty and the individual student , we first measure the model 's perplexity on a held - out test set of Duolingo questions , compared to permutation baselines . Table 2 ( top ) shows that perplexity is lower for true student / target difficulty inputs than when either or both of these are permuted . The target difficulty values in this analysis were defined by the LM - DKT model . We can remove this dependence by using the actual student responses from Duolingo : we set the target difficulty to 1 if the student was correct and 0 otherwise . Table 2 ( bottom ) shows our model prefers questions paired with these " true correctness " targets than paired with random ones .
To evaluate how well our generation model achieves target difficulties , we take 15 unseen students and generate 30 questions for each of 9 input difficulties ( 0.1 - 0.9 ) . We then use LM - KT ( a wellcalibrated proxy for true difficulty ) to measure the difficulty of these generated questions for each student . Figure 3 shows that we are able to achieve fine - grained control over target difficulty for both Spanish and French students , with an average Root - Mean Squared Error ( RMSE ) of .052 across all students and target difficulties . Adding a sampling penalty ( Keskar et al . , 2019 ) increases the variance in difficulty ( RMSE .062 ) in exchange for more novel and diverse questions , as discussed next .
Novelty and Fluency By leveraging a pretrained language model 's ability to manipulate structure , we can generate novel questions not present in the entire Duolingo question set ( See Table 3 ) . Across 4,050 questions generated for Spanish learners , we found that with a repetition penalty ( Keskar et al . , 2019 ) , around 43 % of all questions , and 66 % of high difficulty ( d = 0.1 ) required to rank all questions in the pool , varying its size ( Figure 4 ) . On one NVIDIA Titan XP GPU , we find that , averaged across all target difficulties , our question generation model takes half the time to achieve the same quality as pool selection . The gap increases when trying to sample harder questions ( d < 0.5 ) -even a pool size of 1000 does not have sufficient difficult questions , likely due to a skew in the Duolingo question set . Additional controls , such as for style or topic , can easily be combined with our generation method , but would make pool selection exponentially more complex . Pool Sampling ( all targets ) Pool Sampling ( difficult targets only ) Generation ( all targets ) Generation ( difficult targets only )
Figure 4 : Pool selection ( for one student ) suffers worse question quality vs. latency trade - off than question generation , especially for sampling difficult questions .
Conclusion
Our work is a first step toward showing that sequence - based models combined with domain knowledge , such as pre - trained LMs , can be leveraged for adaptive learning tasks . We show how to use modern LMs to generate novel reversetranslation questions that achieve a target difficulty , allowing adaptive education methods to expand beyond limited question pools . Limitations of our approach include the compute constraints of large LMs and training data availability . More detailed student data will be crucial to future model development . For instance , while most publicly available education datasets do not include the full student responses ( e.g. full translation response in Duolingo ) , such information could significantly improve the performance of our LM - KT model . Other future directions include exploring non - language domains , such as math or logic exercises , and controlling for auxiliary objectives such as question topic .
Finally , designing appropriate user studies to evaluate our method is a complex yet critical next step to determine its suitability in a real - world education setting . Our techniques allows control for individual student difficulty , but it leaves open the question of optimal curriculum design using difficulty - directed question generation .
Broader Impact
Online education platforms can increase the accessibility of high quality educational resources for students around the world . Adaptive techniques that allow for more individualized learning strategies can help such technologies be more inclusive for students who make less - common mistakes or have different prior backgrounds ( Lee and Brunskill , 2012 ) . However , our method is subject to biases found in the training data , and careful consideration of using safe and appropriate data is crucial in an education context . Moreover , our specific use of pre - trained LMs relies on the significant progress of NLP tools for English language -further research and development of these tools for other languages can help ensure our method benefits a larger population of students .
A APPENDIX
A.1 Dataset Details
The 2018 Duolingo Shared Task on Second Language Acquisition Modeling ( Settles et al . , 2018 ) dataset contains questions and responses for Duolingo users over the first 30 days of learning a second language . The dataset contains three different question types : reverse translate ( free response translation of a given prompt in the language they are learning ) , reverse tap ( a selection - based equivalent of reverse translate ) , and listen , where students listen to a vocal utterance . We focus on the reverse translate question type for English - speaking students learning French and Spanish . The dataset size for French learners ( 1.2k users ) is roughly half the size of that for Spanish learners ( 2.6k users ) .
Because the original dataset was intended for per - token error prediction , each question has per - token information that includes whether the student translated the token correctly , as well as Universal Dependencies tags such as part of speech and morphology labels . We use the full question text , rather than individual tokens , for our task , and combine the labels such that if a Duolingo user incorrectly translated one or more tokens in a question , the entire question is marked incorrect . We do not use any additional features .
We use the publicly provided train / dev / test splits from the Shared Task , which are temporally ordered in sequence . We therefore construct student states by tracking user IDs throughout the datasets and appending each new question and response to the current student state . When evaluating our LM - KT model , we use the true responses of preceding questions in the test set to form the student state for a given question . Overall , we find that the dataset is severely imbalanced ( as in the original task ) -about 30 % of questions are answered incorrectly across students studying both French and Spanish .
Finally , we create a held - out set of Duolingo questions for both French and Spanish learners to create the training data for our question generation model . From a set of random student states , we select questions from this set and use a trained LM - KT model to assign the difficulty score . In practice , this held - out set can come from any source , not just Duolingo data .
A.2 Model Training Details
To train both our LM - KT knowledge tracing model and our question generation model , we use the pre - trained OpenAI GPT-2 model from the HuggingFace Transformers library ( Wolf et al . , 2020 ) . For question generation , we modify the library to add a linear layer and the modified loss function for question generation from Section 3 .
We use 1 NVIDIA TitanXP GPU with 12 GB of memory available . Because the maximum input sequence length of the GPT-2 model we use is 1024 tokens , we resize all inputs to the last 1024 tokens before training . We report results for an LM - KT model trained for 13k steps with the default batch size of 2 and learning rate of 5e-5 , and a Question Generation model trained for 25k steps with the same batch size and learning rate . The total compute time to train both models was 2.5 hours for each language learning task .
A.3 Question Generation Details
For both French and Spanish question generation models , we select 15 students unseen during training and generate 30 questions across 9 difficulties from 0.1 to 0.9 , using nucleus sampling ( Holtzman et al . , 2020 ) ( p = 0.99 ) with a maximum output length of 20 tokens . We also vary a repetition penalty ( Keskar et al . , 2019 ) that penalizes for previous tokens ( including those in the student state ) . Lastly , we resize all prompts ( student state and target difficulty ) to fit into the GPT-2 Model by taking the most recent 1024 tokens , as in training . This is a limitation of our work , as the full student history is not able to be considered for students who have answered a large set of questions .
A.4 Additional Question Generation Outputs
Our question generation model demonstrates the ability to generate novel questions that do not exist in the entire Duolingo question dataset , especially when a sampling penalty is applied to encourage more diverse outputs . However , this comes at a cost to fluency . Below we include a set of outputs generated by our model for 1 Spanish student and 1 French student from the Duolingo dataset , with a target difficulty of d = 0.1 , and both with and without a repetition penalty . We observe that while applying a penalty results in a far more novel questions generated , several of these are also non - fluent , using a combination of manual judgement and the Python language - check package ( https://pypi.org/project/language-check/ ) .
-DOCSTART- An Exploratory Analysis of Multilingual Word - Level Quality Estimation with Cross - Lingual Transformers
Most studies on word - level Quality Estimation ( QE ) of machine translation focus on languagespecific models . The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language - specific models . To overcome these problems , we explore different approaches to multilingual , word - level QE . We show that multilingual QE models perform on par with the current language - specific models . In the cases of zeroshot and few - shot QE , we demonstrate that it is possible to accurately predict word - level quality for any given new language pair from models trained on other language pairs . Our findings suggest that the word - level QE models based on powerful pre - trained transformers that we propose in this paper generalise well across languages , making them more useful in real - world scenarios .
Quality Estimation ( QE ) is the task of assessing the quality of a translation without having access to a reference translation ( Specia et al . , 2009 ) . Translation quality can be estimated at different levels of granularity : word , sentence and document level ( I ve et al . , 2018 ) . So far the most popular task has been sentence - level QE , in which QE models provide a score for each pair of source and target sentences . A more challenging task , which is currently receiving a lot of attention from the research community , is word - level quality estimation . This task provides more fine - grained information about the quality of a translation , indicating which words from the source have been incorrectly translated in the target , and whether the words inserted between these words are correct ( good vs bad gaps ) . This information can be useful for post - editors by indicating the parts of a sentence on which they have to focus more .
Word - level QE is generally framed as a supervised ML problem ( Kepler et al . , 2019;Lee , 2020 ) trained on data in which the correctness of translation is labelled at word - level ( i.e. good , bad , gap ) .
The training data publicly available to build wordlevel QE models is limited to very few language pairs , which makes it difficult to build QE models for many languages . From an application perspective , even for the languages with resources , it is difficult to maintain separate QE models for each language since the state - of - the - art neural QE models are large in size ( Ranasinghe et al . , 2020b ) .
In our paper , we address this problem by developing multilingual word - level QE models which perform competitively in different domains , MT types and language pairs . In addition , for the first time , we propose word - level QE as a zero - shot crosslingual transfer task , enabling new avenues of research in which multilingual models can be trained once and then serve a multitude of languages and domains . The main contributions of this paper are the following : i We introduce a simple architecture to perform word - level quality estimation that predicts the quality of the words in the source sentence , target sentence and the gaps in the target sentence .
ii We explore multilingual , word - level quality estimation with the proposed architecture . We show that multilingual models are competitive with bilingual models .
iii We inspect few - shot and zero - shot word - level quality estimation with the bilingual and multilingual models . We report how the sourcetarget direction , domain and MT type affect the predictions for a new language pair .
iv We release the code and the pre - trained models as part of an open - source framework 1 . ( Kepler et al . , 2019 ) . However , the current state of the art in word - level QE is based on transformers like BERT ( Devlin et al . , 2019 ) and XLM - R ( Conneau et al . , 2020 ) where a simple linear layer is added on top of the transformer model to obtain the predictions ( Lee , 2020 ) . All of these approaches consider quality estimation as a language - specific task and build a different model for each language pair . This approach has many drawbacks in real - world applications , some of which are discussed in Section 1 .
Multilinguality Multilinguality allows training a single model to perform a task from and/or to multiple languages . Even though this has been applied to many tasks Zampieri , 2020 , 2021 ) including NMT ( Nguyen and Chiang , 2017;Aharoni et al . , 2019 ) , multilingual approaches have been rarely used in QE . Shah and Specia ( 2016 ) explore QE models for more than one language where they use multitask learning with annotators or languages as multiple tasks . They show that multilingual models led to marginal improvements over bilingual ones with a traditional black - box , feature - based approach . In a recent study , Ranasinghe et al . ( 2020b ) show that multilingual QE models based on transformers trained on high - resource languages can be used for zeroshot , sentence - level QE in low - resource languages .
In a similar architecture , but with multi - task learning , report that multilingual QE models outperform bilingual models , particularly in less balanced quality label distributions and lowresource settings . However , these two papers are focused on sentence - level QE and to the best of our knowledge , no prior work has been done on multilingual , word - level QE models .
In our experiments , we observed that multilingual QE models deliver excellent results on the language pairs they were trained on . In addition , the multilingual QE models perform well in the majority of the zero - shot scenarios where the multilingual QE model is tested on an unseen language pair . Furthermore , multilingual models perform very well with few - shot learning on an unseen language pair when compared to training from scratch for that language pair , proving that multilingual QE models are effective even with a limited number of training instances . While we centered our analysis around the F1 - score of the target words , these findings are consistent with the F1 - score of the target gaps and the F1 - score of the source words too . This suggests that we can train a single multilingual QE model on as many languages as possible and apply it on other language pairs as well . These findings can be beneficial to perform QE in low - resource languages for which the training data is scarce and when maintaining several QE models for different language pairs is arduous .
In this paper , we explored multilingual , word - level QE with transformers . We introduced a new architecture based on transformers to perform wordlevel QE . The implementation of the architecture , which is based on Hugging Face ( Wolf et al . , 2020 ) , has been integrated into the TransQuest framework ( Ranasinghe et al . , 2020b ) which won the WMT 2020 QE task ) on sentencelevel direct assessment ( Ranasinghe et al . , 2020a ) 2 .
We also evaluated how the QE models behave with a limited number of training instances . For each language pair , we initiated the weights of the bilingual model with those of the relevant All-1 QE and trained it on 100 , 200 , 300 and up to 1000 training instances . We compared the results with those obtained having trained the QE model from scratch for that language pair . The results in Figure 2 show
Few - shot QE
One limitation of the zero - shot QE is its inability to perform when the language direction changes . In the scenario where we performed zero - shot learning from De - En to other language pairs , results degraded considerably from the bilingual result . Similarly , the performance is rather poor when we test on De - En for the multilingual zero - shot experiment as the direction of all the other pairs used for training is different . This is in line with results reported by Ranasinghe et al . ( 2020b ) for sentence level .
We also experimented with zero - shot QE with multilingual QE models . We trained the QE model in all the pairs except one and performed predic - tion on the test set of the language pair left out . In section II ( " All-1 " ) , we show its difference to the multilingual QE model . This also provides competitive results for the majority of the languages , proving it is possible to train a single multilingual QE model and extend it to a multitude of languages and domains . This approach provides better results than performing transfer learning from a bilingual model .
To test whether a QE model trained on a particular language pair can be generalised to other language pairs , different domains and MT types , we performed zero - shot quality estimation . We used the QE model trained on a particular language pair and evaluated it on the test sets of the other language pairs . Non - diagonal values of section I in Table 2 show how each QE model performed on other language pairs . For better visualisation , the nondiagonal values of section I of Table 2 show by how much the score changes when the zero - shot QE model is used instead of the bilingual QE model . As can be seen , the scores decrease , but this decrease is negligible and is to be expected . For most pairs , the QE model that did not see any training instances of that particular language pair outperforms the baselines that were trained extensively on that particular language pair . Further analysing the results , we can see that zero - shot QE performs better when the language pair shares some properties such as domain , MT type or language direction . For example , En - De SMT ⇒ En - Cs SMT is better than En - De NMT ⇒ En - Cs SMT and En - De SMT ⇒ En - De NMT is better than En - Cs SMT ⇒ En - De NMT .
Zero - shot QE
We combined instances from all the language pairs and built a single word - level QE model . Our results , displayed in section II ( " All " ) of Table 2 , show that multilingual models perform on par with bilingual models or even better for some language pairs . We also investigate whether combining language pairs that share either the same domain or MT type can be more beneficial , since it is possible that the learning process is better when language pairs share certain characteristics . However as shown in sections III and IV of Table 2 , for the majority of the language pairs , specialised multilingual models built on certain domains or MT types do not perform better than multilingual models which contain all the data . Section IV shows the results of the state - of - the - art methods and the best system submitted for the language pair in that competition . NR implies that a particular result was not reported by the organisers . Zero - shot results are coloured in grey and the value shows the difference between the best result in that section for that language pair and itself .
Multilingual QE
The values displayed diagonally across section I of Table 2 show the results for supervised , bilingual , word - level QE models where the model was trained on the training set of a particular language pair and tested on the test set of the same language pair . As can be seen in section V , the architecture outperforms the baselines in all the language pairs and also outperforms the majority of the best systems from previous competitions . In addition to the target word F1 - score , our architecture outperforms the baselines and best systems in target gaps F1 - score and source words F1 - score too as shown in Tables 5 and 6 . In the following sections we explore its behaviour in different multilingual settings .
For evaluation , we used the approach proposed in the WMT shared tasks in which the classification performance is calculated using the multiplication of F1 - scores for the ' OK ' and ' BAD ' classes against the true labels independently : words in the target ( ' OK ' for correct words , ' BAD ' for incorrect words ) , gaps in the target ( ' OK ' for genuine gaps , ' BAD ' for gaps indicating missing words ) and source words ( ' BAD ' for words that lead to errors in the target , ' OK ' for other words ) . In recent WMT shared tasks , the most popular category was predicting quality for words in the target . Therefore , in Section 5 we only report the F1 - score for words in the target . Other results are presented in the supplementary material . Prior to WMT 2019 , organisers provided separate scores for gaps and words in the target , while after WMT 2019 they produce a single result for target gaps and words . We follow this latter approach .
Our architecture relies on the XLM - R transformer model ( Conneau et al . , 2020 ) to derive the representations of the input sentences . XLM - R has been trained on a large - scale multilingual dataset in 104 languages , totalling 2.5 TB , extracted from the CommonCrawl datasets . It is trained using only RoBERTa 's ( Liu et al . , 2019 ) masked language modelling ( MLM ) objective . XML - R was used by the winning systems in the recent WMT 2020 shared task on sentence - level QE ( Ranasinghe et al . , 2020a;Lee , 2020 ; . This motivated us to use a similar approach for wordlevel QE .
Our architecture adds a new token to the XLM - R tokeniser called < GAP > which is inserted between the words in the target . We then concatenate the source and the target with a [ SEP ] token and we feed them into XLM - R. A simple linear layer is added on top of word and < GAP > embeddings to predict whether it is " Good " or " Bad " as shown in Figure 1 . The training configurations and the system specifications are presented in the supplementary material . We used several language pairs for which word - level QE annotations were available : English - Chinese ( En - Zh ) , English - Czech ( En - Cs ) , English - German ( En - De ) , English - Russian ( En - Ru ) , English - Latvian ( En - Lv ) and German - English ( De - En ) . The texts are from a variety of domains and the translations were produced using both neural and statistical machine translation systems . More details about these datasets can be found in Table 1 and in Fonseca et al . , 2019 ; .
-DOCSTART- Translating Headers of Tabular Data : A Pilot Study of Schema Translation
Schema translation is the task of automatically translating headers of tabular data from one language to another . High - quality schema translation plays an important role in crosslingual table searching , understanding and analysis . Despite its importance , schema translation is not well studied in the community , and state - of - the - art neural machine translation models can not work well on this task because of two intrinsic differences between plain text and tabular data : morphological difference and context difference . To facilitate the research study , we construct the first parallel dataset for schema translation , which consists of 3,158 tables with 11,979 headers written in 6 different languages , including English , Chinese , French , German , Spanish , and Japanese . Also , we propose the first schema translation model called CAST , which is a header - to - header neural machine translation model augmented with schema context . Specifically , we model a target header and its context as a directed graph to represent their entity types and relations . Then CAST encodes the graph with a relational - aware transformer and uses another transformer to decode the header in the target language . Experiments on our dataset demonstrate that CAST significantly outperforms state - of - the - art neural machine translation models . Our dataset will be released at https://github.com/microsoft/ContextualSP .
Introduction
As the saying goes , " a chart is worth a thousand words " . Nowadays , tremendous amounts of tabular data written in various languages are widely used in Wikipedia pages , research papers , finance reports , file systems , and databases , which are informative . Schema translation is the task of automatically translating headers of tabular data from one language to another . High - quality schema translation plays an essential role in cross - lingual table ⇤ Work done during an internship at Microsoft Research .
No .
Match Hosted_by Loc .
Cost ( $ )
Figure 1 : An illustrative example of schema translation from English to Chinese . 1 -4 denotes headers with abbreviation , polysemy , verb - object phrase and special symbol , respectively .
searching , understanding , and analysis ( Zhang and Balog , 2018;Deng et al . , 2019;Sherborne et al . , 2020 ) . Note that in this work , we focus on translating the headers instead of the entire table content , since for each entity in table content , it is hard to decide if it needs to be translated or not . Over translation could even have negative effects in reality . Despite its importance , most research efforts are dedicated to plain text machine translation ( Sutskever et al . , 2014;Bahdanau et al . , 2015;Vaswani et al . , 2017;Yang et al . , 2020 ) , and schema translation is not well studied in the community , to the best of our knowledge . According to our preliminary study , state - of - the - art neural machine translation ( NMT ) systems can not work well on schema translation because of two intrinsic differences between plain text and tabular data : morphological difference and context difference .
Morphological Difference . The morphology of table headers differs from that of plain text in the following four aspects . First , headers are always phrases and they usually contain a lot of domainspecific abbreviations ( e.g. , as shown in Figure 1 , " No . " is the abbreviation of " Number " and the " Loc . " is short for " Location " ) and special symbols ( e.g. , " $ " means " dollar " in Figure 1 ) . Second , verb - object phrases are frequently used as headers which indicate a subject - object relationship between two columns . For example , " Hosted by " in Figure 1 indicates a host relationship between the second and the third columns . Third , special tokenizations like CamelCase and underscore are idiomatic usages in headers . At last , capitalized words are particularly preferred in order to capture more readers ' attention for headers . These special word - forms are commonly used in headers but rarely seen in plain text . Therefore , the NMT models trained with a massive amount of plain text can not be directly applied to schema translation .
Context Difference . Compared with plain text , which is a sequence of words , tables have welldefined structures , and understanding a table 's structure is crucial for schema translation . Specifically , a table consists of an ordered arrangement of rows and columns . Each column header describes the concept of that column . The intersection of a row and a column is called a cell . Each cell contains entities of the column header it belongs to . This structure plays an important role in schema translation , especially for polysemy words and abbreviation words . For example , in Figure 1 , the header " Match " could be translated to " kÙ ( Matchstick ) " , " 9 M ( Mapping ) " , and " ' [ ( Competition ) " , but its sibling column header " Hosted_by " provides important clues that the table might belong to the domain of sport . Thus , translating " Match " to " ' [ ( Competition ) " is more appropriate in the context . Moreover , a column header 's cell values could also provide hints to infer the meaning of the header . For example , successive numerical cell values indicate that " No . " might be an identity column in Figure 1 . NMT models trained with plain text have never seen the structure of tables , and consequently , they perform poorly in schema translation .
Although the context information of tables is important , how to effectively use it for schema translation is challenging . On the one hand , the NMT model needs to make use of the context information to make word - sense disambiguation for polysemy headers and abbreviation headers . For another , the context information should not bring additional noise when translating the target header .
To facilitate the research study , we construct the first parallel dataset for schema translation written in six different languages . It consists of 3,158 tables with 11,979 headers written in six differ - ent languages , including English , Chinese , French , German , Spanish , and Japanese .
Furthermore , to address the challenges in schema translation , we propose a Context Aware Schema Translation ( CAST ) model , which is a header - to - header neural machine translation model augmented with table context . Specifically , we model a target header and its context as a directed graph to represent their entity types and structural relations . Then CAST encodes the graph with a relational - aware transformer and uses another transformer to decode the header in the target language . The advantages of our approach come from two folds : ( 1 ) The structure relationships make the transformer encoder capture the structural information and learn a contextualized representation for the target header ; ( 2 ) The entity types differentiate the target header from its context and thus help denoise the target header translation .
Experiments on our dataset demonstrate that CAST significantly outperforms state - of - the - art neural machine translation models . Our contributions are summarized as follows .
• We propose the task of schema translation , and discuss its differences with a plain text translation . To facilitate the research study , we construct the first parallel schema translation dataset .
• We propose a header - to - header context - aware schema translation model , called CAST , for the new schema translation task . Specifically , we use the transformer self - attention mechanism to encode the schema over predefined entity types and structural relationships , making it aware of the schema context .
• Experiments on our proposed dataset demonstrate that our approach significantly outperforms the state - of - the - art neural machine translation models in schema translation .
Schema Translation Dataset
To address the need for a dataset for the new schema translation task , we construct the first parallel schema translation dataset . It consists of 3,158 tables with 11,979 headers written in six different languages , including English , Chinese , French , German , Spanish , and Japanese . In this section , we will first introduce our construction methodology and then analyze the characteristics of our dataset .
58
Dataset Construction
We construct the dataset in two steps : collecting 3,158 English tables and then manually translating the schema of English tables to other languages . ( Pasupat and Liang , 2015 ) , in which they randomly select 2,108 multidomain data tables in English from Wikipedia with at least eight rows and five columns . Secondly , we manually collect 176 English tables from the search engine covering multiple domains like retail , education , and government . At last , we select all the tables that appear in the training set and development set from the Spider dataset ( Yu et al . , 2018 ) , which contains 200 databases covering 138 different domains . Finally , we obtained 3,158 tables with 11,979 headers in total .
Context Aware Schema Annotation . To reduce the translation effort , we first use Google translator 1 to automatically translate the English headers to five target languages , header by header . Then based on the Google translations , we recruit three professional translators for each language to manually check and modify the translations if inappropriate .
In this process , we found that Google translator is not good enough in schema translation since industry jargon and abbreviations are commonly used in column headers . Table 1 shows some example headers and their paraphrases under different domains in our dataset . However , domain information is implicit , and the meaning of the header needs to be inferred carefully from the entire table context . To get more precise translations , we provide three kinds of additional information as a schema context : ( 1 ) a whole table with structural information , including its table name , column headers and cell values ; ( 2 ) an original web - page URL for the table from the Wikipedia website ; ( 3 ) some natural language question / answer pairs about the table 2 . Our translators are asked to first understand the context of the given schema before validating the translations . We find that the modification rate is 40 % , which indicates that the provided context is very useful . Finally , we further verify the annotated data by asking a different translator to check if the headers are correctly translated .
Data Statistics and Analysis
As we know , the translation cost is expensive , and we provide parallel corpus in six languages , which limits the volume of translated headers . On the basis of our statistics , the average validating speed is 100 headers / hour and we spend 159.34 ⇤ 5 hours in total . This speed is much slower than the plain text translation since our translators need to read large amounts of different domain - specific contexts to help disambiguation . To this end , we make our best effort and translate 11,979 headers , spending 6,625 USD in total . According to our translators ' feedback , the context is quite helpful in understanding the meaning of headers . We will also release these contexts together with our schema translation dataset to facilitate further study .
Dataset Analysis . To have a more quantitative analysis of our dataset , we count the ratio of headers containing four lexical features , including abbreviation , symbol characters , verb - object phrase and capitalized character . As we can see in table 2 , these lexical features commonly occur in headers , making them quite different from plain text .
To help better understand the domains of the collected tables , we firstly use a 44 - category ontology presented in Wikipedia : WikiProject Council / Directory as our domain category . Then we randomly sample 500 tables in the training set and manually label the domains . According to our statistics , our dataset covers all 44 domains . In detail , the Sports , Countries , Economics , and Music topics together comprise 44.6 % of our dataset , but the other 55.4 % is composed of broader topics such as Business , Education , Science , and Government .
Methodology
In this section , we describe our schema translation approach in detail . We first introduce the requirement and our definition for the schema translation task and then introduce the model architecture .
Task Requirement
In schema translation , both the meaning of the headers and the structural information like order and numbers must be completely transferred to the target language . Obviously , this requirement can not be met by translating schema as a whole with the traditional sequence - to - sequence NMT models because it can not achieve precisely token level alignment . For example , when concatenating all headers with a separator " | " , the separator can be easily lost during translation . To meet this requirement , we employ a header - to - header translation manner in this work , which translates one header at a time .
Task Definition
We define a column header as H i = hh 1 , . . . , h n i , where h j is the jth token of the header in the source language .
Let C i = ( S i , V i ) denote the context of H i .
It is made up of a set of selected cell values V i = { v 1 , . . . , v t } of H i and the rest of headers
S i = [ H 1 , . . . , H i 1 , H i+1 , . . . , H m ]
in the schema . The translation of H i is denoted as Y i = hy 1 , . . . , y m i , where y j is the jth token of the header in the target language . Taking a header H and its corresponding context C as input , the model outputs the header Y in the target language .
Model
Basically , our model adopts a Transformer encoderdecoder architecture ( Vaswani et al . , 2017 ) , which takes the source language header with its corresponding context as inputs and generates the translation for the target language header as outputs . Specifically , we model the target header and its context as a directed graph and use the transformer self - attention to encode them over two predefined structural relationships and three entity types . Figure 2 depicts the overall architecture of our model via an illustrative example .
Relation - Aware Self - Attention . First , we introduce self - attention and then its extension , relationaware self - attention . Consider a sequence of inputs
X = { x i } n i=1
where x i 2 R dx . Self - attention introduced by Vaswani et al . ( 2017 ) transforms each x i into z i 2 R dx as follows :
e ij = x i W Q ( x j W K ) T p d z ↵ ij = softmax j { e ij } ( 1 ) z i = n X j=1 ↵ ij ( x j W V )
where dz ) . Shaw et al . ( 2018 ) proposes an extension to selfattention to consider the pairwise relationships between input tokens by changing Equation ( 1 ) as follows :
W Q , W K , W V 2 R dx ⇥ (
e ij = x i W Q ( x j W K + r K ij ) ) T p d z z i = n X j=1 ↵ ij ( x j W V + r V ij ) ( 2 )
Here the r ij terms encode the known relationships between the two tokens x i and x j in the input sequence . In this way , this self - attention is biased toward some pre - defined relationships using the relation vector r ij in each layer when learning the contextualized embedding . Specifically , they use it to represent the relative position information between sequence elements . More details could be found in their work ( Shaw et al . , 2018 ) . Figure 2 : An overview of CAST with an illustrative example of English - to - Chinese schema translation . Firstly , the target header " Chinese " and its context are modeled as a directed graph . Then a stack of relation - aware transformers encodes the input sequence X to X 0 with a relational matrix R induced from the graph .
Inspired by Shaw et al . ( 2018 ) , we model the target header and its context as a labeled directed graph and use the same formulation of relationaware self - attention as Shaw et al . ( 2018 ) . Here
X = { x i } n
i=1 are initial embeddings of our input sequence , and the relational matrix R is induced from the input graph , where r ij is a learned embedding according to the type of edge that x i and x j hold in the directed input graph . The following section will describe the set of relations our model uses to encode a target header concatenated with its context .
Input Graph . We model a target header and its context as a directed graph to represent their entity types and structural relations . Firstly , we induce two kinds of edges to denote the structural relationships between the target header and its context : sibling header ( i.e. , an edge point from tokens in S to tokens in the target header . ) , and belonging value ( i.e. , an edge point from tokens in V to tokens in the target header . ) . In this sense , it could incorporate the structural information into the contextualized representation of the target header .
Then , we define three sorts of entity types to distinguish the target header from its context . Specifically , for a token in the target header , we assign a special edge Target point to itself , denoting the entity type . For tokens in S and V , we assign them different edges point to themselves , e.g. , Header , and Value respectively . Figure 2 illustrates an example graph ( with actual edges and labels ) and its induced relational matrix R. Initial Token Embedding . We obtain the initial token embedding by a pre - trained transformer encoder before feeding it to the ration - aware transformer . To obtain the input sequence , each element in S and V are firstly concatenated with a vertical bar " | " . Then , the target header H , the rest of the headers S , and the selected cell values V are concatenated by a separator symbol " [ sep ] " . At last , following , an additional source language token " hsrci " is added at the front to help the pretrained model identify the source language . The encoder then transforms the final input sequence into a sequence of embedding
X = [ x 1 , . . . , x l ] .
Then we feed them to the relational aware layers and get the final contextualized sequence of embedding X
0 = [ x 0 1 , . . . , x 0 l ] .
Decoder . The goal of the decoder is to autoregressively generate the translated column header Y = hy 1 , . . . , y m i. Specifically , taking X 0 and the representation of previously output token as input , the decoder predicts the translation token by token until an ending signal hendi is generated . Similar to the encoder , a special token htgti which indicates the target language is added at the front to guide the prediction of the target language .
Experiments
In this section , we conduct experiments on our proposed schema translation dataset to evaluate the effectiveness of our approach . Furthermore , we ablate different ways of context modeling in our approach to understand their contributions . At last , we conduct a qualitative analysis and show example cases and their predicting results .
Experiment Setup
Baseline . We choose two state - of - the - art NMT models , including M2M-100 and MBart-50M2 M ( Tang et al . , 2020 ) , as our baselines . Specifically , both of the baseline models employ the Transformer sequence - to - sequence architecture ( Vaswani et al . , 2017 ) to capture features from source language input and generate the translation . The M2M-100 is directly trained on large - scaled translation data while MBart-50M2 M is firstly pre - trained with a " Multilingual Denoising Pretraining " objective and then fine - tuned in machine - translation task . We evaluate the baseline models with the following settings :
• Base : The original NMT models without finetuning on the schema dataset . • H2H : The NMT models that are fine - tuned on our schema translation dataset in a headerto - header manner . • H2H+CXT : The NMT models are fine - tuned by concatenating a target header and its context as input and translating the target header . • H2H+CXT+ExtL : The NMT models with two extra Transformers layers at the end of the encoder , and are fine - tuned with the same setting as H2H+CXT .
Besides NMT models , we also trained a phrasebased statistical machine translation ( PB - SMT ) schema translation model with Moses 3 ( Koehn et al . , 2007 ) , with the same data split .
Evaluation Metrics . We evaluate the performances of different models with the 4 - gram BLEU ( Papineni et al . , 2002 ) score of the translations . Following the evaluation step in M2M-100 , before computing BLEU , we de - tokenize the data and apply standard tokenizers for each language . We use SacreBLEU tokenizer for Chinese , Kytea 4 for Japanese , and Moses tokenizer 5 for the rest of the languages . Besides BLEU , we also conduct a human evaluation for a more precise analysis .
Hyperparameters . We fine - tune all of our NMT models for 4 epochs with a batch size of 4 and a warmup rate of 0.2 . To avoid over - fitting , we set the early stopping patience on the validation set as 2 . In the context construction , we randomly select 5 cell values for each target column . The Adam optimizer ( Kingma and Ba , 2015 ) with ß1 = 0.9 , ß2 = 0.99 and ✏ = 1e-8 is adopted . We set the number of relation - aware layers as 2 , and we set the learning rate of the decoder and the relational aware layers as 3e-5 , and decrease the learning rate of the Transformer encoder to 4 times and 8 times smaller for M2M-100 and MBart-50M2 M respectively .
Experimental Results
We conduct experiments of translating schema from English ( En ) to five different languages , including Chinese ( Zh ) , French ( Fr ) , German ( De ) , Spanish ( Es ) , and Japanese ( Ja ) . The performances of different translation models are listed in Table 4 .
Overall Performance . The overall performances of two NMT models across five target languages show similar trends . Firstly , compared with Base , which is trained only on plain text , H2H gains significant improvement . For example , H2H based on M2M-100 outperforms Base by 17.7 , 24.7 , 26.7 , 15.5 , and 16.6 BLEU in translating schema from En to Zh , Es , Fr , De , and Ja , respectively . It demonstrates a big difference between plain text and tabular data , and fine - tuning on schema translation data could alleviate the difference to some extent .
Next , we find that , in most situations , the performance of H2H can be further boosted by concatenating the constructed context from the table . Taking H2H+CXT based on M2M-100 as an example , comparing with H2H , H2H+CXT obtains 2.1 , 0.6 , and 1.6 points of improvement in En - Zh , En - De , and En - Ja settings , respectively . In terms of H2H+CXT based on MBart-50M2 M , the concatenation of context also boosts the BLEU score for translating schema from En to Zh and Es by 1.5 and 1.2 . The observations demonstrate the benefits of making good use of the constructed context .
However , we also notice that concatenating the context does not help improve the performance of H2H+CXT based on MBart-50M2 M and M2M100 in the setting of En - De and En - Ja , and the setting of En - Es and En - Fr , respectively . We hypothesize that the decrease of BLEU score comes from the noise brought by the context .
There are no significant differences between the performance of H2H+CXT and H2H+CXT+ExtL which has two extra Transformers layers since the pre - trained NMT models have already had 12 Transformers layers .
For example , the H2H+CXT+ExtL model based on M2M100 obtains 47 . 1 , 48.6 , 53.0 , 46.6 , and 40.4 BLEU points on En - Zh , En - Es , En - Fr , En - De , and En - Ja , respectively .
Finally , equipped with the relation - aware module , CAST can make the best use of the context and obtain significant improvement over H2H across all settings . For models based on M2M-100 , CAST outperforms H2H by 2.6 , 1.4 , 0.3 , 1.8 , and 1.9 BLEU in En - Zh , En - Es , En - Fr , En - De , and En - Ja , respectively . When it comes to models based on MBart-50M2 M , CAST obtains 1.6 , 2.7 , 1.9 , 0.9 , 0.2 improvements of BLEU points over H2H in translating schema from En to 5 target languages . It is also noticeable that CAST can help denoise the concatenated context for H2H+CXT . For instance , CAST based on M2M-100 achieves 1.5 and 1.2 improvements of BLEU points over H2H+CXT for schema translation from En to Es and Fr respectively . This improvement shows CAST can better model the target header and its context . We also run a Wilcoxon signed - rank tests between CAST and H2H+CXT and the results show the improvement are significant with p < 0.05 in 3 out of 5 languages . For the rest of the languages CAST achieves comparable results .
Human Evaluation . Since the machine evaluation metrics can not absolutely make sure whether the predicted result is correct or not , we conduct a human evaluation on the test set for a more precise evaluation . Specifically , we invite two experts to evaluate each language pair . For each case , they compare the machine translation and the human annotation . The label is set as 1 if they think the translation is equivalent to the annotation , otherwise 0 . We report the human evaluation results for the Base , H2H , H2H+CXT , and CAST based on M2M-100 on the En - Zh setting in Table 5 . According to human evaluation , H2H achieves 14.84 % improvement over Base , and the performance is further boosted by 3.11 % when the context is added . Finally , enhanced by the relationaware structure , CAST obtains 2.3 % improvement over H2H+CXT , which demonstrates the effectiveness of our approach .
Ablation Study
We conduct ablation studies on CAST to analyze the contributions of our predefined entity types and structural relationships for context modeling . First , we evaluate the variant of CAST without entity types . Next , we evaluate the performance of CAST , without structural relations . Finally , we erase all kinds of relations in CAST which is identical to H2H+CXT . We report the performance of models based on M2M-100 in the setting of En - De and En - Fr in Table 6 .
Firstly , it is clear that erasing entity types decreases the performance of the schema translation Table 7 : Qualitative analysis for models ' performance in schema translation from En to Zh on three kinds of headers . For each predicting result , we add extra explanations for their meanings in the brackets . Results with underline denote the correct translation for the header . models . Comparing CAST ( w/o entity type ) with CAST , for instance , We can see a 0.5 and 0.5 decrease of BLEU for En - De and En - Fr respectively . Secondly , the comparison between CAST ( w/o structural relation ) and CAST shows that the structure relations also play an important role in bettering the performance of context modeling . As seen in the En - Fr translation setting , CAST(w / o structural relation ) obtains a 1.0 lower BLEU score over CAST . Finally , when erasing both kinds of edges and the models give the lowest performance .
Qualitative Analysis
In this section , we conduct a qualitative analysis on the effectiveness of CAST based on M2M-100 for three types of headers : headers with special tokenization , abbreviation headers , and polysemy headers . We list some of the example translations in Table 7 .
By comparing the translations for headers with special tokenization , we can see that all fine - tuned models , including H2H , H2H+CXT , and CAST can accurately translate headers in CamelCase or underscore tokenizations , while Base fails to skip the underscore and can not translate " Debt " in the middle of " AccessedDebtService " .
For the abbreviation headers , when translating " OS " ( the abbreviation of operation system ) and " Jan " ( the abbreviation of January ) , both Base and H2H fail to get the correct result . However , being aware of the context of " Jan " ( e.g. , Feb , Mar and Apr , etc . ) and " OS " ( e.g. , Computer , System , and Core , etc . ) , H2H+CXT and CAST can better understand and translate the abbreviations .
When it comes to the polysemy headers , with the help of context like " Height " , " Width " and " Depth " , H2H+CXT and CAST can disambiguate polysemy header " Area " from region or zone to acreage . For header " Volume " , However , H2H+CXT copies the source language column , which is not a valid translation , because the translator is disturbed by the context . On the other hand , with the help of the relational - aware transformer encoder , CAST generates a proper translation for " Volume " as the capacity of the engine . Affected by the context , H2H+CXT only translates part of the information from header ' Film.1 ' and ' Rank of the year ' , while M2M-100 , H2H , and CAST give an appropriate translation .
Related Work
With the developments of Neural Machine Translation ( NMT ) systems ( Sutskever et al . , 2014;Bahdanau et al . , 2015 ) , tremendous success has been achieved by existing studies on machine translation tasks . For instance , Vaswani et al . ( 2017 ) greatly improved bilingual machine translation systems with the Transformer architectures , ( Edunov et al . , 2018 ) achieved state - of - the - art on the WMT ' 14 English - German tasks with back - translations augmentation , Weng et al . ( 2020 ) and Yang et al . ( 2020 ) explored ways to boost the performance of NMT systems with pre - trained language models . Recent works saw the potential to improve NMT models in many - to - many settings and proposed models that can perform machine translation on various language pairs . While the above - mentioned studies focus on sentence - level translation in plain text , they are not suitable for schema translation .
A line of machine translation research closely related to our task is the phrase - to - phrase translation , which considers phrases in multi - word expressions as their translation unit . Traditional phrase - based SMT models ( Koehn et al . , 2007;Haddow et al . , 2015 ) get phrase table translation probabilities by counting phrase occurrences and use local context through a smoothed n - gram language model . Recently , some works explore ways to adapt NMT models for phrase translation . For example , Wang et al . ( 2017 ) combined the phrase - based statistical machine translation ( SMT ) model into NMT and shown significant improvements on Chineseto - English translation data , explored the use of phrase structures for NMT systems by modeling phrases in target language sequences , and Feng et al . ( 2018 ) used a phrase attention mechanism to enhance the decoder in relevant source segment recognition . The main differences between these studies and our work are : ( 1 ) we do not rely on external phrase dictionaries or phrase tables ; and ( 2 ) we study how to make use of the schema context for word - sense disambiguation in the schema translation scenario .
Context - aware schema encoding has received considerable attention in both recent semantic parsing literature ( Hwang et al . , 2019;Gong et al . , 2019 ) and Table - to - Text literature ( Gong et al . , 2019 ) . In general , there are two sorts of techniques : 1 ) . add additional entity type embedding and special separator token from the input sequence to distinguish the table structure ( i.e. , Type - SQL and IRNET ) ; 2 ) . encode the schema as a directed graph . For example , Bogin et al . ( 2019 ) use a Graph Neural Network ( Scarselli et al . , 2008 ) , and ; Shaw et al . ( 2019 ) use a transformer self - attention mechanism to encode the schema over predefined schema relationships . Unlike these works , we explore the suitability of schema encoding techniques for the newly proposed schema translation task .
Conclusion
In this paper , we propose a new challenging translation task called schema translation , and construct the first parallel dataset for this task . To address the challenges for this new task , we propose CAST , which uses a relational - aware transformer to encode a header and its context over predefined relationships , making it aware of the table context .
Ethical Considerations
The schema translation dataset presented in this work is a free and open resource for the community to study the newly proposed translation task . English tables collected are from three sources . First , we collect all tables from the WikiTableQuestions dataset ( Pasupat and Liang , 2015 ) , which is a free and open dataset for the research of question answering task on semi - structured HTML ta - bles . Since all of the tables are collected from open - access Wikipedia pages , there is no privacy issue . Second , we collect 176 English tables from the search engines which are also publicly available and do not contain personal data . To Further enlarge our dataset , we select all tables from the training set and development set of the Spider dataset ( Yu et al . , 2018 ) , which is also a free and open dataset for research use . Since the tables from the Spider dataset are mainly collected from openaccess online csv files , college database courses and SQL websites , there is no privacy issue either . For the translation step , we hire professional translators to translate the collected English tables to five target languages and the details can be found in Section 2 .
All the experiments with NMT models in this paper can be run on a single Tesla V100 GPU . On average , the training process of models in different languages can be finished in four hours . We implement our model with the Transformer 6 tools in Pytorch 7 , and the data will be released with the paper .
-DOCSTART- Multimodal Quality Estimation for Machine Translation
We propose approaches to Quality Estimation ( QE ) for Machine Translation that explore both text and visual modalities for Multimodal QE . We compare various multimodality integration and fusion strategies . For both sentence - level and document - level predictions , we show that state - of - the - art neural and feature - based QE frameworks obtain better results when using the additional modality .
Quality Estimation ( QE ) for Machine Translation ( MT ) ( Blatz et al . , 2004;Specia et al . , 2009 ) aims to predict the quality of a machine - translated text without using reference translations . It estimates a label ( a category , such as ' good ' or ' bad ' , or a numerical score ) for a translation , given text in a source language and its machine translation in a target language ( Specia et al . , 2018b ) . QE can operate at different linguistic levels , including sentence and document levels . Sentence - level QE estimates the translation quality of a whole sentence , while document - level QE predicts the translation quality of an entire document , even though in practice in literature the documents have been limited to a small set of 3 - 5 sentences ( Specia et al . , 2018b ) .
Table 1 : Example of incorrectly machine - translated text : the word shorts is used to indicate short trousers , but gets translated in French as court , the adjective short . Here multimodality could help to detect the error ( extracted from the Amazon Reviews Dataset of McAuley et al . , 2015 ) . creasingly accompanied with visual elements such as images or videos , especially in social media but also in domains such as e - commerce . Multimodality has not yet been applied to QE . Table 1 shows an example from our e - commerce dataset in which multimodality could help to improve QE . Here , the English noun shorts is translated by the adjective court ( for the adjective short ) in French , which is a possible translation out of context . However , as the corresponding product image shows , this product is an item of clothing , and thus the machine translation is incorrect . External information can hence help identify mismatches between translations which are difficult to find within the text . Progress in QE is mostly benchmarked as part of the Conference on Machine Translation ( WMT ) Shared Task on QE . This paper is based on data from the WMT'18 edition 's Task 4 -documentlevel QE . This Task 4 aims to predict a translation quality score for short documents based on the number and the severity of translation errors at the word level ( Specia et al . , 2018a ) . This data was chosen as it is the only one for which meta information ( images in this case ) is available . We extend this dataset by computing scores for each sentence for a sentence - level prediction task . We consider both feature - based and neural state - of - theart models for QE . Having these as our starting points , we propose different ways to integrate the visual modality .
The main contributions of this paper are as follows : ( i ) we introduce the task of Multimodal QE ( MQE ) for MT as an attempt to improve QE by using external sources of information , namely images ; ( ii ) we propose several ways of incorporating visual information in neural - based and featurebased QE architectures ; and ( iii ) we achieve the state - of - the - art performance for such architectures in document and sentence - level QE .
QE Frameworks and Models
We explore feature - based and neural - based models from two open - source frameworks : QuEst++ : QuEst++ ( Specia et al . , 2015 ) is a feature - based QE framework composed of two modules : a feature extractor module , to extract the relevant QE features from both the source sentences and their translations , and a machine learning module . We only use this framework for our experiments on document - level QE , since it does not perform well enough for sentence - level prediction . We use the same model ( Support Vector Regression ) , hyperparameters and feature settings as the baseline model for the document - level QE task at WMT'18 .
deepQuest : deepQuest ( I ve et al . , 2018 ) is a neural - based framework that provides state - of - theart models for multi - level QE . We use the BiRNN model , a light - weight architecture which can be trained at either sentence or document level .
The BiRNN model uses an encoder - decoder architecture : it takes on its input both the source sentence and its translation which are encoded separately by two independent bi - directional Recurrent Neural Networks ( RNNs ) . The two resulting sentence representations are then concatenated as a weighted sum of their word vectors , generated by an attention mechanism . For sentence - level predictions , the weighted representation of the two input sentences is passed through a dense layer with sigmoid activation to generate the quality estimates . For document - level predictions , the final representation of a document is generated by a second attention mechanism , as the weighted sum of the weighted sentence - level representations of all the sentences within the document . The resulting document - level representation is then passed through a dense layer with sigmoid activation to generate the quality estimates .
Additionally , we propose and experiment with BERT - BiRNN , a variant of the BiRNN model . Rather than training the token embeddings with the task at hand , we use large - scale pre - trained token - level representations from the multilingual cased base BERT model ( Devlin et al . , 2019 ) . During training , the BERT model is fine - tuned by unfreezing the weights of the last four hidden layers along with the token embedding layer . This performs comparably to the state - of - the - art predictorestimator neural model in Kepler et al . ( 2019 ) .
WMT'18 QE Task 4 data : This dataset was created for the document - level track . It contains a sample of products from the Amazon Reviews Dataset ( McAuley et al . , 2015 ) taken from the Sports & Outdoors category . ' Documents ' consist of the English product title and its description , its French machinetranslation and a numerical score to predict , namely the MQM score ( Multidimensional Quality Metrics ) ( Lommel et al . , 2014 ) . This score is computed by annotating and weighting each word - level translation error according to its severity ( minor , major and critical ):
MQM Score = 1 − n min + 5n maj + 10n cri n
For the sentence - level QE task , each document of the dataset was split into sentences ( lines ) , where every sentence has its corresponding MQM score computed in the same way as for the document . We note that this variant is different from the official sentence - level track at WMT since for that task visual information is not available .
Text features : For the feature - based approach , we extract the same 15 features as those for the baseline of WMT'18 at document level . For the neural - based approaches , text features are either the learned word embeddings ( BiRNN ) or pre - trained word embeddings ( BERT - BiRNN ) .
Multimodal QE
We propose different ways to integrate visual features in our two monomodal QE approaches ( Sections 3.1 and 3.2 ) . We compare each proposed model with its monomodal QE counterpart as baseline , both using the same hyperparameters .
Multimodal feature - based QE
The feature - based textual features contain 15 numerical scores , while the visual feature vector contains 4,096 dimensions . To avoid over - weighting the visual features , we reduce their dimensionality using Principal Component Analysis ( PCA ) . We consider up to 15 principal components in order to keep a balance between the visual features and the 15 text features from QuEst++ . We choose the final number of principal components to keep according to the explained variance with the PCA , so this number is treated as a hyperparameter . After analysing the explained variance for up to 15 kept principal components ( see Figure 4 in Appendix ) , we selected six numbers of principal components to train QE models with ( 1 , 2 , 3 , 5 , 10 , and 15 ) . As fusion strategy , we concatenate the two feature vectors .
Multimodal neural - based QE
Multimodality is achieved with two changes in our monomodal models : multimodality integration ( where to integrate the visual features in the architecture ) , and fusion strategy ( how to fuse the visual and textual features ) . We propose the following places to integrate the visual feature vector into the BiRNN architecture :
• annot -the visual feature vector is used after the encoding of the two input sentences by the two bi - directional RNNs ; • last -the visual feature vector is used just before the last layer .
Figure 1 presents the high - level architecture of the document - level BiRNN model , with the various multimodality integration and fusion approaches .
We use the standard training , development and test datasets from the WMT'18 Task 4 track . For feature - based systems , we follow the built - in crossvalidation in QuEst++ , and train a single model with the hyperparameters found by cross - validation . For neural - based models , we use early - stopping with a patience of 10 to avoid over - fitting , and all reported figures are averaged over 5 runs corresponding to different seeds .
We follow the evaluation method of the WMT QE tasks : Pearson 's r correlation as the main metric ( Graham , 2015 ) , Mean - Absolute Error ( MAE ) and Root - Mean - Squared Error ( RMSE ) as secondary metrics . For statistical significance on Pearson 's r , we compute Williams test ( Williams , 1959 ) as suggested by Graham and Baldwin ( 2014 ) .
For all neural - based models , we experiment with the all three integration strategies ( ' embed ' , ' annot ' and ' last ' ) and all three fusion strategies ( ' conc ' , ' mult ' and ' mult2 ' ) presented in Section 3.2 . This leads to 6 multimodal models for each BiRNN and BERT - BiRNN . In Tables 2 and 4 , as well as in Figures 2 and 3 , we report the top three performing models . We refer the reader to the Appendix for the full set of results .
Sentence - level MQE
The first part of Table 2 presents the results for sentence - level multimodal QE with BiRNN . The best model is BiRNN+Vis - embed - mult2 , achieving a Pearson 's r of 0.535 , significantly outperforming the baseline ( p - value<0.01 ) . Visual features can , therefore , help to improve the performance of sentence - level neural - based QE systems significantly .
Figure 2 presents the result of Williams significance test for BiRNN model variants . It is a correlation matrix that can be read as follows : the value in cell ( i , j ) is the p - value of Williams test for the change in performance of the model at row i compared to the model at column j ( Graham , 2015 ) .
With the pre - trained token - level representations from BERT ( second half of Table 2 ) , the best model is BERT - BiRNN+Vis - annot - mult , achieving a Pear- BERT - BiRNN ) and their respective top-3 best performing multimodal variants ( + Vis ) . We refer the reader to the Appendix for the full set of results . Here , BERT , ann - mul and emb - mul2 correspond to the BERT - BiRNN , the BERT - BiRNN+Vis - annot - mult and the BiRNN+Vis - embed - mult2 models of Table 2 .
son 's r of 0.602 . This shows that even when using better word presentations , the visual features help to get further ( albeit modest ) improvements . Table 3 shows an example of predicted scores at the sentence - level for the baseline model ( BiRNN ) and for the best multimodal BiRNN model ( BiRNN+Vis - embed - mult2 ) . The multimodal model has predicted a closer score ( -0.002 ) to the gold MQM score ( 0.167 ) than the baseline model ( -0.248 ) . The French translation is poor ( cumulative - split is , for instance , not translated ) as the low gold MQM score shows . However , the ( main ) word stopwatch is correctly translated as chronomètre in French . Since the associated picture indeed represents a stopwatch , one explanation for this improvement could be that the multimodal model may have rewarded this correct and important part of the translation .
Le chronomètre A601X dispose calendrier cumulative - split . gold MQM score 0.167 BiRNN -0.248 BiRNN+Vis - embed - mult2 -0.002 Table 3 : Example of performance of sentence - level multimodal QE . Compared to the baseline prediction ( BiRNN ) , the prediction from the best multimodal model ( BiRNN+Vis - embed - mult2 ) is closer to the gold MQM score . This could be because the word stopwatch is correctly translated as chronomètre in French , and the additional visual feature confirms it . This could lead to an increase in the predicted score to reward the correct part , despite the poor translation ( extracted from the Amazon Reviews Dataset of McAuley et al . , 2015 ) .
Document - level MQE
Table 4 presents the results for the documentlevel feature - based and BiRNN neural QE models . 1 The first section shows the official models from the WMT'18 QE Task 4 report ( Specia et al . , 2018a ) . The neural - based approach SHEF - PT is the winning submission , outperforming another neural - based approach ( SHEF - mtl - bRNN ) . For our BiRNN models ( second section ) , BiRNN+Visembed - conc performs only slightly better than the monomodal baseline . For the feature - based models ( third section ) , on the other hand , the baseline monomodal QuEst++ is outperformed by various multimodal variants by a large margin , with the one with two principal components ( QuEst+Vis-2 ) performing the best . The more PCA components kept , the worse the results ( see Appendix for full set of results ) . Figure 3 shows the Williams significance test for document - level QuEst++ on the WMT'18 dataset .
As we can see , QuEst+Vis-2 model outperforms the baseline with p - value = 0.002 . Thus , visual features significantly improve the performance of featurebased QE systems compared to the monomodal QE counterparts .
We introduced Multimodal Quality Estimation for Machine Translation , where an external modality -visual information -is incorporated to featurebased and neural - based QE approaches , on sentence and document levels . The use of visual features extracted from images has led to significant improvements in the results of state - of - the - art QE approaches , especially at sentence level .
The version of deepQuest for multimodal QE and scripts to convert document into sentencelevel data are available on https://github.com/ sheffieldnlp / deepQuest .
A Appendix PCA analysis Figure 4 shows an almost linear relationship between the number of principal components and the explained variance of the PCA ( see Section 3.1 ) , i.e. the higher the number of principal components , the larger the explained variance . Therefore , we experimented with various numbers of components up to 15 ( 1 , 2 , 3 , 5 , 10 , and 15 ) on the development set to find the best settings for quality prediction . Complete results Tables 5 and 6 present the full set of results of our experiments on document and sentence - level multimodal QE on our main test set , the WMT'18 test set . These are a super - set of the results presented in the main paper but include all combinations of multimodality integration and fusion strategies for sentence - level prediction , as well as different numbers of principal components kept for document - level QuEst prediction models .
Additional test set Tables 7 and 8 present the full set of results of our experiments on the WMT'19 Task 2 test set on document and sentencelevel multimodal QE , respectively . This was the follow - up edition of the WMT'18 Task 4 , where the same training set is used , but a new test set is released .
For sentence - level , we observe on the one hand quite significant improvements with a gain of almost 8 points in Pearson 's r over BiRNN , our monomodal baseline without pre - trained word embedding . multimodal variants achieve better performance compared to the monomodal BiRNN baseline , with a peak when the visual features are fused with the word embedding representations by elementwise multiplication . On the other hand , we do not observe any gain in using visual features on the WMT'19 test set compared to our monomodal baseline with pre - trained word - embedding ( BERT - BiRNN ) . Here that the BERT - BiRNN baseline model already performs very well . According to the task organisers , the mean MQM value on the WMT'19 test set is higher than on the WMT'18 test set , but actually closer to the training data ( Fonseca
. We therefore hypothesise here that the highly dimensional and contextualised word - level representations from BERT are already enough and do not benefit from the extra information provided by the visual features .
-DOCSTART- The SOFC - Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain
This paper presents a new challenging information extraction task in the domain of materials science . We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications , such as involved materials and measurement conditions . With this paper , we publish our annotation guidelines , as well as our SOFC - Exp corpus consisting of 45 openaccess scholarly articles annotated by domain experts . A corpus and an inter - annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality . We also present strong neural - network based models for a variety of tasks that can be addressed on the basis of our new data set . On all tasks , using BERT embeddings leads to large performance gains , but with increasing task complexity , adding a recurrent neural network on top seems beneficial . Our models will serve as competitive baselines in future work , and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions .
The design of new experiments in scientific domains heavily depends on domain knowledge as well as on previous studies and their findings . However , the amount of publications available is typically very large , making it hard or even impossible to keep track of all experiments conducted for a particular research question . Since scientific experiments are often time - consuming and expensive , effective knowledge base population methods for finding promising settings based on the published research would be of great value ( e.g. , Auer et al . , 2018;Manica et al . , 2019;Mrdjenovich et al . , 2020 ) . While such real - life information extraction tasks have received consid- erable attention in the biomedical domain ( e.g. , Cohen et al . , 2017;Demner - Fushman et al . , 2018 , there has been little work in other domains ( Nastase et al . , 2019 ) , including materials science ( with the notable exception of the work by Mysore et al . , 2017Mysore et al . , , 2019 .
In this paper , we introduce a new information extraction use case from the materials science domain and propose a series of new challenging information extraction tasks . We target publications about solid oxide fuel cells ( SOFCs ) in which the interdependence between chosen materials , measurement conditions and performance is complex ( see Figure 1 ) . For making progress within natural language processing ( NLP ) , the genre - domain combination presents interesting challenges and characteristics , e.g. , domain - specific tokens such as material names and chemical formulas .
The task of finding experiment - specific information can be modeled as a retrieval task ( i.e. , finding relevant information in documents ) and at the same time as a semantic - role - labeling task ( i.e. , identifying the slot fillers ) . We identify three sub - tasks :
( 1 ) identifying sentences describing relevant experiments , ( 2 ) identifying mentions of materials , values , and devices , and ( 3 ) recognizing mentions of slots and their values related to these experiments . We propose and compare several machine learning methods for the different sub - tasks , including bidirectional long - short term memory ( BiLSTM ) networks and BERT - based models . In our results , BERT - based models show superior performance . However , with increasing complexity of the task , it is beneficial to combine the two approaches .
With the aim of fostering research on challenging information extraction tasks in the scientific domain , we target the domain of SOFC - related experiments as a starting point . Our findings based on this sample use case are transferable to similar experimental domains , which we illustrate by applying our best model configurations to a previously existing related corpus ( Mysore et al . , 2019 ) , achieving state - of - the - art results .
• We provide a new corpus of 45 materialsscience publications in the research area of SOFCs , manually annotated by domain experts for information on experimental settings and results ( Section 4 ) . Our corpus is publicly available . 1 Our inter - annotator agreement study provides evidence for high annotation quality ( Section 5 ) .
Information extraction for scientific publications . Recently , several studies addressed information extraction and knowledge base construction in the scientific domain ( Augenstein et al . , 2017;Luan et al . , 2018;Jiang et al . , 2019;Buscaldi et al . , 2019 ) . We also aim at knowledge base construction but target publications about materials science experiments , a domain understudied in NLP to date . Information extraction for materials science . The work closest to ours is the one of Mysore et al . ( 2019 ) also retrieve synthesis procedures and extract recipes , though with a coarser - grained label set , focusing on different synthesis operation types . create a dataset for named entity recognition on abstracts of materials science publications . In contrast to our work , their label set ( e.g. , Material , Application , Property ) is targeted to document indexing rather than information extraction . A notable difference to our work is that we perform full - text annotation while the aforementioned approaches annotate a pre - selected set of paragraphs ( see also . Mysore et al . ( 2017 ) apply the generative model of Kiddon et al . ( 2015 ) to induce action graphs for synthesis procedures of materials from text . In Section 7.1 , we implement a similar entity extraction system and also apply our algorithms to the dataset of Mysore et al . ( 2019 ) . train word2vec ( Mikolov et al . , 2013 ) embeddings on materials science publications and show that they can be used for recommending materials for functional applications . Other works adapt the BERT model to clinical and biomedical domains ( Alsentzer et al . , 2019;Sun and Yang , 2019 ) , or generally to scientific text ( Beltagy et al . , 2019 ) .
Neural entity tagging and slot filling . The neural - network based models we use for entity tagging and slot filling bear similarity to state - ofthe - art models for named entity recognition ( e.g. , Huang et al . , 2015;Lample et al . , 2016;Panchendrarajan and Amaresan , 2018;Lange et al . , 2019 ) . Other related work exists in the area of semantic role labeling ( e.g. , Roth and Lapata , 2015;Kshirsagar et al . , 2015;Hartmann et al . , 2017;Adel et al . , 2018;Swayamdipta et al . , 2018 ) .
In this section , we describe our annotation scheme and guidelines for marking information on SOFCrelated experiments in scientific publications .
We treat the annotation task as identifying instances of a semantic frame ( Fillmore , 1976 ) that represents SOFC - related experiments . We include ( 1 ) cases that introduce novel content ; ( 2 ) descriptions of specific previous work ; ( 3 ) general knowledge that one could find in a textbook or survey ; and also ( 4 ) suggestions for future work .
The above two steps of recognizing relevant sentences and marking coarse - grained entity types are in general applicable to a wide range of experiment types within the materials science domain . We now define a set of slot types particular to experiments on SOFCs . During annotation , we mark these slot types as links between the experimentevoking phrase and the respective slot filler ( entity mention ) , see Figure 1 . As a result , experiment frames are represented by graphs rooted in the node corresponding to the frame - evoking element .
Our annotation scheme comprises 16 slot types relevant for SOFC experiments . Here we explain a few of these types for illustration . A full list of these slot types can be found in Supplementary Material Table 11 ; detailed explanations are given in the annotation guidelines published along with our corpus . PowerDensity , Resistance , WorkingTemperature : These slots are generally filled by mentions of type VALUE , i.e. , a numerical value plus a unit . Our annotation guidelines give examples for relevant units and describe special cases . This enables any materials scientist , even if he / she is not an expert on SOFCs , to easily understand and apply our annotation guidelines .
SOFC - Exp Corpus . Our corpus consists of 45
open - access scientific publications about SOFCs and related research , annotated by domain experts .
Task definitions . Our rich graph - based annotation scheme allows for a number of information extraction tasks . In the scope of this paper , we address the following steps of ( 1 ) identifying sentences that describe SOFC - related experiments , ( 2 ) recognizing and typing relevant named entities , and
We here present the results of our inter - annotator agreement study , which we perform in order to estimate the degree of reproducibility of our corpus and to put automatic modeling performance into perspective . Six documents ( 973 sentences ) have been annotated independently both by our primary annotator , a graduate student of materials science , and a second annotator , who holds a Ph.D. in physics and is active in the field of materials science . The label distribution in this subset is similar to the one of our overall corpus , with each annotator choosing EXPERIMENT about 11.8 % of the time . Identification of experiment - describing sentences . Agreement on our first task , judging whether a sentence contains relevant experimental information , is 0.75 in terms of Cohen 's κ ( Cohen , 1968 ) , indicating substantial agreement according to Landis and Koch ( 1977 ) . The observed agreement , corresponding to accuracy , is 94.9 % ; expected agreement amounts to 79.2 % . Table 2 shows precision , recall and F1 for the doubly - annotated subset , treating one annotator as the gold standard and the other one 's labels as predicted . Our primary annotator identifies 119 out of 973 sentences as experiment - describing , our secondary annotator 111 sentences , with an overlap of 90 sentences . These statistics are helpful to gain further intuition of how well a human can reproduce another annotator 's labels and can also be considered an upper bound for system performance .
Entity mention detection and type assignment .
As mentioned above , relevant entity mentions and their types are only annotated for sentences containing experiment information and neighboring sentences . Therefore , we here compute agreement on the detection of entity mention and type assignment on the subset of 90 sentences that both annotators considered as containing experimental information . We again look at precision and recall of the annotators versus each other , see Table 3 .
The high precision indicates that our secondary annotator marks essentially the same mentions as our primary annotator , but recall suggests a few missing cases . The difference in marking EXPERI - MENT can be explained by the fact that the primary annotator sometimes marks several verbs per sentence as experiment - evoking elements , connecting them with same exp or exp variation , while the secondary annotator links the mentions of relevant slots to the first experiment - evoking element ( see also Supplementary Material Section B ) . Overall , the high agreement between domain expert annotators indicates high data quality . Identifying experiment slot fillers . We compute agreement on the task of identifying the slots of an experiment frame filled by the mentions in a sentence on the subset of sentences that both annotators marked as experiment - describing . Slot fillers are the dependents of the respective edges starting at the experiment - evoking element . Table 4 shows F1 scores for the most frequent ones among those categories . See Supplementary Material Section C for all slot types . Overall , our agreement study provides support for the high quality of our annotation scheme and validates the annotated dataset .
Experiment detection . The task of experiment detection can be modeled as a binary sentence classification problem . It can also be conceived as a retrieval task , selecting sentences as candidates for experiment frame extraction . We implement a bidirectional long short - term memory ( BiLSTM ) model with attention for the task of experiment sentence detection . Each input token is represented by a concatenation of several pretrained word embeddings , each of which is fine - tuned during training . We use the Google News word2vec embeddings ( Mikolov et al . , 2013 ) , domain - specific word2vec embeddings ( mat2vec , , see also Section 2 ) , subword embeddings based on byte - pair encoding ( bpe , Heinzerling and Strube , 2018 ) , BERT ( Devlin et al . , 2019 ) , and SciBERT ( Beltagy et al . , 2019 ) embeddings . For BERT and SciBERT , we take the embeddings of the first word piece as token representation . The embeddings are fed into a BiLSTM model followed by an attention layer that computes a vector for the whole sentence . Finally , a softmax layer decides whether the sentence contains an experiment .
In addition , we fine - tune the original ( uncased ) BERT ( Devlin et al . , 2019 ) as well as SciBERT ( Beltagy et al . , 2019 ) models on our dataset . Sci - BERT was trained on a large corpus of scientific text . We use the implementation of the BERT sentence classifier by Wolf et al . ( 2019 ) that uses the CLS token of BERT as input to the classification layer . 5 Finally , we compare the neural network models with traditional classification models , namely a support vector machine ( SVM ) and a logistic regression classifier . For both models , we use the following set of input features : bag - of - words vectors indicating which 1 - to 4 - grams and part - of - speech tags occur in the sentence . 6 Entity mention extraction . For entity and concept extraction , we use a sequence - tagging approach similar to ( Huang et al . , 2015;Lample et al . , 2016 ) , namely a BiLSTM model . We use the same input representation ( stacked embeddings ) as above , which are fed into a BiLSTM . The subsequent conditional random field ( CRF , Lafferty et al . , 2001 ) output layer extracts the most probable label sequence . To cope with multi - token entities , we convert the labels into BIO format .
We also fine - tune the original BERT and SciB - ERT sequence tagging models on this task . Since we use BIO labels , we extend it with a CRF output layer to enable it to correctly label multi - token mentions and to enable it to learn transition scores between labels . As a non - neural baseline , we train 5 https://github.com/huggingface/ transformers 6 We use sklearn , https://scikit-learn.org .
a CRF model using the token , its lemma , part - ofspeech tag and mat2vec embedding as features . 7
Slot filling . As described in Section 4 , we approach the slot filler extraction task as fine - grained entity - typing - in - context , assuming that each sentence represents a single experiment frame . We use the same sequence tagging architectures as above for tagging the tokens of each experimentdescribing sentence with the set of slot types ( see Table 11 ) . Future work may contrast this sequence tagging baseline with graph - induction based frame extraction .
Hyperparameters and training . The BiLSTM models are trained with the Adam optimizer ( Kingma and Ba , 2015 ) with a learning rate of 1e-3 . For fine - tuning the original BERT models , we follow the configuration published by Wolf et al . ( 2019 ) and use AdamW ( Loshchilov and Hutter , 2019 ) as optimizer and a learning rate of 4e-7 for sentence classification and 1e-5 for sequence tagging . When adding BERT tokens to the BiLSTM , we also use the AdamW optimizer for the whole model and learning rates of 4e-7 or 1e-5 for the BERT part and 1e-3 for the remainder . For regularization , we employ early stopping on the development set . We use a stacked BiLSTM with two hidden layers and 500 hidden units for all tasks with the exception of the experiment sentence de- tection task , where we found one BiLSTM layer to work best . The attention layer of the sentence detection model has a hidden size of 100 .
Experiment sentence detection . Table 5 shows our results on the detection of experimentdescribing sentences . The neural models with bytepair encoding embeddings or BERT clearly outperform the SVM and logistic regression models . Within the neural models , BERT and SciBERT add the most value , both when using their embeddings as another input to the BiLSTM and when finetuning the original BERT models . Note that even the general - domain BERT is strong enough to cope with non - standard domains . Nevertheless , models based on SciBERT outperform BERT - based models , indicating that in - domain information is indeed beneficial . For performance reasons , we use BERT - base in our experiments , but for the sake of completeness , we also run BERT - large for the task of detecting experiment sentences . Because it did not outperform BERT - base in our cross - validation based development setting , we did not further experiment with BERT - large . However , we found that it resulted in the best F1 - score achieved on our test set . In general , SciBERT - based models provide very good performance and seem most robust across dev and test sets . Overall , achieving F1 - scores around 67.0 - 68.6 , such a retrieval model may already be useful in production . However , there certainly is room for improvement . Entity mention extraction . Table 6 provides our results on entity mention detection and typing .
Models are trained and results are reported on the subset of sentences marked as experimentdescribing in the gold standard , amounting to 4,590 entity mentions in total . 9 The CRF baseline achieves comparable or better results than the Bi - LSTM with word2vec and/or mat2vec embeddings . However , adding subword - based embeddings ( bpe and/or BERT ) significantly increases performance of the BiLSTM , indicating that there are many rare words . Again , the best results are obtained when using BERT or SciBERT embeddings or when using the original SciBERT model . It is relatively easy for all model variants to recognize VALUE as these mentions usually consist of a number and unit which the model can easily memorize . Recognizing the types MATERIAL and DEVICE , in contrast , is harder and may profit from using gazetteer - based extensions .
Experiment slot filling . Table 7 shows the macro - average F1 scores for our different models on the slot identification task . 10 As for entity typing , we train and evaluate our model on the subset of sentences marked as experiment - describing , which contain 4,263 slot instances . Again , the CRF baseline outperforms the BiLSTM when using only mat2vec and/or word2vec embeddings . The addition of BERT or SciBERT embeddings improves performance . However , on this task , the BiLSTM model with ( Sci)BERT embeddings outperforms the fine - tuned original ( Sci)BERT model . Compared to the other two tasks , this task requires more complex reasoning and has a larger number of possible output classes . We assume that in such a setting , adding more abstraction power to the model ( in the form of a BiLSTM ) leads to better results . For a more detailed analysis , Table 8 shows the slot - wise results for the non - neural CRF baseline and the model that performs best on the development set : BiLSTM with SciBERT embeddings . As in the case of entity mention detection , the models do well for the categories that consist of numeric mentions plus particular units . In general , model performance is also tied to the frequency of the slot types in the dataset . Recognizing the role a material plays in an experiment ( e.g. , AnodeMaterial vs. CathodeMaterial ) remains challenging , possibly requiring background domain knowledge . This type of information is often not stated explicitly in the sentence , but introduced earlier in the discourse and would hence require document - level modeling .
Entity Extraction Evaluation on the Synthesis Procedures Dataset
As described in Section 2 , the data set curated by Mysore et al . ( 2019 ) contains 230 synthesis procedures annotated with entity type information . 11 We apply our models to this entity extraction task in order to estimate the degree of transferability of our findings to similar data sets . To the best of 11 our knowledge , there have not yet been any publications on the automatic modeling of this data set . We hence compare to the previous work of Mysore et al . ( 2017 ) , who perform action graph induction on a similar data set . 12 Our implementation of BiLSTM - CRF mat2vec+word2vec roughly corresponds to their BiLSTM - CRF system .
Table 9 shows the performance of our models when trained and evaluated on the synthesis procedures dataset . Detailed scores by entity type can be found in the Supplementary Material . We chose to use the data split suggested by the authors for the NER task , using 200 documents for training , and 15 documents for each dev and test set . Among the non - BERT - based systems , the BiLSTM variant using both mat2vec and word2vec performs best , indicating that the two pre - trained embeddings contain complementary information with regard to this task . The best performance is reached by the BiL - STM model including word2vec , mat2vec , bpe and SciBERT embeddings , with 92.2 micro - average F1 providing a strong baseline for future work .
We have presented a new dataset for information extraction in the materials science domain consisting of 45 open - access scientific articles related to solid oxide fuel cells . Our detailed corpus and interannotator agreement studies highlight the complexity of the task and verify the high annotation quality . Based on the annotated structures , we suggest three information extraction tasks : the detection of experiment - describing sentences , entity mention recognition and typing , and experiment slot filling . We have presented various strong baselines for them , generally finding that BERT - based models outperform other model variants . While some categories remain challenging , overall , our models show solid performance and thus prove that this type of data modeling is feasible and can lead to systems that are applicable in production settings . Along with this paper , we make the annotation guidelines and the annotated data freely available .
Outlook . In Section 7.1 , we have shown that our findings generalize well by applying model architectures developed on our corpus to another dataset . A natural next step is to combine the datasets in a multi - task setting to investigate to what extent models can profit from combining the information annotated in the respective datasets . Further research will investigate the joint modeling of entity extraction , typing and experiment frame recognition . In addition , there are also further natural language processing tasks that can be researched using our dataset . They include the detection of events and sub - events when regarding the experiment - descriptions as events , and a more linguistically motivated evaluation of the framesemantic approach to experiment descriptions in text , e.g. , moving away from the one - experimentper - sentence and one - sentence - per - experiment assumptions and modeling the graph - based structures as annotated .
Table 12 reports full statistics for the task of identifying experiment - describing sentences , including precision and recall in the dev setting .
-DOCSTART- Position encoding ( PE ) , an essential part of self - attention networks ( SANs ) , is used to preserve the word order information for natural language processing tasks , generating fixed position indices for input sequences . However , in cross - lingual scenarios , e.g. , machine translation , the PEs of source and target sentences are modeled independently . Due to word order divergences in different languages , modeling the cross - lingual positional relationships might help SANs tackle this problem . In this paper , we augment SANs with crosslingual position representations to model the bilingually aware latent structure for the input sentence . Specifically , we utilize bracketing transduction grammar ( BTG)-based reordering information to encourage SANs to learn bilingual diagonal alignments . Experimental results on WMT'14 English⇒German , WAT'17 Japanese⇒English , and WMT'17 Chinese⇔English translation tasks demonstrate that our approach significantly and consistently improves translation quality over strong baselines . Extensive analyses confirm that the performance gains come from the cross - lingual information .
Although self - attention networks ( SANs ) ( Lin et al . , 2017 ) have achieved the state - of - the - art performance on several natural language processing ( NLP ) tasks ( Vaswani et al . , 2017;Devlin et al . , 2019;Radford et al . , 2018 ) , they possess the innate disadvantage of sequential modeling due to the lack of positional information . Therefore , absolute position encoding ( APE ) ( Vaswani et al . , 2017 ) and relative position encoding ( RPE ) ( Shaw et al . , 2018 ) were introduced to better capture the sequential dependencies . However , either absolute or relative PE is language - independent and its embedding remains fixed . This inhibits the capacity of SANs when modelling multiple languages , which have diverse word orders and structures ( Gell - Mann and Ruhlen , 2011 ) . Recent work have shown that modeling cross - lingual information ( e.g. , alignment or reordering ) at encoder or attention level improves translation performance for different language pairs ( Cohn et al . , 2016;Du and Way , 2017;Zhao et al . , 2018;Kawara et al . , 2018 ) . Inspired by their work , we propose to augment SANs with cross - lingual representations , by encoding reordering indices at embedding level . Taking English⇒Chinese translation task for example , we first reorder the English sentence by deriving a latent bracketing transduction grammar ( BTG ) tree ( Wu , 1997 ) ( Fig . 1a ) . Similar to absolute position , the reordering information can be represented as cross - lingual position ( Fig . 1b ) . In addition , we propose two strategies to incorporate cross - lingual position encoding into SANs . We conducted experiments on three commonlycited datasets of machine translation . Results show that exploiting cross - lingual PE consistently improves translation quality . Further analysis reveals that our method improves the alignment quality ( § Sec . 4.3 ) and context - free Transformer ( Tang et al . , 2019 ) ( § Sec . 4.4 ) . Furthermore , contrastive evaluation demonstrates that NMT models benefits from the cross - lingual information rather than denoising ability ( § Sec . 4.5 ) .
Position Encoding To tackle the position unaware problem , absolute position information is injected into the SANs :
Self - Attention The SANs compute the attention of each pair of elements in parallel . It first converts the input into three matrices Q , K , V , representing queries , keys , and values , respectively :
SANs can be implemented with multi - head attention mechanism , which requires extra splitting and concatenation operations . Specifically , W Q , W K , W V and Q , K , V in Eq . ( 3 ) is split into H sub - matrices , yielding H heads . For the h - th head , the output is computed by :
First , we built a BTG - based reordering model ( Neubig et al . , 2012 ) to generate a reordered source sentence according to the word order of its corresponding target sentence . Second , we obtained the reordered word indices pos XL that correspond with the input sentence X. To output the cross - lingual position matrix PE XL , we inherit the sinusoidal function in Eq . ( 1 ) . Formally , the process is :
As shown in Fig . 2 , we propose two strategies to integrate the cross - lingual position encoding ( XL PE ) into SANs : inputting - level XL ( InXL ) SANs and head - level ( HeadXL ) SANs .
Inputting - level XL SANs As illustrated in Fig . 2a , we employ a non - linear function TANH(• ) to fuse PE abs and PE XL :
Similarly , we use Eq . ( 3)∼ ( 5 ) to calculate multiple heads of SANs .
Head - level XL SANs Instead of projecting XL PE to all attention heads , we feed partial of them , such that some heads contain XL PE and others contain APE , namely HeadXL . As shown in Fig . 2b , we fist add APE and XL PE for X , respectively :
We denote the number of XL PE equipped heads as τ ∈ { 0 , . . . , H } . To perform the attention calculation ,
In particular , τ = 0 refers to the original Transformer ( Vaswani et al . , 2017 ) and τ = H means that XL PE will propagate over all attention heads .
We conduct experiments on word order - diverse language pairs : WMT'14 English⇒German ( En - De ) , WAT'17 Japanese⇒English ( Ja - En ) , and WMT'17 Chinese⇔English ( Zh - En & En - Zh ) .
For English⇒German , the training set consists of 4.5 million sentence pairs and newstest2013 & 2014 are used as the dev . and test sets , respectively . BPE with 32 K merge operations is used to handle low - frequency words . For Japanese⇒English , we follow Morishita et al . ( 2017 ) to use the first two sections as training data , which consists of 2.0 million sentence pairs . The dev . and test sets contain 1790 and 1812 sentences . For Chinese⇔English , we follow Hassan et al . ( 2018 ) sentence pairs . We develop on devtest2017 and test on newstest2017 . We use SacreBLEU ( Post , 2018 ) as the evaluation metric with statistical significance test ( Collins et al . , 2005 ) . We evaluate the proposed XL PE strategies on Transformer . The baseline systems include Relative PE ( Shaw et al . , 2018 ) and directional SAN ( DiSAN , Shen et al . 2018 ) . We implement them on top of OpenNMT ( Klein et al . , 2017 ) . In addition , we report the results of previous studies ( Hao et al . , 2019;Chen et al . , 2019b , a;Du and Way , 2017;Hassan et al . , 2018 ) .
The reordered source sentences are generated by BTG - based preordering model ( Neubig et al . , 2012 ) trained with above sub - word level 1 parallel corpus . At training phase , we first obtain word alignments from parallel data using GIZA++ or FastAlign , and then the training process is to find the optimal BTG tree for source sentence consistent with the order of the target sentence based on the word alignments and parallel data . At decoding phase , we only provide source sentences as input and the model can output reordering indices , which will be fed into NMT model . Thus , bilingual alignment information is only used to preprocess training data , but not necessary at decoding time .
For fair comparison , we keep the Transformer decoder unchanged and validate different position representation strategies on the encoder . We conduct all experiments on the TRANSFORMER - BIG with four V100 GPUs .
Effect of τ in HeadXL SANs
Fig . 3 reports the results of different τ for Head XL SANs . With increasing of XL PE - informed heads , the best BLEU is achieved when # heads = 4 , which is therefore left as the default setting for HeadXL . Then , the BLEU score gradually decreases as the # System Architecture BLEU # Param . number of APE - informed heads decrease ( τ ↑ ) , indicating that sequential position embedding is still essential for SANs .
Tab . 1 shows the results on En - De , inputting - level cross - lingual PE ( + InXL PE ) and head - level crosslingual PE ( + HeadXL PE ) outperform Transformer BIG by 0.30 and 0.36 BLEU points , and combining these two strategies 2 achieves a 0.69 BLEU point increase . For Ja - En , Zh - En , and En - Zh ( Tab . 2 ) , we observe a similar phenomenon , demonstrating that XL PE on SANs do improve the translation performance for several language pairs . It is worth noting that our approach introduces nearly no additional parameters ( +0.01 M over 282.55 M ) .
Our proposed XL PE intuitively encourages SANs to learn bilingual diagonal alignment , so has the 2 Replace PEXL in Eq . ( 9 ) with PEIN - XL in Eq . ( 8) . potential to induce better attention matrices . We explore this hypothesis on the widely used Gold Alignment dataset 3 and follow Tang et al . ( 2019 ) to perform the alignment . The only difference being that we average the attention matrices across all heads from the penultimate layer ( Garg et al . , 2019 ) . The alignment error rate ( AER , Och and Ney 2003 ) , precision ( P ) and recall ( R ) are reported as the evaluation metrics . Tab . 3 summarizes the results . We can see : 1 ) XL PE allows SANs to learn better attention matrices , thereby improving alignment performance ( 27.4 / 26.9 vs. 29.7 ) ; and 2 ) combining the two strategies delivers consistent improvements ( 24.7 vs. 29.7 ) .
Augmenting SANs with position representation SANs ignore the position of each token due to its position - unaware " bag - of - words " assumption . The most straightforward strategy is adding the position representations as part of the token representations ( Vaswani et al . , 2017;Shaw et al . , 2018 lingual position information between languages .
Modeling cross - lingual divergence There has been many works modeling cross - lingual divergence ( e.g. , reordering ) in statistical machine translation ( Nagata et al . , 2006;Durrani et al . , 2011Durrani et al . , , 2013 . However , it is difficult to migrant them to neural machine translation . Kawara et al . ( 2018 ) pre - reordered the source sentences with a recursive neural network model . Chen et al . ( 2019a ) learned the reordering embedding by considering the relationship between the position embedding of a word and SANS - calculated sentence representation . showed that SANs in machine translation could learn word order mainly due to the PE , indicating that modeling cross - lingual information at position representation level may be informative . Thus , we propose a novel cross - lingual PE method to improve SANs .
In this paper , we presented a novel cross - lingual position encoding to augment SANs by considering cross - lingual information ( i.e. , reordering indices ) for the input sentence . We designed two strategies to integrate it into SANs . Experiments indicated that the proposed strategies consistently improve the translation performance . In the future , we plan to extend the cross - lingual position encoding to non - autoregressive MT ( Gu et al . , 2018 ) and unsupervised NMT ( Lample et al . , 2018 ) .
-DOCSTART- We harness neural language and commonsense models to study how cognitive processes of recollection and imagination are engaged in storytelling . We rely on two key aspects of stories : narrative flow ( how the story reads ) and semantic vs. episodic knowledge ( the types of events in the story ) . We propose as a measure of narrative flow the likelihood of sentences under generative language models conditioned on varying amounts of history . Then , we quantify semantic knowledge by measuring the frequency of commonsense events ( from the ATOMIC knowledge graph ; Sap et al . , 2019 ) , and episodic knowledge by counting realis events ( Sims et al . , 2019 ) , both shown in Figure 1 .
We introduce HIPPOCORPUS , 1 a dataset of 6,854 diary - like short stories about salient life events , to examine the cognitive processes of remembering and imagining . Using a crowdsourcing pipeline , we collect pairs of recalled and imagined stories written about the same topic . By design , authors of recalled stories rely on their episodic memory to tell their story .
We demonstrate that our measures can uncover differences in imagined and recalled stories in HIPPOCORPUS . Imagined stories contain more commonsense events and elaborations , whereas recalled stories are more dense in concrete events . Additionally , imagined stories flow substantially more linearly than recalled stories . Our findings provide evidence that surface language reflects the differences in cognitive processes used in imagining and remembering .
We construct HIPPOCORPUS , containing 6,854 stories ( Table 1 ) , to enable the study of imagined and recalled stories , as most prior corpora are either limited in size or topic ( e.g. , Greenberg et al . , 1996;Ott et al . , 2011 ) . See Appendix A for additional details ( e.g. , worker demographics ; § A.2 ) .
Inspired by recent work on discourse modeling ( Kang et al . , 2019;Nadeem et al . , 2019 ) , we use language models to assess the narrative linearity of a story by measuring how sentences relate to their context in the story . We compare the likelihoods of sentences under two generative models ( Figure 2 ) . The bag model makes the assumption that every sentence is drawn independently from the main theme of the story ( represented by E ) . On the other hand , the chain model assumes that a story begins with a theme , and sentences linearly follow each other . 3 . ∆ l is computed as the difference in negative loglikelihoods between the bag and chain models :
We train a realis event tagger ( using BERT - base ; Devlin et al . , 2019 ) on the annotated literary events corpus by Sims et al . ( 2019 ) , which slightly outperforms the original author 's models . We provide further training details in Appendix B.1 .
Given the social focus of our stories , we use the social commonsense knowledge graph ATOMIC ( Sap et al . , 2019 ) . 4 For each story , we first match possible ATOMIC events to sentences by selecting events that share noun chunks and verb phrases with sentences ( e.g. , " getting married " " Per - sonX gets married " ; Figure 1 ) . We then search the matched sentences ' surrounding sentences for commonsense inferences ( e.g. , " be very happy " " happy " ; Figure 1 ) . We describe this algorithm in further detail in Appendix B.2 . In our analyses , the measure quantifies the number of story sentences with commonsense tuple matches in the two preceding and following sentences .
To supplement our analyses , we compute several coarse - grained lexical counts for each story in HIPPOCORPUS . Such approaches have been used in prior efforts to investigate author mental states , temporal orientation , or counterfactual thinking in language ( Tausczik and Pennebaker , 2010;Schwartz et al . , 2015;Son et al . , 2017 ) .
We count psychologically relevant word categories using the Linguistic Inquiry Word Count ( Pennebaker et al . , 2015 , LIWC ;) , focusing only on the cognitive processes , positive emotion , negative emotion , and I - word categories , as well as the ANALYTIC and TONE summary variables . 5 Additionally , we measure the average concreteness level of words in stories using the lexicon by Brysbaert et al . ( 2014 ) .
We summarize the differences between imagined and recalled stories in HIPPOCORPUS in Table 2 . For our narrative flow and lexicon - based analyses , 4 ATOMIC contains social and inferential knowledge about the causes ( e.g. , " X wants to start a family " ) and effects ( e.g. , " X throws a party " , " X feels loved " ) of everyday situations like " PersonX decides to get married " . 5 See liwc.wpengine.com/interpretingliwc-output/ for more information on LIWC variables . we perform paired t - tests . For realis and commonsense event measures , we perform linear regressions controlling for story length . 6 We Holmcorrect for multiple comparisons for all our analyses ( Holm , 1979 ) .
First , we compare the effects of recency of the event described ( TIMESINCEEVENT : a continuous variable representing the log time since the event ) . 9 Then , we contrast recalled stories to their retold counterparts in pairwise comparisons . Finally , we measure the effect of how frequently the experienced event is thought or talked about ( FREQUENCYOFRECALL : a continuous variable ranging from very rarely to very frequently ) . 10 As in § 4 , we Holm - correct for multiple comparisons .
Frequency of recall . We find that the more an event is thought or talked about ( i.e. , higher FRE - QUENCYOFRECALL ) , the more linearly its story flows ( ∆ l ; |β| = 0.07 , p < 0.001 ) , and the fewer realis events ( |β| = 0.09 , p < 0.001 ) it contains .
Furthermore , using lexicon - based measures , we find that stories with high FREQUENCYOFRE - CALL tend to contain more self references ( Iwords ; Pearson 's |r| = 0.07 , p < 0.001 ) . Conversely , stories that are less frequently recalled are more logical or hierarchical ( LIWC 's ANALYTIC ; Pearson 's |r| = 0.09 , p < 0.001 ) and more concrete ( Pearson 's |r| = 0.05 , p = 0.03 ) .
To investigate the use of NLP tools for studying the cognitive traces of recollection versus imagination in stories , we collect and release HIP - POCORPUS , a dataset of imagined and recalled stories . We introduce measures to characterize narrative flow and influence of semantic vs. episodic knowledge in stories . We show that imagined stories have a more linear flow and contain more commonsense knowledge , whereas recalled stories are less connected and contain more specific concrete events . Additionally , we show that our measures can uncover the effect in language of narrativization of memories over time . We hope these findings bring attention to the feasibility of employing statistical natural language processing machinery as tools for exploring human cognition . Figure 4 : We extract phrases from the main themes of recalled ( left ) and imagined ( right ) stories , using RAKE ( Rose et al . , 2010 ) ; size of words corresponds to frequency in corpus , and color is only for readability .
To detect realis events in our stories , we train a tagger ( using BERT - base ; Devlin et al . , 2019 ) on the annotated corpus by Sims et al . ( 2019 ) . This corpus contains 8k realis events annotated by experts in sentences drawn from 100 English books . With development and test F 1 scores of 83.7 % and 75.8 % , respectively , our event tagger slightly outperforms the best performing model in Sims et al . ( 2019 ) , which reached 73.9 % F 1 . In our analyses , we use our tagger to detect the number of realis event mentions .
We design a commonsense extraction tool that aligns sentences in stories with commonsense tuples , using a heuristic matching algorithm . Given a story , we match possible ATOMIC events to sentences by selecting events that share noun chunks and verb phrases with sentences . For every sentence s i that matches an event E in ATOMIC , we check surrounding sentences for mentions of commonsense inferences ( using the same noun and verb phrase matching strategy ) ; specifically , we check the n c preceding sentences for matches of causes of E , and the n e following sentences for event E 's effects .
To measure the prevalence of semantic memory in a story , we count the number of sentences that matched ATOMIC knowledge tuples in their surrounding context . We use a context window of size n c = n e = 2 to match inferences , and use the spaCy pipeline ( Honnibal and Montani , 2017 ) to extract noun and verb phrases .
C.1 Linearity with Varying Context Size Shown in Figure 5 , we compare the negative loglikelihood of sentences when conditioned on varying history sizes ( using the story summary as context E ) . As expected , conditioning on longer histories increases the predictability of a sentence . However , this effect is significantly larger for imagined stories , which suggests that imagined stories flow more linearly than recalled stories .
-DOCSTART- Multilingual BERT Post - Pretraining Alignment
We propose a simple method to align multilingual contextual embeddings as a postpretraining step for improved cross - lingual transferability of the pretrained language models . Using parallel data , our method aligns embeddings on the word level through the recently proposed Translation Language Modeling objective as well as on the sentence level via contrastive learning and random input shuffling . We also perform sentence - level code - switching with English when finetuning on downstream tasks . On XNLI , our best model ( initialized from mBERT ) improves over mBERT by 4.7 % in the zero - shot setting and achieves comparable result to XLM for translate - train while using less than 18 % of the same parallel data and 31 % fewer model parameters . On MLQA , our model outperforms XLM - R Base , which has 57 % more parameters than ours .
Introduction
Building on the success of monolingual pretrained language models ( LM ) such as BERT ( Devlin et al . , 2019 ) and RoBERTa ( Liu et al . , 2019 ) , their multilingual counterparts mBERT ( Devlin et al . , 2019 ) and XLM - R ( Conneau et al . , 2020 ) are trained using the same objectives - Masked Language Modeling ( MLM ) and in the case of mBERT , Next Sentence Prediction ( NSP ) . MLM is applied to monolingual text that covers over 100 languages . Despite the absence of parallel data and explicit alignment signals , these models transfer surprisingly well from high resource languages , such as English , to other languages . On the Natural Language Inference ( NLI ) task XNLI ( Conneau et al . , 2018 ) , a text classification model trained on English training data can be directly applied to the other 14 languages and achieve respectable performance . Having a single model that can serve over 100 languages also has important business applications .
Recent work improves upon these pretrained models by adding cross - lingual tasks leveraging parallel data that always involve English . Conneau and Lample ( 2019 ) pretrain a new Transformerbased ( Vaswani et al . , 2017 ) model from scratch with an MLM objective on monolingual data , and a Translation Language Modeling ( TLM ) objective on parallel data . Cao et al . ( 2020 ) align mBERT embeddings in a post - hoc manner : They first apply a statistical toolkit , FastAlign ( Dyer et al . , 2013 ) , to create word alignments on parallel sentences . Then , mBERT is tuned via minimizing the mean squared error between the embeddings of English words and those of the corresponding words in other languages . Such post - hoc approach suffers from the limitations of word - alignment toolkits : ( 1 ) the noises from FastAlign can lead to error propagation to the rest of the pipeline ; ( 2 ) FastAlign mainly creates the alignments with word - level translation and usually overlooks the contextual semantic compositions . As a result , the tuned mBERT is biased to shallow cross - lingual correspondence . Importantly , both approaches only involve word - level alignment tasks .
Method
This section introduces our proposed Post - Pretraining Alignment ( PPA ) method . We first describe the MoCo contrastive learning framework and how we use it for sentence - level alignment . Next , we describe the finer - grained word - level alignment with TLM . Finally , when training data in the target language is available , we incorporate sentence - level code - switching as a form of both alignment and data augmentation to complement PPA . Figure 1 shows our overall model structure .
Background : Contrastive Learning Instance discrimination - based contrastive learning aims to bring two views of the same source image closer to each other in the representation space while encouraging views of different source images to be dissimilar through a contrastive loss . Recent advances in this area , such as SimCLR ( Chen et al . , 2020 ) and MoCo ( He et al . , 2020 ) have bridged the gap in performance between self - supervised representation learning and fully - supervised methods on the ImageNet ( Deng et al . , 2009 ) dataset . As a key feature for both methods , a large number of negative examples per instance are necessary for the models to learn such good representations . SimCLR uses in - batch negative example sampling , thus requiring a large batch size , whereas MoCo stores negative examples in a queue and casts the contrastive learning task as dictionary ( query - key ) lookup . In what follows , we first describe MoCo and then how we use it for sentence - level alignment .
θ k = mθ k + ( 1 − m)θ q ( 1 )
where θ q and θ k are model parameters of f q and f k , respectively . m is the momentum coefficient .
Sentence - Level Alignment Objective
Our sentence - level alignment falls under the general problem of bringing two views of inputs from the same source closer in the representation space while keeping those from different sources dissimilar through a contrastive loss . From a crosslingual alignment perspective , we treat an English sequence S en i and its translation S tr i in another language tr ∈ L as two manifestations of the same semantics . At the same time , sentences that are not translations of each other should be further apart in the representation space . Given parallel corpora consisting of { ( S en 1 , S tr 1 ) , . . . , ( S en N , S tr N ) } , we align sentence representations in all the different languages together using MoCo .
We use the pretrained mBERT model to initialize both the query and momentum encoders . mBERT is made of 12 Transformer blocks , 12 attention heads , and hidden size d h = 768 . For input , instead of feeding the query encoder with English examples and the momentum encoder with translation examples or vice versa , we propose a random input shuffling approach . Specifically , we randomly shuffle the order of S en i and S tr i when feeding the two encoders , so that the query encoder sees both English and translation examples . We observe that this is a crucial step towards learning good multilingual representations using our method . The final hidden state h ∈ R 1×d h of the [ CLS ] token , normalized with L 2 norm , is treated as the sentence representation 1 . Following Chen et al . ( 2020 ) , we add a non - linear projection layer on top of h :
z = W 2 ReLU ( W 1 h),(2 )
where
W 1 ∈ R d h ×d h , W 2 ∈ R d k ×d h
, and d k is set to 300 . The model is trained using the InfoNCE loss :
L MoCo = − log exp(z q • z k+ /τ ) K k=1 exp(z q • z k /τ ) , ( 3 )
where τ is a temperature parameter . In our implementation , we use a relatively small batch size of 128 , resulting in more frequent parameter updates than if a large batch size were used . Items enqueued early on can thus become outdated with a large queue , so we scale down the queue size to K = 32 , 000 to prevent the queue from becoming stale .
Word - Level Alignment Objective
We use TLM for word - level alignment . TLM is an extension of MLM that operates on bilingual data - parallel sentences are concatenated and MLM is applied to the combined bilingual sequence . Different from Conneau and Lample ( 2019 ) , we do not reset positional embeddings when forming the bilingual sequence , and we also do not use language embeddings . In addition , the order of S en i and S tr i during concatenation is determined by the random input shuffling from the sentence - level alignment step and we add a [ SEP ] token between S en i and S tr i . We randomly mask 15 % of the WordPiece tokens in each combined sequence . Masking is done by using a special [ MASK ] token 80 % of the times , a random token in the vocabulary 10 % of the times , and unchanged for the remaining 10 % . TLM is performed using the query encoder of MoCo . Our final PPA model is trained in a multi - task manner with both sentence - level objective and TLM :
L = L MoCo + L TLM , ( 4 )
Finetuning on Downstream Tasks
After an alignment model is trained with PPA , we extract the query encoder from MoCo and finetune it on downstream tasks for evaluation . We follow the standard way of finetuning BERT - like models for sequence classification and QA tasks :
( 1 ) on XNLI , we concatenate the premise with the hypothesis , and add a [ SEP ] token in between .
A softmax classifier is added on top of the final hidden state of the [ CLS ] token ; ( 2 ) on MLQA , we concatenate the question with the context , and add a [ SEP ] token in between . We add two linear layers on top of mBERT followed by softmax over the context tokens to predict answer start and end positions , respectively . We conduct experiments in two settings : 1 . Zeroshot cross - lingual transfer , where training data is available in English but not in target languages . 2 . Translate - train , where the English training set is ( machine ) translated to all the target languages . For the latter setting , we perform data augmentation with code - switched inputs , when training on languages other than English . For example , a Spanish question q es and context c es pair can be augmented to two question - context pairs ( q es , c en ) and ( q en , c es ) with code - switching , resulting in 2x training data 2 . The same goes for XNLI with premises and hypotheses . The code - switching is always between English , and a target language . During training , we ensure the two augmented pairs appear in the same batch .
3 Experimental Settings
Parallel Data for Post - Pretraining
Parallel Data All parallel data we use involve English as the source language . Specifically , we collect en - fr , en - es , en - de parallel pairs from Europarl , en - ar , en - zh from MultiUN ( Ziemski et al . , 2016 ) , en - hi from IITB ( Kunchukuttan et al . , 2018 ) , and en - bg from both Europarl and EUbookshop . All datasets were downloaded from the OPUS 3 website ( Tiedemann , 2012 ) . In our experiments , we vary the number of parallel sentence pairs for PPA . For each language , we take the first 250k , 600k , and 2 M English - translation parallel sentence pairs except for those too short ( where either sentence has less than 10 WordPiece tokens ) , or too long ( where both sentences concatenated together have more than 128 WordPiece tokens ) . Table 1 shows the actual number of parallel pairs in each of our 250k , 600k , and 2 M settings .
Evaluation Benchmarks
XNLI is an evaluation dataset for cross - lingual NLI that covers 15 languages . The dataset is human - translated from the development and test sets of the English MultiNLI dataset . Given a sentence pair of premise and hypothesis , the task is to classify their relationship as entailment , contradiction , and neutral . For zero - shot cross - lingual transfer , we train on the English MultiNLI training set , and apply the model to the test sets of the other languages . For translatetrain , we train on translation data that come with the dataset 4 .
MLQA is an evaluation dataset for QA that covers seven languages . The dataset is derived from a three step process . We focus on XLT in this work . For zero - shot crosslingual transfer , we train on the English SQuAD v1.1 ( Rajpurkar et al . , 2016 ) training set . For translate - train , we train on translation data provided in Hu et al . ( 2020 ) 5
Training Details
Results
We report results on the test set of XNLI and MLQA and we do hyperparameter searching on the development set . All the experiments for translatetrain were done using the code - switching technique introduced in Section 2 .
XNLI Table 2 shows results on XNLI measured by accuracy . Devlin et al . ( 2019 ) only provide results on a few languages 6 , so we use the mBERT results from as our baseline for zeroshot cross - lingual transfer , and Wu and Dredze ( 2019 ) for translate - train . Our best model , trained with 2 M parallel sentences per language improves over mBERT baseline by 4.7 % for zero - shot transfer , and 3.2 % for translate - train .
Compared to Cao et al . ( 2020 ) , which use 250k parallel sentences per language from the same sources as we do for post - pretraining alignment , our 250k model does better for all languages considered and we do not rely on the word - to - word pre - alignment step using FastAlign , which is prone to error propagation to the rest of the pipeline .
Compared to XLM , our 250k , 600k and 2 M settings represent 3.1 % , 7 % and 17.8 % of the parallel data used by XLM , respectively ( see Table 1 ) . The XLM model also has 45 % more parameters than ours as Table 3 shows . Furthermore , XLM trained with MLM only is already significantly better than mBERT even though the source of its training data is the same as mBERT from Wikipedia . One reason could be that XLM contains 45 % more model parameters than mBERT as model depth and capacity are shown to be key to cross - lingual success ( K et al . , 2020 ) . Additionally , Wu and Dredze ( 2019 ) hypothesize that limiting pretraining to the languages used by downstream tasks may be beneficial since XLM models are pretrained on the 15 XNLI languages only . Our 2 M model bridges the gap between mBERT and XLM from 7.5 % to 2.8 % for zero - shot transfer . Note that , for bg , our total processed pool of en - bg data consists of 456k parallel sentences , so there is no difference in en - bg data between our 600k and 2 M settings . For translatetrain , our model achieves comparable performance to XLM with the further help of code - switching during finetuning .
Our alignment - oriented method is , to a large degree , upper - bounded by the English performance , since all our parallel data involve English and all the other languages are implicitly aligning with English through our PPA objectives . Our 2 M model is able to improve the English performance to 82.4 from the mBERT baseline , but it is still lower than XLM ( MLM ) , and much lower than XLM ( MLM+TLM ) . We hypothesize that more highquality monolingual data and model capacity are needed to further improve our English performance , thereby helping other languages better align with it .
MLQA Table 4 shows results on MLQA measured by F1 score . We notice the mBERT baseline from the original MLQA paper is significantly lower than that from , so we use the latter as our baseline . Our 2 M model outperforms the baseline by 2.3 % for zero - shot and is also 0.2 % better than XLM - R Base , which uses 57 % more model parameters than mBERT as Table 3 shows . For translate - train , our 250k model is 1.3 % better than the baseline .
Comparing our model performance using vary- et al . ( 2020 ) . L is the number of Transformer layers , H m is the hidden size , H f f is the dimension of the feed - forward layer , A is the number of attention heads , and V is the vocabulary size .
ing amounts of parallel data , we observe that 600k per language is our sweet spot considering the trade - off between resource and performance . Going up to 2 M helps on XNLI , but less significantly compared to the gain going from 250k to 600k . On MLQA , surprisingly , 250k slightly outperforms the other two for translate - train .
Ablation Table 5 shows the contribution of each component of our method on XNLI . Removing TLM ( -TLM ) consistently leads to about 1 % accuracy drop across the board , showing positive effects of the word - alignment objective . To better understand TLM 's consistent improvement , we replace TLM with MLM ( repl TLM w/ MLM ) , where we treat S en i and S tr i from the parallel corpora as separate monolingual sequences and perform MLM on each of them . The masking scheme is the same as TLM described in Section 2 . We observe that MLM does not bring significant improvement . This confirms that the improvement of TLM is not from the encoders being trained with more data and iterations . Instead , the word - alignment nature of TLM does help the multilingual training .
Comparing our model without word - level alignment , i.e. , -TLM , to the baseline mBERT in Table 2 , we get 2 - 4 % improvement in the zero - shot setting and 1 - 2 % improvement in translate - train as the amount of parallel data is increased . These are relatively large improvements considering the fact that only sentence - level alignment is used . This also conforms to our intuition that sentence - level alignment is a good fit here since XNLI is a sentencelevel task .
In the zero - shot setting , removing MoCo ( -MoCo ) performs similarly to -TLM , where we observe an accuracy drop of about 1 % compared to our full system . In translate - train , -MoCo outperforms -TLM and even matches the full system performance for 250k .
Finally , we show ablation result for our codeswitching in translate - train . On average , codeswitching provides an additional gain of 1 % . ( Lewis et al . , 2020 ) 74.9 54.8 62.2 68.0 48.8 61.1 61.6 XLM - R Base ( Conneau et al . , 2020 ) 77.1 54.9 60.9 67.4 59.4 61.8 63.6 Translate - train mBERT from ( Lewis et al . , 2020 ) 77.7 51.8 62 Training mBERT with Word Alignments Cao et al . ( 2020 ) post - align mBERT embeddings by first generating word alignments on parallel sentences that involve English . For each aligned word pair , the L 2 distance between their embeddings is minimized to train the model . In order to maintain original transferability to downstream tasks , a regularization term is added to prevent the target language embeddings from deviating too much from their mBERT initialization . Our approach post - aligns mBERT with two self - supervised signals from parallel data without using pre - alignment tools . Wang et al . ( 2019 ) also align mBERT em- Table 5 : Ablation Study on XNLI . 250k , 600k , 2 M refer to the maximum number of parallel sentence pairs per language used in PPA . MoCo refers to our sentence - level alignment task using contrastive learning . TLM refers to our word - level alignment task with translation language modeling . CS stands for code - switching . We conduct an additional study repl TLM w/ MLM , which means instead of TLM training , we augment our sentence - level alignment with regular MLM on monolingual text . This ablation confirms that the TLM objective helps because of its word alignment capability , not because we train the encoders with more data and iterations .
beddings using parallel data . They learn a linear transformation that maps a word embedding in a target language to the embedding of the aligned word in the source language . They show that their transformed embeddings are more effective on zero - shot cross - lingual dependency parsing .
Besides the aforementioned three major directions , Artetxe and Schwenk ( 2019 ) train a multilingual sentence encoder on 93 languages . Their stacked BiLSTM encoder is trained by first generating embedding of a source sentence and then decoding the embedding into the target sentence in other languages .
Concurrent to our work , Chi et al . ( 2020 ) , Feng et al . ( 2020 and also leverage variants of contrastive learning for cross - lingual alignment . We focus on a smaller model and improve on it using as little parallel data as possible . We also explore code - switching during finetuning on downtream tasks to complement the post - pretraining alignment objectives .
Conclusion
Post - pretraining embedding alignment is an efficient means of improving cross - lingual transferability of pretrained multilingual LMs , especially when pretraining from scratch is not feasible . We showed that our self - supervised sentence - level and word - level alignment tasks can greatly improve mBERT 's performance on downstream tasks of NLI and QA , and the method can potentially be applied to improve other pretrained multilingual LMs .
In addition to zero - shot cross - lingual transfer , we also showed that code - switching with English during finetuning provides additional alignment signals , when training data is available for the target language .
-DOCSTART- Aspect - Controlled Neural Argument Generation
We rely on arguments in our daily lives to deliver our opinions and base them on evidence , making them more convincing in turn . However , finding and formulating arguments can be challenging . In this work , we present the Arg - CTRL - a language model for argument generation that can be controlled to generate sentence - level arguments for a given topic , stance , and aspect . We define argument aspect detection as a necessary method to allow this fine - granular control and crowdsource a dataset with 5,032 arguments annotated with aspects . Our evaluation shows that the Arg - CTRL is able to generate high - quality , aspectspecific arguments , applicable to automatic counter - argument generation . We publish the model weights and all datasets and code to train the Arg - CTRL . 1 Nuclear reactors produce radioactive waste ...
Introduction
Language models ( Bengio et al . , 2003 ) allow to generate text through learned distributions of a language and have been applied to a variety of areas like machine translation ( Bahdanau et al . , 2015 ) , summarization ( Paulus et al . , 2018 ) , or dialogue systems ( Wen et al . , 2017 ) . A rather new field for these models is the task of producing text with argumentative content ( Wang and Ling , 2016 ) . We believe this technology can support humans in the challenging task of finding and formulating arguments . A politician might use this to prepare for a debate with a political opponent or for a press conference . It may be used to support students in writing argumentative essays or to enrich one - sided discussions with counter - arguments . In contrast to retrieval methods , generation allows to combine and stylistically adapt text ( e.g. arguments ) based on a given input ( usually the beginning of a sentence ) . Current argument generation models , however , produce lengthy texts and allow the user little control over the aspect the argument should address Hua and Wang , 2018 ) . We show that argument generation can be enhanced by allowing for a fine - grained control and limiting the argument to a single but concise sentence .
Controllable language models like the CTRL ( Keskar et al . , 2019 ) allow to condition the model at training time to certain control codes . At inference , these can be used to direct the model 's output with regard to content or style . We build upon this architecture to control argument generation based solely on a given topic , stance , and argument aspect . For instance , to enforce focus on the aspect of cancer for the topic of nuclear energy , we input a control code " Nuclear Energy CON cancer " that creates a contra argument discussing this aspect , for instance : " Studies show that people living next to nuclear power plants have a higher risk of developing cancer . " .
To obtain control codes from training data , we pre - define a set of topics to retrieve documents for and rely on an existing stance detection model to classify whether a sentence argues in favor ( pro ) or against ( con ) the given topic ( Stab et al . , 2018a ) . Regarding argument aspect detection , however , past work has two drawbacks : it either uses simple rule - based extraction of verb - and noun - phrases ( Fujii and Ishikawa , 2006 ) or the definition of aspects is based on target - concepts located within the same sentence ( Gemechu and Reed , 2019 ) . Aspects as we require and define them are not bound to any part - of - speech tag and ( 1 ) hold the core reason upon which the conclusion / evidence is built and ( 2 ) encode the stance towards a general but not necessarily explicitly mentioned topic the argument discusses . For instance :
Topic : Nuclear Energy Argument : Running nuclear reactors is costly as it involves long - time disposal of radioactive waste .
The evidence of this argument is based upon the two underlined aspects . While these aspects encode a negative stance towards the topic of " Nuclear Energy " , the topic itself is not mentioned explicitly in the argument .
Our final controlled argument generation pipeline ( see Figure 1 ) works as follows : ( 1 ) We gather several million documents for eight different topics from two large data sources . All sentences are classified into pro- , con- , and non - arguments . We detect aspects of all arguments with a model trained on a novel dataset and concatenate arguments with the same topic , stance , and aspect into training documents . ( 2 ) We use the collected classified data to condition the Arg - CTRL on the topics , stances , and aspects of all gathered arguments .
( 3 ) At inference , passing the control code [ Topic ] [ Stance ] [ Aspect ] to the model will generate an argument that follows these commands .
Our evaluation shows that the Arg - CTRL is able to produce aspect - specific , high - quality arguments , applicable to automatic counter - argument generation . The contributions are as follows : ( i ) We adapt and fine - tune the CTRL for aspect - controlled neural argument generation . ( ii ) We show that detecting argument aspects and conditioning the generation model on them are necessary steps to control the model 's training process and its perspective while generating . ( iii ) We propose several methods to analyze and evaluate the quality of ( controllable ) argument generation models . ( iv ) We develop a new scheme to annotate argument aspects and release a dataset with 5,032 samples .
Related Work
Argument Aspect Detection Early work by Fujii and Ishikawa ( 2006 ) focuses mainly on Japanese and restricts aspects to noun - and verb - phrases , extracted via hand - crafted rules . Boltužić and Šnajder ( 2017 ) extract noun - phrases and aggregate them into concepts to analyze the microstructure of claims . Misra et al . ( 2015 ) introduce facets as low level issues , used to support or attack an argumentation . In that , facets are conceptually similar to aspects , but not explicitly phrased and instead seen as abstract concepts that define clusters of semantically similar text - spans of summaries . Bilu et al . ( 2019 ) define commonplace arguments that are valid in several situations for specified actions ( e.g. " ban " ) and topics ( e.g. " smoking " ) . These actions are similar to aspects , but limited in number and manually defined . Gemechu and Reed ( 2019 ) detect , amongst others , concepts and aspects in arguments with models trained on expert annotations . However , in their definition , aspects have to point to a target concept mentioned in the argument . In our definition , aspects refer to a general topic which is not necessarily part of the sentence and our annotation scheme is applicable by non - experts .
The concept of framing dimensions ( Boydstun et al . , 2014 ) is close to argument aspects . In the field of argument mining , Ajjour et al . ( 2019 ) recently applied frames to label argument clusters . Yet , their method does not allow to detect frames . Other works present methods to automatically label sentences of news articles and online discussions with frames ( Hartmann et al . , 2019;Naderi and Hirst , 2017 ) . These methods are , however , limited to a small set of predefined frames that represent high - level concepts . Contrarily , we operate on a fine - grained span - level to detect aspects that are explicitly mentioned in arguments .
Argument Generation Early approaches rely on rules from argumentation theory and user preference models ( Carenini and Moore , 2006;Zukerman et al . , 1998 ) . In a more recent work , Sato et al . ( 2015 ) construct rules to find arguments in a large data source , which are then filtered and ordered with a neural network based ranker . Baff et al . ( 2019 ) use a clustering and regression approach to assemble discourse units ( major claims , pro and con statements ) to argumentative texts . However , most of these approaches rely on hand - crafted features and do not generalize well . Moreover , they all require permanent access to large data sources and are not able to generate new arguments .
Recently , research on generating arguments with language models gained more attention . use a sequence to sequence model ( Sutskever et al . , 2014 ) to generate argumentative text by attending to the input and keyphrases automatically extracted for the input from , for example , Wikipedia . Other work focuses on generating argumentative dialogue ( Le et al . , 2018 ) and counterarguments ( Hidey and McKeown , 2019 ; based on a given input sentence , or on generating summaries from a set of arguments ( Wang and Ling , 2016 ) . Contrarily , we train a language model that does not require a sentence - level input for generation and allows for direct control over the topic , stance , and aspect of the produced argument . Xing et al . ( 2017 ) design a language model that attends to topic information to generate responses for chatbots . Dathathri et al . ( 2019 ) train two models that control the sentiment and topic of the output of pre - trained language models at inference . Gretz et al . ( 2020a ) fine - tune GPT-2 on existing , labeled datasets to generate claims for given topics . However , the latter works do not explore generation for such a fine - grained and explicit control as proposed in this work . We show that argument generation requires the concept of argument aspects to shape the produced argument 's perspective and to allow for diverse arguments for a topic of interest .
Argument Aspect Detection
Argument aspect detection is necessary for our argument generation pipeline , as it allows for a finegrained control over the generation process . We create a new dataset , as existing approaches either rely on coarse - grained frames or can not be applied by non - expert annonators in a scalable manner .
Dataset Creation
We base our new aspect detection dataset on the UKP Sentential Argument Mining Corpus ( UKP - Corpus ) by Stab et al . ( 2018b ) , as it already contains sentence - level arguments and two of the control codes we aim to use : topics and stance labels . More precisely , it contains 25,474 manually labelled sentences for eight controversial topics in English . Each sample consists of a topic and a sentence , labelled as either being supporting , attacking , or no argument towards the given topic . As we are only interested in arguments , we do not consider the non - argumentative sentences .
Step 1 : Preliminary annotations To ensure the feasibility of creating a dataset for this task , two experts ( a post - doctoral researcher and an undergraduate student with NLP background ) independently annotate 800 random samples ( from four topics , 200 per topic ) taken from the UKP - Corpus . The annotations are binary and on token - level , where multiple spans of tokens could be selected as aspects . The resulting inter - annotator agreement of this study is Krippendorff 's α u = .38 . While this shows that the task is generally feasible , the agreement on exact token spans is rather low . Hence , in the following steps , we reduce the complexity of the annotation task .
Step 2 : Annotation scheme Instead of free spanlevel annotations , we present annotators with a ranked list of aspect recommendations . To generate meaningful recommendations , we train a ranking model using the preliminary annotations ( Step 1 ) .
Step 2a : Data preparation for ranking To create training data for the ranker , we use a simple heuristic to calculate scores between 0 and 1 for all N - grams of a sentence by dividing the number of aspect tokens within an N - gram by its length N : # aspect tokens N ∈ [ 0 , 1 ] . Our analysis reveals that 96 % ( 783 of 814 ) of all aspects in the preliminary annotation dataset only contain one to four tokens . We thus decide to ignore all candidates with more than four tokens . No other limitations or filtering mechanisms are applied .
Step 2b : Training the ranker We use BERT ( Devlin et al . , 2019 ) and MT - DNN 2 ( base and large ) to train a ranker . For training , we create five splits : ( 1 ) one in - topic split using a random subset from all four topics and ( 2 ) four
Topic
Five most frequent aspects ( frequency ) Gun control right ( 30 ) , protect ( 18 ) , background checks ( 17 ) , gun violence ( 14 ) , criminal ( 13 ) Death penalty cost ( 16 ) , innocent ( 12 ) , retribution ( 10 ) , murder rate ( 9 ) , deterrent ( 8) Abortion right ( 21 ) , pain ( 10 ) , choice ( 10 ) , right to life ( 9 ) , risk ( 9 ) Marijuana legalization dangerous ( 16 ) , cost ( 13 ) , risk ( 12 ) , harm ( 10 ) , black market ( 9 ) General aspects dangerous ( in 8 of 8 topics ) , cost / life / risk / safety ( in 7 of 8 topics ) cross - topic splits using a leave - one - topic - out strategy . The cross - topic setup allows us to estimate the ranker 's performance on unseen topics of the UKP - Corpus .
A single data sample is represented by an argument and an 1 - to 4 - gram of this argument , separated by the BERT architecture 's [ SEP ] token . This technique expands the 800 original samples of the dataset to around 80,336 . The model is trained for 5 epochs , with a learning rate of 5 × 10 −5 , and a batch size of 8 . We use the mean squared error as loss and take the recall@k to compare the models . The in - and cross - topic results of the bestperforming model ( MT - DNN BASE ) are reported in Table 2 . All results are the average over runs with five different seeds ( and over all four splits for the cross - topic experiments ) .
Step 2c : Creating the annotation data For each of the four topics that are part of the preliminary annotation dataset , we use the in - topic model to predict aspects of 629 randomly chosen , unseen arguments from the UKP - Corpus . For the other four topics of the UKP - Corpus , we choose the best cross - topic model to predict aspects for the same amount of samples . To keep a recall of at least 80 % , we choose the ten and fifteen highest - ranked aspect candidates for samples as predicted by the in - topic and cross - topic model , respectively . We remove aspect candidates that include punctuation , begin or end with stopwords , or contain digits .
Step 3 : Annotation study We use Amazon Mechanical Turk to annotate each sample by eight different workers located in the US , paying $ 7.6 per hour ( minimum wage is $ 7.25 per hour ) . Based on a subset of 232 samples , we compute an α u of .67 between crowdworkers and experts ( three doctoral researchers ) . Compared to the initial study , the new approach increases the inter - annotator agreement between experts by approx . 11 points ( see App . A for further details on the annotation study ) . Based on this promising result , we create a dataset of 5,032 high - quality samples that are labelled with aspects , as well as with their original stance labels from the UKP - Corpus . We show the most frequent ( lemmatized ) aspects that appear in some topics in Table 1 .
Evaluation
We create a cross - topic split with the data of two topics as test set ( gun control , school uniforms ) , one topic as dev set ( death penalty ) , and the remaining topics as train set and evaluate two models with it . First , we use the ranking approach described in Step 2a-2b to fine - tune MT - DNN BASE on the newly generated data ( " Ranker " ) . At inference , we choose the top T aspects for each argument as candidates . We tune T on the dev set and find T = 2 to be the best choice . Second , we use BERT for sequence tagging ( Wolf et al . , 2020 ) and label all tokens of the samples with BIO tags . As previously done with the ranker , we experiment with BERT and MT - DNN weights and find BERT LARGE to be the best choice ( trained for 5 epochs , with a learning rate of 1 × 10 −5 and a batch size of 32 ) . We flatten the predictions for all test samples and calculate the F 1 , Precision , and Recall macro scores . All models are trained over five seeds and the averaged results are reported in Table 3 .
Data Collection Pipeline
This section describes the data collection and preprocessing for the argument generation pipeline .
BERT LARGE predicts classes B and I with an F 1 of .65 and .53 , hence aspects with more than one token are less well identified . A difference is to be expected , as the class balance of B 's to I 's is 2,768 to 2,103 . While the ranker performs worse based on the shown metrics , it has a slightly higher recall for class I. We assume this is due to the fact that it generally ranks aspects with more than one token on top , i.e. there will often be at least one or more I 's in the prediction . In contrast to that , BERT LARGE focuses more on shorter aspects , which is also in accordance with the average aspect length of 1.8 tokens per aspect in the dataset . In total , BERT LARGE outperforms the ranker by almost 6 percentage points in F 1 macro .
We aim to train a model that is able to transfer argumentative information concisely within a single sentence . We define such an argument as the combination of a topic and a sentence holding evidence with a specific stance towards this topic ( Stab et al . , 2018b ) . Consequently , the following preprocessing steps ultimately target retrieval and classification of sentences . To evaluate different data sources , we use a dump from Common- We notice that many sentences are not relevant with regard to the document 's topic . To enforce topicrelevance , we decide to filter out all sentences that do not contain at least one token of the respective topic or its defined synonyms ( see App . B ) . We use the ArgumenText API 's 6 argument and stance classification models ( Stab et al . , 2018a ) to classify all sentences into argument or non - argument ( F 1 macro = .7384 ) , and remaining arguments into pro or con with regard to the topic ( F 1 macro = .7661 ) .
Aspect Detection We detect aspects on all remaining arguments . To speed up the detection on millions of sentences , we use BERT BASE instead of BERT LARGE ( see Table 3 ) .
Training Document Generation We create the final training documents for the argument generation model by concatenating all arguments that have the same topic , stance , and aspect ( i.e. the same control code ) . Further , we aggregate all arguments that include an aspect with the same stem into the same document ( e.g. arguments with cost and costs as aspect ) . To cope with limited hardware resources , we restrict the total number of arguments for each topic and stance to 100,000 ( i.e. 1.6 M over all eight topics ) . Also , as some aspects dominate by means of quantity of related arguments and others appear only rarely , we empirically determine an upper and lower bound of 1,500 and 15 arguments for each document , which still allows us to retrieve the above defined amount of training arguments .
Model Training and Analysis
In the following , we describe the architecture and the training process of the Arg - CTRL and analyze its performance .
Model and Training
Model The goal of a statistical language model is to learn the conditional probability of the next word given all ( or a subset of ) the previous ones ( Bengio et al . , 2003 ) . That is , for a sequence of tokens x = ( x 1 , ... , x n ) , the model learns p(x i |x < i ) where x i is the i - th word of sequence x. For this work , we use the 1.63 billion - parameter Conditional Transformer Language Model ( CTRL ) by Keskar et al . ( 2019 ) , which is built on a transformerbased sequence to sequence architecture ( Vaswani et al . , 2017 ) . The CTRL has shown to produce high quality text , is general enough to be adapted for conditioning on the control codes we aim to use , and we do not need to pre - train the weights from scratch . Formally , the CTRL adds an extra condition to each sequence by prepending a control code c , hence learning p(x i |x < i , c ) . The control code is represented by a single token and can then be used to direct the model output at inference . We extend the model from its previous limit of a singletoken control code to accept multiple tokens . For The respective control code is prepended to each sequence of 256 subwords of a document .
Analysis
Generation At inference , we gather multiple generated arguments from a control code input by splitting the generated output text into sentences with NLTK ( Bird et al . , 2009 ) . We observe that for the first generated argument , the Arg - CTRL mostly outputs very short phrases , as it tries to incorporate the control code into a meaningful start of an argument . We prevent this by adding punctuation marks after each control code ( e.g. a period or colon ) , signaling the model to start a new sentence . In this fashion , we generate proand con - arguments up to the pre - defined training split size 7 for each topic of the UKP - Corpus , resulting in 7,991 newly generated arguments . We do this with both models and use the generated arguments as a basis for the following analysis and evaluation methods . Examples of generated arguments can be found in Tables 4 , 6 , and 7 ( as part of the evaluation , see Section 7 ) .
Results With no other previous work on explicit control of argument generation ( to the best of our knowledge ) , we decide to proof our concept of aspect - controlled neural argument generation by 7 Not counting non - arguments from the splits .
comparing both generation models to a retrieval approach as a strong upper bound . The retrieval approach returns all arguments from the classified training data ( see Section 4 ) that match a given topic , stance , and aspect . Both the retrieval and generation approaches are evaluated against reference data from debate portals and compared via METEOR ( Lavie and Agarwal , 2007 ) and ROUGE - L ( Lin , 2004 ) metrics . The retrieval approach has an advantage in this setup , as the arguments are also of human origin and aspects are always explicitly stated within a belonging argument .
The reference data was crawled from two debate portals 8 and consists of pro - and con - paragraphs discussing the eight topics of the UKP - Corpus . As the paragraphs may include non - arguments , we filter these out by classifying all sentences with the ArgumenText API into arguments and nonarguments . This leaves us with 349 pro - and 355 con - arguments over all topics ( see App . D for the topic - wise distribution ) . Next , we detect all aspects in these arguments . Arguments with the same topic , stance , and aspect are then grouped and used as reference for arguments from the ( a ) generated arguments and ( b ) retrieval approach arguments if these hold the same topic , stance , and aspect . The results reveal that both the average METEOR and ROUGE - L scores are only marginally lower than the retrieval scores ( METEOR is 0.5/1.1 points lower for the Arg - CTRL REDDIT /Arg - CTRL CC , see Table 5 ) . It not only shows the strength of the architecture , but also the success in generating sound aspect - specific arguments with our approach . Overlap with Training Data We find arguments generated by the models to be genuine , i.e. demonstrating substantial differences to the training data . For each of the 7,991 generated arguments , we find the most similar argument in the training data based on the cosine similarity of their BERT embeddings We compare all models by verifying whether or not the aspect used for generation ( including synonyms and their stems and lemmas ) can be found in the generated arguments . For the original models conditioned on aspects , this is true in 79 % of Generated sentence : We do n't need more gun control laws when we already have enough restrictions on who can buy guns in this country .
Training sentence : We have some of the strongest gun laws in the country , but guns do n't respect boundaries any more than criminals do . Cosine similarity / edit distance / rel . overlap : 95.59 / 88 / 8 % Generated sentence : The radioactivity of the spent fuel is a concern , as it can be used to make weapons and has been linked to cancer in humans . Training sentence : However , it does produce radioactive waste , which must be disposed of carefully as it can cause health problems and can be used to make nuclear weapons Cosine similarity / edit distance / rel . overlap : 92.40 / 99 / 17 % the cases for Arg - CTRL REDDIT and in 74 % of the cases for Arg - CTRL CC . For the model that was not conditioned on aspects , however , it is only true in 8 % of the cases . It clearly shows the necessity to condition the model on aspects explicitly , implying the need for argument aspect detection , as the model is unable to learn generating aspect - related arguments otherwise . Moreover , without prior detection of aspects , we have no means for proper aggregation over aspects . We notice that for the model without prior knowledge of aspects , 79 % of all aspects in the training data appear in only one argument . For these aspects , the model will likely not pick up a strong enough signal to learn them .
Evaluation
We evaluate the quality ( intrinsic evaluation ) of the Arg - CTRL and its performance on an exemplary task ( extrinsic evaluation ) . As a basis , we use the 7,991 arguments generated in Section 5 .
Intrinsic Evaluation
Human Evaluation We conduct an expert evaluation on a subset of generated arguments with two researchers ( field of expertise is natural language processing ) not involved in this paper . Two aspects are evaluated : fluency and persuasiveness . We consider a sentence as fluent if it is grammatically correct , i.e. contains neither semantic nor syntactic errors , and arrange this as a binary task . To reduce subjectivity for the persuasiveness evaluation , the experts do not annotate single arguments but instead compare pairs ( Habernal and Gurevych , 2016 ) of generated and refer - ence data arguments ( see Section 5.2 ) . The experts could either choose one argument as being more persuasive or both as being equally persuasive . In total , the experts compared 100 ( randomly sorted and ordered ) argument pairs for persuasiveness and fluency ( 50 from both the Arg - CTRL REDDIT and the Arg - CTRL CC ) . A pair of arguments always had the same topic and stance . For fluency , only the annotations made for generated arguments were extracted and taken into account . Averaged results of both experts show that in 33 % of the cases , the generated argument is either more convincing ( 29 % ) or as convincing ( 4 % ) as the reference argument . Moreover , 83 % of generated arguments are fluent . The inter - annotator agreement ( Cohen , 1960 ) between the two experts is Cohen 's κ = .30 ( percentage agreement : .62 ) for persuasiveness and κ = .43 ( percentage agreement : .72 ) for fluency , which can be interpreted as " fair " and " moderate " agreement , respectively ( Landis and Koch , 1977 ) . As we compare to high - quality , curated data , the perceived persuasiveness of the generated arguments shows the potential of the work - further strengthened in the remainder of this section .
Argument Quality We introduce a novel method to evaluate generated arguments based on the argument quality detection approach proposed by Gretz et al . ( 2020b ) . They create an argument quality dataset that contains around 30,000 arguments over 71 topics . For each argument , annotators were asked whether or not they would recommend a friend to use the displayed argument in a speech . The quality scores for each argument result from a weighted average ( WA ) or MACE Probability function of all annotations and range between 0 ( lowest quality ) and 1.0 ( highest quality ) . We use the WA - score as label , the same model ( BERT BASE ) and hyperparameters as given in the original paper , and reproduce the reported correlations of .52 ( Pearson ) and .48 ( Spearman ) on the test dataset ( averaged over five different seeds ) . The model predicts an average argument quality of .71 for the Arg - CTRL REDDIT arguments , .75 for the training arguments of the UKP - Corpus , and even .76 for the Arg - CTRL CC arguments . It shows that our model is able to produce arguments that are generally on a similar quality level as arguments made by humans . The difference of 5 points between the Arg - CTRL CC and the Arg - CTRL REDDIT , however , shows that training on REDDIT data leads to a lower quality of generated arguments . Exemplary Human : Nuclear energy produces waste that stays radioactive for thousands of years and pollutes the environment . Arg - CTRLCC : ( nuclear energy PRO waste . ) The only thing that 's going to be left over is the spent fuel rods , which are not very radioactive and can be reprocessed [ ... ] Arg - CTRLCC : ( nuclear energy PRO radioactive . ) The radioactivity of the waste is not a problem , because it will decay away in about 300 years or so [ ... ] Arg - CTRLCC : ( nuclear energy PRO environment . ) The fact that the radioactive waste is not only stored in a safe manner , but also protected from any potential threat to human health or the environment for thousands of years [ ... ] Human : School uniforms are expensive and affect the pupil 's individuality . Arg - CTRLCC : ( school uniforms PRO expensive . ) The cost of the uniform is not that much more than a regular shirt and shorts [ ... ] Arg - CTRLCC : ( school uniforms PRO individuality . ) The uniform is a symbol of unity and identity , which helps to foster an environment where students can feel comfortable about expressing their own individual style without being judged [ ... ] for three topics , we show the generated arguments with the highest and lowest argument quality in Table 4 ( see App . E for the full table ) .
Extrinsic Evaluation :
Counter - Arguments
Drafting counter - arguments is an important skill for debating , to provide constructive feedback , and to foster critical thinking . We lean onto the work of Wachsmuth et al . ( 2018 ) who describe a counterargument as discussing the same aspect as an initial argument , but with a switched stance . Hence , given our defined control codes , our model is especially fit for counter - argument generation . Unlike current models for this task , we do not require a specific dataset with argument and counterargument pairs ( Hidey and McKeown , 2019 ; . Also , in contrast to the model by that implicitly integrates inputrelated " Keyphrases " into the process of counterargument generation , our model is able to concentrate on every aspect of the input explicitly and with a separate argument , allowing for more transparency and interpretability over the process of counter - argument generation . We exemplary show how the combination of aspect detection and controlled argument generation can be successfully leveraged to tackle this task . For that , we manually compose initial arguments for the topics nuclear energy and school uniforms . Then , we automatically detect their aspects and generate a counterargument for each aspect by passing the topic , opposite stance of the original argument , and one of the aspects into the Arg - CTRL CC . For both topics , the Arg - CTRL CC produces meaningful counterarguments based on the detected aspects ( see Table 7 ) .
Conclusion
We apply the concept of controlled neural text generation to the domain of argument generation . Our Arg - CTRL is conditioned on topics , stances , and aspects and can reliably create arguments using these control codes . We show that arguments generated with our approach are genuine and of high argumentative and grammatical quality in general . Moreover , we show that our approach can be used to generate counter - arguments in a transparent and interpretable way . We fine - tune the Arg - CTRL on two different data sources and find that using mixed data from Common - Crawl results in a higher quality of generated arguments than using user discussions from Reddit - Comments . Further , we define argument aspect detection for controlled argument generation and introduce a novel annotation scheme to crowdsource argument aspect annotations , resulting in a high - quality dataset . We publish the model weights , data , and all code necessary to train the Arg - CTRL .
Ethics Statement
Models for argument and claim generation have been discussed in our related work and are widely available . Gretz et al . ( 2020a ) suggest that , in order to allow for a fine - grained control over claim / argument generation , aspect selection needs to be handled carefully , which is what we have focused on in this work . The dangers of misuse of language models like the CTRL have been extensively discussed by its authors ( Keskar et al . , 2019 ) . The ethical impact of these works has been weighed and deemed justifiable . Argument generation - and natural language generation as a whole - is subject to dual use . The technology can be used to create arguments that can not be distinguished from human - made arguments . While our intentions are to support society , to foster diversity in debates , and to encourage research on this important topic , we are aware of the possibility of harmful applications this model can be used for . For instance , the model could be used to generate only opposing ( or supporting ) arguments on one of the pretrained topics and aspects and , as such , bias a debate into a certain direction . Also , bots could use the generated arguments to spread them via social media . The same is true , however , for argument search engines , which can be used by malicious parties to retrieve ( and then spread ) potentially harmful information .
However , controllable argument generation can also be used to support finding and formulating ( counter-)arguments for debates , for writing essays , to enrich one - sided discussions , and thus , to make discourse more diverse overall . For instance , anticipating opposing arguments is crucial for critical thinking , which is the foundation for any democratic society . The skill is extensively taught in school and university education . However , confirmation bias ( or myside bias ) ( Stanovich et al . , 2013 ) , i.e. the tendency to ignore opposing arguments , is an ever - present issue . Technologies like ours could be used to mitigate this issue by , for instance , automatically providing topic - and aspectspecific counter - arguments for all arguments of a given text ( this has been shown for single arguments in Section 7.2 ) . We believe that working on and providing access to such models is of major importance and , overall , a benefit to society .
Open - sourcing such language models also encourages the work on counter - measures to detect malicious use : While many works have been published on the topic of automatic fake news detection in texts ( Kaliyar et al . , 2020;Reis et al . , 2019;Hanselowski et al . , 2018;Pérez - Rosas et al . , 2018 ) , the recent emergence of large - scale language models has also encouraged research to focus on detecting the creator of these texts ( Varshney et al . , 2020;Zellers et al . , 2019 ) . The former approaches are aimed at detecting fake news in general , i.e. independent of who ( or what ) composed a text , whereas the latter approaches are designed to recognize if a text was written by a human or generated by a language model . We encourage the work on both types of methods . Ideally , social networks and news platforms would indicate if a statement was automatically generated in addition to its factual correctness .
Further , we point out some limitations of the Arg - CTRL that mitigate the risks discussed before . One of these limitations is that it can not be used to generate arguments for unseen topics , which makes a widespread application ( e.g. to produce fake news ) rather unlikely ( using an unseen topic as control code results in nonsensical repetitions of the input ) . The analysis in Section 6 of the paper shows that the model fails to produce aspectspecific sentences in 92 % of the cases if it was not explicitly conditioned on them at training time . Even in case of success , the aspect has to exist in the training data . Also , the model is trained with balanced classes , i.e. both supporting and opposing arguments for each topic are seen with equal frequency to prevent possible bias into one or the other direction .
To further restrict malicious use , we release the training data for the Arg - CTRLs with an additional clause that forbids use for any other than research purposes . Also , all the training datasets for the Arg - CTRLs will be accessible only via access control ( e - mail , name , and purpose of use ) . Lastly , this work has been reviewed by the ethics committee of the Technical University of Darmstadt that issued a positive vote . page 833 - 838 , USA . American Association for Artificial Intelligence .
A Argument Aspect Annotation Study
For the final crowdsourcing study , we use Amazon Mechanical Turk . Workers had to take a qualification test , have an acceptance rate of at least 95 % , and location within the US . We paid $ 7.6 per hour ( minimum wage is $ 7.25 per hour ) . Each data sample is annotated by eight crowdworkers . In case the ranker cut off the real aspect(s ) from the list of candidates , crowdworkers could select any sequence up to four tokens from a second list .
Figure 2 shows the annotation guidelines for the Amazon Mechanical Turk study . Figure 3 shows one example of a HIT with two aspects selected . Selected aspects are highlighted in the sentence . We did not allow to choose overlapping aspects . If the aspect was not found in the first list provided by the learned ranker , crowdworkers could choose from as second list with the remaining 1 - 4 - grams of the sentence ( aspect candidates starting or ending with stopwords , as well as candidates with punctuation and numbers , were removed from the list ) . Additional checkboxes were added to choose from if the sentence contained no aspect or the aspect was not explicitly mentioned . Figure 4 shows a ranked list of aspect candidates for an example .
The structure of the final dataset is described in Section F. For reproducibility of results , we create fixed splits for in - and cross - topic experiments . 9 , we show the synonyms used for filtering prior to the argument and stance classification step . We filtered out all sentences that did not contain tokens from the topic they belong to or any synonyms defined for this topic .
B Search Query and Topic Relevance Synonyms
C Model Parameters and Details
All arguments of the training documents are tokenized with a BPE model ( Sennrich et al . , 2016 ) trained by the authors of the CTRL ( Keskar et al . , 2019 ) . Both the Arg - CTRL CC and the Arg - CTRL REDDIT are fine - tuned on a Tesla V100 with 32 GB of Memory . We mainly keep the default hyperparameters but reduce the batch size to 4 and train both models for 1 epoch . Each model takes around five days to train on the 1.6 M training sentences .
D Reference Data Statistics
Table 10 shows the sources and number of arguments for all topics of the reference dataset . The dataset is used to compare the argument generation models to a retrieval approach .
E Examples of Generated Arguments
For all eight topics , we show the generated argument with the highest and lowest argument quality score in tables 11 ( Arg - CTRL CC ) and 12 ( Arg - CTRL REDDIT ) . Text in bold shows the given control code , text afterwards represents the generated argument . Numbers in brackets after the text show the quality score as predicted by the argument quality model .
F Argument Aspect Detection Dataset
The argument aspect detection dataset contains a total of 5,032 samples in JSONL - format , i.e. each dataset sample has a separate line and can be parsed as JSON . A sample contains the keys :
• hash : Unique identifier .
• aspect_pos : List of string tuples " ( begin , length ) " , marking the character position and length of each aspect within the argument .
• aspect_pos_string : The aspects as a list of strings .
• stance : Original stance label of the argument towards the topic , taken from the UKP - Corpus ( Stab et al . , 2018b ) . Either " Argument_for " or " Argument_against " .
• topic : The topic of the argument .
• sentence : The argument .
For reproducibility , we define a fixed cross - topic split with the data of two topics as test set ( gun control , school uniforms ) , the data of one topic as development set ( death penalty ) , and the data of the remaining five topics as train set . We also create a fixed in - topic split by randomly taking 3,532 samples of all topics for training , 500 for development , and 1,000 for testing . nuclear AND ( energy OR fission OR power OR plant ) In addition , we must continue developing safer technologies like small modular reactors which will help us meet our nation 's need for reliable , emission - free sources of low - emission energy while also creating jobs and providing solutions to some of the world s most pressing problems : climate change , food security and sustainable development . ( 0.96 ) nuclear energy CON leak . " We are concerned about the possibility of further releases of radioactivity due to possible melting or cracking of fuel rods at the No . ( 0.47 ) death penalty CON inhuman . Amnesty International opposes the death penalty in all cases as the ultimate form of cruel , inhuman or degrading punishment and a violation of fundamental rights -the right to life and the prohibition of torture .
( 1.00 ) death penalty CON god . And yet , while the Church exhorts civil authorities . ( 0.41 ) school uniforms CON resources . The lack of a uniform system for the collection and use of data , as well as insufficient funding to collect it , are major barriers that limit access to information on student achievement in schools with high concentrations of students from low - income families . ( 0.92 ) school uniforms PRO fun : The kids are having a lot of fun . ( 0.38 ) gun control PRO homicides . In addition to being an effective crime deterrent and reducing suicides , research has shown that defensive firearm use prevents more than 80 percent of all gun - related violent crimes including homicides , robberies , assaults , rapes , and aggravated assaults [ 1 ] . ( 0.96 ) gun control PRO prevent ; You liberals need to wake up and smell the coffee . ( 0.43 ) marijuana legalization PRO safer : Legalizing cannabis will help reduce crime rates ( especially violent crimes ) and make society safer overall . ( 0.96 ) marijuana legalization PRO benefits . Decrease amount of police officers needed 6 . ( 0.37 ) minimum wage PRO poor : Raising the minimum wage will not only benefit those working full time but also reduce government expenditures on social services such as food stamps and Medicaid which disproportionately affect the poor . ( 0.97 ) minimum wage CON cost : If you raise the price of a Big Mac to $ 10 and then pay an extra dime or two per burger so that it 's still only $ 9 ... well , maybe your business is n't worth saving at all because there just wo n't be anyone willing to eat out after they ' ve had their fill ( unless perhaps they 're on welfare ) and now have no job and ca n't afford health care eitherbut I doubt very seriously that most people making those kinds of wages / salaries would suddenly decide not to work if forced to shoulder the added cost . ( 0.44 ) 12 : Generated arguments with the Arg - CTRL REDDIT . Text in bold shows the given control code , text afterwards represents the generated argument . Numbers in brackets after the text show the quality score as predicted by the argument quality model .
Acknowledgements
We thank Tilman Beck and Nandan Thakur for their support in the human evaluation ( Section 7.1 ) . This work has been supported by the German Research Foundation within the project " Open Argument Mining " ( GU 798/25 - 1 ) , associated with the Priority Program " Robust Argumentation Machines ( RATIO ) " ( SPP-1999 ) .
-DOCSTART- Linking Entities to Unseen Knowledge Bases with Arbitrary Schemas
In entity linking , mentions of named entities in raw text are disambiguated against a knowledge base ( KB ) . This work focuses on linking to unseen KBs that do not have training data and whose schema is unknown during training . Our approach relies on methods to flexibly convert entities with several attribute - value pairs from arbitrary KBs into flat strings , which we use in conjunction with state - of - the - art models for zero - shot linking . We further improve the generalization of our model using two regularization schemes based on shuffling of entity attributes and handling of unseen attributes . Experiments on English datasets where models are trained on the CoNLL dataset , and tested on the TAC - KBP 2010 dataset show that our models are 12 % ( absolute ) more accurate than baseline models that simply flatten entities from the target KB . Unlike prior work , our approach also allows for seamlessly combining multiple training datasets . We test this ability by adding both a completely different dataset ( Wikia ) , as well as increasing amount of training data from the TAC - KBP 2010 training set . Our models are more accurate across the board compared to baselines .
Introduction
Entity linking consists of linking mentions of entities found in text against canonical entities found in a target knowledge base ( KB ) . Early work in this area was motivated by the availability of large KBs with millions of entities ( Bunescu and Paşca , 2006 ) . Most subsequent work has followed this tradition of linking to a handful of large , publicly available KBs such as Wikipedia , DBPedia ( Auer et al . , 2007 ) or the KBs used in the now decade - old TAC - KBP challenges ( McNamee and Dang , 2009;Ji et al . , 2010 ) . As a result , previous work always assumes complete knowledge of the schema of the target KB that entity linking models are trained for , i.e. how many and which attributes are used to represent entities in the KB . This allows training supervised machine learning models that exploit the schema along with labeled data that link mentions to this a priori known KB . However , this strong assumption breaks down in scenarios which require linking to KBs that are not known at training time . For example , a company might want to automatically link mentions of its products to an internal KB of products that has a rich schema with several attributes such as product category , description , dimensions , etc . It is very unlikely that the company will have training data of this nature , i.e. mentions of products linked to its database .
Our focus is on linking entities to unseen KBs with arbitrary schemas . One solution is to annotate data that can be used to train specialized models for each target KB of interest , but this is not scalable . A more generic solution is to build entity linking models that work with arbitrary KBs . We follow this latter approach and build entity linking models that link to target KBs that have not been observed during training . 1 Our solution builds on recent models for zero - shot entity linking ( Wu et al . , 2020;Logeswaran et al . , 2019 ) . However , these models assume the same , simple KB schema during training and inference . We generalize these models to handle different KBs during training and inference , containing entities represented with an arbitrary set of attribute - value pairs . This generalization relies on two key ideas . First , we convert KB entities into strings that are consumed by the models for zero - shot linking . Central to the string representation are special tokens called attribute separators , which represent frequently occurring attributes in the training KB(s ) , and carry over their knowledge to unseen KBs during inference ( Section 4.1 ) . Second , we generate more flexible string representations by shuffling entity attributes before converting them to strings ,
Generic EL
Zero - shot EL Linking to any DB This work ( Logeswaran et al . , 2019 ) ( Sil et al . , 2012 ) Test entities not seen during training Test KB schema unknown Out - of - domain test data Unrestricted Candidate Set and by stochastically removing attribute separators to generalize to unseen attributes ( Section 4.2 ) .
Our primary experiments are cross - KB and focus on English datasets . We train models to link to one KB during training ( viz . Wikidata ) , and evaluate them for their ability to link to an unseen KB ( viz . the TAC - KBP Knowledge Base ) . These experiments reveal that our model with attributeseparators and the two generalization schemes are 12 - 14 % more accurate than the baseline zero - shot models . Ablation studies reveal that all components individually contribute to this improvement , but combining all of them yields the most accurate models .
Unlike previous work , our models also allow seamless mixing of multiple training datasets which link to different KBs with different schemas . We investigate the impact of training on multiple datasets in two sets of experiments involving additional training data that links to ( a ) a third KB that is different from our original training and testing KBs , and ( b ) the same KB as the test data . These experiments reveal that our models perform favorably under all conditions compared to baselines .
Background
Conventional entity linking models are trained and evaluated on the same KB , which is typically Wikipedia , or derived from Wikipedia ( Bunescu and Paşca , 2006 ; . This limited scope allows models to use other sources of information to improve linking , including alias tables , frequency statistics , and rich metadata .
Beyond Conventional Entity Linking There have been several attempts to go beyond such conventional settings , e.g. by linking to KBs from diverse domains such as the biomedical sciences ( Zheng et al . , 2014;D'Souza and Ng , 2015 ) and music ( Oramas et al . , 2016 ) or even being completely domain and language independent Onoe and Durrett , 2020 ) . Lin et al . ( 2017 ) discuss approaches to link entities to a KB that simply contains a list of names without any other information . Sil et al . ( 2012 ) use databaseagnostic features to link against arbitrary databases . However , their approach still requires training data from the target KB . In contrast , this work aims to train entity linking models that do not rely on training data from the target KB , and can be trained on arbitrary KBs , and applied to a different set of KBs . Pan et al . ( 2015 ) also do unsupervised entity linking by generating rich context representations for mentions using Abstract Meaning Representations ( Banarescu et al . , 2013 ) , followed by unsupervised graph inference to compare contexts . They assume a rich target KB that can be converted to a connected graph . This works for Wikipedia and adjacent resources but not for arbitrary KBs . Logeswaran et al . ( 2019 ) introduce a novel zeroshot framework to " develop entity linking systems that can generalize to unseen specialized entities " . Table 1 summarizes differences between our framework and those from prior work .
Contextualized Representations for Entity Linking Models in this work are based on BERT . While many studies have tried to explain the effectiveness of BERT for NLP tasks ( Rogers et al . , 2020 ) , the work by Tenney et al . ( 2019 ) is most relevant as they use probing tasks to show that BERT encodes knowledge of entities . This has also been shown empirically by many works that use BERT and other contextualized models for entity linking and disambiguation ( Broscheit , 2019;Shahbazi et al . , 2019;Yamada et al . , 2020;Févry et al . , 2020;Poerner et al . , 2020 ) .
Preliminaries
Entity Linking Setup
Entity linking consists of disambiguating entity mentions M from one or more documents to a target knowledge base , KB , containing unique entities . We assume that each entity e ∈ KB is represented using a set of attribute - value pairs
{ ( k i , v i ) } n i=1 .
The attributes k i collectively form the schema of KB . The disambiguation of each m ∈ M is aided by the context c in which m appears .
Models for entity linking typically consist of two stages that balance recall and precision .
Typically , models for candidate generation are less complex ( and hence , less precise ) than those used in the following ( re - ranking ) stage since they handle all entities in KB .
1 . Candidate generation : The objective of this stage is to select K candidate entities E ⊂ KB for each mention m ∈ M , where K is a hyperparameter and K < < |KB| .
Instead , the goal of these models is to produce a small but high - recall candidate list E. Ergo , the success of this stage is measured using a metric such as recall@K i.e. whether the candidate list contains the correct entity .
2 . Candidate Reranking : This stage ranks the candidates in E by how likely they are to be the correct entity . Unlike candidate generation , models for re - ranking are typically more complex and oriented towards generating a high - precision ranked list since the objective of this stage is to identify the most likely entity for each mention . This stage is evaluated using precision@1 ( or accuracy ) i.e. whether the highest ranked entity is the correct entity .
In traditional entity linking , the training mentions M train and test mentions M test both link to the same KB . Even in the zero - shot settings of Logeswaran et al . ( 2019 ) , while the training and target domains and KBs are mutually exclusive , the schema of the KB is constant and known . On the contrary , our goal is to link test mentions M test to a knowledge base KB test which is not known during training . The objective is to train models on mentions M train that link to KB train and directly use these models to link M test to KB test .
Zero - shot Entity Linking
The starting point ( and baselines ) for our work are the state - of - the - art models for zero - shot entity linking , which we briefly describe here ( Wu et al . , 2020;Logeswaran et al . , 2019 ) . 2 Candidate Generation Our baseline candidate generation approach relies on similarities between mentions and candidates in a vector space to identify the candidates for each mention ( Wu et al . , 2020 ) using two BERT models . The first BERT model encodes a mention m along with its context c into a vector representation v m . v m is obtained from the pooled representation captured by the [ CLS ] token used in BERT models to indicate the start of a sequence . In this encoder , a binary ( 0/1 ) indicator vector is used to identify the mention span . The embeddings for this indicator vector ( indicator embeddings ) are added to the token embeddings of the mention as in Logeswaran et al . ( 2019 ) .
The second unmodified BERT model ( i.e. not containing the indicator embeddings as in the mention encoder ) independently encodes each e ∈ KB into vectors . The candidates E for a mention are the K entities whose representations are most similar to v m . Both BERT models are fine - tuned jointly using a cross - entropy loss to maximize the similarity between a mention and its corresponding correct entity , when compared to other random entities .
Candidate Re - ranking The candidate reranking approach uses a BERT - based crossattention encoder to jointly encode a mention and its context along with each candidate from E ( Logeswaran et al . , 2019 ) . Specifically , the mention m is concatenated with its context on the left ( c l ) , its context on the right ( c r ) , and a single candidate entity e ∈ E. An [ SEP ] token , which is used in BERT to separate inputs from different segments , is used here to separate the mention in context , from the candidate . This concatenated string is encoded using BERT 3 to obtain , h m , e a representation for this mention / candidate pair ( from the [ CLS ] token ) . Given a candidate list E of size K generated in the previous stage , K scores are generated for each mention , which are subsequently scored using a dot - product with a learned weight vector ( w ) . Thus ,
h m , e = BERT([CLS ] c l m c r [ SEP ] e [ SEP ] ) , score m , e = w T h m , e .
The candidate with the highest score is chosen as the correct entity , i.e.
Linking to Unseen Knowledge Bases
The models in Section 3 were designed to operate in settings where the entities in the target KB were only represented using a textual description . For example , the entity Douglas Adams would be represented in such a database using a description as follows : " Douglas Adams was an English author , screenwriter , essayist , humorist , satirist and dramatist . He was the author of The Hitchhiker 's Guide to the Galaxy . "
However , linking to unseen KBs requires handling entities with an arbitrary number and type of attributes . The same entity ( Douglas Adams ) can be represented in a different KB using attributes such as " name " , " place of birth " , etc . ( top of Figure 1 ) . This raises the question of whether such models , that harness the power of pre - trained language models , generalize to linking mentions to unseen KBs , including those without such textual descriptions . This section presents multiple ideas to this end .
Representing Arbitrary Entities using
Attribute Separators
One way of using these models for linking against arbitrary KBs is by defining an attribute - to - text function f , that maps arbitrary entities with any set of attributes { k i , v i } n i=1 to a string representation e that can be consumed by BERT , i.e.
e = f ( { k i , v i } n i=1 ) .
If all entities in the KB are represented using such string representations , then the models described in Section 3 can directly be used for arbitrary schemas . This leads to the question : how can we generate string representations for entities from arbitrary KBs such that they can be used for BERT - based models ? Alternatively , what form can f take ?
A simple answer to this question is concatenation of the values v i , given by
f ( { k i , v i } n i=1 ) = v 1 v 2 ... v n .
We can improve on this by adding some structure to this representation by teaching our model that the v i belong to different segments . As in the baseline candidate re - ranking model , we do this by separating them with [ SEP ] tokens . We call this [ SEP]-separation . This approach is also used by Logeswaran et al . ( 2019 ) andMulang ' et al . ( 2020 ) " name " : " Douglas Adams " " place of birth " : " Cambridge " " occupation " : " novelist " " employer " : " BBC " to separate the entity attributes in their respective KBs .
f ( { k i , v i } n i=1 ) = [ SEP ] v 1 [ SEP ] v 2 ... [ SEP ] v n
The above two definitions of f use the values v i , but not the attributes k i , which also contain meaningful information . For example , if an entity seen during inference has a capital attribute with the value " New Delhi " , seeing the capital attribute allows us to infer that the target entity is likely to be a place , rather than a person , especially if we have seen the capital attribute during training . We capture this information using attribute separators , which are reserved tokens ( in the vein of [ SEP ] tokens ) corresponding to attributes . In this case ,
f ( { k i , v i } n i=1 ) = [ K 1 ] v 1 [ K 2 ] v 2 ... [ K n ] v n .
These Figure 1 illustrates the three instantiations of f . In all cases , attribute - value pairs are ordered in descending order of the frequency with which they appear in the training KB . Finally , since both the candidate generation and candidate re - ranking models we build on use BERT , the techniques discussed here can be applied to both stages , but we only focus on re - ranking .
Regularization Schemes for Improving Generalization
Building models for entity linking against unseen KBs requires that such models do not overfit to the training data by memorizing characteristics of the training KB . This is done by using two regularization schemes that we apply on top of the candidate string generation techniques discussed in the previous section .
The first scheme , which we call attribute - OOV , prevents models from overtly relying on individual [ K i ] tokens and generalize to attributes that are not seen during training . Analogous to how out - of - vocabulary tokens are commonly handled ( Dyer et al . , 2015 , inter alia ) , every [ K i ] token is stochastically replaced with the [ SEP ] token during training with probability p drop . This encourages the model to encode semantics of the attributes in not only the [ K i ] tokens , but also in the [ SEP ] token , which is used when unseen attributes are encountered during inference .
The second regularization scheme discourages the model from memorizing the order in which particular attributes occur . Under attribute - shuffle , every time an entity is encountered during training , its attribute / values are randomly shuffled before it is converted to a string representation using the techniques from Section 4.1 .
Experiments and Discussion
Data
Our held - out test bed is the TAC - KBP 2010 data ( LDC2018T16 ) which consists of documents from English newswire , discussion forum and web data ( Ji et al . , 2010 ) . 4 The target KB ( KB test ) is the TAC - KBP Reference KB and is built from English Wikipedia articles and their associated infoboxes ( LDC2014T16 ) . 5 Our primary training and validation data is the CoNLL - YAGO dataset ( Hoffart et al . , 2011 ) Table 2 describes the sizes of these various datasets along with the number of entities in their respective KBs .
While covering similar domains , Wikidata and the TAC - KBP Reference KB have different schemas . Wikidata is more structured and entities are associated with statements represented using attribute - value pairs , which are short snippets rather than full sentences . The TAC - KBP Reference KB contains both short snippets like these , along with the text of the Wikipedia article of the entity . The two KBs also differ in size , with Wikidata containing almost seven times the number of entities in TAC KBP .
Both during training and inference , we only retain the 100 most frequent attributes in the respective KBs . The attribute - separators ( Section 4.1 ) are created corresponding to the 100 most frequent attributes in the training KB . Candidates and mentions ( with context ) are represented using strings of 128 sub - word tokens each , across all models .
839
Hyperparameters
All BERT models are uncased BERT - base models with 12 layers , 768 hidden units , and 12 heads with default parameters , and trained on English Wikipedia and the BookCorpus . The probability p drop for attribute - OOV is set to 0.3 . Both candidate generation and re - ranking models are trained using the BERT Adam optimizer ( Kingma and Ba , 2015 ) , with a linear warmup for 10 % of the first epoch to a peak learning rate of 2 × 10 −5 and a linear decay from there till the learning rate approaches zero . 9 Candidate generation models are trained for 200 epochs with a batch size of 256 . Re - ranking models are trained for 4 epochs with a batch size of 2 , and operate on the top 32 candidates returned by the generation model . Hyperparameters are chosen such that models can be run on a single NVIDIA V100 Tensor Core GPU with 32 GB RAM , and are not extensively tuned . All models have the same number of parameters except the ones with attribute - separators which have 100 extra token embeddings ( of size 768 each ) .
Candidate generation Since the focus of our experiments is on re - ranking , we use a fixed candidate generation model for all experiments that combines the architecture of Wu et al . ( 2020 ) ( Section 3 ) with [ SEP]-separation to generate candidate strings . This model also has no knowledge of the test KB and is trained only once on the CoNLL - Wikidata dataset . It achieves a recall@32 of 91.25 when evaluated on the unseen TAC - KBP 2010 data .
Research Questions
We evaluate the re - ranking model ( Section 3 ) in several settings to answer the following questions : For all experiments , we report the mean and standard deviation of the accuracy across five runs with different random seeds .
Main results
Our primary experiments focus on the first two research questions and study the accuracy of the model that uses the re - ranking architecture from Section 3 with the three core components introduced in Section 4 viz . attribute - separators to generate string representations of candidates , along with attribute - OOV and attribute - shuffle for regularization . We compare this against two baselines without these components that use the same architecture and use concatenation and [ SEP]separation instead of attribute - separators . As a reminder , all models are trained as well as validated on CoNLL - Wikidata and evaluated on the completely unseen TAC - KBP 2010 test set .
Results confirm that adding structure to the candidate string representations via [ SEP ] tokens leads to more accurate models compared to generating strings by concatenation ( Table 3 ) . Using attributeseparators instead of [ SEP ] tokens leads to an absolute gain of over 5 % and handling unseen attributes via attribute - OOV further increases the accuracy to 56.2 % , a 7.1 % increase over the [ SEP ] baseline . These results show that the attributeseparators capture meaningful information about attributes , even when only a small number of attributes from the training data ( 15 ) are observed during inference . Shuffling attribute - value pairs before converting them to a string representation using attributeseparators also independently provides an absolute gain of 3.5 % over the model which uses attribute - separators without shuffling . Overall , models that combine attribute - shuffling and attribute - OOV are the most accurate with an accuracy of 61.6 % , which represents a 12 % absolute gain over the best baseline model .
Prior work ( Raiman and Raiman , 2018;Cao et al . , 2018;Wu et al . , 2020;Févry et al . , 2020 ) reports higher accuracies on the TAC data but they are fundamentally incomparable with our numbers due to the simple fact that we are solving a different task with three key differences : ( 1 ) Models in prior work are trained and evaluated using mentions that link to the same KB . On the contrary , we show how far we can go without such in - KB training mentions .
( 2 ) The test KB used by these works is different from our test KB . Each entry in the KB used by prior work simply consists of the name of the entity with a textual description , while each entity in our KB is represented via multiple attribute - value pairs . ( 3 ) These models exploit the homogeneous nature of the KBs and usually pre - train models on millions of mentions from Wikipedia . This is beneficial when the training and test KBs are Wikipedia or similar , but is beyond the scope of this work , as we build models applicable to arbitrary databases .
Training on multiple unrelated datasets
An additional benefit of being able to link to multiple KBs is the ability to train on more than one dataset , each of which links to a different KB with different schemas . While prior work has been unable to do so due to its reliance on knowledge of KB test , this ability is more crucial in the settings we investigate , as it allows us to stack independent datasets for training . This allows us to answer our third research question . Specifically , we compare the [ SEP]-separation baseline with our full model that uses attribute - separators , attributeshuffle , and attribute - OOV . We ask whether the % of TAC 4 ) . In contrast , the baseline model observes a bigger increase in accuracy from 49.1 % to 62.6 % . While the difference between the two models reduces , the full model remains more accurate . These results also show that the seamless stacking of multiple datasets allowed by our models is effective empirically .
Impact of schema - aware training data
Finally , we investigate to what extent do components introduced by us help in linking when there is training data available that links to the inference KB , KB test . We hypothesize that while attributeseparators will still be useful , attribute - OOV and attribute - shuffle will be less useful as there is a smaller gap between training and test scenarios , reducing the need for regularization .
For these experiments , models from Section 5.4 are further trained with increasing amounts of data from the TAC - KBP 2010 training set . A sample of 200 documents is held out from the training data as a validation set . The models are trained with the exact same configuration as the base models , except with a smaller constant learning rate of 2 × 10 −6 to not overfit on the small amounts of data . Unsurprisingly , the accuracy of all models increases as the amount of TAC training data in- Crucially , the model with only attribute separators is the most accurate model across the spectrum . Moreover , the difference between this model and the baseline model sharply increases as the amount of schema - aware data decreases ( e.g. when using 13 annotated documents , i.e. 1 % of the training data , we get a 9 % boost in accuracy over the model that does not see any schema - aware data ) . These trends show that our models are not only useful in settings without any data from the target KB , but also in settings where limited data is available .
Qualitative Analysis
Beyond the quantitative evaluations above , we further qualitatively analyze the predictions of the best model from Table 3 to provide insights into our modeling decisions and suggest avenues for improvements .
Improvements over baseline
First , we categorize all newly correct mentions , i.e. mentions that are correctly linked by the top model but incorrectly linked by the [ SEP]-separation baseline by the entity type of the gold entity . This type is one of person ( PER ) , organization ( ORG ) , geo - political entity ( GPE ) , and a catchall unknown 10 The 0 % results are the same as those in Table 3 . category ( UKN ) . 11 This categorization reveals that the newly correct mentions represent about 15 % of the total mentions of the ORG , GPE , and UKN categories and as much as 25 % of the total mentions of the PER category . This distributed improvement highlights that the relatively higher accuracy of our model is due to a holistic improvement in modeling unseen KBs across all entity types .
Why does PER benefit more than other entity types ? To answer this , we count the fraction of mentions of each entity type that have at least one column represented using attribute separators . This counting reveals that approximately 56 - 58 % of mentions of type ORG , GPE , and UKN have at least one such column . On the other hand , this number is 71 % for PER mentions . This suggests that the difference is directly attributable to more PER entities having a column that has been modeled using attribute separators , further highlighting the benefits of this modeling decision .
Error Analysis
To identify the shortcomings of our best model , we categorize 100 random mentions that are incorrectly linked by this model into six categories ( demonstrated with examples in Table 6 ) , inspired by the taxonomy of .
Under this taxonomy , a common error ( 33 % ) is predicting a more specific entity than that indicated by the mention ( the city of Hartford , Connecticut , rather than the state ) . The reverse is also observed ( i.e. the model predicts a more general entity ) , but far less frequently ( 6 % ) . Another major error category ( 33 % ) is when the model fails to pick up the correct signals from the context and assigns a similarly named entity of a similar type ( e.g. the river Mobile , instead of the city Mobile , both of which are locations ) . 21 % of the errors are cases where the model predicts an entity that is related to the gold entity , but is neither more specific , nor more generic , but rather of a different type ( Santos Football Club instead of the city of Santos ) .
Errors in the last category occur when the model predicts an entity whose name has no string overlap with that of the gold entity or the mention . This likely happens when the signals from the context override the signals from the mention itself .
Conclusion
The primary contribution of this work is a novel framework for entity linking against unseen target KBs with unknown schemas . To this end , we introduce methods to generalize existing models for zero - shot entity linking to link to unseen KBs . These methods rely on converting arbitrary entities represented using a set of attribute - value pairs into a string representation that can be then consumed by models from prior work .
There is still a significant gap between models used in this work and schema - aware models that are trained on the same KB as the inference KB . One way to close this gap is by using automatic table - to - text generation techniques to convert arbitrary entities into fluent and adequate text ( Kukich , 1983;McKeown , 1985;Reiter and Dale , 1997;Wiseman et al . , 2017;Chisholm et al . , 2017 ) . Another promising direction is to move beyond BERT to other pre - trained representations that are better known to encode entity information ( Zhang et al . , 2019;Guu et al . , 2020;Poerner et al . , 2020 ) .
Finally , while the focus of this work is only on English entity linking , challenges associated with this work naturally occur in multilingual settings as well . Just as we can not expect labeled data for every target KB of interest , we also can not expect labeled data for different KBs in different languages . In future work , we aim to investigate how we can port the solutions introduced here to multilingual settings as well develop novel solutions for scenarios where the documents and the KB are in languages other than English ( Sil et al . , 2018;Upadhyay et al . , 2018;Botha et al . , 2020 ) .
Acknowledgements
The authors would like to thank colleagues from Amazon AI for many helpful discussions that shaped this work , and for reading and providing feedback on earlier drafts of the paper . They also thank all the anonymous reviewers for their helpful feedback .
Datasets . We conduct experiments on English - Macedonian ( En - Mk ) and English - Albanian ( En - Sq ) , as Mk , Sq are low - resource languages , where lexical - level alignment can be most beneficial . We use 3 K randomly sampled sentences of SETIMES ( Tiedemann , 2012 ) as validation / test sets . We also use 68 M En sentences from NewsCrawl . For Sq and Mk we use all the CommonCrawl corpora from Ortiz Suárez et al . ( 2019 ) , which are 4 M Sq and 2.4 M Mk sentences .
Baseline . We use a method that relies on crosslingual language model pretraining , namely XLM ( Lample and Conneau , 2019 ) . This approach trains a bilingual MLM separately for En - Mk and En - Sq , which is used to initialize the encoder - decoder of the corresponding NMT system . Each system is then trained in an unsupervised way .
The scores presented are significantly different ( p < 0.05 ) from the respective baseline . CHRF1 refers to character n - gram F1 score ( Popović , 2015 ) . The models in italics are ours .
Table 1 shows the results of our approach compared to two pretraining approaches that rely on In the case of XLM , the effect of cross - lingual lexical alignment is more evident for En - Mk , as Mk is less similar to En , compared to Sq . This is mainly the case because the two languages use a different alphabet ( Latin for En and Cyrillic for Mk ) . This is also true for RE - LM when translating out of En , showing that enhancing the fine - tuning step of MLM with pretrained embeddings is helpful and improves the final UNMT performance .
In Table 2 , we observe that lexical alignment is more beneficial for En - Mk . This can be explained by the limited vocabulary overlap of the two languages , which does not provide sufficient crosslingual signal for the training of MLM . By contrast , initializing an MLM with pretrained embeddings largely improves performance , even for a higherperforming model , such as RE - LM . In En - Sq , the effect of our approach is smaller yet consistent . This can be attributed to the fact that the two languages use the same script .
Overall , our method enhances the lexical - level information captured by pretrained MLMs , as shown empirically . This is consistent with our intuition that cross - lingual embeddings capture a bilingual signal that can benefit MLM representations . 1 - gram precision scores . To examine whether the improved translation performance is a result of the lexical - level information provided by static embeddings , we present 1 - gram precision scores in Ta- ble 3 , as they can be directly attributed to lexical alignment . The biggest performance gains ( up to +10.4 ) are obtained when the proposed approach is applied to XLM . This correlates with the BLEU scores of Table 1 . Moreover , the En - Mk language pair benefits more than En - Sq from the lexicallevel alignment both in terms of 1 - gram precision and BLEU . These results show that the improved BLEU scores can be attributed to the enhanced lexical representations . How should static embeddings be integrated in the MLM training ? We explore different ways of incorporating the lexical knowledge of pretrained cross - lingual embeddings to the second , masked language modeling stage of our approach ( § 2.2 ) . Specifically , we keep the aligned embeddings fixed ( frozen ) during XLM training and compare the performance of the final UNMT model to the proposed ( fine - tuned ) method . We point out that , after we transfer the trained MLM to an encoder - decoder model , all layers are trained for UNMT .
We tie the embedding and output ( projection ) layers of both LM and NMT models ( Press and Wolf , 2017 ) . We use a dropout rate of 0.1 and GELU activations ( Hendrycks and Gimpel , 2017 ) . We use the default parameters of Lample and Conneau ( 2019 ) in order to train our models .
In this work , we focus on self - supervised , alignment - oriented training tasks using minimum parallel data to improve mBERT 's cross - lingual transferability . We propose a Post - Pretraining Alignment ( PPA ) method consisting of both wordlevel and sentence - level alignment , as well as a finetuning technique on downstream tasks that take pairs of text as input , such as NLI and Question Answering ( QA ) . Specifically , we use a slightly different version of TLM as our word - level alignment task and contrastive learning ( Hadsell et al . , 2006 ) on mBERT 's [ CLS ] tokens to align sentence - level representations . Both tasks are self - supervised and do not require pre - alignment tools such as FastAlign . Our sentence - level alignment is implemented using MoCo ( He et al . , 2020 ) , an instance discrimination - based method of contrastive learn- ing that was recently proposed for self - supervised representation learning in computer vision . Lastly , when finetuning on NLI and QA tasks for non - English languages , we perform sentence - level codeswitching with English as a form of both alignment and data augmentation . We conduct controlled experiments on XNLI and MLQA ( Lewis et al . , 2020 ) , leveraging varying amounts of parallel data during alignment . We then conduct an ablation study that shows the effectiveness of our method . On XNLI , our aligned mBERT improves over the original mBERT by 4.7 % for zero - shot transfer , and outperforms Cao et al . ( 2020 ) while using the same amount of parallel data from the same source . For translate - train , where translation of English training data is available in the target language , our model achieves comparable performance to XLM while using far fewer resources . On MLQA , we get 2.3 % improvement over mBERT and outperform XLM - R Base for zero - shot transfer .
Concretely , MoCo employs a dual - encoder architecture . Given two views v 1 and v 2 of the same image , v 1 is encoded by the query encoder f q and v 2 by the momentum encoder f k . v 1 and v 2 form a positive pair . Negative examples are views of different source images , and are stored in a queue ∈ K , which is randomly initialized . K is usually a large number ( e.g. , K = 65 , 536 for ImageNet ) . Negative pairs are formed by comparing v 1 with each item in the queue . Similarity between pairs is measured by dot product . MoCo uses the InfoNCE loss ( van den Oord et al . , 2019 ) to bring positive pairs closer to each other and push negative pairs apart . After a batch of view pairs are processed , those encoded by the momentum encoder are added to the queue as negative examples for future queries . During training , the query encoder is updated by the optimizer while the momentum encoder is updated by the exponential moving average of the query encoder 's parameters to maintain queue consistency :
-DOCSTART- For both PPA and finetuning on downstream tasks , we use the AdamW optimizer with 0.01 weight decay and a linear learning rate scheduler . For PPA , we use a batch size of 128 , mBERT max sequence length 128 and learning rate warmup for the first 10 % of the total iterations , peaking at 0.00003 . The MoCo momentum is set to 0.999 , queue size 32000 and temperature 0.05 . Our PPA models are trained for 10 epochs , except for the 2 M setting where 5 epochs are trained . On XNLI , we use a batch size of 32 , mBERT max sequence length 128 and finetune the PPA model for 2 epochs . Learning rate peaks at 0.00005 and warmup is done to the first 1000 iterations . On MLQA , mBERT max sequence length is set to 386 and peak learning rate 0.00003 . The other parameters are the same as XNLI . Our experiments are run on a single 32 GB V100 GPU , except for PPA training that involves either MLM or TLM , where two such GPUs are used . We also use mixed - precision training to save on GPU memory and speed up experiments .
-DOCSTART- Unsupervised multiple - choice question generation for out - of - domain Q&A fine - tuning
Pre - trained models have shown very good performances on a number of question answering benchmarks especially when fine - tuned on multiple question answering datasets at once . In this work , we propose an approach for generating a fine - tuning dataset thanks to a rule - based algorithm that generates questions and answers from unannotated sentences . We show that the state - of - the - art model UnifiedQA can greatly benefit from such a system on a multiple - choice benchmark about physics , biology and chemistry it has never been trained on . We further show that improved performances may be obtained by selecting the most challenging distractors ( wrong answers ) , with a dedicated ranker based on a pretrained RoBERTa model .
Introduction
In the past years , deep learning models have greatly improved their performances on a large range of question answering tasks , especially using pretrained models such as BERT ( Devlin et al . , 2019 ) , RoBERTa ( Liu et al . , 2019 ) and T5 ( Raffel et al . , 2020 ) . More recently , these models have shown even better performances when fine - tuned on multiple question answering datasets at once . Such a model is UnifiedQA ( Khashabi et al . , 2020 ) , which , starting from a T5 model , is trained on a large number of question answering datasets including multiple choices , yes / no , extractive and abstractive question answering . UnifiedQA is , at the time of writing , state - of - the - art on a large number of question answering datasets including multiple - choice datasets like OpenBookQA ( Mihaylov et al . , 2018 ) or ARC . However , even if Uni - fiedQA achieves good results on previously unseen datasets , it often fails to achieve optimal performances on these datasets until it is further finetuned on dedicated human annotated data . This tendency is increased when the target dataset deals with questions about a very specific domain .
One solution to this problem would be to finetune or retrain these models with additionnal human annotated data . However , this is expensive both in time and resources . Instead , a lot of work has been done lately on automatically generating training data for fine - tuning or even training completely unsupervised models for question answering . One commonly used dataset for unsupervised question answering is the extractive dataset SQUAD ( Rajpurkar et al . , 2016 ) . proposed a question generation method for SQUAD using an unsupervised neural based translation method . Fabbri et al . ( 2020 ) and further gave improved unsupervised performances on SQUAD and showed that simple rulebased question generation could be as effective as the previously mentioned neural method . These approches are rarely applied to multiple - choice questions answering in part due to the difficulty of selecting distractors . A few research papers however proposed distractor selection methods for multiple - choice questions using either supervised approaches ( Sakaguchi et al . , 2013;Liang et al . , 2018 ) or general purpose knowledge bases ( Ren and Q. Zhu , 2021 ) .
In this paper , we propose an unsupervised process to generate questions , answers and associated distractors in order to fine - tune and improve the performance of the state - of - the - art model UnifiedQA on unseen domains . This method , being unsupervised , needs no additional annotated domain specific data requiring only a set of unannotated sentences of the domain of interest from which the questions are created . Contrarily to most of the aforementioned works , our aim is not to train a new completely unsupervised model but rather to incorporate new information into an existing stateof - the - art model and thus to take advantage of the question - answering knowledge already learned .
We conduct our experiments on the SciQ dataset ( Welbl et al . , 2017 choice questions ( 4 choices ) featuring subjects centered around physics , biology and chemistry . An example of question can be found in Figure 1 . We focus on the SciQ dataset because it has not yet been used for training UnifiedQA and it requires precise scientific knowledge . Furthermore , our experiments reveal that the direct application of UnifiedQA on the SciQ benchmark leads to a much lower performance than when fine - tuning it on the SciQ training set ( see Section 4 ) . Our objective in this work is to solve this gap between UnifiedQA and UnifiedQA fine - tuned on supervised data with the unsupervised question generation approach described in Section 2 . We additionally test our method on two commonly used multiple choice question answering datasets : Common - senseQA ( Talmor et al . , 2019 ) and QASC . These datasets contain questions with similar domains to SciQ even though the questions are slightly less specific . Furthermore , neither of them has been used during the initial training of UnifiedQA .
Question Generation Method
We propose a method for generating multiplechoice questions in order to fine - tune and improve UnifiedQA . This process is based on 3 steps . First , a set of sentences is being selected ( Section 2.1 ) from which a generic question generation system is applied ( Section 2.2 ) . Then a number of distractors are added to each question ( Section 2.3 ) .
Sentence Selection
Our question generation method uses a set of unannotated sentences from which the questions will be generated . We compare three selection methods . First , we consider a scenario where the application developer does not manually collect any sentence , but simply gives the name ( or topic ) of the target domain . In our case , the topics are " Physics " , " Biology " and " Chemistry " since these are the main domains in SciQ. A simple information retrieval strategy is then applied to automatically mine sentences from Wikipedia . We first compute a list of Wikipedia categories by recursively visiting all subcategories starting from the target topic names . The maximum recursion number is limited to 4 . We then extract the summary ( head paragraph of each Wikipedia article ) for each of the articles matching the previously extracted categories and subcategories . We only keep articles with more than 800 average visitors per day for the last ten days ( on April 27 , 2021 ) , resulting in 12 656 pages .
The two other selection methods extract sentences from SciQ itself and therefore are not entirely unsupervised but rather simulate a situation where we have access to unannotated texts that precisely describe the domains of interest such as a school book for example . The SciQ dataset includes a support paragraph for each question ( see Figure 1 ) . Pooled together , these support paragraphs provide us with a large dataset of texts about the domains of interest . We gather the paragraphs corresponding to all questions and split them into sentences to produce a large set of sentences that are no longer associated with any particular question but cover all the topics found in the questions . We compare two different setups . In the first one , we include all the sentences extracted from the train , validation and test sets thus simulating a perfect selection of sentences that cover all the knowledge expressed in the questions . Still , we only use the support paragraphs and not the annotated questions themselves . As compared to the classical supervised paradigm , this setting removes all annotation costs for the application developer , but it still requires to gather sentences that are deemed useful for the test set of interest . We then compare this setup with another one , where only the sentences from the train set are included . This scenario arguably meets more practical needs since it would suffice to gather sentences close to the domain of interest . The number of sentences for each dataset is presented in Table 1 .
Questions Generation
The generation of questions from a sentence relies on the jsRealB text realizer ( Lapalme , 2021 ) which generates an affirmative sentence from a constituent structure . It can also be parameterized to generate variations of the original sentence such as its negation , its passive form and different types of questions such as who , what , when , etc . The constituency structure of a sentence is most often created by a user or by a program from data . In this work , it is instead built from a Universal Dependency ( UD ) structure using a technique developed for SR'19 ( Lapalme , 2019 ) . The UD structure of a sentence is the result of a dependency parse with Stanza ( Qi et al . , 2020 ) . We thus have a pipeline composed of a neural dependency parser , followed by a program to create a constituency structure used as input for a text realizer , both in JavaScript . Used without modification , this would create a complex echo program for the original affirmative sentence , but by changing parameters , its output can vary .
In order to create questions from a single constituency structure , jsRealB uses the classical grammar transformations : for a who question , it removes the subject ( i.e. the first noun phrase before the verb phrase ) , for a what question , it removes the subject or the direct object ( i.e. the first noun phrase within the verb phrase ) ; for other types of questions ( when , where ) it removes the first prepositional phrase within the verb phrase . Depending on the preposition , the question will be a when or a where . Note that the removed part becomes the answer to the question .
In order to determine which questions are appropriate for a given sentence , we examine the dependency structure of the original sentence and check if it contains the required part to be removed before parameterizing the realization . The generated questions are then filtered to remove any question for which the answer is composed of a single stopword . Table 1 shows the number of questions generated for each dataset . An example of a synthetic question is shown in Figure 3 .
Distractors Selection
Since SciQ is a multiple - choice dataset , we must add distractors to each question we generate , to match the format of SciQ. A simple solution to this problem is to select random distractors among answers to other similar questions generated from the dataset of sentences we gathered . Obviously , selecting random distractors may lead to a fine - tuning dataset that is too easy to solve . Therefore , we propose another strategy that selects hard distractors for each question . To do so , starting from our synthetic dataset with random distractors , we finetune RoBERTa ( Liu et al . , 2019 ) using the standard method of training for multiple choices question answering . Each pair question / choice is fed to RoBERTa and the embedding corresponding to the first token ( " [ CLS ] " ) is given to a linear layer to produce a single scalar score for each choice . The scores corresponding to every choice for a given question are then compared to each other by a softmax and a cross - entropy loss . With this method , RoBERTa is trained to score a possible answer for a given question , based on whether or not it is a credible answer to that question . For each question , we then randomly select a number of candidate distractors from the answers to other questions and we use our trained RoBERTa to score each of these candidates . The 3 candidates with the highest scores ( and thus the most credible answers ) are selected . The idea is that during this first training , RoBERTa will learn a large amount of simplistic logic . For example , because of the initial random selection of distractors , it is highly unlikely that even one of the distractors will be close enough to the question 's semantic field . Furthermore , a lot distractors have an incorrect grammar ( eg : a distractor might be plural when the question expects a singular ) . Therefore , in this initial training , RoBERTa might learn to isolate the answer with a corresponding semantic field or the one with correct grammar . The re - selection then minimizes the amount of trivial distractors and models trained on this new refined dataset will have to focus on deeper and more meaningful relations between the questions and the answers . The process is better shown in Figure 4 , and an example of refined distractors can be found in Figure 3 .
The number of scored candidate distractors is an hyper - parameter . A small number of candidates may result in a situation where none of the candidates are credible enough , while a large number requires more computation time , since the score of each candidate for every question needs to be computed , and has a higher risk of proposing multiple valid answers . In our experiments , we use a number of 64 candidates in order to limit computation time .
Training and Implementation Details
To refine distractors , we use the " Large " version of RoBERTa and all models are trained for 4 epochs and a learning rate of 1 × 10 −5 . These hyperparameters are chosen based on previous experiments with RoBERTa on other multiple - choice datasets . The final UnifiedQA fine - tuning is done using the same multiple choices question answering setup as the one used in the original UnifiedQA paper ( Khashabi et al . , 2020 ) . We use the " Large " version of UnifiedQA and all the models are trained for 4 epochs using Adafactor and a learning rate of 1 × 10 −5 . The learning rate is loosely tuned to get the best performance on the validation set during the supervised training of UnifiedQA . We use the Hugging Face pytorch - transformers ( Wolf et al . , 2020 ) library for model implementation . Experiments presented in this paper were carried out using the Grid'5000 testbed ( Balouek et al . , 2013 ) , supported by a scientific interest group hosted by Inria and including CNRS , RENATER and several Universities as well as other organizations ( see https://www.grid5000.fr ) .
Results
Accuracy results in Table 2 have a 95 % Wald confidence interval of ±2.8 % . The first row of Table 2 presents the accuracy results of a vanilla UnifiedQA large model on SciQ. The second line shows the accuracy when UnifiedQA is fine - tuned over the full training corpus . Our objective is thus to get as close as possible to this accuracy score using only un - supervised methods . The results using Wikipedia are the only ones that are unsupervised and therefore are the ones directly comparable to UnifiedQA with no fine - tuning or other unsupervised methods . Table 2 : Accuracy on SciQ by UnifiedQA fine - tuned on our synthetic datasets . " SciQ data " refers to the questions generated using the support paragraphs in SciQ while " Wikipedia data " refers to questions generated using sentences harvested from Wikipedia . All scores are averaged over 3 independent runs ( including the complete question generation process and the final Uni - fiedQA fine - tuning ) .
Fine - tuning UnifiedQA on synthetic questions with random distractors improves the results as compared to the baseline and , as expected , the closer the unlabeled sentences are to the topics of the questions , the better is the accuracy . Hence , generating questions from only the train set of SciQ gives performances that are comparable but slightly lower to the ones obtained from the combined train , dev and test set of SciQ. Finally , questions selected from Wikipedia also improve the results , despite being loosely related to the target test corpus . Our distractor selection method further boosts the accuracy results in all setups . This suggests that a careful selection of distractors is important , and that the hard selection criterion used here seems adequate in our context .
The results for CommonsenseQA and QASC using the same selection of sentences from Wikipedia are reported in table 3 . Overall , we obtain similar results to SciQ with a large improvement of performances when generating questions and a further boost with refined distractors . However compared to SciQ , the improvement brought by the distractor refining process is less significant . This could be partly explained by the fact that the distractors in the original QASC and CommonsenseQA datasets are overall easier and therefore it is less advantageous for a model to be trained on harder questions .
Conclusion
In this work , we proposed a multiple - choice question generation method that can be used to fine - tune the state - of - the - art UnifiedQA model and improve its performance on an unseen and out of domain dataset . Our contributions are :
• We have shown that simple unsupervised methods could be used to finetune existing multipurpose question answering models ( in our case UnifiedQA ) to new datasets or domains .
• We propose a novel distractor refining method able to select harder distractors for a given generated question and show its superiority compared to a random selection .
Future work includes comparing our method to other question generation methods ( including supervised methods : , Puri et al . ( 2020 ) ) in order to assess the effect of both the generation method and the questions quality on the final performances of our models . Also , we will further compare different variations of our question generation and distractor refining methods in order to more thoroughly understand the effect of hyper - parameters such as the number of candidate distractors .
-DOCSTART- Global Entity Disambiguation with BERT
We propose a global entity disambiguation ( ED ) model based on BERT ( Devlin et al . , 2019 ) . To capture global contextual information for ED , our model treats not only words but also entities as input tokens , and solves the task by sequentially resolving mentions to their referent entities and using resolved entities as inputs at each step . We train the model using a large entity - annotated corpus obtained from Wikipedia . We achieve new state - of - the - art results on five standard ED datasets : AIDA - CoNLL , MSNBC , AQUAINT , ACE2004 , and WNED - WIKI . The source code and model checkpoint are available at https : //github.com / studio - ousia / luke .
Introduction
Entity disambiguation ( ED ) refers to the task of assigning mentions in a document to corresponding entities in a knowledge base ( KB ) . This task is challenging because of the ambiguity between mentions ( e.g. , " World Cup " ) and the entities they refer to ( e.g. , FIFA World Cup or Rugby World Cup ) . ED models typically rely on local contextual information based on words that co - occur with the mention and global contextual information based on the entity - based coherence of the disambiguation decisions . A key to improve the performance of ED is to effectively combine both local and global contextual information ( Ganea and Hofmann , 2017;Le and Titov , 2018 ) .
In this study , we propose a global ED model based on BERT ( Devlin et al . , 2019 ) . Our model treats words and entities in the document as input tokens , and is trained by predicting randomly masked entities in a large entity - annotated corpus obtained from Wikipedia . This training enables the model to learn how to disambiguate masked entities based on words and non - masked entities . At the inference time , our model disambiguates * Work done at RIKEN .
mentions sequentially using words and already resolved entities ( see Figure 1 ) . This sequential inference effectively accumulates the global contextual information and enhances the coherence of disambiguation decisions . We conducted extensive experiments using six standard ED datasets , i.e. , AIDA - CoNLL , MSNBC , AQUAINT , ACE2004 , WNED - WIKI , and WNED - CWEB . As a result , the global contextual information consistently improved the performance . Furthermore , we achieved new state of the art on all datasets except for WNED - CWEB . The source code and model checkpoint are available at https://github.com/ studio - ousia / luke .
Related Work
Transformer - based ED . Several recent studies have proposed ED models based on Transformer ( Vaswani et al . , 2017 ) trained with a large entity - annotated corpus obtained from Wikipedia ( Broscheit , 2019;Ling et al . , 2020;Cao et al . , 2021;Barba et al . , 2022 ) . Broscheit ( 2019 ) trained an ED model based on BERT by classifying each word in the document to the corresponding entity . Similarly , addressed ED using BERT by classifying mention spans to the corresponding entities . Ling
Words Entities
Input
Position emb . ( Lewis et al . , 2020 ) to generate referent entity titles of target mentions in an autoregressive manner . Barba et al . ( 2022 ) formulated ED as a text extraction problem ; they fed the document and candidate entity titles to BART and Longformer ( Beltagy et al . , 2020 ) and disambiguated a mention in the document by extracting the referent entity title of the mention . However , unlike our model , these models addressed the task based only on local contextual information .
Treating entities as inputs of Transformer . Recent studies Yamada et al . , 2020;Sun et al . , 2020 ) have proposed Transformerbased models that treat entities as input tokens to enrich their expressiveness using additional information contained in the entity embeddings . However , these models were designed to solve general NLP tasks and not tested on ED . We treat entities as input tokens to capture the global context that is shown to be highly effective for ED .
ED as sequential decision task . Past studies Fang et al . , 2019 ) have solved ED by casting it as a sequential decision task to capture global contextual information . We adopt a similar method with an enhanced Transformer architecture , a training task , and an inference method to implement the global ED model based on BERT .
Model
Given a document with N mentions , each of which has K entity candidates , our model solves ED by selecting a correct referent entity from the entity candidates for each mention .
Model Architecture
Our model is based on BERT and takes words and entities ( Wikipedia entities or the [ MASK ] entity ) .
The input representation of a word or an entity is constructed by summing the token , token type , and position embeddings ( see Figure 2 ):
Token embedding is the embedding of the corresponding token . The matrices of the word and entity token embeddings are represented as A ∈ R Vw×H and B ∈ R Ve×H , respectively , where H is the size of the hidden states of BERT , and V w and V e are the number of items in the word vocabulary and that of the entity vocabulary , respectively .
Token type embedding represents the type of token , namely word ( C word ) or entity ( C entity ) .
Position embedding represents the position of the token in a word sequence . A word and an entity appearing at the i - th position in the sequence are represented as D i and E i , respectively . If an entity mention contains multiple words , its position embedding is computed by averaging the embeddings of the corresponding positions ( see Figure 2 ) . Following Devlin et al . ( 2019 ) , we tokenize the document text using the BERT 's wordpiece tokenizer , and insert [ CLS ] and [ SEP ] tokens as the first and last words , respectively .
Training Task
Similar to the masked language model ( MLM ) objective adopted in BERT , our model is trained by predicting randomly masked entities . Specifically , we randomly replace some percentage of the entities with special [ MASK ] entity tokens and then trains the model to predict masked entities .
We adopt a model equivalent to the one used to predict words in MLM . Formally , we predict the original entity corresponding to a masked entity by applying softmax over all entities :
ED Model
Local ED Model . Our local ED model takes words and N [ MASK ] tokens corresponding to the mentions in the document . The model then computes the embedding m ′ e ∈ R H for each [ MASK ] token using Eq.(2 ) and predicts the entity using softmax over the K entity candidates :
where B * ∈ R K×H and b * o ∈ R K consist of the entity token embeddings and the bias corresponding to the entity candidates , respectively . Note that B * and b * o are the subsets of B and b o , respectively . Global ED Model . Our global ED model resolves mentions sequentially for N steps ( see Algorithm 1 ) . First , the model initializes the entity of each mention using the [ MASK ] token . Then , for each step , it predicts an entity for each [ MASK ] token , selects the prediction with the highest probability produced by the softmax function in Eq.(3 ) , and resolves the corresponding mention by assigning the predicted entity to it . This model is denoted as confidence - order . We also test a model that selects mentions according to their order of appearance in the document and denote it by natural - order .
Modeling Details
Our model is based on BERT LARGE ( Devlin et al . , 2019 ) . The parameters shared with BERT are initialized using BERT , and the other parameters are initialized randomly . We treat the hyperlinks in Wikipedia as entity annotations and randomly mask 30 % of all entities . We train the model by maximizing the log likelihood of entity predictions . Further details are described in Appendix A.
Experiments
Our experimental setup follows Le and Titov ( 2018 ) . In particular , we test the proposed ED models using six standard datasets : AIDA - CoNLL ( CoNLL ) ( Hoffart et al . , 2011 ) , MSNBC , AQUAINT , ACE2004 , WNED - CWEB ( CWEB ) , and WNED - WIKI ( WIKI ) ( Guo and Barbosa , 2018 ) . We consider only the mentions that refer to valid entities in Wikipedia . For all datasets , we use the KB+YAGO entity candidates and their associatedp(e|m ) ( Ganea and Hofmann , 2017 ) , and use the top 30 candidates based onp(e|m ) .
For the CoNLL dataset , we also test the performance using PPRforNED entity candidates ( Pershina et al . , 2015 ) . We report the in - KB accuracy for the CoNLL dataset and the micro F1 score ( averaged per mention ) for the other datasets . Further details of the datasets are provided in Appendix C. Furthermore , we optionally fine - tune the model by maximizing the log likelihood of the ED predictions ( ŷ ED ) using the training set of the CoNLL dataset with the KB+YAGO candidates . We mask 90 % of the mentions and fix the entity token embeddings ( B and B * ) and the bias ( The model is trained for two epochs using AdamW. Additional details are provided in Appendix B. Our global models consistently perform better than the local model , demonstrating the effectiveness of using global contextual information even if local contextual information is captured using expressive BERT model . Moreover , the confidenceorder model performs better than the natural - order model on most datasets . An analysis investigating why the confidence - order model outperforms the natural - order model is provided in the next section .
Results
The fine - tuning on the CoNLL dataset significantly improves the performance on this dataset ( Table 1 ) . However , it generally degrades the performance on the other datasets ( Table 2 ) . This suggests that Wikipedia entity annotations are more suitable than the CoNLL dataset to train generalpurpose ED models .
Additionally , our models perform worse than Yang et al . ( 2018 ) on the CWEB dataset . This is because this dataset is significantly longer on average than other datasets , i.e. , approximately 1,700 words per document on average , which is more than three times longer than the 512 - word limit that can be handled by BERT - based models including ours . Yang et al . ( 2018 ) achieved excellent performance on this dataset because their model uses various hand - engineered features capturing document - level contextual information .
Analysis
To investigate how global contextual information helps our model to improve performance , we manually analyze the difference between the predictions of the local , natural - order , and confidence - order models . We use the fine - tuned model using the CoNLL dataset with the YAGO+KB candidates . Although all models perform well on most mentions , the local model often fails to resolve mentions of common names referring to specific entities ( e.g. , " New York " referring to New York Knicks ) . Global models are generally better to resolve such difficult cases because of the presence of strong global contextual information ( e.g. , mentions refer - ring to basketball teams ) .
Furthermore , we find that the confidence - order model works especially well for mentions that require a highly detailed context to resolve . For example , a mention of " Matthew Burke " can refer to two different former Australian rugby players . Although the local and natural - order models incorrectly resolve this mention to the player who has the larger number of occurrences in our Wikipediabased corpus , the confidence - order model successfully resolves this by disambiguating its contextual mentions , including his teammates , in advance . We provide detailed inference sequence of the corresponding document in Appendix D.
Performance for Rare Entities
We examine whether our model learns effective embeddings for rare entities using the CoNLL dataset . Following Ganea and Hofmann ( 2017 ) , we use the mentions of which entity candidates contain their gold entities and measure the performance by dividing the mentions based on the frequency of their entities in the Wikipedia annotations used to train the embeddings .
As presented in Table 3 , our models achieve enhanced performance for rare entities . Furthermore , the global models consistently outperform the local model both for rare and frequent entities .
Conclusion and Future Work
We propose a new global ED model based on BERT .
Our extensive experiments on a wide range of ED datasets demonstrate its effectiveness .
One limitation of our model is that , similar to existing ED models , our model can not handle entities that are not included in the vocabulary . In our future work , we will investigate the method to compute the embeddings of such entities using a post - hoc training with an extended vocabulary ( Tai et al . , 2020 ) .
Appendix for " Global Entity Disambiguation with BERT " A Details of Proposed Model
As the input corpus for training our model , we use the December 2018 version of Wikipedia , comprising approximately 3.5 billion words and 11 million entity annotations . We generate input sequences by splitting the content of each page into sequences comprising ≤ 512 words and their entity annotations ( i.e. , hyperlinks ) . The input text is tokenized using BERT 's tokenizer with its vocabulary consisting of V w = 30 , 000 words . Similar to Ganea and Hofmann ( 2017 ) , we create an entity vocabulary consisting of V e = 128 , 040 entities , which are contained in the entity candidates in the datasets used in our experiments .
Our model consists of approximately 440 million parameters . To reduce the training time , the parameters that are shared with BERT are initialized using BERT . The other parameters are initialized randomly . The model is trained via iterations over Wikipedia pages in a random order for seven epochs . To stabilize the training , we update only those parameters that are randomly initialized ( i.e. , fixed the parameters initialized using BERT ) at the first epoch , and update all parameters in the remaining six epochs . We implement the model using PyTorch ( Paszke et al . , 2019 ) and Hugging Face Transformers ( Wolf et al . , 2020 ) , and the training takes approximately ten days using eight Tesla V100 GPUs . We optimize the model using AdamW. The hyper - parameters used in the training are detailed in Table 4 .
B Details of Fine - tuning on CoNLL Dataset
The hyper - parameters used in the fine - tuning on the CoNLL dataset are detailed in Table 5 . We select these hyper - parameters from the search space described in Devlin et al . ( 2019 ) based on the accuracy on the development set of the CoNLL dataset .
A document is split if it is longer than 512 words , which is the maximum word length of the BERT model .
C Details of ED Datasets
The statistics of the ED datasets used in our experiments are provided in Table 6 .
D Example of Inference by
Confidence - order Model Figure 3 shows an example of the inference performed by our confidence - order model fine - tuned on the CoNLL dataset . The document is obtained from the test set of the CoNLL dataset . As shown in the figure , the model starts with unambiguous player names to recognize the topic of the document , and subsequently resolves the mentions that are challenging to resolve . Notably , the model correctly resolves the mention " Nigel Walker " to the corresponding former rugby player instead of a football player , and the mention " Matthew Burke " to the correct former Australian rugby player born in 1973 instead of the former Australian rugby player born in 1964 . This is accomplished by resolving other contextual mentions , including their colleague players , in advance . These two mentions are denoted in red in the figure . Note that our local model fails to resolve both mentions , and our natural - order model fails to resolve " Matthew Burke . "
-DOCSTART- A Matter of Framing : The Impact of Linguistic Formalism on Probing Results
Deep pre - trained contextualized encoders like BERT ( Devlin et al . , 2019 ) demonstrate remarkable performance on a range of downstream tasks . A recent line of research in probing investigates the linguistic knowledge implicitly learned by these models during pretraining . While most work in probing operates on the task level , linguistic tasks are rarely uniform and can be represented in a variety of formalisms . Any linguistics - based probing study thereby inevitably commits to the formalism used to annotate the underlying data . Can the choice of formalism affect probing results ? To investigate , we conduct an in - depth cross - formalism layer probing study in role semantics . We find linguistically meaningful differences in the encoding of semantic role - and proto - role information by BERT depending on the formalism and demonstrate that layer probing can detect subtle differences between the implementations of the same linguistic formalism . Our results suggest that linguistic formalism is an important dimension in probing studies and should be investigated along with the commonly used cross - task and cross - lingual experimental settings .
Introduction
The emergence of deep pre - trained contextualized encoders has had a major impact on the field of natural language processing . Boosted by the availability of general - purpose frameworks like AllenNLP and Transformers ( Wolf et al . , 2019 ) , pre - trained models like ELMO and BERT ( Devlin et al . , 2019 ) have caused a shift towards simple architectures where a strong pre - trained encoder is paired with a shallow downstream model , often outperforming the intricate task - specific architectures of the past .
The versatility of pre - trained representations implies that they encode some aspects of general
L=0 L=8 L=11
Figure 1 : Intra - sentence similarity by layer L of the multilingual BERT - base . Functional tokens are similar in L = 0 , syntactic groups emerge at higher levels .
linguistic knowledge ( Reif et al . , 2019 ) . Indeed , even an informal inspection of layer - wise intrasentence similarities ( Fig . 1 ) suggests that these models capture elements of linguistic structure , and those differ depending on the layer of the model . A grounded investigation of these regularities allows to interpret the model 's behaviour , design better pre - trained encoders and inform the downstream model development . Such investigation is the main subject of probing , and recent studies confirm that BERT implicitly captures many aspects of language use , lexical semantics and grammar ( Rogers et al . , 2020 ) .
Most probing studies use linguistics as a theoretical scaffolding and operate on a task level . However , there often exist multiple ways to represent the same linguistic phenomenon : for example , English dependency syntax can be encoded using a variety of formalisms , incl . Universal ( Schuster and Manning , 2016 ) , Stanford ( de Marneffe and Manning , 2008 ) and CoNLL-2009 dependencies ( Hajič et al . , 2009 ) , all using different label sets and syntactic head attachment rules . Any probing study inevitably commits to the specific theoretical framework used to produce the underlying data . The differences between linguistic formalisms , however , can be substantial .
Can these differences affect the probing results ? This question is intriguing for several reasons . Lin - guistic formalisms are well - documented , and if the choice of formalism indeed has an effect on probing , cross - formalism comparison will yield new insights into the linguistic knowledge obtained by contextualized encoders during pre - training . If , alternatively , the probing results remain stable despite substantial differences between formalisms , this prompts a further scrutiny of what the pretrained encoders in fact encode . Finally , on the reverse side , cross - formalism probing might be used as a tool to empirically compare the formalisms and their language - specific implementations . To the best of our knowledge we are the first to explicitly address the influence of formalism on probing .
Ideally , the task chosen for a cross - formalism study should be encoded in multiple formalisms using the same textual data to rule out the influence of the domain and text type . While many linguistic corpora contain several layers of linguistic information , having the same textual data annotated with multiple formalisms for the same task is rare . We focus on role semantics -a family of shallow semantic formalisms at the interface between syntax and propositional semantics that assign roles to the participants of natural language utterances , determining who did what to whom , where , when etc . Decades of research in theoretical linguistics have produced a range of rolesemantic frameworks that have been operationalized in NLP : syntax - driven PropBank ( Palmer et al . , 2005 ) , coarse - grained VerbNet ( Kipper - Schuler , 2005 ) , fine - grained FrameNet ( Baker et al . , 1998 ) , and , recently , decompositional Semantic Proto - Roles ( SPR ) ( Reisinger et al . , 2015;White et al . , 2016 ) . The SemLink project ( Bonial et al . , 2013 ) offers parallel annotation for PropBank , VerbNet and FrameNet for English . This allows us to isolate the object of our study : apart from the rolesemantic labels , the underlying data and conditions for the three formalisms are identical . SR3DE ( Mújdricza - Maydt et al . , 2016 ) provides compatible annotation in three formalisms for German , enabling cross - lingual validation of our results . Combined , these factors make role semantics an ideal target for our cross - formalism probing study .
A solid body of evidence suggests that encoders like BERT capture syntactic and lexical - semantic properties , but only few studies have considered probing for predicate - level semantics ( Tenney et al . , 2019b;Kovaleva et al . , 2019 ) . To the best of our knowledge we are the first to conduct a cross - formalism probing study on role semantics , thereby contributing to the line of research on how and whether pre - trained BERT encodes higher - level semantic phenomena .
Contributions . This work studies the effect of the linguistic formalism on probing results . We conduct cross - formalism experiments on PropBank , VerbNet and FrameNet role prediction in English and German , and show that the formalism can affect probing results in a linguistically meaningful way ; in addition , we demonstrate that layer probing can detect subtle differences between implementations of the same formalism in different languages . On the technical side , we advance the recently introduced edge and layer probing framework ( Tenney et al . , 2019b ) ; in particular , we introduce anchor tasks -an analytical tool inspired by feature - based systems that allows deeper qualitative insights into the pre - trained models ' behaviour . Finally , advancing the current knowledge about the encoding of predicate semantics in BERT , we perform a fine - grained semantic proto - role probing study and demonstrate that semantic proto - role properties can be extracted from pre - trained BERT , contrary to the existing reports . Our results suggest that along with task and language , linguistic formalism is an important dimension to be accounted for in probing research .
Related Work
BERT as Encoder
BERT is a Transformer ( Vaswani et al . , 2017 ) encoder pre - trained by jointly optimizing two unsupervised objectives : masked language model and next sentence prediction . It uses WordPiece ( WP , Wu et al . ( 2016 ) ) subword tokens along with positional embeddings as input , and gradually constructs sentence representations by applying tokenlevel self - attention pooling over a stack of layers L. The result of BERT encoding is a layer - wise representation of the input wordpiece tokens with higher layers representing higher - level abstractions over the input sequence . Thanks to the joint pre - training objective , BERT can encode words and sentences in a unified fashion : the encoding of a sentence or a sentence pair is stored in a special token [ CLS ] .
To facilitate multilingual experiments , we use the multilingual BERT - base ( mBERT ) published by Devlin et al . ( 2019 ) . Although several recent encoders have outperformed BERT on benchmarks Lan et al . , 2019;Raffel et al . , 2019 ) , we use the original BERT architecture , since it allows us to inherit the probing methodology and to build upon the related findings .
Probing
Due to space limitations we omit high - level discussions on benchmarking ( Wang et al . , 2018 ) and sentence - level probing ( Conneau et al . , 2018a ) , and focus on the recent findings related to the representation of linguistic structure in BERT . Surface - level information generally tends to be represented in the lower layers of deep encoders , while higher layers store hierarchical and semantic information ( Belinkov et al . , 2017;Lin et al . , 2019 ) . Tenney et al . ( 2019a ) show that the abstraction strategy applied by the English pre - trained BERT encoder follows the order of the classical NLP pipeline . Strengthening the claim about linguistic capabilities of BERT , Hewitt and Manning ( 2019 ) demonstrate that BERT implicitly learns syntax , and Reif et al . ( 2019 ) show that it encodes fine - grained lexicalsemantic distinctions . Rogers et al . ( 2020 ) provide a comprehensive overview of BERT 's properties discovered to date .
While recent results indicate that BERT successfully represents lexical - semantic and grammatical information , the evidence of its high - level semantic capabilities is inconclusive . Tenney et al . ( 2019a ) show that the English PropBank semantics can be extracted from the encoder and follows syntax in the layer structure . However , out of all formalisms PropBank is most closely tied to syntax , and the results on proto - role and relation probing do not follow the same pattern . Kovaleva et al . ( 2019 ) identify two attention heads in BERT responsible for FrameNet relations . However , they find that disabling them in a fine - tuning evaluation on the GLUE ( Wang et al . , 2018 ) benchmark does not result in decreased performance .
Although we are not aware of any systematic studies dedicated to the effect of formalism on probing results , the evidence of such effects is scattered across the related work : for example , the aforementioned results in Tenney et al . ( 2019a ) show a difference in layer utilization between constituents - and dependency - based syntactic probes and semantic role and proto - role probes . It is not clear whether this effect is due to the differences in the underlying datasets and task architecture or the formalism per se .
Our probing methodology builds upon the edge and layer probing framework . The encoding produced by a frozen BERT model can be seen as a layer - wise snapshot that reflects how the model has constructed the high - level abstractions . Tenney et al . ( 2019b ) introduce the edge probing task design : a simple classifier is tasked with predicting a linguistic property given a pair of spans encoded using a frozen pre - trained model . Tenney et al . ( 2019a ) use edge probing to analyse the layer utilization of a pre - trained BERT model via scalar mixing weights learned during training . We revisit this framework in Section 3 .
Role Semantics
We now turn to the object of our investigation : role semantics . For further discussion , consider the following synthetic example :
a. Despite surface - level differences , the sentences express the same meaning , suggesting an underlying semantic representation in which these sentences are equivalent . One such representation is offered by role semantics -a shallow predicatesemantic formalism closely related to syntax . In terms of role semantics , Mary , book and John are semantic arguments of the predicate give , and are assigned roles from a pre - defined inventory , for example , Agent , Recipient and Theme .
Semantic roles and their properties have received extensive attention in linguistics ( Fillmore , 1968;Levin and Rappaport Hovav , 2005;Dowty , 1991 ) and are considered a universal feature of human language . The size and organization of the role and predicate inventory are subject to debate , giving rise to a variety of role - semantic formalisms .
PropBank assumes a predicate - independent labeling scheme where predicates are distinguished by their sense ( get.01 ) , and semantic arguments are labeled with generic numbered core ( Arg0 - 5 ) and modifier ( e.g. AM - TMP ) roles . Core roles are not tied to specific definitions , but the effort has been made to keep the role assignments consistent for similar verbs ; Arg0 and Arg1 correspond to the Proto - Agent and Proto - Patient roles as per Dowty ( 1991 ) . The semantic interpretation of core roles depends on the predicate sense .
VerbNet follows a different categorization scheme . Motivated by the regularities in verb behavior , Levin ( 1993 ) has introduced the group - ing of verbs into intersective classes ( ILC ) . This methodology has been adopted by VerbNet : for example , the VerbNet class get-13.5.1 would include verbs earn , fetch , gain etc . A verb in Verb - Net can belong to several classes corresponding to different senses ; each class is associated with a set of roles and licensed syntactic transformations . Unlike PropBank , VerbNet uses a set of approx . 30 thematic roles that have universal definitions and are shared among predicates , e.g. Agent , Beneficiary , Instrument .
FrameNet takes a meaning - driven stance on the role encoding by modeling it in terms of frame semantics : predicates are grouped into frames ( e.g. Commerce buy ) , which specify role - like slots to be filled . FrameNet offers fine - grained frame distinctions , and roles in FrameNet are frame - specific , e.g. Buyer , Seller and Money . The resource accompanies each frame with a description of the situation and its core and peripheral participants .
SPR follows the work of Dowty ( 1991 ) and discards the notion of categorical semantic roles in favor of feature bundles .
Instead of a fixed role label , each argument is assessed via a 11 - dimensional cardinal feature set including Proto - Agent and Proto - Patient properties like volitional , sentient , destroyed , etc . The feature - based approach eliminates some of the theoretical issues associated with categorical role inventories and allows for more flexible modeling of role semantics .
Each of the role labeling formalisms offers certain advantages and disadvantages ( Giuglea and Moschitti , 2006;Mújdricza - Maydt et al . , 2016 ) . While being close to syntax and thereby easier to predict , PropBank does n't contribute much semantics to the representation . On the opposite side of the spectrum , FrameNet offers rich predicatesemantic representations for verbs and nouns , but suffers from high granularity and coverage gaps ( Hartmann et al . , 2017 ) . VerbNet takes a middle ground by following grammatical criteria while still encoding coarse - grained semantics , but only focuses on verbs and core ( not modifier ) roles . SPR avoids the granularity - generalization trade - off of the categorical inventories , but is yet to find its way into practical NLP applications .
Probing Methodology
We take the edge probing setup by Tenney et al . ( 2019b ) as our starting point . Edge probing aims to predict a label given a pair of contextualized span or word encodings . More formally , we encode a WP - tokenized sentence [ wp 1 , wp 2 , ... wp k ] with a frozen pre - trained model , producing contextual embeddings [ e 1 , e 2 , ... e k ] , each of which is a layered representation over L = { l 0 , l 1 , ... l m } layers , with encoding at layer l n for the wordpiece wp i further denoted as e n i . A trainable scalar mix is applied to the layered representation to produce the final encoding given the per - layer mixing weights { a 0 , a 1 .. a m } and a scaling parameter γ :
e i = γ m l=0 sof tmax(a l ) e l i
Given the source src and target tgt wordpieces encoded as e src and e tgt , our goal is to predict the label y.
Due to its task - agnostic architecture , edge probing can be applied to a wide variety of unary ( by omitting tgt ) and binary labeling tasks in a unified manner , facilitating the cross - task comparison . The original setup has several limitations that we address in our implementation .
Regression tasks . The original edge probing setup only considers classification tasks . Many language phenomena -including positional information and semantic proto - roles , are naturally modeled as regression . We extend the architecture by Tenney et al . ( 2019b ) and support both classification and regression : the former achieved via softmax , the latter via direct linear regression to the target value .
Flat model . To decrease the models ' own expressive power ( Hewitt and Liang , 2019 ) , we keep the number of parameters in our probing model as low as possible . While Tenney et al . ( 2019b ) utilize pooled self - attentional span representations and a projection layer to enable cross - model comparison , we directly feed the wordpiece encoding into the classifier , using the first wordpiece of a word . To further increase the selectivity of the model , we directly project the source and target wordpiece representations into the label space , opposed to the two - layer MLP classifier used in the original setup .
Separate scalar mixes . To enable fine - grained analysis of probing results , we train and analyze separate scalar mixes for source and target wordpieces , motivated by the fact that the classifier might utilize different aspects of their representation for prediction 1 . Indeed , we find that the mixing weights learned for source and target wordpieces might show substantial -and linguistically meaningful -variation . Combined with regressionbased objective , separating the scalar mixes allows us to scrutinize layer utilization patterns for semantic proto - roles .
Sentence - level probes . Utilizing the BERTspecific sentence representation [ CLS ] allows us to incorporate the sentence - level natural language inference ( NLI ) probe into our kit .
Anchor tasks . We employ two analytical tools from the original layer probing setup . Mixing weight plotting compares layer utilization among tasks by visually aligning the respective learned weight distributions transformed via a softmax function . Layer center - of - gravity is used as a summary statistic for a task 's layer utilization . While the distribution of mixing weights along the layers allows us to estimate the order in which information is processed during encoding , it does n't allow to directly assess the similarity between the layer utilization of the probing tasks . Tenney et al . ( 2019a ) have demonstrated that the order in which linguistic information is stored in BERT mirrors the traditional NLP pipeline . A prominent property of the NLP pipelines is their use of low - level features to predict downstream phenomena . In the context of layer probing , probing tasks can be seen as end - to - end feature extractors . Following this intuition , we define two groups of probing tasks : target tasks -the main tasks under investigation , and anchor tasks -a set of related tasks that serve as a basis for qualitative comparison between the targets . The softmax transformation of the scalar mixing weights allows to treat them as probability distributions : the higher the mixing weight of a layer , the more likely the probe is to utilize information from this layer during prediction . We use Kullback - Leibler divergence to compare target tasks ( e.g. role labeling in different formalisms ) in terms of their similarity to lowerlevel anchor tasks ( e.g. dependency relation and lemma ) . Note that the notion of anchor task is contextual : the same task can serve as a target and as an anchor , depending on the focus of the study . jections in the background , but do not investigate the differences between the learned projections .
Probing tasks
Our probing kit spans a wide range of probing tasks , from primitive surface - level tasks mostly utilized as anchors later to high - level semantic tasks that language en de aim to provide a representational upper bound to predicate semantics . We follow the training , test and development splits from the original SR3de , CoNLL-2009 and SPR data . The XNLI task is sourced from the development set and only used for scalar mix analysis . To reduce the number of labels in some of the probing tasks , we collect frequency statistics over the corresponding training sets and only consider up to 250 most frequent labels . Below we define the tasks in order of their complexity , Table 2 provides the probing task statistics , Table 3 compares the categorical role labeling formalisms in terms of granularity , and Table 4 provides examples . We evaluate the classification performance using Accuracy , while regression tasks are scored via R 2 .
Token type ( ttype ) predicts the type of a word . This requires contextual processing since a word might consist of several wordpieces ; Token position ( token.ix ) predicts the linear position of a word , cast as a regression task over the first 20 words in the sentence . Again , the task is non - trivial since it requires the words to be assembled from the wordpieces . Part - of - speech ( pos ) predicts the languagespecific part - of - speech tag for the given token . Lexical unit ( lex.unit ) predicts the lemma and POS of the given word -a common input representation for the entries in lexical resources . We extract coarse POS tags by using the first character of the language - specific POS tag .
Dependency relation ( deprel ) predicts the dependency relation between the parent src and dependent tgt tokens ; Semantic role ( role.[frm ] ) predicts the semantic role given a predicate src and an argument tgt token in one of the three role labeling formalisms : PropBank pb , VerbNet vn and FrameNet fn . Note that we only probe for the role label , and the model has no access to the verb sense information from the data . Semantic proto - role ( spr . [ prop ] ) is a set of eleven regression tasks predicting the values of the proto - role properties as defined in ( Reisinger et al . , 2015 ) , given a predicate src and an argument tgt . XNLI is a sentence - level NLI task directly sourced from the corresponding dataset . Given two sentences , the goal is to determine whether an entailment or a contradiction relationship holds between them . We use NLI to investigate the layer utilization of mBERT for high - level semantic tasks . We extract the sentence pair representation via the [ CLS ] token and treat it as a unary probing task .
Results
Our probing framework is implemented using AllenNLP . 2 We train the probes for 20 epochs using the Adam optimizer with default parameters and a batch size of 32 . Due to the frozen encoder and flat model architecture , the total runtime of the main experiments is under 8 hours on a single Tesla V100 GPU . In addition to pre - trained mBERT we report baseline performance using a frozen untrained mBERT model obtained by randomizing the encoder weights post - initialization as in Jawahar et al . ( 2019 ) .
General Trends
While absolute performance is secondary to our analysis , we report the probing task scores on respective development sets in Table 5 . We observe that grammatical tasks score high , while core role labeling lags behind -in line with the findings of Tenney et al . ( 2019a ) 3 We observe lower scores for German role labeling which we attribute to the lack of training data . Surprisingly , as we show below , this does n't prevent the edge probe from task en de * token.ix 0.95 ( 0.93 ) 0.92 ( 0 . learning to locate relevant role - semantic information in mBERT 's layers .
The untrained mBERT baseline expectedly underperforms ; however , we note good baseline results on surface - level tasks for English , which we attribute to memorizing token identity and position : although the weights are set randomly , the frozen encoder still associates each wordpiece input with a fixed random vector . We have confirmed this assumption by scalar mix analysis of the untrained mBERT baseline : in our experiments the baseline probes for both English and German attended almost exclusively to the first few layers of the encoder , independent of the task . For brevity , here and further we do not examine baseline mixing weights and only report the scores .
Our main probing results mirror the findings of Tenney et al . ( 2019a ) about the sequential processing order in BERT . We observe that the layer utilization among tasks ( Fig . 2 ) generally aligns for English and German 4 , although we note that in terms of center - of - gravity mBERT tends to utilize deeper layers for German probes . Basic word - level tasks are indeed processed early by the model , and XNLI probes focus on deeper levels , suggesting that the representation of higher - level semantic phenomena follows the encoding of syntax and predicate semantics .
The Effect of Formalism
Using separate scalar mixes for source and target tokens allows us to explore the cross - formalism encoding of role semantics by mBERT in detail . For both English and German role labeling , the probe 's layer utilization drastically differs for predicate and 4 Echoing the recent findings on mBERT 's multilingual capacity ( Pires et al . , 2019;Kondratyuk and Straka , 2019 [ 6.16 ] xnli [ 6.28 ] en Layer [ 4.61 ] [ 5.2 ] [ 5.09 ] [ 5.75 ] [ 6.01 ] [ 5.99 ] [ 5.18 ] [ 5.24 ] [ 5.13 ] [ 6.12 ] [ 6.06 ] [ 5.75 ] [ 6.15 ] de argument tokens . While the argument representation role * tgt mostly focuses on the same layers as the dependency parsing probe , the layer utilization of the predicates role * src is affected by the chosen formalism . In English , PropBank predicate token mixing weights emphasize the same layers as dependency parsing -in line with the previously published results . However , the probes for VerbNet and FrameNet predicates ( role.vn src and role.fn src ) utilize the layers associated with ttype and lex.unit that contain lexical information . Coupled with the fact that both VerbNet and FrameNet assign semantic roles based on lexical - semantic predicate groupings ( frames in FrameNet and verb classes in VerbNet ) , this suggests that the lower layers of mBERT implicitly encode predicate sense information ; moreover , sense encoding for VerbNet utilizes deeper layers of the model associated with syntax , in line with Verb - Net 's predicate classification strategy . This finding confirms that the formalism can indeed have linguistically meaningful effects on probing results .
Anchor Tasks in the Pipeline
We now use the scalar mixes of the role labeling probes as target tasks , and lower - level probes as anchor tasks to qualitatively explore the differences between how our role probes learn to represent predicates and semantic arguments 5 ( Fig . 3 ) . The results reveal a distinctive pattern that confirms our previous observations : while Verb - Net and FrameNet predicate layer utilization src is similar to the scalar mixes learned for ttype and lex.unit , the learned argument representations tgt and the PropBank predicate attend to the layers associated with dependency relation and POS probes . Aside from the PropBank predicate encoding which we address below , the pattern reproduces for English and German . This aligns with the traditional separation of the semantic role labeling task into predicate disambiguation followed by semantic argument identification and labeling , along with the feature sets employed for these tasks ( Björkelund et al . , 2009 ) . Note that the observation about the pipeline - like task processing within the BERT encoders thereby holds , albeit on a sub - task level .
Formalism Implementations
Both layer and anchor task analysis reveal a prominent discrepancy between English and German role probing results : while the PropBank predicate layer utilization for English mostly relies on syntactic information , German PropBank predicates behave similarly to VerbNet and FrameNet . The lack of systematic cross - lingual differences between layer utilization for other probing tasks 6 allows us to rule out the effect of purely typological features such as word order and case marking as a likely cause . The difference in the number of role labels for English and German PropBank , however , points at possible qualitative differences in the labeling schemes ( Table 3 ) . The data for English stems from the token - level alignment in SemLink that maps the original PropBank roles to Verb - Net and FrameNet . Role annotations for German have a different lineage : they originate from the FrameNet - annotated SALSA corpus ( Burchardt et al . , 2006 ) semi - automatically converted to Prop - Bank style for the CoNLL-2009 shared task ( Hajič et al . , 2009 ) , and enriched with VerbNet labels in SR3de ( Mújdricza - Maydt et al . , 2016 ) . As a result , while English PropBank labels are assigned in a predicate - independent manner , German PropBank , following the same numbered labeling scheme , keeps this scheme consistent within the frame . We assume that this incentivizes the probe to learn semantic verb groupings and reflects in our probing results . The ability of the probe to detect subtle differences between formalism implementations constitutes a new use case for probing , and a promising direction for future studies .
Encoding of Proto - Roles
We now turn to the probing results for decompositional semantic proto - role labeling tasks . Unlike ( Tenney et al . , 2019b ) who used a multi - label classification probe , we treat SPR properties as separate regression tasks . The results in Table 6 show that the performance varies by property , with some of the properties attaining reasonably high R 2 scores despite the simplicity of the probe architecture and the small dataset size . We observe that properties associated with Proto - Agent tend to perform better . The untrained mBERT baseline performs poorly which we attribute to the lack of data and the finegrained semantic nature of the task . Our fine - grained , property - level task design allows for more detailed insights into the layer utilization by the SPR probes ( Fig . 4 ) . The results indicate that while the layer utilization on the predicate side ( src ) shows no clear preference for particular layers ( similar to the results obtained by Tenney et al . ( 2019a ) ) , some of the proto - role features follow the pattern seen in the categorical role labeling and dependency parsing tasks for the argument tokens tgt . With few exceptions , we observe that the properties displaying that behavior are Proto - Agent properties ; moreover , a close examination of the results on syntactic preference by Reisinger et al . ( 2015 , p. 483 ) reveals that these properties are also the ones with strong preference for the subject position , including the outlier case of stationary which in their data behaves like a Proto - Agent property . The correspondence is not strict , and we leave an in - depth investigation of the reasons behind these discrepancies for follow - up work .
Conclusion
We have demonstrated that the choice of linguistic formalism can have substantial , linguistically meaningful effects on role - semantic probing results . We have shown how probing classifiers can be used to detect discrepancies between formalism implementations , and presented evidence of semantic proto - role encoding in the pre - trained mBERT model . Our refined implementation of the edge probing framework coupled with the anchor task methodology enabled new insights into the processing of predicate - semantic information within mBERT . Our findings suggest that linguistic formalism is an important factor to be accounted for in probing studies . This prompts several recommendations for the follow - up probing studies . First , the formalism and implementation used to prepare the linguistic material underlying a probing study should be always explicitly specified . Second , if possible , results on multiple formalisations of the same task should be reported and validated for several languages . Finally , assembling corpora with parallel cross - formalism annotations would facilitate further research on the effect of formalism in probing .
While our work illustrates the impact of formalism using a single task and a single probing framework , the influence of linguistic formalism per se is likely to be present for any probing setup that builds upon linguistic material . An investigation of how , whether , and why formalisms and their implementations affect probing results for tasks beyond role labeling and for frameworks beyond edge probing constitutes an exciting avenue for future research .
Acknowledgments
This work has been funded by the LOEWE initiative ( Hesse , Germany ) within the emergenCITY center .
-DOCSTART- Pre - Training Transformers as Energy - Based Cloze Models
We introduce Electric , an energy - based cloze model for representation learning over text . Like BERT , it is a conditional generative model of tokens given their contexts . However , Electric does not use masking or output a full distribution over tokens that could occur in a context . Instead , it assigns a scalar energy score to each input token indicating how likely it is given its context . We train Electric using an algorithm based on noise - contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre - training method . Electric performs well when transferred to downstream tasks and is particularly effective at producing likelihood scores for text : it reranks speech recognition n - best lists better than language models and much faster than masked language models . Furthermore , it offers a clearer and more principled view of what ELECTRA learns during pre - training .
Introduction
The cloze task ( Taylor , 1953 ) of predicting the identity of a token given its surrounding context has proven highly effective for representation learning over text . BERT ( Devlin et al . , 2019 ) implements the cloze task by replacing input tokens with [ MASK ] , but this approach incurs drawbacks in efficiency ( only 15 % of tokens are masked out at a time ) and introduces a pre - train / fine - tune mismatch where BERT sees [ MASK ] tokens in training but not in fine - tuning . ELECTRA ( Clark et al . , 2020 ) uses a different pre - training task that alleviates these disadvantages . Instead of masking tokens , ELECTRA replaces some input tokens with fakes sampled from a small generator network . The pre - training task is then to distinguish the original vs. replaced tokens . While on the surface it appears quite different from BERT , in this paper we elucidate a close connection between ELECTRA and cloze modeling . In particular , we develop a new way of implementing the cloze task using an energy - based model ( EBM ) . Then we show the resulting model , which we call Electric , is closely related to ELECTRA , as well as being useful in its own right for some applications . 1 EBMs learn an energy function that assigns low energy values to inputs in the data distribution and high energy values to other inputs . They are flexible because they do not have to compute normalized probabilities . For example , Electric does not use masking or an output softmax , instead producing a scalar energy score for each token where a low energy indicates the token is likely given its context . Unlike with BERT , these likelihood scores can be computed simultaneously for all input tokens rather than only for a small masked - out subset . We propose a training algorithm for Electric that efficiently approximates a loss based on noise - contrastive estimation ( Gutmann and Hyvärinen , 2010 ) . Then we show that this training algorithm is closely related to ELECTRA ; in fact , ELECTRA can be viewed as a variant of Electric using negative sampling instead of noise - contrastive estimation .
We evaluate Electric on GLUE and SQuAD ( Rajpurkar et al . , 2016 ) , where Electric substantially outperforms BERT but slightly under - performs ELECTRA . However , Electric is particularly useful in its ability to efficiently produce pseudo - likelihood scores ( Salazar et al . , 2020 ) for text : Electric is better at re - ranking the outputs of a speech recognition system than GPT-2 ( Radford et al . , 2019 ) and is much faster at re - ranking than BERT because it scores all input tokens simultaneously rather than having to be run multiple times with different tokens masked out . In total , investigating Electric leads to a more principled understanding of ELECTRA and our results suggest that EBMs are a promising alternative to the standard generative models currently used for language representation learning .
Method
BERT and related pre - training methods ( Baevski et al . , 2019;Lan et al . , 2020 ) train a large neural network to perform the cloze task . These models learn the probability p data ( x t |x \t ) of a token x t occurring in the surrounding context
x \t = [ x 1 , ... , x t−1 , x t+1 , ... , x n ] .
Typically the context is represented as the input sequence with x t replaced by a special [ MASK]placeholder token . This masked sequence is encoded into vector representations by a transformer network ( Vaswani et al . , 2017 ) . Then the representation at position t is passed into a softmax layer to produce a distribution over tokens p θ ( x t |x \t ) for the position .
The Electric Model
Electric also models p data ( x t |x \t ) , but does not use masking or a softmax layer . Electric first maps the unmasked input x = [ x 1 , ... , x n ] into contextualized vector representations h(x ) = [ h 1 , ... , h n ] using a transformer network . The model assigns a given position t an energy score E(x ) t = w T h(x ) t using a learned weight vector w. The energy function defines a distribution over the possible tokens at position t as
p θ ( x t |x \t ) = exp ( −E(x ) t ) /Z(x \t ) = exp ( −E(x ) t ) x ∈V exp ( −E(REPLACE(x , t , x ) ) t )
where REPLACE(x , t , x ) denotes replacing the token at position t with x and V is the vocabulary , in practice usually word pieces ( Sennrich et al . , 2016 ) . Unlike with BERT , which produces the probabilities for all possible tokens x using a softmax layer , a candidate x is passed in as input to the transformer . As a result , computing p θ is prohibitively expensive because the partition function Z θ ( x \t ) requires running the transformer |V| times ; unlike most EBMs , the intractability of Z θ ( x \t ) is due to the expensive scoring function rather than having a large sample space .
As computing the exact likelihood is intractable , training energy - based models such as Electric with standard maximum - likelihood estimation is not possible . Instead , we use ( conditional ) Noise - Contrastive Estimation ( NCE ) ( Gutmann and Hyvärinen , 2010;Ma and Collins , 2018 ) , which provides a way of efficiently training an unnormalized model that does not compute Z θ ( x \t ) . NCE learns the parameters of a model by defining a binary classification task where samples from the data distribution have to be distinguished from samples generated by a noise distribution q(x t |x \t ) . First , we define the un - normalized output p θ ( x t |x \t ) = exp ( −E(x ) t )
Operationally , NCE can be viewed as follows :
• A positive data point is a text sequence x from the data and position in the sequence t.
• A negative data point is the same except x t , the token at position t , is replaced with a noise tokenx t sampled from q.
• Define a binary classifier D that estimates the probability of a data point being positive as
n •p θ ( x t |x \t ) n •p θ ( x t |x \t ) + k • q(x t |x \t )
• The binary classifier is trained to distinguish positive vs negative data points , with k negatives sampled for every n positive data points .
Formally , the NCE loss L(θ ) is
n • E x , t − log n •p θ ( x t |x \t ) n •p θ ( x t |x \t ) + k • q(x t |x \t ) + k • E x , t xt∼q − log k • q(x t |x \t ) n •p θ ( x t |x \t ) + k • q(x t |x \t )
This loss is minimized whenp θ matches the data distribution p data ( Gutmann and Hyvärinen , 2010 ) . A consequence of this property is that the model learns to be self - normalized such that Z θ ( x \t ) = 1 .
Training Algorithm
To minimize the loss , the expectations could be approximated by sampling as shown in Algorithm 1 . Taking the gradient of this estimated loss produces Algorithm 1 Naive NCE loss estimation Given : Input sequence x , number of negative samples k , noise distribution q , modelp θ .
1 . Initialize the loss as
n t=1 − log n•p θ ( xt|x \t ) n•p θ ( xt|x \t ) + k•q(xt|x \t ) . 2 . Sample k negative samples according to t ∼ unif{1 , n},x t ∼ q(x t |x \t ) . 3 . For each negative sample , add to the loss − log k•q(xt|x \t ) n•p θ ( xt|x \t ) + k•q(xt|x \t ) .
an unbiased estimate of ∇ θ L(θ ) . However , this algorithm is computationally expensive to run , since it requires k + 1 forward passes through the transformer to compute thep θ s ( once for the positive samples and once for each negative sample ) . We propose a much more efficient approach that replaces k input tokens with noise samples simultaneously shown in Algorithm 2 . It requires just
Algorithm 2 Efficient NCE loss estimation
Given : Input sequence x , number of negative samples k , noise distribution q , modelp θ . 1 . Pick k unique random positions R = { r 1 , ... , r k } where each r i is 1 ≤ r i ≤ n.
2 . Replace the k random positions with negative samples :
x i ∼ q(x i |x \i ) for i ∈ R , x noised = REPLACE(x , R , X ) . 3 .
For each position t = 1 to n : add to the loss − log
k•q(xt|x \t ) ( n−k)•p θ ( xt|x noised \t ) + k•q(xt|x \t ) if t ∈ R − log ( n−k)•p θ ( xt|x noised \t ) ( n−k)•p θ ( xt|x noised \t ) + k•q(xt|x \t ) otherwise
one pass through the transformer for k noise samples and n − k data samples . However , this procedure only truly minimizes L ifp θ ( x t |x \t ) = p θ ( x t |x noised \t ) . To apply this efficiency trick we are making the assumption they are approximately equal , which we argue is reasonable because ( 1 ) we choose a small k of 0.15n and ( 2 ) q is trained to be close to the data distribution ( see below ) . This efficiency trick is analogous to BERT masking out multiple tokens per input sequence .
Noise Distribution
The noise distribution q comes from a neural network trained to match p data . NCE commonly employs this idea to ensure the classification task is sufficiently challenging for the model ( Gutmann and Hyvärinen , 2010;Wang and Ou , 2018 ) . In particular , we use a two - tower cloze model as proposed by Baevski et al . ( 2019 ) , which is more accurate than a language model because it uses context to both sides of each token . The model runs two transformers T LTR and T RTL over the input sequence . These transformers apply causal masking so one processes the sequence left - to - right and the other operates right - to - left . The model 's predictions come from a softmax layer applied to the concatenated states of the two transformers :
− → h = T LTR ( x ) , ← − h = T RTL ( x ) q(x t |x \t ) = softmax(W [ − → h t−1 , ← − h t+1 ] ) xt
The noise distribution is trained simultaneously with Electric using standard maximum likelihood estimation over the data . The model producing the noise distribution is much smaller than Electric to reduce the computational overhead .
Connection to ELECTRA
Electric is closely related to the ELECTRA pretraining method . ELECTRA also trains a binary classifier ( the " discriminator " ) to distinguish data tokens from noise tokens produced by a " generator " network . However , ELECTRA 's classifier is simply a sigmoid layer on top of the transformer : it models the probability of a token being negative ( i.e. , as replaced by a noise sample ) as σ(E(x ) t ) where σ denotes the sigmoid function . Electric on the other hand models this probability as
k • q(x|x \t ) n • exp ( −E(x ) t ) + k • q(x|x \t ) = σ E(x ) t + log k • q(x|x \t ) n
While ELECTRA learns whether a token is more likely to come from the data distribution p data or noise distribution q , Electric only learns p data because q is passed into the model directly . This difference is analogous to using negative sampling ( Mikolov et al . , 2013 ) vs. noise - contrastive estimation ( Mnih and Kavukcuoglu , 2013 ) for learning word embeddings . A disadvantage of Electric compared to ELEC - TRA is that it is less flexible in the choice of noise distribution . Since ELECTRA 's binary classifier does not need to access q , its q only needs to be defined for negative sample positions in the input sequence . Therefore ELECTRA can use a masked language model rather than a two - tower cloze model for q. An advantage of Electric is that it directly provides ( un - normalized ) probabilitieŝ p θ for tokens , making it useful for applications such as re - ranking the outputs of text generation systems . The differences between ELECTRA and Electric are summarized below :
Model Noise Dist . Binary Classifier Electric Two - Tower Cloze Model σ E(x)t + log k•q(x|x \t ) n ELECTRA Masked LM σ(E(x)t )
Experiments
We train two Electric models the same size as BERT - Base ( 110 M parameters ): one on Wikipedia and BooksCorpus ( Zhu et al . , 2015 ) for comparison with BERT and one on OpenWebTextCorpus ( Gokaslan and Cohen , 2019 ) for comparison 2 with GPT-2 . The noise distribution transformers T LTR and T RTL are 1/4 the hidden size of Electric . We do no hyperparameter tuning , using the same hyperparameter values as ELECTRA . Further details on training are in the appendix .
Transfer to Downstream Tasks
We evaluate fine - tuning the Electric model on the GLUE natural language understanding benchmark and the SQuAD 2.0 question answering dataset ( Rajpurkar et al . , 2018 ) . We report exact - match for SQuAD , average score 3 over the GLUE tasks 4 , and accuracy on the multi - genre natural language inference GLUE task . Reported scores are medians over 10 fine - tuning runs with different random seeds . We use the same finetuning setup and hyperparameters as ELECTRA .
Results are shown in Table 1 . Electric scores better than BERT , showing the energy - based formulation improves cloze model pre - training . However , Electric scores slightly lower than ELECTRA . One possible explanation is that Electric 's noise distribution is worse because a two - tower cloze model is less expressive than a masked LM . We tested this hypothesis by training ELECTRA with the same two - tower noise model as Electric . Performance did indeed go down , but it only explained about half the gap . The surprising drop in performance suggests that learning the difference between the data and generations from a low - capacity model leads to better representations than only learning the data distribution , but we believe further research is needed to fully understand the discrepancy .
Fast Pseudo - Log - Likelihood Scoring
An advantage of Electric over BERT is that it can efficiently produce pseudo - log - likelihood ( PLL ) scores for text ( Wang and Cho , 2019 ) . PLLs for Electric are
PLL(x ) = n t=1 log(p θ ( x t |x \t ) ) = n t=1 −E(x ) t
PLLs can be used to re - rank the outputs of an NMT or ASR system . While historically log - likelihoods from language models have been used for such reranking , recent work has demonstrated that PLLs from masked language models perform better ( Shin et al . , 2019 ) . However , computing PLLs from a masked language model requires n passes of the transformer : once with each token masked out . Salazar et al . ( 2020 ) suggest distilling BERT into a model that uses no masking to avoid this cost , but this model considerably under - performed regular LMs in their experiments .
Electric can produce PLLs for all input tokens in a single pass like a LM while being bidirectional like a masked LM . We use the PLLs from Electric for re - ranking the 100 - best hypotheses of a 5 - layer BLSTMP model from ESPnet ( Watanabe et al . , 2018 ) on the 960 - hour LibriSpeech corpus ( Panayotov et al . , 2015 ) following the same experimental setup and using the same n - best lists as Salazar et al . ( 2020 ) . Given speech features s and speech recognition model f the re - ranked output is arg max
x∈n - best(f , s ) f ( x|s ) + λPLL(x )
where n - best(f , s ) consists of the top n ( we use n = 100 ) predictions from the speech recognition model found with beam search , f ( x|s ) is the score the speech model assigns the candidate output sequence x. We select the best λ on the dev set out of [ 0.05 , 0.1 , ... , 0.95 , 1.0 ] , with different λs selected for the " clean " and " other " portions of the data .
We compare Electric against GPT-2 ( Radford et al . , 2019 ) , BERT ( Devlin et al . , 2019 ) , and two baseline systems that are bidirectional while only requiring a single transformer pass like Electric . TwoTower is a two - tower cloze model similar to Electric 's noise distribution , but of the same size as Electric . ELECTRA - TT is identical to ELECTRA except it uses a two - tower noise distribution rather than a masked language model . 5 The noise distribution probabilities and binary classifiers scores of ELECTRA can be combined to assign probabilities for tokens as shown in Appendix G of the ELECTRA paper .
Results are shown in Table 2 . Electric scores better than GPT-2 when trained on comparable data . While scoring worse than BERT , Electric is much faster to run . It also slightly outperforms ELECTRA - TT , which is consistent with the finding from Labeau and Allauzen ( 2018 ) that NCE outperforms negative sampling for training language models . Furthermore , Electric is simpler and faster than ELETRA - TT in that it does not require running the generator to produce PLL scores . TwoTower scores lower than Electric , presumably because it is not a " deeply " bidirectional model and instead just concatenates forward and backward hidden states .
Related Work
Language modeling ( Dai and Le , 2015;Radford et al . , 2018;Peters et al . , 2018 ) and cloze modeling ( Devlin et al . , 2019;Baevski et al . , 2019 ; have proven to be effective pre - training tasks for NLP . Unlike Electric , these methods follow the standard recipe of estimating token probabilities with an output softmax and using maximumlikelihood training .
Energy - based models have been widely explored in machine learning ( Dayan et al . , 1995 et al . , 2007 ) . While many training methods involve sampling from the EBM using gradientbased MCMC ( Du and Mordatch , 2019 ) or Gibbs sampling ( Hinton , 2002 ) , we considered these approaches too slow for pre - training because they require multiple passes through the model per sample . We instead use noise - contrastive estimation ( Gutmann and Hyvärinen , 2010 ) , which has widely been used in NLP for learning word vectors ( Mnih and Kavukcuoglu , 2013 ) and text generation models ( Jean et al . , 2014;Józefowicz et al . , 2016 ) . While EBMs have previously been applied to leftto - right ( Wang et al . , 2015 ) or globally normalized ( Rosenfeld et al . , 2001;Deng et al . , 2020 ) text generation , they have not previously been applied to cloze models or for pre - training NLP models . Several papers have pointed out the connection between EBMs and GANs ( Zhao et al . , 2016;Finn et al . , 2016 ) , which is similar to the Electric / ELECTRA connection .
Conclusion
We have developed an energy - based cloze model we call Electric and designed an efficient training algorithm for Electric based on noise - contrastive estimation . Although Electric can be derived solely from the cloze task , the resulting pre - training method is closely related to ELECTRA 's GANlike pre - training algorithm . While slightly underperforming ELECTRA on downstream tasks , Electric is useful for its ability to quickly produce pseudo - log - likelihood scores for text . Furthermore , it offers a clearer and more principled view of the ELECTRA objective as a " negative sampling " version of cloze pre - training .
as ELECTRA 's ( Clark et al . , 2020 ) , which adds some additional ideas from on top of the BERT codebase , such as dynamic masking and removing the next - sentence prediction task . We use the weight sharing trick from ELECTRA , where the transformers producing the proposal distribution and the main transformer share token embeddings . We do not use whole - word or n - gram masking , although we believe it would improve results too . We did no hyperparameter tuning , directly using the hyperparameters from ELECTRA - Base for Electric and our baselines . These hyperparameters are slightly modified from the ones used in BERT ; for completeness , we show these values in Table 3 . The hidden sizes , feed - forward hidden sizes , and number of attention heads of the two transformers T LTR and T RTL used to produce the proposal distribution are 1/4 the size of Electric . We chose this value because it keeps the compute comparable to ELECTRA ; running two 1/4 - sized transformers takes roughly the same compute as running one 1/3sized transformer , which is the size of ELECTRA 's generator . To make the compute exactly equal , we train Electric for slightly fewer steps than ELEC - TRA . This same generator architecture was used for ELECTRA - TT . The TwoTower baseline consists of two transformers 2/3 the size of BERT 's , which takes approximately the same compute to run . The Electric models , ELECTRA - Base , and BERT - Base all use the same amount of pre - train compute ( e.g. , Electric is trained for fewer steps than BERT due to the extra compute from the proposal distribution ) , which equates to approximately three days of training on 16 TPUv2s .
B Fine - Tuning Details
We use ELECTRA 's top - level classifiers and hyperparameter values for fine - tuning as well . For GLUE tasks , a simple linear classifier is added on top of the pre - trained transformer . For SQuAD , a question answering module similar XLNet 's ( Yang et al . , 2019 ) is added on top of the transformer , which is slightly more sophisticated than BERT 's in that it jointly rather than independently predicts the start and end positions and has an " answerability " classifier added for SQuAD 2.0 . ELECTRA 's hyperparameters are similar to BERT 's , with the main difference being the addition of a layer - wise learning rate decay where layers of the network closer to the output have a higher learning rate .
Following BERT , we submit the best of 10 models fine - tuned with different random seeds to the GLUE leaderboard for test set results .
C Dataset Details
We provide details on the fine - tuning datasets below . All datasets are in English . GLUE data can be downloaded at https:// gluebenchmark.com/ and SQuAD data can be downloaded at https://rajpurkar . github.io/SQuAD-explorer/.
• CoLA : Corpus of Linguistic Acceptability ( Warstadt et al . , 2018 ) . The task is to determine whether a given sentence is grammatical or not . The dataset contains 8.5k train examples from books and journal articles on linguistic theory .
• SST : Stanford Sentiment Treebank ( Socher et al . , 2013 ) . The tasks is to determine if the sentence is positive or negative in sentiment .
The dataset contains 67k train examples from movie reviews .
• MRPC : Microsoft Research Paraphrase Corpus ( Dolan and Brockett , 2005 ) . The task is to predict whether two sentences are semantically equivalent or not . The dataset contains 3.7k train examples from online news sources .
• STS : Semantic Textual Similarity ( Cer et al . , 2017 ) . The tasks is to predict how semantically similar two sentences are on a 1 - 5 scale . The dataset contains 5.8k train examples drawn from new headlines , video and image captions , and natural language inference data .
• QQP : Quora Question Pairs ( Iyer et al . , 2017 ) . The task is to determine whether a pair of questions are semantically equivalent . The dataset contains 364k train examples from the community question - answering website Quora .
• MNLI : Multi - genre Natural Language Inference ( Williams et al . , 2018 ) . Given a premise sentence and a hypothesis sentence , the task is to predict whether the premise entails the hypothesis , contradicts the hypothesis , or neither . The dataset contains 393k train examples drawn from ten different sources . using dev - set model selection to choose the test set submission may alleviate the high variance of fine - tuning to some extent , such model selection is still not sufficient for reliable comparisons between methods ( Reimers and Gurevych , 2018 ) .
Acknowledgements
We thank John Hewitt , Yuhao Zhang , Ashwin Paranjape , Sergey Levine , and the anonymous reviewers for their thoughtful comments and suggestions . Kevin is supported by a Google PhD Fellowship .
A Pre - Training Details
The neural architectures of our models are identical to BERT - Base ( Devlin et al . , 2019 ) , although we believe incorporating additions such as relative position encodings ( Shaw et al . , 2018 ) • QNLI : Question Natural Language Inference ; constructed from SQuAD ( Rajpurkar et al . , 2016 ) . The task is to predict whether a context sentence contains the answer to a question sentence . The dataset contains 108k train examples from Wikipedia .
• RTE : Recognizing Textual Entailment ( Giampiccolo et al . , 2007 ) . Given a premise sentence and a hypothesis sentence , the task is to predict whether the premise entails the hypothesis or not . The dataset contains 2.5k train examples from a series of annual textual entailment challenges .
• SQuAD 1.1 : Stanford Question Answering Dataset ( Rajpurkar et al . , 2016 ) . Given a context paragraph and a question , the task is to select the span of text in the paragraph answering the question . The dataset contains 88k train examples from Wikipedia .
• SQuAD 2.0 : Stanford Question Answering Dataset version 2.0 ( Rajpurkar et al . , 2018 ) . This task adds addition questions to SQuAD whose answer does not exist in the context ; models have to recognize when these questions occur and not return an answer for them .
The dataset contains 130k train examples ,
We report Spearman correlation for STS , Matthews correlation coefficient ( MCC ) for CoLA , exact match for SQuAD , and accuracy for the other tasks . We use the provided evaluation script for SQuAD 6 , scipy to compute Spearman scores 7 , and sklearn to compute MCC 8 . We use the standard train / dev / test splits .
D Detailed Results
We show detailed results on GLUE and SQuAD in Table 4 and detailed results on LibriSpeech reranking in Table 5 . Following BERT , we do not show results on the WNLI GLUE task , as it is difficult to beat even the majority classifier using a standard fine - tuning - as - classifier approach . We show dev rather than test results on GLUE in the main paper because they are more reliable ; the performance of fine - tuned models varies substantially based on the random seed ( Phang et al . , 2018;Clark et al . , 2019;Dodge et al . , 2020 ) , but GLUE only supports submitting a single model rather than getting a median score of multiple models . While 6 https://worksheets . codalab.org/rest/bundles/ 0x6b567e1cf2e041ec80d7098f031c5c9e/ contents / blob/ 7 https://docs.scipy.org/doc/ scipy / reference / generated / scipy.stats . spearmanr.html
-DOCSTART- FIND : Human - in - the - Loop Debugging Deep Text Classifiers
Since obtaining a perfect training dataset ( i.e. , a dataset which is considerably large , unbiased , and well - representative of unseen cases ) is hardly possible , many real - world text classifiers are trained on the available , yet imperfect , datasets . These classifiers are thus likely to have undesirable properties . For instance , they may have biases against some sub - populations or may not work effectively in the wild due to overfitting . In this paper , we propose FINDa framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features . Experiments show that by using FIND , humans can improve CNN text classifiers which were trained under different types of imperfect datasets ( including datasets with biases and datasets with dissimilar traintest distributions ) .
Introduction
Deep learning has become the dominant approach to address most Natural Language Processing ( NLP ) tasks , including text classification . With sufficient and high - quality training data , deep learning models can perform incredibly well . However , in real - world cases , such ideal datasets are scarce . Often times , the available datasets are small , full of regular but irrelevant words , and contain unintended biases ( Wiegand et al . , 2019;Gururangan et al . , 2018 ) . These can lead to suboptimal models with undesirable properties . For example , the models may have biases against some sub - populations or may not work effectively in the wild as they overfit the imperfect training data .
To improve the models , previous work has looked into different techniques beyond standard model fitting . If the weaknesses of the training datasets or the models are anticipated , strategies can be tailored to mitigate such weaknesses . For example , augmenting the training data with genderswapped input texts helps reduce gender bias in the models ( Park et al . , 2018;Zhao et al . , 2018 ) . Adversarial training can prevent the models from exploiting irrelevant and/or protected features ( Jaiswal et al . , 2019;Zhang et al . , 2018 ) . With a limited number of training examples , using human rationales or prior knowledge together with training labels can help the models perform better ( Zaidan et al . , 2007;Bao et al . , 2018;Liu and Avci , 2019 ) .
Nonetheless , there are side - effects of suboptimal datasets that can not be predicted and are only found after training thanks to post - hoc error analysis . To rectify such problems , there have been attempts to enable humans to fix the trained models ( i.e. , to perform model debugging ) ( Stumpf et al . , 2009;Teso and Kersting , 2019 ) . Since the models are usually too complex to understand , manually modifying the model parameters is not possible . Existing techniques , therefore , allow humans to provide feedback on individual predictions instead . Then , additional training examples are created based on the feedback to retrain the models . However , such local improvements for individual predictions could add up to inferior overall performance ( Wu et al . , 2019 ) . Furthermore , these existing techniques allow us to rectify only errors related to examples at hand but provide no way to fix problems kept hidden in the model parameters .
In this paper , we propose a framework which allows humans to debug and improve deep text classifiers by disabling hidden features which are irrelevant to the classification task . We name this framework FIND ( Feature Investigation aNd Disabling ) . FIND exploits an explanation method , namely layer - wise relevance propagation ( LRP ) ( Arras et al . , 2016 ) , to understand the behavior of a classifier when it predicts each training instance .
Then it aggregates all the information using word clouds to create a global visual picture of the model . This enables humans to comprehend the features automatically learned by the deep classifier and then decide to disable some features that could undermine the prediction accuracy during testing . The main differences between our work and existing work are : ( i ) first , FIND leverages human feedback on the model components , not the individual predictions , to perform debugging ; ( ii ) second , FIND targets deep text classifiers which are more convoluted than traditional classifiers used in existing work ( such as Naive Bayes classifiers and Support Vector Machines ) .
We conducted three human experiments ( one feasibility study and two debugging experiments ) to demonstrate the usefulness of FIND . For all the experiments , we used as classifiers convolutional neural networks ( CNNs ) ( Kim , 2014 ) , which are a popular , well - performing architecture for many text classification tasks including the tasks we experimented with ( Gambäck and Sikdar , 2017;Johnson and Zhang , 2015;Zhang et al . , 2019 ) . The overall results show that FIND with human - in - the - loop can improve the text classifiers and mitigate the said problems in the datasets . After the experiments , we discuss the generalization of the proposed framework to other tasks and models . Overall , the main contributions of this paper are :
• We propose using word clouds as visual explanations of the features learned .
• We propose a technique to disable the learned features which are irrelevant or harmful to the classification task so as to improve the classifier . This technique and the word clouds form the human - debugging framework -FIND .
• We conduct three human experiments that demonstrate the effectiveness of FIND in different scenarios . The results not only highlight the usefulness of our approach but also reveal interesting behaviors of CNNs for text classification .
The rest of this paper is organized as follows . Section 2 explains related work about analyzing , explaining , and human - debugging text classifiers . Section 3 proposes FIND , our debugging framework . Section 4 explains the experimental setup followed by the three human experiments in Section 5 to 7 . Finally , Section 8 discusses generalization of the framework and concludes the paper . Code and datasets of this paper are available at https://github.com/plkumjorn/FIND .
Related Work
Analyzing deep NLP models -There has been substantial work in gaining better understanding of complex , deep neural NLP models . By visualizing dense hidden vectors , Li et al . ( 2016 ) found that some dimensions of the final representation learned by recurrent neural networks capture the effect of intensification and negation in the input text . Karpathy et al . ( 2015 ) revealed the existence of interpretable cells in a character - level LSTM model for language modelling . For example , they found a cell acting as a line length counter and cells checking if the current letter is inside a parenthesis or a quote . Jacovi et al . ( 2018 ) presented interesting findings about CNNs for text classification including the fact that one convolutional filter may detect more than one n - gram pattern and may also suppress negative n - grams . Many recent papers studied several types of knowledge in BERT ( Devlin et al . , 2019 ) , a deep transformer - based model for language understanding , and found that syntactic information is mostly captured in the middle BERT layers while the final BERT layers are the most task - specific ( Rogers et al . , 2020 ) . Inspired by many findings , we make the assumption that each dimension of the final representation ( i.e. , the vector before the output layer ) captures patterns or qualities in the input which are useful for classification . Therefore , understanding the roles of these dimensions ( we refer to them as features ) is a prerequisite for effective human - in - the - loop model debugging , and we exploit an explanation method to gain such an understanding . Explaining predictions from text classifiers -Several methods have been devised to generate explanations supporting classifications in many forms , such as natural language texts , rules ( Ribeiro et al . , 2018 ) , extracted rationales ( Lei et al . , 2016 ) , and attribution scores ( Lertvittayakumjorn and Toni , 2019 ) . Some explanation methods , such as LIME ( Ribeiro et al . , 2016 ) and SHAP ( Lundberg and Lee , 2017 ) , are model - agnostic and do not require access to model parameters . Other methods access the model architectures and parameters to generate the explanations , such as DeepLIFT ( Shrikumar et al . , 2017 ) and LRP ( layer - wise relevance propagation ) ( Bach et al . , 2015;Arras et al . , 2016 ) . In this work , we use LRP to explain not the predictions but the learned features so as to expose the model behavior to humans and enable informed model debugging .
Debugging text classifiers using human feedback -Early work in this area comes from the human - computer interaction community . Stumpf et al . ( 2009 ) studied the types of feedback humans usually give in response to machine - generated predictions and explanations . Also , some of the feedback collected ( i.e. , important words of each category ) was used to improve the classifier via a user co - training approach . Kulesza et al . ( 2015 ) presented an explanatory debugging approach in which the system explains to users how it made each prediction , and the users then rectify the model by adding / removing words from the explanation and adjusting important weights . Even without explanations shown , an active learning framework proposed by Settles ( 2011 ) asks humans to iteratively label some chosen features ( i.e. , words ) and adjusts the model parameters that correspond to the features . However , these early works target simpler machine learning classifiers ( e.g. , Naive Bayes classifiers with bag - of - words ) and it is not clear how to apply the proposed approaches to deep text classifiers .
Recently , there have been new attempts to use explanations and human feedback to debug classifiers in general . Some of them were tested on traditional text classifiers . For instance , Ribeiro et al . ( 2016 ) showed a set of LIME explanations for individual SVM predictions to humans and asked them to remove irrelevant words from the training data in subsequent training . The process was run for three rounds to iteratively improve the classifiers . Teso and Kersting ( 2019 ) proposed CAIPI , which is an explanatory interactive learning framework . At each iteration , it selects an unlabelled example to predict and explain to users using LIME , and the users respond by removing irrelevant features from the explanation . CAIPI then uses this feedback to generate augmented data and retrain the model . While these recent works use feedback on lowlevel features ( input words ) and individual predictions , our framework ( FIND ) uses feedback on the learned features with respect to the big picture of the model . This helps us avoid local decision pitfalls which usually occur in interactive machine learning ( Wu et al . , 2019 ) . Overall , what makes our contribution different from existing work is that ( i ) we collect the feedback on the model , not the individual predictions , and ( ii ) we target deep text classifiers which are more complex than the models used in previous work .
FIND : Debugging Text Classifiers
Motivation
Generally , deep text classifiers can be divided into two parts . The first part performs feature extraction , transforming an input text into a dense vector ( i.e. , a feature vector ) which represents the input . There are several alternatives to implement this part such as using convolutional layers , recurrent layers , and transformer layers . The second part performs classification passing the feature vector through a dense layer with softmax activation to get predicted probability of the classes . These deep classifiers are not transparent , as humans can not interpret the meaning of either the intermediate vectors or the model parameters used for feature extraction . This prevents humans from applying their knowledge to modify or debug the classifiers .
In contrast , if we understand which patterns or qualities of the input are captured in each feature , we can comprehend the overall reasoning mechanism of the model as the dense layer in the classification part then becomes interpretable . In this paper , we make this possible using LRP . By understanding the model , humans can check whether the input patterns detected by each feature are relevant for classification . Also , the features should be used by the subsequent dense layer to support the right classes . If these are not the case , debugging can be done by disabling the features which may be harmful if they exist in the model . Figure 1 shows the overview of our debugging framework , FIND .
Notation
Let us consider a text classification task with |C| classes where C is the set of all classes and let V be a set of unique words in the corpus ( the vocabulary ) .
A training dataset D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } is given , where x i is the i - th document containing a sequence of L words , [ x i1 , x i2 , ... , x iL ] , and y i ∈ C is the class label of
Understanding the Model
To understand how the model M works , we analyze the patterns or characteristics of the input that activate each feature f i . Specifically , using LRP 1 , for each f i of an example x j in the training dataset , we calculate a relevance vector r ij ∈ R L showing the relevance scores ( the contributions ) of each word in x j for the value of f i . After doing this for all d features of all training examples , we can produce word clouds to help the users better understand the model M . Word clouds -For each feature f i , we create ( one or more ) word clouds to visualize the patterns in the input texts which highly activate f i . This can be done by analyzing r ij for all x j in the training data and displaying , in the word clouds , words or n - grams which get high relevance scores . Note that different model architectures may have different ways to generate the word clouds so as to effectively reveal the behavior of the features .
For CNNs , the classifiers we experiment with in this paper , each feature has one word cloud containing the n - grams , from the training examples , which were selected by the max - pooling of the CNNs . For instance , Figure 2 , corresponding to a feature of filter size 2 , shows bi - grams ( e.g. , " love love " , " love my " , " loves his " , etc . ) whose font size corresponds to the feature values of the bi - grams . This is similar to how previous works analyze CNN features ( Jacovi et al . , 2018;Lertvittayakumjorn and Toni , 2019 ) , and it is equivalent to back - propagating the feature values to the input using LRP and cropping the consecutive input words with non - zero LRP scores to show in the word clouds .
Disabling Features
As explained earlier , we want to know whether the learned features are valid and relevant to the classification task and whether or not they get appropriate weights from the next layer . This is possible by letting humans consider the word cloud(s ) of each feature and tell us which class the feature is relevant to . A word cloud receiving human answers that are different from the class it should support ( as indicated by W ) exhibits a flaw in the model . For example , if the word cloud in Figure 2 represents the feature f i in a sentiment analysis task but the i th column of W implies that f i supports the negative sentiment class , we know the model is not correct here . If this word cloud appears in a product categorization task , this is also problematic because the phrases in the word cloud are not discriminative of any product category . Hence , we provide options for the users to disable the features which correspond to any problematic word clouds so that the features do not play a role in the classification . To enable this to happen , we modify M c to be M c where p = M c ( f ) = softmax((W Q)f + b ) and Q ∈ R |C|×d is a masking matrix with being an element - wise multiplication operator . Initially , all elements in Q are ones which enable all the connections between the features and the output . To disable feature f i , we set the i th column of Q to be a zero vector . After disabling features , we then freeze the parameters of M f and fine - tune the parameters of M c ( except the masking matrix Q ) with the original training dataset D in the final step .
Experimental Setup
All datasets and their splits used in the experiments are listed in Table 1 . We will explain each of them in the following sections . For each classification task , we ran and improved three models , using different random seeds , independently of one another , and the reported results are the average of the three runs . Regarding the models , we used 1D CNNs with the same structures for all the tasks and datasets . The convolution layer had three filter sizes [ 2 , 3 , 4 ] with 10 filters for each size ( i.e. , d = 10 × 3 = 30 ) . All the activation functions were ReLU except the softmax at the output layer . The input documents were padded or trimmed to have 150 words ( L = 150 ) . We used pre - trained 300 - dim GloVe vectors ( Pennington et al . , 2014 ) as non - trainable weights in the embedding layers . All the models were implemented using Keras and trained with Adam optimizer . We used iNNvestigate ( Alber et al . , 2018 ) to run LRP on CNN features . In particular , we used the LRP - propagation rule to stabilize the relevance scores ( = 10 −7 ) . Finally , we used Amazon Mechanical Turk ( MTurk ) to collect crowdsourced responses for selecting features to disable . Each question was answered by ten workers and the answers were aggregated using majority votes or average scores depending on the question type ( as explained next ) .
Exp 1 : Feasibility Study
In this feasibility study , we assessed the effectiveness of word clouds as visual explanations to reveal the behavior of CNN features . We trained CNN models using small training datasets and evaluated the quality of CNN features based on responses from MTurk workers to the feature word clouds .
Then we disabled features based on their average quality scores . The assumption was : if the scores of the disabled features correlated with the drop in the model predictive performance , it meant that humans could understand and accurately assess CNN features using word clouds . We used small training datasets so that the trained CNNs had features with different levels of quality . Some features detected useful patterns , while others overfitted the training data .
Datasets
We used subsets of two datasets : ( 1 ) Yelp -predicting sentiments of restaurant reviews ( positive or negative ) and ( 2 ) Amazon Products -classifying product reviews into one of four categories ( Clothing Shoes and Jewelry , Digital Music , Office Products , or Toys and Games ) ( He and McAuley , 2016 ) . We sampled 500 and 100 examples to be the training data for Yelp and Amazon Products , respectively .
Human Feedback Collection and Usage
We used human responses on MTurk to assign ranks to features . As each classifier had 30 original features ( d = 30 ) , we divided them into three ranks ( A , B , and C ) each of which with 10 features . We expected that features in rank A are most relevant and useful for the prediction task , and features in rank C least relevant , potentially undermining the performance of the model . To make the annotation more accessible to lay users , we designed the questions to ask whether a given word cloud is ( mostly or partially ) relevant to one of the classes or not , as shown in Figure 3 . If the answer matches how the model really uses this feature ( as indicated by W ) , the feature gets a positive score from this human response . For example , if the CNN feature of the word cloud in Figure 3 is used by the model for the negative sentiment class , the scores of the five options in the figure are -2 , -1 , 0 , 1 , 2 , respectively . We collected ten responses for each question and used the average score to sort the features descendingly . After sorting , the 1 st -10 th features , 11 th -20 th features , and 21 st -30 th features are considered as rank A , B , and C , respectively . 3 To show the effects of feature disabling , we compared the original model M with the modified model M with features in rank X disabled where X ∈ { A , B , C , A and B , A and C , B and C } .
Results and Discussions
Figure 4 shows the distribution of average feature scores from one of the three CNN instances for the Yelp dataset . Examples of the word clouds from each rank are displayed in Figure 5 . We can clearly see dissimilar qualities of the three features . Some participants answered that the rank B feature in Figure 5 was relevant to the positive class ( probably due to the word ' delicious ' ) , and the weights of this feature in W agreed ( Positive : Negative = 0.137:-0.135 ) . Interestingly , the rank C feature in Figure 5 got a negative score because some participants believed that this word cloud was relevant to the positive class , but actually the model used this feature as evidence for the negative class ( Positive : Negative = 0.209:0.385 ) .
Considering all the three runs , Figure 6 ( top ) shows the average macro F1 score of the original model ( the blue line ) and of each modified model . The order of the performance drops is AB > A > AC > BC > B > Original > C. This makes sense because disabling important features ( rank A and/or B ) caused larger performance drops , and the overall results are consistent with the average fea- ture scores given by the participants ( as in Figure 4 ) . It confirms that using word clouds is an effective way to assess CNN features . Also , it is worth noting that the macro F1 of the model slightly increased when we disabled the low - quality features ( rank C ) . This shows that humans can improve the model by disabling irrelevant features .
The CNNs for the Amazon Products dataset also behaved in a similar way ( Figure 6 -bottom ) , except that disabling rank C features slightly undermined , not increased , performance . This implies that even the rank C features contain a certain amount of useful knowledge for this classifier . 4
6 Exp 2 : Training Data with Biases
Given a biased training dataset , a text classifier may absorb the biases and produce biased predictions against some sub - populations . We hypothesize that if the biases are captured by some of the learned features , we can apply FIND to disable such features and reduce the model biases .
Datasets and Metrics
We focus on reducing gender bias of CNN models trained on two datasets -Biosbias ( De - Arteaga et al . , 2019 ) and Waseem ( Waseem and Hovy , 2016 ) . For Biosbias , the task is predicting the occupation of a given bio paragraph , i.e. , whether the person is ' a surgeon ' ( class 0 ) or ' a nurse ' ( class 1 ) . Due to the gender imbalance in each occupation , a classifier usually exploits gender information when making predictions . As a result , bios of female surgeons and male nurses are often misclassified . For Waseem , the task is abusive language detection -assessing if a given text is abusive ( class 1 ) or not abusive ( class 0 ) . Previous work found that this dataset contains a strong negative bias against females ( Park et al . , 2018 ) . In other words , texts related to females are usually classified as abusive although the texts themselves are not abusive at all . Also , we tested the models , trained on the Waseem dataset , using another abusive language detection dataset , Wikitoxic ( Thain et al . , 2017 ) , to assess generalizability of the models . To quantify gender biases , we adopted two metrics -false positive equality difference ( FPED ) and false negative equality difference ( FNED ) ( Dixon et al . , 2018 ) .
The lower these metrics are , the less biases the model has . 4 We also conducted the same experiments here with bidirectional LSTM networks ( BiLSTMs ) which required a different way to generate the word clouds ( see Appendix C ) . The results on BiLSTMs , however , are not as promising as on CNNs . This might be because the way we created word clouds for each BiLSTM feature was not an accurate way to reveal its behavior . Unlike for CNNs , understanding recurrent neural network features for text classification is still an open problem .
Human Feedback Collection and Usage
Unlike the interface in Figure 3 , for each word cloud , we asked the participants to select the relevant class from three options ( Biosbias : surgeon , nurse , it could be either / Waseem : abusive , nonabusive , it could be either ) . The feature will be disabled if the majority vote does not select the class suggested by the weight matrix W. To ensure that the participants do not use their biases while answering our questions , we firmly mentioned in the instructions that gender - related terms should not be used as an indicator for one or the other class .
Results and Discussions
The results of this experiment are displayed in Figure 7 . For Biosbias , on average , the participants ' responses suggested us to disable 11.33 out of 30 CNN features . By doing so , the FPED of the models decreased from 0.250 to 0.163 , and the FNED decreased from 0.338 to 0.149 . After investigating the word clouds of the CNN features , we found that some of them detected patterns containing both gender - related terms and occupation - related terms such as " his surgical expertise " and " she supervises nursing students " . Most of the MTurk participants answered that these word clouds were relevant to the occupations , and thus the corresponding features were not disabled . However , we believe that these features might contain gender biases . So , we asked one annotator to consider all the word clouds again and disable every feature for which the prominent n - gram patterns contained any genderrelated terms , no matter whether the patterns detect occupation - related terms . With this new disabling policy , 12 out of 30 features were disabled on average , and the model biases further decreased , as shown in Figure 7 ( Debugged ( One ) ) . The sideeffect of disabling 33 % of all the features here was only a slight drop in the macro F1 from 0.950 to 0.933 . Hence , our framework was successful in reducing gender biases without severe negative effects in classification performance .
Concerning the abusive language detection task , on average , the MTurk participants ' responses suggested us to disable 12 out of 30 CNN features . Unlike Biosbias , disabling features based on MTurk responses unexpectedly increased the gender bias for both Waseem and Wikitoxic datasets . However , we found one similar finding to Biosbias , that many of the CNN features captured n - grams which were both abusive and related to a gender such as ' these girls are terrible ' and ' of raping slave girls ' , and these features were not yet disabled . So , we asked one annotator to disable the features using the new " brutal " policy -disabling all which involved gender words even though some of them also detected abusive words . By disabling 18 out of 30 features on average , the gender biases were reduced for both datasets ( except FPED on Wikitoxic which stayed close to the original value ) . Another consequence was that we sacrificed 4 % and 1 % macro F1 on the Waseem and Wikitoxic datasets , respectively . This finding is consistent with ( Park et al . , 2018 ) that reducing the bias and maintaining the classification performance at the same time is very challenging .
Exp 3 : Dataset Shift
Dataset shift is a problem where the joint distribution of inputs and outputs differs between training and test stage ( Quionero - Candela et al . , 2009 ) . Many classifiers perform poorly under dataset shift because some of the learned features are inapplicable ( or sometimes even harmful ) to classify test documents . We hypothesize that FIND is useful for investigating the learned features and disabling the overfitting ones to increase the generalizability of the model .
Datasets
We considered two tasks in this experiment . The first task aims to classify " Christianity " vs " Atheism " documents from the 20 Newsgroups dataset 5 . This dataset is special because it contains a lot of artifacts -tokens ( e.g. , person names , punctuation marks ) which are not relevant , but strongly co - occur with one of the classes . For evaluation , we used the Religion dataset by Ribeiro et al . ( 2016 ) , containing " Christianity " and " Atheism " web pages , as a target dataset . The second task is sentiment analysis . We used , as a training dataset , Amazon Clothes , with reviews of clothing , shoes , and jewelry products ( He and McAuley , 2016 ) , and as test sets three out - of - distribution datasets -Amazon Music ( He and McAuley , 2016 ) , Amazon Mixed , and the Yelp dataset ( which was used in Experiment 1 ) . Amazon Music contains only reviews from the " Digital Music " product category which was found to have an extreme distribution shift from the clothes category ( Hendrycks et al . , 2020 ) . Amazon Mixed compiles the reviews from various kinds of products , while Yelp focuses on restaurant reviews .
Human Feedback Collection and Usage
We collected responses from MTurk workers using the same user interfaces as in Experiment 2 . Simply put , we asked the workers to select a class which was relevant to a given word cloud and checked if the majority vote agreed with the weights in W.
Results and Discussions
For the first task , on average , 14.33 out of 30 features were disabled and the macro F1 scores of the 20Newsgroups before and after debugging are 0.853 and 0.828 , respectively . The same metrics of the Religion dataset are 0.731 and 0.799 . This shows that disabling irrelevant features mildly undermined the predictive performance on the indistribution dataset , but clearly enhanced the performance on the out - of - distribution dataset ( see Figure 8 , left ) . This is especially evident for the Atheism class for which the F1 score increased around 15 % absolute . We noticed from the word clouds that many prominent words for the Atheism class learned by the models are person names ( e.g. , Keith , Gregg , Schneider ) and these are not applicable to the Religion dataset . Forcing the models to use only relevant features ( detecting terms like ' atheists ' and ' science ' ) , therefore , increased the macro F1 on the Religion dataset .
Unlike 20Newsgroups , Amazon Clothes does not seem to have obvious artifacts . Still , the re - sponses from crowd workers suggested that we disable 6 features . The disabled features were correlated to , but not the reason for , the associated class . For instance , one of the disabled features was highly activated by the pattern " my .... year old " which often appeared in positive reviews such as " my 3 year old son loves this . " . However , these correlated features are not very useful for the three outof - distribution datasets ( Music , Mixed , and Yelp ) . Disabling them made the model focus more on the right evidence and increased the average macro F1 for the three datasets , as shown in Figure 8 ( right ) . Nonetheless , the performance improvement here was not as apparent as in the previous task because , even without feature disabling , the majority of the features are relevant to the task and can lead the model to the correct predictions in most cases . 6
Discussion and Conclusions
We proposed FIND , a framework which enables humans to debug deep text classifiers by disabling irrelevant or harmful features . Using the proposed framework on CNN text classifiers , we found that ( i ) word clouds generated by running LRP on the training data accurately revealed the behaviors of CNN features , ( ii ) some of the learned features might be more useful to the task than the others and ( iii ) disabling the irrelevant or harmful features could improve the model predictive performance and reduce unintended biases in the model .
Generalization to Other Models
In order to generalize the framework beyond CNNs , there are two questions to consider . First , what is an effective way to understand each feature ? We exemplified this with two word clouds representing each BiLSTM feature in Appendix C , and we plan to experiment with advanced visualizations such as LSTMVis ( Strobelt et al . , 2018 ) in the future . Second , can we make the model features more interpretable ? For example , using ReLU as activation functions in LSTM cells ( instead of tanh ) renders the features non - negative . So , they can be summarized using one word cloud which is more practical for debugging .
In general , the principle of FIND is understanding the features and then disabling the irrelevant ones . The process makes visualizations and interpretability more actionable . Over the past few years , we have seen rapid growth of scientific research in both topics ( visualizations and interpretability ) aiming to understand many emerging advanced models including the popular transformer - based models ( Jo and Myaeng , 2020;Voita et al . , 2019;Hoover et al . , 2020 ) . We believe that our work will inspire other researchers to foster advances in both topics towards the more tangible goal of model debugging .
Generalization to Other Tasks
FIND is suitable for any text classification tasks where a model might learn irrelevant or harmful features during training . It is also convenient to use since only the trained model and the training data are required as input . Moreover , it can address many problems simultaneously such as removing religious and racial bias together with gender bias even if we might not be aware of such problems before using FIND . In general cases , FIND is at least useful for model verification .
For future work , it would be interesting to extend FIND to other NLP tasks , e.g. , question answering and natural language inference . This will require some modifications to understand how the features capture relationships between two input texts .
Limitations
Nevertheless , FIND has some limitations . First , the word clouds may reveal sensitive contents in the training data to human debuggers . Second , the more hidden features the model has , the more human effort FIND needs for debugging . For instance , BERT - base ( Devlin et al . , 2019 ) has 768 features ( before the final dense layer ) which require lots of human effort to perform investigation . In this case , it would be more efficient to use FIND to disable attention heads rather than individual features ( Voita et al . , 2019 ) . Third , it is possible that one feature detects several patterns ( Jacovi et al . , 2018 ) and it will be difficult to disable the feature if some of the detected patterns are useful while the others are harmful . Hence , FIND would be more effective when used together with disentangled text representations ( Cheng et al . , 2020 ) .
A Layer - wise Relevance Propagation
Layer - wise Relevance Propagation ( LRP ) is a technique for explaining predictions of neural networks in terms of importance scores of input features ( Bach et al . , 2015 ) . Originally , it was devised to explain predictions of image classifiers by creating a heatmap on the input image highlighting pixels that are important for the classification . Then Arras et al . ( 2016 ) and Arras et al . ( 2017 ) extended LRP to work on CNNs and RNNs for text classification , respectively .
Consider a neuron k whose value is computed using n neurons in the previous layer ,
x k = g ( n j=1 x j w jk + b k )
where x k is the value of the neuron k , g is a nonlinear activation function , w jk and b k are weights and bias in the network , respectively . We can see that the contribution of a single node j to the value of the node k is
z jk = x j w jk + b k n
assuming that the bias term b k is distributed equally to the n neurons . LRP works by propagating the activation of a neuron of interest back through the previous layers in the network proportionally . We call the value each neuron receives a relevance score ( R ) of the neuron . To back propagate , if the relevance score of the neuron k is R k , the relevance score that the neuron j receives from the neuron
k is R j←k = z jk n j = 1 z j k R k
To make the relevance propagation more stable , we add a small positive number ( as a stabilizer ) to the denominator of the propagation rule :
R j←k = z jk + n j = 1 z j k R k
We used this propagation rule , so called LRP- , in the experiments of this paper . For more details about LRP propagation rules , please see Montavon et al . ( 2019 ) .
To explain a prediction of a CNN text classifier , we propagate an activation value of the output node back to the word embedding matrix . After that , the relevance score of an input word equals the sum of relevance scores each dimension of its word vector receives . However , in this paper , we want to analyze the hidden features rather than the output , so we start back propagating from the hidden features instead to capture patterns of input words which highly activate the features .
B Multiclass Classification
As shown in Figure 9 , we used a slightly different user interface in Experiment 1 for the Amazon Products dataset which is a multiclass classification task . In this setting , we did not provide the options for mostly and partly relevant ; otherwise , there would have been nine options per question which are too many for the participants to answer accurately . With the user interface in Figure 9 , we gave a score to the feature f i based on the participant answer . To explain , we re - scaled values in the i th column of W to be in the range [ 0,1 ] using min - max normalization and gave the normalized value of the chosen class as a score to the feature f i . If the participant selects None , this feature gets a zero score . The distribution of the average feature scores for this task ( one CNN ) is displayed in Figure 10 .
C Bidirectional LSTM networks
To understand BiLSTM features , we created two word clouds for each feature . The first word cloud contains top three words which gain the highest positive relevance scores from each training example , while the second word cloud does the same but for the top three words which gain the lowest negative relevance scores ( see Figure 11 ) .
Furthermore , we also conducted Experiment 1 for BiLSTMs . Each direction of the recurrent layer had 15 hidden units and the feature vector was obtained by taking element - wise max of all the hidden states ( i.e. , d = 15 × 2 = 30 ) . We adapted the code of ( Arras et al . , 2017 ) to run LRP on BiLSTMs . Regarding human feedback collection , we collected feedback from Amazon Mechanical Turk workers by splitting the pair of word clouds into two and asking the question about the relevant class independently of each other . The answer of the positive relevance word cloud should be consistent with the weight matrix W , while the answer of the negative relevance word cloud should be the opposite of the weight matrix W. The score of a BiLSTM feature is the sum of its scores from the positive word cloud and the negative word cloud .
The results of the extra BiLSTM experiments are shown in Table 4 and 5 . Table 4 shows unexpected results after disabling features . For instance , disabling rank B features caused a larger performance drop than removing rank A features . This suggests that how we created word clouds for each BiLSTM feature ( i.e. , displaying top three words with the highest positive and lowest negative rel- evance ) might not be an accurate way to explain the feature . Nevertheless , another observation from Table 4 is that even when we disabled two - third of the BiLSTM features , the maximum macro F1 drop was less than 5 % . This suggests that there is a lot of redundant information in the features of the BiLSTMs .
D Metrics for Biases
In this paper , we used two metrics to quantify biases in the models -False positive equality difference ( FPED ) and False negative equality difference ( FNED ) -with the following definitions ( Dixon et al . , 2018 ) .
F P ED = t∈T |F P R − F P R t | F N ED = t∈T |F N R − F N R t |
where T is a set of all sub - populations we consider ( i.e. , T = { male , female } ) . FPR and FNR stand for false positive rate and false negative rate , respectively . The subscript t means that we calculate the metrics using data examples mentioning the sub - population t only . We used the following keywords to identify examples which are related to or mentioning the sub - populations . Male gender terms :
" male " , " males " , " boy " , " boys " , " man " , " men " , " gentleman " , " gentlemen " , " he " , " him " , " his " , " himself " , " brother " , " son " , " husband " , " boyfriend " , " father " , " uncle " , " dad " Female gender terms :
" female " , " females " , " girl " , " girls " , " woman " , " women " , " lady " , " ladies " , " she " , " her " , " herself " , " sister " , " daughter " , " wife " , " girlfriend " , " mother " , " aunt " , " mom " . All the bios are from Common Crawl August 2018 Index .
• Waseem : The authors of ( Waseem and Hovy , 2016 ) kindly provided the dataset to us by email . We considered " racism " and " sexism " examples as " Abusive " and " neither " examples as " Non - abusive " .
• Wikitoxic : The dataset can be downloaded here 10 . We used only examples which were given the same label by all the annotators .
• 20Newsgroups : We downloaded the standard splits of the dataset using scikit - learn 11 . The header and the footer of each text were removed .
F Full Experimental Results
Model : CNNs
Test dataset : Yelp Negative F1
Positive F1 Accuracy Macro F1 Original 0.767 ± 0.02 0.800 ± 0.00 0.785 ± 0.01 0.789 ± 0.01 Disabling ( MTurk ) 0.786 ± 0.00 0.804 ± 0.00 0.795 ± 0.00 0.796 ± 0.00
Acknowledgments
We would like to thank Nontawat Charoenphakdee and anonymous reviewers for helpful comments . Also , the first author wishes to thank the support from Anandamahidol Foundation , Thailand .