We released semi-oracle evidence passages for researchers interested in multi-answer extraction and disambiguation rather than retrieval. This document describes how they are obtained, statistics and the upperbound when using these evidence passages.
The number of Wikipedia articles per question is 3.0 on average.
The json file is a list, which i-th item is a dictionary containing id
, question
, annotations
(as in the original AmbigQA data) as well as articles_plain_text
and articles_html_text
. articles_plain_text
is a list of articles in plain text (Markdown), such as:
[
"# Dexter (season 1)\n\nThe first season of Dexter is an adaptation of Jeff Lindsay's first novel in a series of the same name, Darkly Dreaming Dexter. ...",
"# Chrisstian Camargo\n\nChristian Camargo is an American actor, producer, writer and director. ... ## Early years\n\nCamargo was born ...",
"# List of Dexter characters\n\nThis is a list of characters ... * Michael C. Hall\n* Maxwell Huckabee (age 3) * Nicholas Vigneau (young Dexter, season 7) ..."
]
article_html_text
is a list of articles in an html format, such as:
[
"<h1>Dexter (season 1)\n\nThe first season of Dexter is an adaptation of Jeff Lindsay's first novel in a series of the same name, Darkly Dreaming Dexter. ...",
"<h1>Chrisstian Camargo</h1>\n\nChristian Camargo is an American actor, producer, writer and director. ... <h2>Early years</h2>\n\nCamargo was born ...",
"<h1>List of Dexter characters</h1>\n\nThis is a list of characters ... <ul><li>Michael C. Hall</li><li>Maxwell Huckabee (age 3)</li><li>Nicholas Vigneau (young Dexter, season 7)</li> ..."
]
We recommend using this data if you want to focus on multi-answer extraction and disambiguation given evidence text. The end-to-end QA model is supposed to retrieve evidence text, but evidence retrieval itself is a very difficult problem and current retrieval models are not good at retrieving high-coverage evidence text (reference: this paper). While we encourage making progress in the retrieval part, we are releasing this semi-oracle evidence data so that the progress in the subsequent part is not blocked by the progress in retrieval.
While the size of the evidence text can be a variable in the end-to-end QA model, we set the size of the semi-oracle evidence to be approximately 10,000 words, following much of recent work in QA that uses 100 passages * 100 words per passage.
1 | 2 | 3 | 4+ | |
---|---|---|---|---|
Train | 0.1 | 0.1 | 99.4 | 0.3 |
Dev | 0.0 | 0.0 | 99.5 | 0.4 |
Test | 0.0 | 0.0 | 99.5 | 0.5 |
(based on the plain text, white space tokenization)
0--5000 | 5000--10000 | 10000--15000 | 15000--20000 | 20000-- | |
---|---|---|---|---|---|
Train | 30.9 | 33.2 | 19.1 | 9.2 | 7.7 |
Dev | 29.8 | 33.9 | 19.4 | 8.8 | 8.0 |
Test | 29.0 | 34.8 | 18.0 | 9.8 | 8.3 |
(Performance upperbound is the same for both answer F1 and QG F1)
Macro-Avg coverage | Perf upperbound (all) | Perf upperbound (multi-only) | |
---|---|---|---|
Train | 78.2 | 80.1 | 77.1 |
Dev | 84.4 | 86.6 | 82.2 |
Test | 83.0 | 85.6 | 81.3 |
0 | 1 | 2 | 3 | 4+ | |
---|---|---|---|---|---|
Train | 10.1 | 62.8 | 33.6 | 23.8 | 10.1 |
Dev | 15.7 | 58.5 | 42.4 | 30.4 | 15.7 |
Test | 18.8 | 56.1 | 45.6 | 36.0 | 18.8 |
We use the Wikipedia dump of 02/01/2020, which is the same one as used in the AmbigQA paper. We preprocess the dump so that each article includes headers, plain text and lists (tables and infoboxes are excluded). We excluded disambiguation pages, following prior work (DrQA, DPR and more).
We look up the annotator interactive logs, and find positive articles and negative articles as follows.
- Positive articles: we examine articles that anotator clicked (if they clicked a disambiguation page, articles that are linked to the disambiguation page), and include articles that contain any valid answer as positive articles.
- Negative articles: we include all articles that annotators have seen (including just titles). This includes articles that are result of the search engine and all articles linked to the disambiguation page. Among those, articles that do not contain the valid answers are considered as negative articles.
Once we obtain positive articles and negative articles, we create a set of articles by (1) first including all positive articles, and (2) if the number of positive articles is less than 3, sampling negative articles as follows.
- Create a BM25 index using all positive and negative articles.
- Compute BM25 scores of each article using the question as a query.
- Compute a weight probability using a softmax of BM25 scores.
- Sample articles based on the weight probability, until the number of unique articles is 3.