This is the repository documenting the paper AmbigQA: Answering Ambiguous Open-domain Questions (EMNLP 2020) by Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer.
- Website
- Read the paper
- Download the dataset: AmbigNQ light ver. / AmbigNQ full ver. / AmbigNQ evidence ver. (new!) / NQ-open
- Update (07/2020): Try running baseline codes
- Update (11/2021): We released semi-oracle evidence passages for researchers interested in multi-answer extraction and disambiguation rather than retrieval. Please read evidence.md for details.
If you find the AmbigQA task or AmbigNQ dataset useful, please cite our paper:
@inproceedings{ min2020ambigqa,
title={ {A}mbig{QA}: Answering Ambiguous Open-domain Questions },
author={ Min, Sewon and Michael, Julian and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
booktitle={ EMNLP },
year={2020}
}
Please also make sure to credit and cite the creators of Natural Questions, the dataset which we built ours off of:
@article{ kwiatkowski2019natural,
title={ Natural questions: a benchmark for question answering research},
author={ Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and others },
journal={ Transactions of the Association for Computational Linguistics },
year={ 2019 }
}
Click here to download the light version of the data (1.1M).
Click here to download the full version of the data (18M).
We provide two distributions of our new dataset AmbigNQ: a full
version with all annotation metadata
and a light
version with only inputs and outputs.
The full version contains
- train.json (47M)
- dev.json (17M)
The light version contains
- train_light.json (3.3M)
- dev_light.json (977K)
train.json
and dev.json
files contain a list of dictionary that represents a single datapoint, with the following keys
id
(string): an identifier for the question, consistent with the original NQ dataset.question
(string): a question. This is identical to the question in the original NQ except we postprocess the string to start uppercase and end with a question mark.annotations
(a list of dictionaries): a list of all acceptable outputs, where each output is a dictionary that represents either a single answer or multiple question-answer pairs.type
:singleAnswer
ormultipleQAs
- (If
type
issingleAnswer
)answer
: a list of strings that are all acceptable answer texts - (If
type
ismultipleQAs
)qaPairs
: a list of dictionaries withquestion
andanswer
.question
is a string, andanswer
is a list of strings that are all acceptable answer texts
viewed_doc_titles
(a list of strings): a list of titles of Wikipedia pages viewed by crowdworkers during annotations. This is an underestimate, since Wikipedia pages viewed through hyperlinks are not included. Note that this should not be the input to a system. It is fine to use it as extra supervision, but please keep in mind that it is an underestimate.used_queries
(a list of dictionaries): a list of dictionaries containing the search queries and results that were used by crowdworkers during annotations. Each dictionary containsquery
(a string) andresults
(a list of dictionaries containingtitle
andsnippet
). Search results are obtained through the Google Search API restricted to Wikipedia (details in the paper). Note that this should not be the input to a system. It is fine to use it as extra supervision.nq_answer
(a list of strings): the list of annotated answers in the original NQ.nq_doc_title
(string): an associated Wikipedia page title in the original NQ.
{train|dev}_light.json
are formatted the same way, but only contain id
, question
and annotations
.
Click here to download the data (575M).
Please read evidence.md for details.
The evidence version contains
- train_with_evidence_articles.json (1.2G)
- dev_with_evidence_articles.json (241M)
- test_with_evidence_articles_without_answers.json (245M)
They contain a list of dictionary that represents a single datapoint, just as the above. In addition to id
, question
and annotations
(omitted in the test data), each dictionary contains
articles_plain_text
: a list of articles in the plain text.articles_html_text
: a list of articles in the HTML text.
In order to evaluate your model on the test data: Follow Leaderboard submission guide to submit your model predictions on the test questions.
Click here to download the data (3.9M).
We release our split of NQ-open, for comparison and use as weak supervision:
- nqopen-train.json (9.7M)
- nqopen-dev.json (1.1M)
- nqopen-test.json (489K)
Each file contains a list of dictionaries representing a single datapoint, with the following keys
id
(string): an identifier that is consistent with the original NQ.question
(string): a question.answer
(a list of strings): a list of acceptable answer texts.
docs.db
: sqlite db that is consistent with DrQA; containing plain text only, no disambiguation pagesdocs-html.db
: sqlite db that is consistent with DrQA, containing html, no disambiguation pages- Top 100 Wikipedia passages retrieved from Dense Passage Retrieval
The evaluation script is here. It has been tested on Python 3.5 and 3.6.
Step 1. Follow the instruction in coco-caption for setup. If you want to compute F1 answer only, you can skip this.
Step 2. Run the evaluation script via
python ambigqa_evaluation_script.py --reference_path {reference data file} --prediction_path {prediction file}
The prediction should be a json file with a dictionary that has id
as a key and a prediction object as a value. A prediction object should be in the following format.
- a list of strings (answers), if you only want to compute answer F1.
- a list of dictionaries with "question" and "answer" as keys, if you want to compute full metrics.
Example:
To only compute answer F1:
{
"-6631842452804060768": ["1624", "1664"],
...
}
To compute full metrics:
{
"-6631842452804060768": [
{"question": "When was city of new york city founded with dutch protection?", "answer": "1624"},
{"question": "When was city of new york city founded and renamed with english name?", "answer": "1664"}
],
...
}
Try running baseline codes (instructions in its README), which includes DPR retrieval, DPR reader and SpanSeqGen. This includes codes and scripts for both NQ-open and AmbigNQ.
Create a prediction file using the questions on NQ-open test data, and email it to Sewon Min.
Please make sure you include the following in the email:
- test prediction file. Make sure that the format is in line with the official evaluation script. As you are not supposed to know which subset of NQ-open test set is AmbigNQ, your file should contain predictions for all NQ-open test examples.
- whether the prediction is in the standard setting or zero-shot setting, i.e. whether the model was trained on AmbigNQ train data or not.
- the name of the model
- [optional] dev prediction file and expected dev results. This is to double-check there is no unexpected problem.
- [optional] the institution, and link to the paper/code/demo. They can be updated later.
If you use semi-oracle articles described here, please make sure to mention it.
Notes
- Models will be sorted by
F1 answer (all) + F1 edit-f1
(standard) orF1 answer (all)
(zero-shot). - Please allow for up to one week ahead of time before getting the test numbers and/or your numbers appear on the leaderboard.
- We limit the number of submissions to be 20 per year and 5 per month.