Understanding of natural language + knowledge of about the world
Challenging task of reading comprehension
ImageNet (Deng et al), Penn Treebank for syntactic parsing (Marcus et al)
Shortcomings:
- high in quality - too small for training
- large - semi-synthetic
The answer to question is text segment - span.
Distances in dependency trees to quantify diversity (of questions and answers types).
Implemented a logistic regression.
Hirschman et al. (1999) - curated a dataset of 600 3rd-6th grade reading comprehension questions.
Syntactic divergence
Candidate answers were generated by Stanford CoreNLP
Sliding window approach + distance-based extension by Richardson et al. (2013)
Logistic Regression
Features (bold are most important):
- matching word frequencies (sum of the tf-idf)
- matching bigram frequencies (generalization of the tf-idf described in Shirakawa et al. (2015))
- root match (dependency parse tree roots)
- lengths
- span word frequencies
- constituent label
- span POS tags
- lexicalized
- dependency tree paths