Skip to content

Latest commit

 

History

History
36 lines (24 loc) · 1.04 KB

squad.md

File metadata and controls

36 lines (24 loc) · 1.04 KB

Understanding of natural language + knowledge of about the world

Challenging task of reading comprehension

ImageNet (Deng et al), Penn Treebank for syntactic parsing (Marcus et al)

Shortcomings:

  1. high in quality - too small for training
  2. large - semi-synthetic

The answer to question is text segment - span.

Distances in dependency trees to quantify diversity (of questions and answers types).

Implemented a logistic regression.

Hirschman et al. (1999) - curated a dataset of 600 3rd-6th grade reading comprehension questions.

Syntactic divergence

Candidate answers were generated by Stanford CoreNLP

Sliding window approach + distance-based extension by Richardson et al. (2013)

Logistic Regression

Features (bold are most important):

  • matching word frequencies (sum of the tf-idf)
  • matching bigram frequencies (generalization of the tf-idf described in Shirakawa et al. (2015))
  • root match (dependency parse tree roots)
  • lengths
  • span word frequencies
  • constituent label
  • span POS tags
  • lexicalized
  • dependency tree paths