-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about FusionNet on SQuAD #1
Comments
I'm new to all this, so take my ideas with a serious grain of salt. My guess based on my understanding of the paper and your comment, the POS and NER may be the problem here. I'm guessing POSs and NERs are actually not trainable and they are obtained through different means as they represent parts of speech and entities. I'm not sure how the encodings are produced, but my guess is that they are fixed (the same way GloVE and CoVe are fixed) and not trainable. Hope this helps. Cheers. |
In fact, if you look at the code, there is a preprocessing step that shows how the entities are set: |
Hi @ioana-blue, The code you mentioned (https://github.com/momohuang/FusionNet-NLI/blob/master/prepro.py#L105) is only used to get the set of POSs and NERs and create a dictionary that maps them to some indices during pre-processing which is then served as the input to the embedding layers. |
Thank you for clarifying. Wondering why the POS and NER get trained as well when they could potentially get initialized from "traditional" NLP tools. Do you understand why training them is better? |
Hi @ioana-blue, @momohuang: Could you answer my questions? Thanks. |
Hi @felixgwu and @ioana-blue , Thank you, both for active discussion! I will first provide my thought on the POS and NER embeddings: The original DrQA paper (https://arxiv.org/pdf/1704.00051.pdf) did not elaborate on the detail for how to incorporate POS and NER embeddings. Around the last summer, the DrQA implementation that @felixgwu used (see this older version) transform the POS and NER tags to low dimensional embeddings (similar to word embeddings) rather than the current one-hot encoding. I simply follow this design choice and remains this way. I am not sure why they changed it afterward, but I think it might not be critical to the performance. I feel that both designs make some sense to me. For the original problem, I think there may be several subtle reasons for the 1% difference that @felixgwu observed. When I directly run this DrQA implementation in the last summer, I also observe the performance to be ~1% lower than that reported. This paper by Tao Lei also found a similar problem. My guess is that when people tune the performance on certain GPU, the performance can be a bit lower when running on other GPU architectures. However, I managed to reach and surpass DrQA's reported performance by making some other changes. Below is a list of things that you can try.
|
Hi @momohuang, Thanks for your helpful explanations! I'll definitely try them out. |
Hi @felixgwu, I always initialize them by uniform distribution between -1 and 1. |
Hello @felixgwu Do you have the code somewhere for the FusionNet on SQUAD dataset modifying DRQA? Thanks in advance. |
Hi @rsanjaykamath, Unfortunately, I haven't made it public yet. |
Ok please let us know if you make it public. I'm trying to use the fully aware attentions in my model. As a start I'd use your inputs above and the code from the repo. Thanks. |
Same question. Thanks. |
Dear authors,
Thank you for sharing this awesome work. I really like ideas of using all history of words to produce the attention as well as the newly proposed symmetric form for attentions.
Recently, I am trying to apply the layers in this repo to the DrQA code to reproduce your FusionNet results on the SQuAD dataset.
However, I can only get dev EM: 73.88 F1: 82.23 which are much worse than the numbers (EM: 75.3 F1: 83.6) reported in the paper.
I also observe that the scores (dev EM: 59.35 and F1: 59.35) after epoch 1 is worse than the ones shown in Firgure 5 & 6 in the paper.
I wonder if there is something that I misunderstand about the paper or I apply the dropouts at the wrong places.
Here is the model I created:
I apply sequential dropout (or called variational dropout) with drop rate 0.4 to all embeddings, additional features, inputs of all LSTMs, inputs of GRUCell, inputs of FullAttentions, and inputs of BilinearSeqAttn.
Here are the explanations of the modules:
The qemb_match module does the Fully-Aware Multi-level Fusion: Word-level in the paper.
The doc_rnn and question_rnn modules produces the low-level and high-level concepts.
The question_urnn generates the understanding vectors of the question.
The multi_level_fusion module corresponds to the Fully-Aware Multi-level Fusion: Higher-level.
The doc_urnn extracts the understanding vectors of the context using the concatenation of the outputs of doc_rnn and multi_level_fusion as its input.
The self_boost_fusion module is for Fully-Aware Self-Boosted Fusion.
The self_attn is used to provide a weighted sum of the understanding vectors of the question.
The start_attn predicts the start of the span.
The end_gru generates the v^Q in the paper which is used to predict the end of the span.
The end_attn predicts the end of the span.
The FullAttention and MTLSTM are from this repo and the BilinearSeqAttn, LinearSeqAttn and StackedBRNN are borrowed from the DrQA code.
For embeddings, I use the same pretrained 300d GloVe and 600d CoVe as this repo and initialize the 12d POS and 8d NER embeddings randomly. Also, I use a 1-dim term frequency and 3-dim one-hot exact match feature for whether a word's original, short, and lemma form appear in the question.
I can also share the code with you if needed.
Thanks in advance.
The text was updated successfully, but these errors were encountered: