Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about FusionNet on SQuAD #1

Open
felixgwu opened this issue Feb 22, 2018 · 12 comments
Open

Questions about FusionNet on SQuAD #1

felixgwu opened this issue Feb 22, 2018 · 12 comments

Comments

@felixgwu
Copy link

Dear authors,

Thank you for sharing this awesome work. I really like ideas of using all history of words to produce the attention as well as the newly proposed symmetric form for attentions.
Recently, I am trying to apply the layers in this repo to the DrQA code to reproduce your FusionNet results on the SQuAD dataset.
However, I can only get dev EM: 73.88 F1: 82.23 which are much worse than the numbers (EM: 75.3 F1: 83.6) reported in the paper.
I also observe that the scores (dev EM: 59.35 and F1: 59.35) after epoch 1 is worse than the ones shown in Firgure 5 & 6 in the paper.

I wonder if there is something that I misunderstand about the paper or I apply the dropouts at the wrong places.
Here is the model I created:

FusionNet(
  (embedding): Embedding(91187, 300, padding_idx=0)
  (CoVe): MTLSTM(
    (embedding): Embedding(91187, 300, padding_idx=0)
    (rnn1): LSTM(300, 300, bidirectional=True)
    (rnn2): LSTM(600, 300, bidirectional=True)
  )
  (pos_embedding): Embedding(51, 12)
  (ner_embedding): Embedding(20, 8)
  (qemb_match): FullAttention: (atten. 300 -> 300, take 300) x 1
  (doc_rnn): StackedBRNN(
    (rnns): ModuleList(
      (0): LSTM(1224, 125, bidirectional=True)
      (1): LSTM(250, 125, bidirectional=True)
    )
  )
  (question_rnn): StackedBRNN(
    (rnns): ModuleList(
      (0): LSTM(900, 125, bidirectional=True)
      (1): LSTM(250, 125, bidirectional=True)
    )
  )
  (question_urnn): StackedBRNN(
    (rnns): ModuleList(
      (0): LSTM(500, 125, bidirectional=True)
    )
  )
  (multi_level_fusion): FullAttention: (atten. 1400 -> 250, take 250) x 3
  (doc_urnn): StackedBRNN(
    (rnns): ModuleList(
      (0): LSTM(1250, 125, bidirectional=True)
    )
  )
  (self_boost_fusion): FullAttention: (atten. 2400 -> 250, take 250) x 1
  (doc_final_rnn): StackedBRNN(
    (rnns): ModuleList(
      (0): LSTM(500, 125, bidirectional=True)
    )
  )
  (self_attn): LinearSeqAttn(
    (linear): Linear(in_features=250, out_features=1)
  )
  (start_attn): BilinearSeqAttn(
    (linear): Linear(in_features=250, out_features=250)
  )
  (end_gru): GRUCell(250, 250)
  (end_attn): BilinearSeqAttn(
    (linear): Linear(in_features=250, out_features=250)
  )
)

I apply sequential dropout (or called variational dropout) with drop rate 0.4 to all embeddings, additional features, inputs of all LSTMs, inputs of GRUCell, inputs of FullAttentions, and inputs of BilinearSeqAttn.

Here are the explanations of the modules:
The qemb_match module does the Fully-Aware Multi-level Fusion: Word-level in the paper.
The doc_rnn and question_rnn modules produces the low-level and high-level concepts.
The question_urnn generates the understanding vectors of the question.
The multi_level_fusion module corresponds to the Fully-Aware Multi-level Fusion: Higher-level.
The doc_urnn extracts the understanding vectors of the context using the concatenation of the outputs of doc_rnn and multi_level_fusion as its input.
The self_boost_fusion module is for Fully-Aware Self-Boosted Fusion.
The self_attn is used to provide a weighted sum of the understanding vectors of the question.
The start_attn predicts the start of the span.
The end_gru generates the v^Q in the paper which is used to predict the end of the span.
The end_attn predicts the end of the span.

The FullAttention and MTLSTM are from this repo and the BilinearSeqAttn, LinearSeqAttn and StackedBRNN are borrowed from the DrQA code.

For embeddings, I use the same pretrained 300d GloVe and 600d CoVe as this repo and initialize the 12d POS and 8d NER embeddings randomly. Also, I use a 1-dim term frequency and 3-dim one-hot exact match feature for whether a word's original, short, and lemma form appear in the question.

I can also share the code with you if needed.
Thanks in advance.

@ioana-blue
Copy link

I'm new to all this, so take my ideas with a serious grain of salt. My guess based on my understanding of the paper and your comment, the POS and NER may be the problem here. I'm guessing POSs and NERs are actually not trainable and they are obtained through different means as they represent parts of speech and entities. I'm not sure how the encodings are produced, but my guess is that they are fixed (the same way GloVE and CoVe are fixed) and not trainable. Hope this helps. Cheers.

@ioana-blue
Copy link

In fact, if you look at the code, there is a preprocessing step that shows how the entities are set:
https://github.com/momohuang/FusionNet-NLI/blob/master/prepro.py#L105
This makes me more confident that I'm right. Hope it helps.

@felixgwu
Copy link
Author

Hi @ioana-blue,
Thank you for your comment.
Based on the paper and the code, they fine-tune the top 1000 most frequent words in GloVe embeddings, keep other words fixed, and fix the CoVe embeddings as well.
However, both POS and NER embeddings are randomly initialized and trained.
You may take a look at the code here:
https://github.com/momohuang/FusionNet-NLI/blob/master/FusionModel/FusionNet.py#L21-L44
This is how the embeddings are initialized.

The code you mentioned (https://github.com/momohuang/FusionNet-NLI/blob/master/prepro.py#L105) is only used to get the set of POSs and NERs and create a dictionary that maps them to some indices during pre-processing which is then served as the input to the embedding layers.

@ioana-blue
Copy link

Thank you for clarifying. Wondering why the POS and NER get trained as well when they could potentially get initialized from "traditional" NLP tools. Do you understand why training them is better?

@felixgwu
Copy link
Author

Hi @ioana-blue,
I'm not sure why they use randomly initialized POS and NER embeddings. I don't if there are pre-trained POS and NER embeddings available somewhere. Previous reading comprehension papers such as the DrQA use the one-hot encoding as features. I assume that learning embeddings gives them better performance; however, It wasn't demonstrated in the ablation study in the paper.

@momohuang: Could you answer my questions? Thanks.

@hsinyuan-huang
Copy link
Owner

hsinyuan-huang commented Feb 28, 2018

Hi @felixgwu and @ioana-blue ,

Thank you, both for active discussion!

I will first provide my thought on the POS and NER embeddings:

The original DrQA paper (https://arxiv.org/pdf/1704.00051.pdf) did not elaborate on the detail for how to incorporate POS and NER embeddings. Around the last summer, the DrQA implementation that @felixgwu used (see this older version) transform the POS and NER tags to low dimensional embeddings (similar to word embeddings) rather than the current one-hot encoding. I simply follow this design choice and remains this way. I am not sure why they changed it afterward, but I think it might not be critical to the performance. I feel that both designs make some sense to me.

For the original problem, I think there may be several subtle reasons for the 1% difference that @felixgwu observed. When I directly run this DrQA implementation in the last summer, I also observe the performance to be ~1% lower than that reported. This paper by Tao Lei also found a similar problem. My guess is that when people tune the performance on certain GPU, the performance can be a bit lower when running on other GPU architectures. However, I managed to reach and surpass DrQA's reported performance by making some other changes. Below is a list of things that you can try.

  1. During preprocessing, I split the tokens more often. Sometimes strings like "1998-2000" may be taken as a single token, but the answer is "1998". So the network will definitely fail to answer this question. You can check out the implementation here: https://github.com/momohuang/FusionNet-NLI/blob/master/general_utils.py#L36-L41

  2. The dropout is applied after the word embeddings and before all linear layers following the implementation of DrQA. So the word embeddings (GloVe and CoVe) are actually dropout twice. According to this paper by Tao Lei, they tune the dropout rate of word embeddings, linear layers, and the learning rate on their own machine to push back the performance. Note that we only dropout the additional features once (i.e., only before LSTM input). From your writing, it seems these additional features are also dropout twice.

  3. Before the submission to SQuAD, I have tried many network initializations with different random seeds. The reported number is the maximum of what we obtained. This can also boost the number a little. Furthermore, after training a model, I would try several random seeds for initializing unknown word embeddings. I also found doing this to be slightly helpful.

  4. To further improve the performance, you can try adding a skip connection in the input LSTMs. I found this to be slightly useful in natural language inference. The implementation is pretty simple, you can check out page 15 of our paper or see the implementation in this repo. But you may have to tune other hyperparameters too.

@felixgwu
Copy link
Author

Hi @momohuang,

Thanks for your helpful explanations! I'll definitely try them out.
Just one more question:
I saw that this DrQA implementation initializes all of the unknown words as zeros. However, in this repo, you initialize them by uniform distribution between -1 and 1.
I wonder if you also tune the range as well. If so, do you remember what range you chose for SQuAD at the end?

@hsinyuan-huang
Copy link
Owner

Hi @felixgwu,

I always initialize them by uniform distribution between -1 and 1.
If I remember correctly, I think this is how GloVe is initialized.

@rsanjaykamath
Copy link

Hello @felixgwu

Do you have the code somewhere for the FusionNet on SQUAD dataset modifying DRQA?

Thanks in advance.

@felixgwu
Copy link
Author

felixgwu commented Mar 7, 2018

Hi @rsanjaykamath,

Unfortunately, I haven't made it public yet.
I am still working on matching performance of the paper.
I modified my code based on the 1, 2, and 3 suggested by momohuang above, and try some other random seeds; however, the gap hasn't been filled yet.

@rsanjaykamath
Copy link

Ok please let us know if you make it public.

I'm trying to use the fully aware attentions in my model. As a start I'd use your inputs above and the code from the repo.

Thanks.

@guotong1988
Copy link

Same question. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants