Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the BiLSTM baseline model stated in the PAWS paper #3

Open
AladarMiao opened this issue Jun 27, 2019 · 4 comments
Open

Regarding the BiLSTM baseline model stated in the PAWS paper #3

AladarMiao opened this issue Jun 27, 2019 · 4 comments

Comments

@AladarMiao
Copy link

If I read the PAWS paper correctly, it stated that BiLSTM+cosine similarity is one of the baseline models that was used to evaluate the PAWS dataset. I tried to reenact the experiment with a BiLSTM+cosine similarity model I designed, but the accuracy is still quite far from the accuracy stated in the paper. Is there somewhere to see how you guys defined the BiLSTM+cosine similarity model? It would be really helpful on my current study regarding paraphrase identification. Thanks in advance!

@yuanzh
Copy link
Collaborator

yuanzh commented Jul 1, 2019

Hi, sorry for the delay. Could you please specify which number in the paper you would like to compare to, and whether you got a lower or a higher accuracy number?

Regarding our model architecture, it's a standard BiLSTM with dropout=0.2, hidden size = 256, activation = relu, using the first/last state vec of the forward/backward LSTM, and Glove embedding. What's your model configuration?

@AladarMiao
Copy link
Author

I am currently using a self trained embedding, BiLSTM, last state vec, concatenate, and dense as the last layer. If what you stated is the case, where does cosine similarity come in? I am comparing my model with what's stated on page 8 of the paper, where BiLSTM achieved a 86.3 acc and 91.6 auc.

@yuanzh
Copy link
Collaborator

yuanzh commented Jul 3, 2019

  1. Each input is first mapped to a vector by the BiLSTM. Let v_l, and v_r be the vectors of the left/right inputs.
  2. The final score is sigmoid(a(cosine_similarity(v_l, v_r) + b)) where a and b are learned variables. I'm not sure if the affine transformation makes a big difference.

Just to be more precise, we take the state at the last token for the forward LSTM, and the state at the first token for the backward LSTM. Concatenate the two states and add a dense layer to project them to the required dimension (256).

@AladarMiao
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants