-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility of edge probing tasks #1305
Comments
Hi @sagnik, I think you're more or less correct, and the issue may be coming from the For the probing tasks though, they generally do not use just the top layer representations, but information for the middle of the layers.
The implementation you described takes only the final layer activations, which does not match either of the above setups. I will be busy over the next few days, but I will try to get some results that compare all three. |
Thanks for the reply @zphang . Yes, I agree the original paper uses If you can run the experiments on your end, that would be great. If the scalar mix/cat is difficult to put in, just the results from the top layer would be fine: I just want to know if there's something I am doing massively wrong. At least two papers have reported results with the top layer and my results are almost 20% off. If by any chance you can get the layerwise results, that would be awesome! Also, in the tenney papers branch, In the configs, Thanks for getting back! |
I have a small update on this. I have updated the code to include scalar mixing and cat. Here's the changeset for the files:
scalarmix is pretty much the same as AllenNLP, with small changes:
Do the codes look right? If yes, then I still can't reproduce the results. The cat setting gives me an f1_score of |
Describe the bug
I am having a bit of trouble reproducing the results for edge probing tasks. Various people seemed to have reported various results. For simplicity, I will use the example for Coref on ontonotes data, and the ones that use the top layer of BERT.
cat
part.Where coref_bert_run_config is given by
However, I am also wondering where is the encoder frozen in the existing implementation, which I think is necessary, right? As Tenney 2019, Bert rediscovers writes,
If I look back in the original implementation branch for Tenney papers, there is an option to do this, which I can not find in the existing code. If I freeze the encoder weights myself, the result for Coref reduces down to 77.7 for bert-base-uncased, which is really low but corresponds well to Liu 2019 paper, appendix D6 (the data is the same AFAICT, but there is no self-attention pooling layer over the spans). Given this, my question is: where is the encoder frozen for EP tasks? Or am I understanding the task design completely wrong?
Here's the original code for multi label span classification in the existing implementation:
I added this before creating taskmodel:
To Reproduce
jiant
you're using: the latest one, with some modifications for reading yml files.jiant
, e.g, "2 P40 GPUs": single nvidia GPUdefaults.conf
): see above.Expected behavior
Screenshots
Additional context
The text was updated successfully, but these errors were encountered: