-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Needed help with accessing multi-batch trf_data #6
Comments
Hi! Thanks for the kind words; I'm really happy that you've found the materials useful. I will look into this and get back to you! |
Hi @Jupiter79, can you provide me with an example of the error message raised when dealing with Docs in multiple batches? Thanks! |
Hi @thiippal, Here is the code I used:
The output is:
Thanks a lot for the help and for the quick response! Best, |
Hi @Jupiter79! I've now updated the materials – I ran some experiments using a longer text and updated the code to deal with batched outputs from the Transformer. Essentially, I used the same solution as you did: reshaping the tensor using I also changed the output of Token, Span and Doc tensors to averages, not sums. I don't really know why I summed them in the first place; averaging seems like a more reasonable solution. Thanks again for your comments and contributing to the development of the learning materials! |
Hi @thiippal, Firstly, I have to thank you once again, I highly appreciate both your help, and the excellent materials you made! I tried the code, it seems to work flawlessly. I just have couple of issues, but I don't think it's because of the code: what I'm trying to do is to identify unique NER's in the text, so my idea was to use similarity to decide which NER's are the same (for example, multiple mentions of company Facebook, or mentions of Trump and Donald Trump), but I get rather different (even low) similarity scores for different occurrences of the same word (for example, Facebook). I suppose it's because of the nature of contextual embeddings. Thanks again! Best, |
Hey @Jupiter79, good to hear it works! In your case, I would perhaps go for "traditional" word embeddings, since they seek to learn representations for particular words, such as the proper nouns "Donald" and "Trump", regardless of their context of occurrence. Transformers, in turn, attempt to model the different meanings that "Donald" and "Trump" may receive in a given context. This has also been explained in the spaCy docs. |
Hi @thiippal , When I compare similarity between 2 spans, it works. Any advice about it? Some additional notes: I tried to change doc_tensor method with following line to take into account mean of tokens array but in this case similarity score is too high (like 0.98) even for unrelated sentences.
|
Hi @mehmetilker! Okay, a couple of questions:
|
@thiippal thank you for your reply, For the first question: I am comparing two documents in a simple way. But it will be a large batch of Doc objects.
I could calculate similarity with the following way, but I am experimenting to see if I can improve similarity score for more similar documents.
For the second question: I tried with my custom trained model and got that exception. The model trained with ""dbmdz/electra-small-turkish-cased-discriminator" |
Hi @mehmetilker! What is the error message for the example above for |
Hi @thiippal ,
I guess main problem is what _.trf_data.tensors attribute carries for elektra and roberta models. |
Right, if there is no Transformer output for the entire Doc, then obviously the code will not work. Does this manual solution that you presented above work?
If you get funny results for cosine similarity, don't sum up the vectors, but average them instead! |
Yes it works, at least I can get a result. But problem is that score does not change enough to reflect changes on text. I either observe 0.92 or 0.95 And there is same problem on your original code for doc similarity as well (for en_core_web_trf model) unlike span/token similarity score (it works as in your example). I see almost same result when I change doc_tensor method to np.dot score : 0.9605925135658091 Also changing np.sum to np.mean gave me same score here :
|
Hi,
First of all, thank you very much for the excellent course on NLP, it is very clearly written!
I got stuck on the Contextual embeddings from Transformers part of the Natural Language Processing for Linguists course. I tried to implement your custom class tensor2attr on longer texts, but it starts reporting errors, since your class is designed to only access the first batch. Could you please give an advice on how to implement tensor2attr class on docs with multiple batches?
I have tried to change your code in the part # Get Token tensors under tensors[0]; the second [0] accesses batch, in the following way:
with this one:
But, I am not sure if that's the right way to do it, since I get some strange similarity scores. I would highly appreciate any advice or help you could share.
Thank you very much for your efforts!
Best,
Jovan
The text was updated successfully, but these errors were encountered: