Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needed help with accessing multi-batch trf_data #6

Open
JovanNj opened this issue Apr 1, 2021 · 13 comments
Open

Needed help with accessing multi-batch trf_data #6

JovanNj opened this issue Apr 1, 2021 · 13 comments
Assignees

Comments

@JovanNj
Copy link

JovanNj commented Apr 1, 2021

Hi,

First of all, thank you very much for the excellent course on NLP, it is very clearly written!

I got stuck on the Contextual embeddings from Transformers part of the Natural Language Processing for Linguists course. I tried to implement your custom class tensor2attr on longer texts, but it starts reporting errors, since your class is designed to only access the first batch. Could you please give an advice on how to implement tensor2attr class on docs with multiple batches?

I have tried to change your code in the part # Get Token tensors under tensors[0]; the second [0] accesses batch, in the following way:


def span_tensor(self, span):
        
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
        
        # Get Token tensors under tensors[0]; the second [0] accesses batch
        tensor = span.doc._.trf_data.tensors[0][0][tensor_ix]
        
        # Sum vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.sum(axis=0)

with this one:


def span_tensor(self, span):
        
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
        
        # Get Token tensors under tensors[0]; the second [0] accesses batch
        tensor = span.doc._.trf_data.tensors[0].reshape(-1, 768)[tensor_ix]
        
        # Sum vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.sum(axis=0)

But, I am not sure if that's the right way to do it, since I get some strange similarity scores. I would highly appreciate any advice or help you could share.

Thank you very much for your efforts!

Best,
Jovan

@thiippal thiippal self-assigned this Apr 1, 2021
@thiippal
Copy link
Contributor

thiippal commented Apr 1, 2021

Hi!

Thanks for the kind words; I'm really happy that you've found the materials useful.

I will look into this and get back to you!

@thiippal
Copy link
Contributor

thiippal commented Apr 1, 2021

Hi @Jupiter79, can you provide me with an example of the error message raised when dealing with Docs in multiple batches? Thanks!

@JovanNj
Copy link
Author

JovanNj commented Apr 1, 2021

Hi @thiippal,

Here is the code I used:

text = '''
EU urges US to draft joint rule book to rein in tech giants
The European Union is urging U.S. President Joe Biden to help draw up a common rule book to rein in the power of big tech companies like Facebook and Twitter and combat the spread of fake news
ByThe Associated Press
26 January 2021, 21:25
• 4 min read

 
On Location: March 5, 2021
Catch up on the developing stories making headlines.
BRUSSELS -- The European Union called Tuesday on U.S. President Joe Biden to help draw up a common rule book to rein in the power of big tech companies like Facebook and Twitter and combat the spread of fake news that is eating away at Western democracies.
In a speech to the Davos World Economic Forum, European Commission President Ursula von der Leyen urged the Biden administration to join forces against “the darker sides of the digital world,” which she said was partly behind the “shock” storming of Capitol Hill on Jan. 6.
“The business model of online platforms has an impact and not only on free and fair competition, but also on our democracies, our security and on the quality of our information,” von der Leyen said. “That is why we need to contain this immense power of the big digital companies.”
She urged the White House to join the 27-nation bloc’s efforts, saying that “together, we could create a digital economy rule book that is valid worldwide,” and would encompass data protection, privacy rules and the security of critical infrastructure.
Von der Leyen said the EU wants the onus put on the tech giants, with “it clearly laid down that internet companies take responsibility for the manner in which they disseminate, promote and remove content.”
In December, the European Commission proposed two new pieces of EU legislation to better protect consumers and their rights online, make tech platforms more accountable, and improve digital competition, building on the bloc’s data protection rules, which are among the most stringent in the world.
“We want the platforms to be transparent about how their algorithms work,” von der Leyen said. “Because we cannot accept that decisions that have a far-reaching impact on our democracy are taken by computer programs alone.”
Von der Leyen also referred to the decision earlier this month by Facebook and Twitter to cut off President Donald Trump from their platforms for allegedly inciting the assault on the U.S. Capitol, an unprecedented step that underscored the immense power of tech giants to regulate speech.
“No matter how tempting it may have been for Twitter to switch off President Trump’s account, such serious interference with freedom of expression should not be based on company rules alone,” she said. “There needs to be a framework of laws for such far-reaching decisions.”
Trump’s permanent suspension from Twitter and Facebook is prompting EU member Hungary to push its own measures to regulate social media companies.
Hungary’s justice minister said Tuesday that large tech companies might face Hungarian government regulation over what she called “deliberate, ideological” censorship on social media.
In a Facebook post, Justice Minister Judit Varga wrote that the government would move to place restrictions on tech giants that she said arbitrarily silence users of online platforms, including the accounts of government state leaders - a reference to decisions by Twitter and Facebook to permanently suspend former U.S. president Donald Trump after his supporters mounted an assault on the U.S. capitol on Jan. 6.
Varga called for the “transparent and controllable operation” of tech companies, and said she would submit a bill on the matter to Hungary’s parliament in the spring to counter what she called their “systematic abuse of free speech.”
Hungary’s next parliamentary election is scheduled for 2022. Recent polls showed a tight race between the ruling Fidesz party and a six-party opposition coalition.
Hungarian Prime Minister Viktor Orban, a Trump ally, has been accused of overseeing the consolidation of the country’s media into the hands of business interests with ties to his party.
Opposition parties have used social media to reach potential voters amid Aa lack of coverage in Hungary’s public outlets. A 2018 report by the Organization for Security and Co-operation in Europe found that national elections that year “were characterized by a pervasive overlap between state and ruling party resources” and media bias.
Last week, Varga claimed that tech companies “limit the visibility of Christian, conservative, right-wing opinions,” and that “power groups behind global tech giants” were capable of deciding elections. She alleged that she had personally been “shadow-banned” by Facebook, a term referring to social media platforms restricting the visibility of users’ profiles or posts without their knowledge.
A representative for Facebook told local media that the company had not interfered with Varga’s account. Facebook did not immediately respond to a request for comment.

'''

# Import the Language object under the 'language' module in spaCy,
# and NumPy for calculating cosine similarity.
from spacy.language import Language
import numpy as np

# We use the @ character to register the following Class definition
# with spaCy under the name 'tensor2attr'.
@Language.factory('tensor2attr')

# We begin by declaring the class name: Tensor2Attr. The name is 
# declared using 'class', followed by the name and a colon.
class Tensor2Attr:
    
    # We continue by defining the first method of the class, 
    # __init__(), which is called when this class is used for 
    # creating a Python object. Custom components in spaCy 
    # require passing two variables to the __init__() method:
    # 'name' and 'nlp'. The variable 'self' refers to any
    # object created using this class!
    def __init__(self, name, nlp):
        
        # We do not really do anything with this class, so we
        # simply move on using 'pass' when the object is created.
        pass

    # The __call__() method is called whenever some other object
    # is passed to an object representing this class. Since we know
    # that the class is a part of the spaCy pipeline, we already know
    # that it will receive Doc objects from the preceding layers.
    # We use the variable 'doc' to refer to any object received.
    def __call__(self, doc):
        
        # When an object is received, the class will instantly pass
        # the object forward to the 'add_attributes' method. The
        # reference to self informs Python that the method belongs
        # to this class.
        self.add_attributes(doc)
        
        # After the 'add_attributes' method finishes, the __call__
        # method returns the object.
        return doc
    
    # Next, we define the 'add_attributes' method that will modify
    # the incoming Doc object by calling a series of methods.
    def add_attributes(self, doc):
        
        # spaCy Doc objects have an attribute named 'user_hooks',
        # which allows customising the default attributes of a 
        # Doc object, such as 'vector'. We use the 'user_hooks'
        # attribute to replace the attribute 'vector' with the 
        # Transformer output, which is retrieved using the 
        # 'doc_tensor' method defined below.
        doc.user_hooks['vector'] = self.doc_tensor
        
        # We then perform the same for both Spans and Tokens that
        # are contained within the Doc object.
        doc.user_span_hooks['vector'] = self.span_tensor
        doc.user_token_hooks['vector'] = self.token_tensor
        
        # We also replace the 'similarity' method, because the 
        # default 'similarity' method looks at the default 'vector'
        # attribute, which is empty! We must first replace the
        # vectors using the 'user_hooks' attribute.
        doc.user_hooks['similarity'] = self.get_similarity
        doc.user_span_hooks['similarity'] = self.get_similarity
        doc.user_token_hooks['similarity'] = self.get_similarity
    
    # Define a method that takes a Doc object as input and returns 
    # Transformer output for the entire Doc.
    def doc_tensor(self, doc):
        
        # Return Transformer output for the entire Doc. As noted
        # above, this is the last item under the attribute 'tensor'.
        # Use [0] to access the batch.
        return doc._.trf_data.tensors[-1][0]
    
    # Define a method that takes a Span as input and returns the Transformer 
    # output.
    def span_tensor(self, span):
        
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
        
        # Get Token tensors under tensors[0]; the second [0] accesses batch
        tensor = span.doc._.trf_data.tensors[0][0][tensor_ix]
        
        # Sum vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.sum(axis=0)
    
    # Define a function that takes a Token as input and returns the Transformer
    # output.
    def token_tensor(self, token):
        
        # Get alignment information for Token; flatten array for indexing.
        # Again, we use the 'doc' attribute of a Token to get the parent Doc,
        # which contains the Transformer output.
        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()
        
        # Get Token tensors under tensors[0]; the second [0] accesses batch
        tensor = token.doc._.trf_data.tensors[0][0][tensor_ix]

        # Sum vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.sum(axis=0)
    
    # Define a function for calculating cosine similarity between vectors
    def get_similarity(self, doc1, doc2):
        
        # Calculate and return cosine similarity
        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)

import spacy
nlp = spacy.load('en_core_web_trf')

# Add the component named 'tensor2attr', which we registered using the
# @Language decorator and its 'factory' method to the pipeline.
nlp.add_pipe('tensor2attr')

# Call the 'pipeline' attribute to examine the pipeline
nlp.pipeline

doc = nlp(text)

# Retrieve vectors for the two Tokens corresponding to "capital";
# assign to variables 'city' and 'money'.
span1 = doc[3]
span2 = doc[118]

# Compare the similarity of the two meanings of 'capital'
print(span1.text,'\n', span2.text)
print(doc._.trf_data.tensors[0].shape)
span1.similarity(span2)

The output is:

US 
 Twitter
(10, 153, 768)

IndexError                                Traceback (most recent call last)
<ipython-input-1-2cbd3fa7e71c> in <module>
    166 print(span1.text,'\n', span2.text)
    167 print(doc._.trf_data.tensors[0].shape)
--> 168 span1.similarity(span2)

~/anaconda3/envs/conda_python38/lib/python3.8/site-packages/spacy/tokens/token.pyx in spacy.tokens.token.Token.similarity()

<ipython-input-1-2cbd3fa7e71c> in get_similarity(self, doc1, doc2)
    144 
    145         # Calculate and return cosine similarity
--> 146         return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)
    147 
    148 import spacy

~/anaconda3/envs/conda_python38/lib/python3.8/site-packages/spacy/tokens/token.pyx in spacy.tokens.token.Token.vector.__get__()

<ipython-input-1-2cbd3fa7e71c> in token_tensor(self, token)
    134 
    135         # Get Token tensors under tensors[0]; the second [0] accesses batch
--> 136         tensor = token.doc._.trf_data.tensors[0][0][tensor_ix]
    137 
    138         # Sum vectors along axis 0 (columns). This yields a 768-dimensional

IndexError: index 176 is out of bounds for axis 0 with size 153

Thanks a lot for the help and for the quick response!

Best,
Jovan

@thiippal
Copy link
Contributor

thiippal commented Apr 2, 2021

Hi @Jupiter79!

I've now updated the materials – I ran some experiments using a longer text and updated the code to deal with batched outputs from the Transformer.

Essentially, I used the same solution as you did: reshaping the tensor using reshape(-1, 768) to allow the token/tensor alignment to work.

I also changed the output of Token, Span and Doc tensors to averages, not sums. I don't really know why I summed them in the first place; averaging seems like a more reasonable solution.

Thanks again for your comments and contributing to the development of the learning materials!

@JovanNj
Copy link
Author

JovanNj commented Apr 3, 2021

Hi @thiippal,

Firstly, I have to thank you once again, I highly appreciate both your help, and the excellent materials you made!

I tried the code, it seems to work flawlessly. I just have couple of issues, but I don't think it's because of the code: what I'm trying to do is to identify unique NER's in the text, so my idea was to use similarity to decide which NER's are the same (for example, multiple mentions of company Facebook, or mentions of Trump and Donald Trump), but I get rather different (even low) similarity scores for different occurrences of the same word (for example, Facebook). I suppose it's because of the nature of contextual embeddings.
My plan now is to try to define some rule which would make decisions based both on contextual embeddings, and tok2vec word embeddings. The idea is to use the rule for help in NER labeling, in order to link NER's and create knowledge base. If you would like to share your opinion or some advice on the issue, you are more than welcome.

Thanks again!

Best,
Jovan

@thiippal
Copy link
Contributor

thiippal commented Apr 3, 2021

Hey @Jupiter79,

good to hear it works!

In your case, I would perhaps go for "traditional" word embeddings, since they seek to learn representations for particular words, such as the proper nouns "Donald" and "Trump", regardless of their context of occurrence.

Transformers, in turn, attempt to model the different meanings that "Donald" and "Trump" may receive in a given context.

This has also been explained in the spaCy docs.

@mehmetilker
Copy link

mehmetilker commented Apr 25, 2021

Hi @thiippal ,
I am experiencing different problem but on the same line, therefore I preferred not to open separate issue.

When I compare similarity between 2 spans, it works.
But if I run similarity for 2 docs it raises an error: ValueError: shapes (50,256) and (46,256) not aligned: 256 (dim 1) != 46 (dim 0)

Any advice about it?

Some additional notes:
When I have checked "The second item under index 1 holds the output for the entire Doc." with ~~~example_doc._.trf_data.tensors[1].shape~~~ throws exception due to tensors list contains only one item.

I tried to change doc_tensor method with following line to take into account mean of tokens array but in this case similarity score is too high (like 0.98) even for unrelated sentences.

    def doc_tensor(self, doc):        
        #return doc._.trf_data.tensors[-1].mean(axis=0)
        return doc._.trf_data.tensors[-1][0].mean(axis=0)

@thiippal
Copy link
Contributor

Hi @mehmetilker!

Okay, a couple of questions:

  1. Are you trying to compare the cosine similarity of a large batch of Doc objects?
  2. Which model / architecture are you using? 256 dimension suggests its something else than RoBERTa (the default Transformer-based model for English in spaCy).

@mehmetilker
Copy link

mehmetilker commented Apr 26, 2021

@thiippal thank you for your reply,

For the first question: I am comparing two documents in a simple way. But it will be a large batch of Doc objects.

text1 = "sentence 1"
doc1 = nlp(text1)
text2 ="sentence2"
doc2 = nlp(text2)

doc1.similarity(doc2) # and throws exception

I could calculate similarity with the following way, but I am experimenting to see if I can improve similarity score for more similar documents.

ten1 = np.sum(np.array(doc1._.trf_data.tensors[0][0]), axis=0)
ten2 = np.sum(np.array(doc2._.trf_data.tensors[0][0]), axis=0)
score = 1 - distance.cosine(ten1, ten2)

For the second question: I tried with my custom trained model and got that exception. The model trained with ""dbmdz/electra-small-turkish-cased-discriminator"
I also tried with spacy's en_core_web_trf model and there wasn't a problem. So it must be my selection of tranformer model or training pipeline.

@thiippal
Copy link
Contributor

Hi @mehmetilker!

What is the error message for the example above for doc1.similarity(doc2)?

@mehmetilker
Copy link

Hi @thiippal ,

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/projects/x/cli/parseTest.py", line 197, in <module>
    x = doc2.similarity(doc)
  File "spacy/tokens/doc.pyx", line 568, in spacy.tokens.doc.Doc.similarity
  File "/home/projects/x/cli/parseTest.py", line 147, in get_similarity
    d = np.dot(doc1.vector, doc2.vector)
  File "<__array_function__ internals>", line 5, in dot
ValueError: shapes (46,256) and (50,256) not aligned: 256 (dim 1) != 50 (dim 0)

I guess main problem is what _.trf_data.tensors attribute carries for elektra and roberta models.
There no "output for the entire Doc" vector data for elektra model.

@thiippal
Copy link
Contributor

Right, if there is no Transformer output for the entire Doc, then obviously the code will not work.

Does this manual solution that you presented above work?

ten1 = np.sum(np.array(doc1._.trf_data.tensors[0][0]), axis=0)
ten2 = np.sum(np.array(doc2._.trf_data.tensors[0][0]), axis=0)
score = 1 - distance.cosine(ten1, ten2)

If you get funny results for cosine similarity, don't sum up the vectors, but average them instead!

@mehmetilker
Copy link

mehmetilker commented Apr 29, 2021

Yes it works, at least I can get a result. But problem is that score does not change enough to reflect changes on text. I either observe 0.92 or 0.95

And there is same problem on your original code for doc similarity as well (for en_core_web_trf model) unlike span/token similarity score (it works as in your example).

I see almost same result when I change doc_tensor method to return doc._.trf_data.tensors[-1][0].mean(axis=0)

np.dot score : 0.9605925135658091
cosine score: 0.9605925679206848

Also changing np.sum to np.mean gave me same score here :

ten1 = np.mean(np.array(doc1._.trf_data.tensors[0][0]), axis=0)
ten2 = np.mean(np.array(doc2._.trf_data.tensors[0][0]), axis=0)
score = 1 - distance.cosine(ten1, ten2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants