Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Semantic Textual Similarity Problem #71

Open
bendarodes opened this issue Mar 1, 2024 · 3 comments
Open

Multiple Semantic Textual Similarity Problem #71

bendarodes opened this issue Mar 1, 2024 · 3 comments

Comments

@bendarodes
Copy link

bendarodes commented Mar 1, 2024

Hello,
I have a problem, I have 1000's of sentences. I want to determine which of these 1000 sentences a sentence newly entered by the system is closest to. But I don't want it to analyze 1000 sentences every time. For this, I need the encode values of the sentences. But I couldn't get it. I would be very pleased if you could help me.

@bendarodes
Copy link
Author

`
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

Load T5 model and tokenizer

model_name = "boun-tabi-LMG/TURNA"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

#sample sentences
sentences = ["elma güzel meyvedir","armut lezzetlidir","kitap ","defter","çelik vida","banyo dolabı"]

Encode sentences using mean pooling

outputs = []
def mean_pooling(sentence):
input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
print(input_ids)
decoder_input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
with torch.no_grad():
output = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

# Average pooling over tokens (excluding special tokens)
return output[0].mean(dim=1)[0]

Assuming you have 'embeddings' as your data

n = len(embeddings)

Compute cosine similarity for upper triangular part

upper_triangular = np.triu(cosine_similarity(embeddings), k=1)

Fill the lower triangular part

similarity_matrix = upper_triangular + upper_triangular.T

Print the similarity matrix

print("Cosine similarity matrix:")
print(similarity_matrix)
`

I wrote this code. However, all similarity values are very close. Similarity scores are not accurate. I would be very grateful if you could help me understand where I made a mistake.

@gokceuludogan
Copy link
Contributor

We haven't yet tested TURNA's performance for generating sentence embeddings. Your proposed approach seems logical. However, it's notable that the paper "Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models" explores various methods for encoding text/sentences using pre-trained T5 models. They found that utilizing the encoder and averaging its token representations performs better than using both encoder and decoder.
An alternative to the suggested method involves using only the encoder by loading it with the T5EncoderModel class instead of AutoModel. Here's an example of how to obtain embeddings using this method:

def mean_pooling(sentence):
    input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
    with torch.no_grad():
        output = model(input_ids=input_ids)
    return output.last_hidden_state.mean(dim=1)[0]

Additionally, consider exploring finetuned NLI and STS models for extracting embeddings:

https://huggingface.co/boun-tabi-LMG/turna_nli_nli_tr
https://huggingface.co/boun-tabi-LMG/turna_semantic_similarity_stsb_tr

Please share your findings with us. I'm eager to learn about the results.

@onurgu
Copy link

onurgu commented Apr 15, 2024

Hi @bendarodes , any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants