Multiple Semantic Textual Similarity Problem #71

bendarodes · 2024-03-01T12:21:35Z

Hello,
I have a problem, I have 1000's of sentences. I want to determine which of these 1000 sentences a sentence newly entered by the system is closest to. But I don't want it to analyze 1000 sentences every time. For this, I need the encode values of the sentences. But I couldn't get it. I would be very pleased if you could help me.

bendarodes · 2024-03-12T14:40:53Z

`
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

Load T5 model and tokenizer

model_name = "boun-tabi-LMG/TURNA"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

#sample sentences
sentences = ["elma güzel meyvedir","armut lezzetlidir","kitap ","defter","çelik vida","banyo dolabı"]

Encode sentences using mean pooling

outputs = []
def mean_pooling(sentence):
input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
print(input_ids)
decoder_input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
with torch.no_grad():
output = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

# Average pooling over tokens (excluding special tokens)
return output[0].mean(dim=1)[0]

Assuming you have 'embeddings' as your data

n = len(embeddings)

Compute cosine similarity for upper triangular part

upper_triangular = np.triu(cosine_similarity(embeddings), k=1)

Fill the lower triangular part

similarity_matrix = upper_triangular + upper_triangular.T

Print the similarity matrix

print("Cosine similarity matrix:")
print(similarity_matrix)
`

I wrote this code. However, all similarity values are very close. Similarity scores are not accurate. I would be very grateful if you could help me understand where I made a mistake.

gokceuludogan · 2024-03-14T07:22:28Z

We haven't yet tested TURNA's performance for generating sentence embeddings. Your proposed approach seems logical. However, it's notable that the paper "Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models" explores various methods for encoding text/sentences using pre-trained T5 models. They found that utilizing the encoder and averaging its token representations performs better than using both encoder and decoder.
An alternative to the suggested method involves using only the encoder by loading it with the T5EncoderModel class instead of AutoModel. Here's an example of how to obtain embeddings using this method:

def mean_pooling(sentence):
    input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
    with torch.no_grad():
        output = model(input_ids=input_ids)
    return output.last_hidden_state.mean(dim=1)[0]

Additionally, consider exploring finetuned NLI and STS models for extracting embeddings:

https://huggingface.co/boun-tabi-LMG/turna_nli_nli_tr
https://huggingface.co/boun-tabi-LMG/turna_semantic_similarity_stsb_tr

Please share your findings with us. I'm eager to learn about the results.

onurgu · 2024-04-15T10:26:19Z

Hi @bendarodes , any news?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Semantic Textual Similarity Problem #71

Multiple Semantic Textual Similarity Problem #71

bendarodes commented Mar 1, 2024 •

edited

Loading

bendarodes commented Mar 12, 2024

gokceuludogan commented Mar 14, 2024

onurgu commented Apr 15, 2024

Multiple Semantic Textual Similarity Problem #71

Multiple Semantic Textual Similarity Problem #71

Comments

bendarodes commented Mar 1, 2024 • edited Loading

bendarodes commented Mar 12, 2024

Load T5 model and tokenizer

Encode sentences using mean pooling

Assuming you have 'embeddings' as your data

Compute cosine similarity for upper triangular part

Fill the lower triangular part

Print the similarity matrix

gokceuludogan commented Mar 14, 2024

onurgu commented Apr 15, 2024

bendarodes commented Mar 1, 2024 •

edited

Loading