-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Semantic Textual Similarity Problem #71
Comments
` Load T5 model and tokenizermodel_name = "boun-tabi-LMG/TURNA" #sample sentences Encode sentences using mean poolingoutputs = []
Assuming you have 'embeddings' as your datan = len(embeddings) Compute cosine similarity for upper triangular partupper_triangular = np.triu(cosine_similarity(embeddings), k=1) Fill the lower triangular partsimilarity_matrix = upper_triangular + upper_triangular.T Print the similarity matrixprint("Cosine similarity matrix:") I wrote this code. However, all similarity values are very close. Similarity scores are not accurate. I would be very grateful if you could help me understand where I made a mistake. |
We haven't yet tested TURNA's performance for generating sentence embeddings. Your proposed approach seems logical. However, it's notable that the paper "Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models" explores various methods for encoding text/sentences using pre-trained T5 models. They found that utilizing the encoder and averaging its token representations performs better than using both encoder and decoder. def mean_pooling(sentence):
input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
with torch.no_grad():
output = model(input_ids=input_ids)
return output.last_hidden_state.mean(dim=1)[0] Additionally, consider exploring finetuned NLI and STS models for extracting embeddings: https://huggingface.co/boun-tabi-LMG/turna_nli_nli_tr Please share your findings with us. I'm eager to learn about the results. |
Hi @bendarodes , any news? |
Hello,
I have a problem, I have 1000's of sentences. I want to determine which of these 1000 sentences a sentence newly entered by the system is closest to. But I don't want it to analyze 1000 sentences every time. For this, I need the encode values of the sentences. But I couldn't get it. I would be very pleased if you could help me.
The text was updated successfully, but these errors were encountered: