hackerllama/blog/posts/sentence_embeddings/ #4

utterances-bot · 2024-01-17T09:23:30Z

hackerllama - Sentence Embeddings

Everything you wanted to know about sentence embeddings (and maybe a bit more)

https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/

songxujay · 2024-01-17T09:23:35Z

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

0ENZO · 2024-01-17T09:41:45Z

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

Hi @songxujay, author is covering this in the "Selecting and evaluating models" part. Have a look at it. One of the main source is still the MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

arnoldlayne0 · 2024-01-17T10:56:55Z

Hi! Thank you for the great article. To better understand the differences between word2vec- and Transformer-based embeddings, could you elaborate how the masked language modelling objective of BERT is different from the CBOW objective in word2vec (which as I understand is also about "filling in a blank"). Is it that the objectives are similar but the neural net architectures differ in these two approaches, allowing BERT to add contextual info?

osanseviero · 2024-01-17T11:10:57Z

Hey @arnoldlayne0! Overall you're right, BERT and CBOW objectives have some similarities. Here are some differences

CBOW context window is fixed, so it doesn't capture the broader context outside the window
CBOW treats all context words in the same way, while BERT uses attention mechanisms to weigh each token embedding differently
CBOW does not have a sense of directionality due to this
BERT can actually mask multiple tokens at the same time

9opsec · 2024-04-21T14:26:57Z

I think something has changed about the quora dataset used in the colab example. I'm getting this error:

from datasets import load_dataset
dataset = load_dataset("quora")["train"]

TypeError: http_get() got an unexpected keyword argument 'displayed_filename'

optimisa96 · 2024-08-22T19:51:15Z

Just what I needed entering the world of LLMs, thank you a lot!

qiulang · 2024-09-27T07:31:45Z

Hi I would like to know other than using Sentence Transformers (sbert), what other open source sentence embeddings methods can I choose?

I find two other options, InferSent and google's USE. But InferSent seems dead now and USE is not widely used too. In 2024 I don't think I should use Doc2Vec or Word2Vec, right ?

So why does sbert take over sentence embeddings methods ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hackerllama/blog/posts/sentence_embeddings/ #4

hackerllama/blog/posts/sentence_embeddings/ #4

utterances-bot commented Jan 17, 2024

songxujay commented Jan 17, 2024

0ENZO commented Jan 17, 2024 •

edited

Loading

arnoldlayne0 commented Jan 17, 2024

osanseviero commented Jan 17, 2024

9opsec commented Apr 21, 2024

optimisa96 commented Aug 22, 2024

qiulang commented Sep 27, 2024

hackerllama/blog/posts/sentence_embeddings/ #4

hackerllama/blog/posts/sentence_embeddings/ #4

Comments

utterances-bot commented Jan 17, 2024

hackerllama - Sentence Embeddings

songxujay commented Jan 17, 2024

0ENZO commented Jan 17, 2024 • edited Loading

arnoldlayne0 commented Jan 17, 2024

osanseviero commented Jan 17, 2024

9opsec commented Apr 21, 2024

optimisa96 commented Aug 22, 2024

qiulang commented Sep 27, 2024

0ENZO commented Jan 17, 2024 •

edited

Loading