Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hackerllama/blog/posts/sentence_embeddings/ #4

Open
utterances-bot opened this issue Jan 17, 2024 · 7 comments
Open

hackerllama/blog/posts/sentence_embeddings/ #4

utterances-bot opened this issue Jan 17, 2024 · 7 comments

Comments

@utterances-bot
Copy link

hackerllama - Sentence Embeddings

Everything you wanted to know about sentence embeddings (and maybe a bit more)

https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/

Copy link

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

@0ENZO
Copy link

0ENZO commented Jan 17, 2024

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

Hi @songxujay, author is covering this in the "Selecting and evaluating models" part. Have a look at it. One of the main source is still the MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

Copy link

Hi! Thank you for the great article. To better understand the differences between word2vec- and Transformer-based embeddings, could you elaborate how the masked language modelling objective of BERT is different from the CBOW objective in word2vec (which as I understand is also about "filling in a blank"). Is it that the objectives are similar but the neural net architectures differ in these two approaches, allowing BERT to add contextual info?

@osanseviero
Copy link
Owner

Hey @arnoldlayne0! Overall you're right, BERT and CBOW objectives have some similarities. Here are some differences

  • CBOW context window is fixed, so it doesn't capture the broader context outside the window
  • CBOW treats all context words in the same way, while BERT uses attention mechanisms to weigh each token embedding differently
  • CBOW does not have a sense of directionality due to this
  • BERT can actually mask multiple tokens at the same time

Copy link

9opsec commented Apr 21, 2024

I think something has changed about the quora dataset used in the colab example. I'm getting this error:

from datasets import load_dataset
dataset = load_dataset("quora")["train"]

TypeError: http_get() got an unexpected keyword argument 'displayed_filename'

Copy link

Just what I needed entering the world of LLMs, thank you a lot!

Copy link

qiulang commented Sep 27, 2024

Hi I would like to know other than using Sentence Transformers (sbert), what other open source sentence embeddings methods can I choose?

I find two other options, InferSent and google's USE. But InferSent seems dead now and USE is not widely used too. In 2024 I don't think I should use Doc2Vec or Word2Vec, right ?

So why does sbert take over sentence embeddings methods ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants