Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For future reference: What embeddings are good and bad at, why hybrid representation is key in this project #10

Open
patham9 opened this issue May 4, 2023 · 0 comments

Comments

@patham9
Copy link
Owner

patham9 commented May 4, 2023

Sentence embedding similarity is sufficient to query for relevant information where the same information will likely be included in the top-k items, but is not sufficient to check whether a particular item is indeed the same information. As Hugo pointed out, the issue is especially present when there are asymmetric relations in the sentence, sentence embeddings as-of-now do not properly capture that.

Essentially, even when the cosine similarity is close to 1 it is still not a reliable indicator for the information being the same, there will be other pairs which have higher scores and yet are completely different, here examples to illustrate:

Things which convey different information having high score:

emb1 = get_embedding("the cat hates the dog")
emb2 = get_embedding("the dog hates the cat")
print(cosine_similarity(emb1, emb2))
0.98
emb1 = get_embedding("the cat is lying on the table")
emb2 = get_embedding("the cat is lying under the table")
print(cosine_similarity(emb1, emb2))
0.97

Things which convey the same information, yet having lower score than the above:

emb1 = get_embedding("the duck is floating in the pond")
emb2 = get_embedding("the duck is swimming in the pond")
print(cosine_similarity(emb1, emb2))
0.96
emb1 = get_embedding("the person is drinking the beer")
emb2 = get_embedding("the person is taking a sip from the beer")
0.96

For NARS it is only safe to merge evidence when it is about the same relation, which is addressed by triggering revision only when the symbolic encodings match. In this project this works even with natural sentence input, since GPT is prompted to extract atomic pieces of information (as simple sentences involving a subject, relation, and predicate) NARS can effectively work with and reason on and become part of GPT's prompt for question-answering together with by NARS derived pieces of information.

But in case better embeddings which can reliably capture relational semantics will become available, the following can be tried in addition:

  • gptONA: when Relation/Property claims are made, check existing embeddings and use the existing relational representation if the similarity is above a threshold, potentially penalizing confidence by the difference (by simply multiplying with the embedding similarity).
  • NarsGPT: leave relational encoding as-is, but apply revision&choice with items within a threshold in terms of sentence embedding similarity, also penalizing the evidence based on the difference (again by multiplication with the similarity score).
@patham9 patham9 changed the title For future reference: Embedding similarity cannot be used for revision (and isn't) For future reference: What embeddings are good and bad at, why hybrid representation is key in this project May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant