Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

实现快速的相似度搜索 #200

Open
c913168497 opened this issue Jun 5, 2024 · 7 comments
Open

实现快速的相似度搜索 #200

c913168497 opened this issue Jun 5, 2024 · 7 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@c913168497
Copy link
Contributor

  1. 希望能实现一个功能,将文本数据向量化后存储在向量数据库中,以实现快速的相似度搜索,从而检索与输入查询相关的文本片段,再将检索到的文本输入,最终形成我需要的提示词
@phodal
Copy link
Member

phodal commented Jun 5, 2024

我们在 VSCode 版本实现了类似的功能,但是导致插件体积太大,暂时没有精力迁移到 IDEA 中。详细可以参考:https://github.com/unit-mesh/auto-dev-vscode

比较理想的形式应该是使用额外的 embedding 包和向量数据库。

欢迎来 PR

@phodal phodal added the help wanted Extra attention is needed label Jun 5, 2024
@c913168497
Copy link
Contributor Author

有没有这方面embedding包和向量数据库推荐呢~

@phodal
Copy link
Member

phodal commented Jun 5, 2024

可以参考 VSCode 版本

@phodal
Copy link
Member

phodal commented Jun 23, 2024

@c913168497

方式 1:使用 TFIDF 算法。Copilot 主要用的就是他,和 embedding 之类的相比,还是相当靠谱的。
方式 2:在 Unit Mesh 中,可以用我们的 LLM SDK 来开发:https://github.com/unit-mesh/chocolate-factory

phodal added a commit that referenced this issue Aug 7, 2024
Deleted the `StandardTextChunk.kt` file from the `src/main/kotlin/cc/unitmesh/devti/agent/model` directory as it was not in use.
phodal added a commit that referenced this issue Aug 7, 2024
This commit introduces the LocalEmbedding class, which provides functionality to generate text embeddings using an ONNX model and HuggingFace tokenizer. The class includes a suspendable embed function and supports parallel processing for embedding generation. It also features a companion object to create instances of the LocalEmbedding class with a default model.
@phodal phodal self-assigned this Aug 7, 2024
phodal added a commit that referenced this issue Aug 7, 2024
…search indices #200

This commit introduces two classes, `InMemoryEmbeddingSearchIndex` and `DiskSynchronizedEmbeddingSearchIndex`, which implement the `EmbeddingSearchIndex` interface. These classes provide methods for addingfeat, updating(embed,ding and): deleting add embedding entries in,-memory as and well disk-sync as searchinged for embedding search the closest index embeddings to

This commit a introduces given a query new embedding. in The-memory ` and diskIn-sMemoryynchronizedEmbed embeddingding searchSearch indexIndex.` The stores in all-memory embeddings index in stores memory embeddings, in while memory the and ` supportsDisk concurrentS readynchronized operationsEmbed,ding whileSearch theIndex disk`-s synchronynchronizedizes index index maintains changes index with synchronization disk with storage disk. storage Additionally., Both the commit indices implement includes the a Embed `dingLockedSearchSequenceWrapperIndex interface`, class providing to methods safely for iterate adding over entries embeddings, under saving a/loading lock from, disk as, well finding as closest utility embeddings functions, for and calculating more embedding. similarity Additionally and, normalization the. Locked OverallSequence,Wrapper these ensures classes thread provide-safe efficient iteration and over thread the-safe index ways. to manage and search embedding indices.
phodal added a commit that referenced this issue Aug 8, 2024
…urrentHashMap.newKeySet #200

This change replaces the usage of `ConcurrentCollectionFactory` with `ConcurrentHashMap.newKeySet` for creating a concurrent set of unchecked IDs, simplifying the dependency and utilizing a more direct approach for concurrent set creation in `DiskSynchronizedEmbeddingSearchIndex`.
phodal added a commit that referenced this issue Aug 8, 2024
…th ConcurrentHashMap.newKeySet #200

This change improves the performance and memory usage by utilizing the built-in `ConcurrentHashMap.newKeySet()` for the `uncheckedIds` set, which provides a more efficient concurrent implementation.
@c913168497
Copy link
Contributor Author

牛 真的在实现了, 我试试

@c913168497
Copy link
Contributor Author

看了一下代码 还在实现中~~~~ 加油~

@phodal
Copy link
Member

phodal commented Aug 16, 2024

只是接口上支持,功能还没实现

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants