-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Multimodal semantic search design #362
Comments
Shouldn't we have embeddings for each field instead of single embedding of all fields combined? Otherwise, query recall will degrade a lot when data is ingested with both text and image and search is happening using either text or image only?
|
that depends on a model, anyway the search happens inside the model. Just to clarify, when ingesting we combine both fields and create single embedding, for search we combine both fields into one input DTO for ml client and send it in one call, receive single embedding for query and do knn using that single vector. |
Is there user requests for this feature? As a user, I would search either using text or image but not both in general. By having single embedding, wouldn't it harm on recall compared to having embedding for each field? |
There is a feature request #318. Actual use case is to search using both image and text, even more types like video or audio can to be added later. You can check on similar feature implemented or planned by other platforms: https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search, https://www.pinecone.io/learn/clip-image-search/. |
Thanks for sharing the links. Seems like this feature will be needed to integrate with Vision Language Model(VLM). |
I dont know if I like hardcoding the name as "query_image". Like you mentioned, it is not extensible. Are there any other options available? Additionally, for testing, could you elaborate on how we will benchmark the system? What datasets are available and what will be run? |
Sure, I'll put other options, but everything that is extensible would put us out of the 2.11 release time, so simplicity of implementation overwrites all other reasons here. Other options include some generic format and/or info of type. As for the datasets, I think we can take something like flickr where the mixed media is present, and use recall and/or NSGD to measure changes in score accuracy. I'll search for exact datasets and update this RFC with findings |
@martin-gaievski can we resolve this issue. As this feature is released. |
I added few alternative options for future reference, closing this RFC |
Introduction
This issue describes details of Multimodal semantic search support Design, including high a low level approaches and pros and cons of each.
Background
This project aims to enhance semantic search use cases by enabling Multimodal support in Neural Search plugin. Multimodal improves the relevancy of the results by combining text and other forms of inputs like image, audio, video etc.
As of today OpenSearch supports Semantic search use cases purely based on text based embedding models. While this works for majority of usecases, it does not scale well for applications that need to embed other forms of input. For example, consider a query “Give me bright colored blue shoes”. This can get more relevant results if image properties such as color, intensity is captured along with text. Multimodal solves this problem.
Requirements
This is what is required at Phase 1, for 2.11.
Functional Requirements
Non Functional Requirements
Scope
In this document we propose a solution for the questions below:
How do we ingest data and create embeddings as part of that process
How do we run neural search queries using created above embedding and models with multimodal support
Out of Document Scope
Changes on ml-commons related to model connector(s) are not covered, the assumption is that we do have a working interface as part of the ml-commons client.
Current State
eural search and ml-commons only supports text embeddings.
existing format for field mapping
existing format for data ingestion pipeline definition
existing format for search request
Solution Overview
Neural-search needs support of multimodal in both ingestion path and search path.
Following are assumptions we’re taking for this change:
Data Ingestion
Search
Solution HLD
Proposed
Data Ingestion
High level sequence diagram for data ingestion workflow
For data ingestion there are two main components: field mapping and ingestion pipeline. Our goal is ideally to avoid any type information but focus more on providing user ability to define model input and output.
Option 1: Model input/output mapping is part of the pipeline, field names are fixed
Pros:
text_embedding
ingestion processorCons:
Search
High level sequence diagram for search workflow
We will be modifying existing
neural
query clause to support multiple fields in a single query.Option 1: Model params are fixed
Pros:
Cons:
Solution Details
Data ingestion
We will need to create a processor with a generic name text-image-embedding , that is independent of the option we choose. At the code level that new processor and existing TextEmbeddingProcessor can use same code, or even be related via class hierarchy.
For Option 1 we need to add map of field types. That can be passed to processor from a Processor.Factory. Default type can be text. We need to validate if there are no type info for field without a field mapping info.
Search
Main change is in parsing for a query, we need to change simple parsing of text field to a more sophisticated logic for different field types.
We must leave the current format as it’s used currently by customers. We must handle today’s format query_text in same way as following with new format:
Use of that field must be marked as deprecated
Both Data Ingestion and Search need changes in MLCommonsClientAccessor.inferenceSentence, as both essentially use embeddings produced by predict API in ml-commons.
We need to change a way we call predict API using MLClient . In particular, we need to use RemoteInferenceInputDataSet instead of TextDocsInputDataSet. There is no need to differentiate between text and image because RemoteInferenceInputDataSet accepts only String.
Alternatives Considered
Data Ingestion
Option 2: Model input/output mapping is part of the pipeline
Pros
Cons
Option 3: Type is part of the ingestion pipeline, no model input/output mapping
Type information can be passed as part of the ingestion pipeline.
Pros
Cons
Option 4: Field type is part of mapping
We need some extra field to describe type. We’ll be adding type of the field under meta collection, using new field
mime_type
.Pros
Cons
Search
We will be modifying existing neural query clause to support multiple fields in a single query.
Option 2: Model params are part of search request, field names are fixed
Pros:
Cons
Option 3: Type specific attributes
Pros:
Cons:
Option 4: Generic attributes list
Pros:
Cons:
Solution Comparison
Most alternatives were rejected due to short timeline of initial phase, but can be used for consideration for future development, as most of them focused around extensible approach.
We inclined to avoid options that bundle OpenSearch type and model parameter mapping as those are not related categories., it's better to keep it closer to pipeline definition or ideally read in runtime from the model connector definition. That however has it's own problems as such reading will affect the performance, which is especially bad for search flow.
Another aspect is embedding processors, should we do something generic and configurable or it has to be set of specialized processors.
Testability
General tests for the neural-search repo will be part of the development.
Limited integ tests are possible, but connector and model will be mocked
Manual testing for real end-to-end scenario will be conducted for model connectors and remotely hosted models.
Reference Links
The text was updated successfully, but these errors were encountered: