[RFC] Multimodal semantic search design #362

martin-gaievski · 2023-09-29T16:57:40Z

Introduction

This issue describes details of Multimodal semantic search support Design, including high a low level approaches and pros and cons of each.

Background

This project aims to enhance semantic search use cases by enabling Multimodal support in Neural Search plugin. Multimodal improves the relevancy of the results by combining text and other forms of inputs like image, audio, video etc.
As of today OpenSearch supports Semantic search use cases purely based on text based embedding models. While this works for majority of usecases, it does not scale well for applications that need to embed other forms of input. For example, consider a query “Give me bright colored blue shoes”. This can get more relevant results if image properties such as color, intensity is captured along with text. Multimodal solves this problem.

Requirements

This is what is required at Phase 1, for 2.11.

Functional Requirements

Multimodal connector should be supported (Text and Image)
Interface changes should be extensible to accommodate different input formats(image, audio, video) in future
Ingest processor should be able to accept single field(text or image) or multiple fields(text and image) to be passed to Multimodal and index the output embedding to the doc
Search should be able to accept single field(text or image) or multiple fields(text and image) to be passed to Multimodal and search with the output embedding for finding k- nearest neighbors

Non Functional Requirements

minimize breaking changes, ideally solution should be backward compatible with existing code in neural-search (2-ways door decisions)

Scope

In this document we propose a solution for the questions below:

How do we ingest data and create embeddings as part of that process
How do we run neural search queries using created above embedding and models with multimodal support

Out of Document Scope

Changes on ml-commons related to model connector(s) are not covered, the assumption is that we do have a working interface as part of the ml-commons client.

Current State

eural search and ml-commons only supports text embeddings.

existing format for field mapping

PUT /my-nlp-index-1 //index definition
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {
                    "name": "string",
                    "space_type": "string",
                    "engine": "string"
                }
            },
            "passage_text": { 
                "type": "text"
            },
            "passage_image": {
                "type": "binary"
            }
        }
    }
}

existing format for data ingestion pipeline definition

PUT _ingest/pipeline/nlp-pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "bxoDJ7IHGM14UqatWc_2j",
        "field_map": {
           "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

existing format for search request

GET my_index/_search
{
  "query": {
        "neural": {
              "passage_vector": {
                  "query_text": "Hello world",
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }
}

Solution Overview

Neural-search needs support of multimodal in both ingestion path and search path.

Following are assumptions we’re taking for this change:

fields of different types (image, text etc.) are supported by a single model
there will be always a single embeddings vector, this doesn’t depend on number of fields
values for media types like image are passed with the request. references like url for image can be added later

Data Ingestion

Additional fields need to be added to the definition of the ingestion processor for getting embeddings
Based on assumption of a single vector result for embeddings we will rename existing text_embedding processor to a more generic inference_processor .

Search

Existing or new query should support fields of different types like text, image etc. as part of the query clause

Solution HLD

Proposed

Data Ingestion

High level sequence diagram for data ingestion workflow

For data ingestion there are two main components: field mapping and ingestion pipeline. Our goal is ideally to avoid any type information but focus more on providing user ability to define model input and output.

Option 1: Model input/output mapping is part of the pipeline, field names are fixed

{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text-image-embedding": {
        "model_id": "model1234567890",
        "embedding": "multimodal_embedding" // opensearch field for embeddings 
        "field_map": {
            "text": "my_text_field", // opensearch data source field for text
            "image": "my_image_field" // opensearch data source field for image
        }
      }
    }
  ]
}

Pros:

simple, processor will be very similar to existing text_embedding ingestion processor
processor can be used in future as specialized processor for text and image pairs

Cons:

not extensible

Search

High level sequence diagram for search workflow

We will be modifying existing neural query clause to support multiple fields in a single query.

Option 1: Model params are fixed

{
  "query": {
        "neural": {
              "passage_vector": { // embedding field name
                  "query_text": "Hello world", // existing format just for reference
                  "query_image": "base64forimage_1234567890",
                  "model_id": "xzy76xswsd",
                  "k": 100
                }
           }
        }
    }
}

Pros:

simple
consistent with query format of other queries in OpenSearch

Cons:

not extensible

Solution Details

Data ingestion

We will need to create a processor with a generic name text-image-embedding , that is independent of the option we choose. At the code level that new processor and existing TextEmbeddingProcessor can use same code, or even be related via class hierarchy.

For Option 1 we need to add map of field types. That can be passed to processor from a Processor.Factory. Default type can be text. We need to validate if there are no type info for field without a field mapping info.

Search

Main change is in parsing for a query, we need to change simple parsing of text field to a more sophisticated logic for different field types.

We must leave the current format as it’s used currently by customers. We must handle today’s format query_text in same way as following with new format:

"query": {
    "text": "Hello world"
}

Use of that field must be marked as deprecated

Both Data Ingestion and Search need changes in MLCommonsClientAccessor.inferenceSentence, as both essentially use embeddings produced by predict API in ml-commons.

We need to change a way we call predict API using MLClient . In particular, we need to use RemoteInferenceInputDataSet instead of TextDocsInputDataSet. There is no need to differentiate between text and image because RemoteInferenceInputDataSet accepts only String.

Alternatives Considered

Data Ingestion

Option 2: Model input/output mapping is part of the pipeline

{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text-image-embedding": {
        "model_id": "some_remote_model",
       
        "field_map": {
            "my_text": {                                                     // opensearch data source field
                "model_input": "TextInput1",                     // input field for model
                "model_output": "TextEmbdedding1"      // output from model
                "embedding": "multimodal_embedding" // opensearch field for embeddings 
            },
            "my_image": {
                "model_input": "ImageInput2",
                "model_output": "ImageEmbdedding2",
                "embedding": "multimodal_embedding"
            }
        }
    }
  ]
}

Pros

user has control over mapping of model’s input and output
model params and embeddings have no link to types in OpenSearch
extensible
can be same interface as in search query

Cons

more work for development due to extensive parsing logic and new processor
not clear on defaults

Option 3: Type is part of the ingestion pipeline, no model input/output mapping

Type information can be passed as part of the ingestion pipeline.

PUT /my-nlp-index-1 //index definition
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {...}
            },
            "passage_text": { 
                "type": "text"
            },
            "passage_image": {
                "type": "binary"
            }
        }
    }
}

PUT _ingest/pipeline/nlp-pipeline //ingestion pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "inference_processor": {
        "model_id": "bxoDJ7IHGM14UqatWc_2j",
        "field_map": {
           "passage_text": "multimodal_embedding",
           "passage_image": "multimodal_embedding"
        },
        "type_map": {
           "passage_text": "text",
           "passage_image": "image"
        }
      }
    }
  ]
}

Pros

no need to know a field type at the index creation
easy to implement, changes only in neural-search plugin

Cons

no user provided mapping for model input/output
less error prone, with multiple pipelines it’s more efforts to maintain type (although it may be a flexibility as well)
more complex logic for parsing, need to have defaults and handle cases when number of types and fields are different

Option 4: Field type is part of mapping

We need some extra field to describe type. We’ll be adding type of the field under meta collection, using new field mime_type.

PUT /my-nlp-index //index definition
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "multimodal_embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {  }
            },
            "passage_text": { 
                "type": "text",
                "meta": {
                    "mime_type": "text"            
                }
            },
            "passage_image": {
                "type": "binary",
                "meta": {
                    "mime_type": "image"  
                }
            },
            "passage_video": {
                "type": "binary"
            }
        }
    }
}

PUT _ingest/pipeline/nlp-pipeline //ingestion pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "inference_processor": {
        "model_id": "bxoDJ7IHGM14UqatWc_2j",
        "field_map": {
           "passage_text": "multimodal_embedding",
           "passage_image": "multimodal_embedding"
        }
      }
    }
  ]
}

Pros

keep structure in data definition, field mapping is one place to store all field details

Cons

need to know type of the field at index creation
feasibility of implementation is questionable, meta info is not available for ingestion processor due to difficulties of getting MapperSerrvice class, may require a change in core

Search

We will be modifying existing neural query clause to support multiple fields in a single query.

Option 2: Model params are part of search request, field names are fixed

GET my_index/_search
{
  "query": {
        "neural": {
              "multimodal_embedding": { // embedding field name
                  "queries":[ 
                      {
                          "model_input"  : "TextInput1",
                          "model_output" : "TextEmbdedding1"
                          "query": "my text for query"
                      },
                      {
                          "model_input"  : "ImageInput2",
                          "model_output" : "ImageEmbdedding1",
                          "query": "base64_query_image_123123123213"
                      }
                  ],
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }
}

Pros:

flexibility to pass model params with query
extensible format
backward compatible, for existing query_text we just use existing text embedding processor, new format will be ignored

Cons

verbose
more work on parsing side

Option 3: Type specific attributes

GET my_index/_search
{
  "query": {
        "neural": {
              "passage_vector": {
                  "query": {
                        "text": "Hello world",
                        "image":  {
                            "value": "bGlkaHQtd29rfx4"
                            "type": "base64"
                        },
                        "image":  {
                            "value": "http://myserver/image1.jpg"
                            "type": "url"
                        },
                        "audio": {
                            "value": "bGlkaHQtd29rfx4"
                            "type": "base64"
                        },
                        "video": {
                        ...
                        }
                      }
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }

Pros:

extensible format, new types can be added with low efforts
backward compatible with existing format
validation is possible

Cons:

no control over model input/output mapping
type is part of the search request
each type can have different set of attributes

Option 4: Generic attributes list

GET my_index/_search
{
  "query": {
        "neural": {
              "passage_vector": {
                  "inputList": [
                    {
                        "data": "Hello world",
                        "type": "text"        
                    },
                    {
                        "data": "asdfasdfasdfadf",
                        "type": "image",
                        "format": "base64"   
                    },
                    {
                        "data": "http://myserver/image_1.jpg",
                        "type": "image",
                        "format": "url"    
                    }
                  ]
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }

Pros:

extensible format, new types can be added with low efforts
easy validation of invalid input

Cons:

no control over model input/output mapping
type is part of the search request
not backward compatible with existing format, more complex logic if go with deprecation of today’s format
harder to extend as attribute will be accessible to all types and extra validation is needed

Solution Comparison

Most alternatives were rejected due to short timeline of initial phase, but can be used for consideration for future development, as most of them focused around extensible approach.

We inclined to avoid options that bundle OpenSearch type and model parameter mapping as those are not related categories., it's better to keep it closer to pipeline definition or ideally read in runtime from the model connector definition. That however has it's own problems as such reading will affect the performance, which is especially bad for search flow.

Another aspect is embedding processors, should we do something generic and configurable or it has to be set of specialized processors.

Testability

General tests for the neural-search repo will be part of the development.

Limited integ tests are possible, but connector and model will be mocked

Manual testing for real end-to-end scenario will be conducted for model connectors and remotely hosted models.

Reference Links

The text was updated successfully, but these errors were encountered:

heemin32 · 2023-09-29T17:30:44Z

Shouldn't we have embeddings for each field instead of single embedding of all fields combined? Otherwise, query recall will degrade a lot when data is ingested with both text and image and search is happening using either text or image only?

there will be always a single embeddings vector, this doesn’t depend on number of fields

martin-gaievski · 2023-09-29T17:54:08Z

Shouldn't we have embeddings for each field instead of single embedding of all fields combined? Otherwise, query recall will degrade a lot when data is ingested with both text and image and search is happening using either text or image only?
there will be always a single embeddings vector, this doesn’t depend on number of fields

that depends on a model, anyway the search happens inside the model. Just to clarify, when ingesting we combine both fields and create single embedding, for search we combine both fields into one input DTO for ml client and send it in one call, receive single embedding for query and do knn using that single vector.

heemin32 · 2023-09-29T18:06:48Z

Is there user requests for this feature? As a user, I would search either using text or image but not both in general. By having single embedding, wouldn't it harm on recall compared to having embedding for each field?

martin-gaievski · 2023-09-29T18:22:44Z

Is there user requests for this feature? As a user, I would search either using text or image but not both in general. By having single embedding, wouldn't it harm on recall compared to having embedding for each field?

There is a feature request #318. Actual use case is to search using both image and text, even more types like video or audio can to be added later. You can check on similar feature implemented or planned by other platforms: https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search, https://www.pinecone.io/learn/clip-image-search/.
For information retrieval usually recall is not a main concern, it's NDCG.

heemin32 · 2023-09-29T18:41:15Z

Thanks for sharing the links. Seems like this feature will be needed to integrate with Vision Language Model(VLM).

jmazanec15 · 2023-10-02T23:59:56Z

I dont know if I like hardcoding the name as "query_image". Like you mentioned, it is not extensible. Are there any other options available?

Additionally, for testing, could you elaborate on how we will benchmark the system? What datasets are available and what will be run?

martin-gaievski · 2023-10-03T01:53:05Z

Sure, I'll put other options, but everything that is extensible would put us out of the 2.11 release time, so simplicity of implementation overwrites all other reasons here. Other options include some generic format and/or info of type.

As for the datasets, I think we can take something like flickr where the mixed media is present, and use recall and/or NSGD to measure changes in score accuracy. I'll search for exact datasets and update this RFC with findings

navneet1v · 2023-10-25T18:20:07Z

@martin-gaievski can we resolve this issue. As this feature is released.

martin-gaievski · 2023-10-25T20:09:34Z

@martin-gaievski can we resolve this issue. As this feature is released.

I added few alternative options for future reference, closing this RFC

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement v2.11.0 Issues targeting release v2.11.0 labels Sep 29, 2023

martin-gaievski self-assigned this Sep 29, 2023

github-actions bot added the untriaged label Sep 29, 2023

martin-gaievski mentioned this issue Sep 29, 2023

Added Multimodal semantic search feature #359

Merged

5 tasks

martin-gaievski removed the v2.11.0 Issues targeting release v2.11.0 label Oct 4, 2023

martin-gaievski closed this as completed Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Multimodal semantic search design #362

[RFC] Multimodal semantic search design #362

martin-gaievski commented Sep 29, 2023 •

edited

Loading

heemin32 commented Sep 29, 2023

martin-gaievski commented Sep 29, 2023

heemin32 commented Sep 29, 2023

martin-gaievski commented Sep 29, 2023

heemin32 commented Sep 29, 2023 •

edited

Loading

jmazanec15 commented Oct 2, 2023

martin-gaievski commented Oct 3, 2023

navneet1v commented Oct 25, 2023

martin-gaievski commented Oct 25, 2023

[RFC] Multimodal semantic search design #362

[RFC] Multimodal semantic search design #362

Comments

martin-gaievski commented Sep 29, 2023 • edited Loading

Introduction

Background

Requirements

Scope

Out of Document Scope

Current State

Solution Overview

Data Ingestion

Search

Solution HLD

Data Ingestion

Option 1: Model input/output mapping is part of the pipeline, field names are fixed

Search

Option 1: Model params are fixed

Solution Details

Data ingestion

Search

Alternatives Considered

Data Ingestion

Option 2: Model input/output mapping is part of the pipeline

Option 3: Type is part of the ingestion pipeline, no model input/output mapping

Option 4: Field type is part of mapping

Search

Option 2: Model params are part of search request, field names are fixed

Option 3: Type specific attributes

Option 4: Generic attributes list

Solution Comparison

Testability

Reference Links

heemin32 commented Sep 29, 2023

martin-gaievski commented Sep 29, 2023

heemin32 commented Sep 29, 2023

martin-gaievski commented Sep 29, 2023

heemin32 commented Sep 29, 2023 • edited Loading

jmazanec15 commented Oct 2, 2023

martin-gaievski commented Oct 3, 2023

navneet1v commented Oct 25, 2023

martin-gaievski commented Oct 25, 2023

martin-gaievski commented Sep 29, 2023 •

edited

Loading

heemin32 commented Sep 29, 2023 •

edited

Loading