Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Multimodal semantic search design #362

Closed
martin-gaievski opened this issue Sep 29, 2023 · 9 comments
Closed

[RFC] Multimodal semantic search design #362

martin-gaievski opened this issue Sep 29, 2023 · 9 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement untriaged

Comments

@martin-gaievski
Copy link
Member

martin-gaievski commented Sep 29, 2023

Introduction

This issue describes details of Multimodal semantic search support Design, including high a low level approaches and pros and cons of each.

Background

This project aims to enhance semantic search use cases by enabling Multimodal support in Neural Search plugin. Multimodal improves the relevancy of the results by combining text and other forms of inputs like image, audio, video etc.
As of today OpenSearch supports Semantic search use cases purely based on text based embedding models. While this works for majority of usecases, it does not scale well for applications that need to embed other forms of input. For example, consider a query “Give me bright colored blue shoes”. This can get more relevant results if image properties such as color, intensity is captured along with text. Multimodal solves this problem.

Requirements

This is what is required at Phase 1, for 2.11.

Functional Requirements

  • Multimodal connector should be supported (Text and Image)
  • Interface changes should be extensible to accommodate different input formats(image, audio, video) in future
  • Ingest processor should be able to accept single field(text or image) or multiple fields(text and image) to be passed to Multimodal and index the output embedding to the doc
  • Search should be able to accept single field(text or image) or multiple fields(text and image) to be passed to Multimodal and search with the output embedding for finding k- nearest neighbors

Non Functional Requirements

  • minimize breaking changes, ideally solution should be backward compatible with existing code in neural-search (2-ways door decisions)

Scope

In this document we propose a solution for the questions below:

  1. How do we ingest data and create embeddings as part of that process

  2. How do we run neural search queries using created above embedding and models with multimodal support

Out of Document Scope

Changes on ml-commons related to model connector(s) are not covered, the assumption is that we do have a working interface as part of the ml-commons client.

Current State

eural search and ml-commons only supports text embeddings.

existing format for field mapping

PUT /my-nlp-index-1 //index definition
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {
                    "name": "string",
                    "space_type": "string",
                    "engine": "string"
                }
            },
            "passage_text": { 
                "type": "text"
            },
            "passage_image": {
                "type": "binary"
            }
        }
    }
}

existing format for data ingestion pipeline definition

PUT _ingest/pipeline/nlp-pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "bxoDJ7IHGM14UqatWc_2j",
        "field_map": {
           "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

existing format for search request

GET my_index/_search
{
  "query": {
        "neural": {
              "passage_vector": {
                  "query_text": "Hello world",
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }
}

Solution Overview

Neural-search needs support of multimodal in both ingestion path and search path.

Following are assumptions we’re taking for this change:

  • fields of different types (image, text etc.) are supported by a single model
  • there will be always a single embeddings vector, this doesn’t depend on number of fields
  • values for media types like image are passed with the request. references like url for image can be added later
Data Ingestion
  1. Additional fields need to be added to the definition of the ingestion processor for getting embeddings
  2. Based on assumption of a single vector result for embeddings we will rename existing text_embedding processor to a more generic inference_processor .
Search
  1. Existing or new query should support fields of different types like text, image etc. as part of the query clause

Solution HLD

Proposed

Data Ingestion

High level sequence diagram for data ingestion workflow

Multimodal_ingest_sequence

For data ingestion there are two main components: field mapping and ingestion pipeline. Our goal is ideally to avoid any type information but focus more on providing user ability to define model input and output.

Option 1: Model input/output mapping is part of the pipeline, field names are fixed
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text-image-embedding": {
        "model_id": "model1234567890",
        "embedding": "multimodal_embedding" // opensearch field for embeddings 
        "field_map": {
            "text": "my_text_field", // opensearch data source field for text
            "image": "my_image_field" // opensearch data source field for image
        }
      }
    }
  ]
} 

Pros:

  • simple, processor will be very similar to existing text_embedding ingestion processor
  • processor can be used in future as specialized processor for text and image pairs

Cons:

  • not extensible

Search

High level sequence diagram for search workflow

Multimodal_search_sequence

We will be modifying existing neural query clause to support multiple fields in a single query.

Option 1: Model params are fixed
{
  "query": {
        "neural": {
              "passage_vector": { // embedding field name
                  "query_text": "Hello world", // existing format just for reference
                  "query_image": "base64forimage_1234567890",
                  "model_id": "xzy76xswsd",
                  "k": 100
                }
           }
        }
    }
}

Pros:

  • simple
  • consistent with query format of other queries in OpenSearch

Cons:

  • not extensible

Solution Details

Data ingestion

We will need to create a processor with a generic name text-image-embedding , that is independent of the option we choose. At the code level that new processor and existing TextEmbeddingProcessor can use same code, or even be related via class hierarchy.

For Option 1 we need to add map of field types. That can be passed to processor from a Processor.Factory. Default type can be text. We need to validate if there are no type info for field without a field mapping info.

Multimodal-LLD class diagram drawio

Search

Main change is in parsing for a query, we need to change simple parsing of text field to a more sophisticated logic for different field types.

We must leave the current format as it’s used currently by customers. We must handle today’s format query_text in same way as following with new format:

"query": {
    "text": "Hello world"
}

Use of that field must be marked as deprecated

Both Data Ingestion and Search need changes in MLCommonsClientAccessor.inferenceSentence, as both essentially use embeddings produced by predict API in ml-commons.

We need to change a way we call predict API using MLClient . In particular, we need to use RemoteInferenceInputDataSet instead of TextDocsInputDataSet. There is no need to differentiate between text and image because RemoteInferenceInputDataSet accepts only String.

Alternatives Considered

Data Ingestion

Option 2: Model input/output mapping is part of the pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text-image-embedding": {
        "model_id": "some_remote_model",
       
        "field_map": {
            "my_text": {                                                     // opensearch data source field
                "model_input": "TextInput1",                     // input field for model
                "model_output": "TextEmbdedding1"      // output from model
                "embedding": "multimodal_embedding" // opensearch field for embeddings 
            },
            "my_image": {
                "model_input": "ImageInput2",
                "model_output": "ImageEmbdedding2",
                "embedding": "multimodal_embedding"
            }
        }
    }
  ]
} 

Pros

  • user has control over mapping of model’s input and output
  • model params and embeddings have no link to types in OpenSearch
  • extensible
  • can be same interface as in search query

Cons

  • more work for development due to extensive parsing logic and new processor
  • not clear on defaults
Option 3: Type is part of the ingestion pipeline, no model input/output mapping

Type information can be passed as part of the ingestion pipeline.

PUT /my-nlp-index-1 //index definition
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {...}
            },
            "passage_text": { 
                "type": "text"
            },
            "passage_image": {
                "type": "binary"
            }
        }
    }
}

PUT _ingest/pipeline/nlp-pipeline //ingestion pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "inference_processor": {
        "model_id": "bxoDJ7IHGM14UqatWc_2j",
        "field_map": {
           "passage_text": "multimodal_embedding",
           "passage_image": "multimodal_embedding"
        },
        "type_map": {
           "passage_text": "text",
           "passage_image": "image"
        }
      }
    }
  ]
}

Pros

  • no need to know a field type at the index creation
  • easy to implement, changes only in neural-search plugin

Cons

  • no user provided mapping for model input/output
  • less error prone, with multiple pipelines it’s more efforts to maintain type (although it may be a flexibility as well)
  • more complex logic for parsing, need to have defaults and handle cases when number of types and fields are different
Option 4: Field type is part of mapping

We need some extra field to describe type. We’ll be adding type of the field under meta collection, using new field mime_type.

PUT /my-nlp-index //index definition
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "multimodal_embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {  }
            },
            "passage_text": { 
                "type": "text",
                "meta": {
                    "mime_type": "text"            
                }
            },
            "passage_image": {
                "type": "binary",
                "meta": {
                    "mime_type": "image"  
                }
            },
            "passage_video": {
                "type": "binary"
            }
        }
    }
}

PUT _ingest/pipeline/nlp-pipeline //ingestion pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "inference_processor": {
        "model_id": "bxoDJ7IHGM14UqatWc_2j",
        "field_map": {
           "passage_text": "multimodal_embedding",
           "passage_image": "multimodal_embedding"
        }
      }
    }
  ]
}

Pros

  • keep structure in data definition, field mapping is one place to store all field details

Cons

  • need to know type of the field at index creation
  • feasibility of implementation is questionable, meta info is not available for ingestion processor due to difficulties of getting MapperSerrvice class, may require a change in core

Search

We will be modifying existing neural query clause to support multiple fields in a single query.

Option 2: Model params are part of search request, field names are fixed
GET my_index/_search
{
  "query": {
        "neural": {
              "multimodal_embedding": { // embedding field name
                  "queries":[ 
                      {
                          "model_input"  : "TextInput1",
                          "model_output" : "TextEmbdedding1"
                          "query": "my text for query"
                      },
                      {
                          "model_input"  : "ImageInput2",
                          "model_output" : "ImageEmbdedding1",
                          "query": "base64_query_image_123123123213"
                      }
                  ],
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }
}

Pros:

  • flexibility to pass model params with query
  • extensible format
  • backward compatible, for existing query_text we just use existing text embedding processor, new format will be ignored

Cons

  • verbose
  • more work on parsing side
Option 3: Type specific attributes
GET my_index/_search
{
  "query": {
        "neural": {
              "passage_vector": {
                  "query": {
                        "text": "Hello world",
                        "image":  {
                            "value": "bGlkaHQtd29rfx4"
                            "type": "base64"
                        },
                        "image":  {
                            "value": "http://myserver/image1.jpg"
                            "type": "url"
                        },
                        "audio": {
                            "value": "bGlkaHQtd29rfx4"
                            "type": "base64"
                        },
                        "video": {
                        ...
                        }
                      }
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }

Pros:

  • extensible format, new types can be added with low efforts
  • backward compatible with existing format
  • validation is possible

Cons:

  • no control over model input/output mapping
  • type is part of the search request
  • each type can have different set of attributes
Option 4: Generic attributes list
GET my_index/_search
{
  "query": {
        "neural": {
              "passage_vector": {
                  "inputList": [
                    {
                        "data": "Hello world",
                        "type": "text"        
                    },
                    {
                        "data": "asdfasdfasdfadf",
                        "type": "image",
                        "format": "base64"   
                    },
                    {
                        "data": "http://myserver/image_1.jpg",
                        "type": "image",
                        "format": "url"    
                    }
                  ]
                  "model_id": "xzy76xswsd",
                  "k": 100
                  }
             }
       }
  }

Pros:

  • extensible format, new types can be added with low efforts
  • easy validation of invalid input

Cons:

  • no control over model input/output mapping
  • type is part of the search request
  • not backward compatible with existing format, more complex logic if go with deprecation of today’s format
  • harder to extend as attribute will be accessible to all types and extra validation is needed

Solution Comparison

Most alternatives were rejected due to short timeline of initial phase, but can be used for consideration for future development, as most of them focused around extensible approach.

We inclined to avoid options that bundle OpenSearch type and model parameter mapping as those are not related categories., it's better to keep it closer to pipeline definition or ideally read in runtime from the model connector definition. That however has it's own problems as such reading will affect the performance, which is especially bad for search flow.

Another aspect is embedding processors, should we do something generic and configurable or it has to be set of specialized processors.

Testability

General tests for the neural-search repo will be part of the development.

Limited integ tests are possible, but connector and model will be mocked

Manual testing for real end-to-end scenario will be conducted for model connectors and remotely hosted models.

Reference Links

  1. Feature branch https://github.com/martin-gaievski/neural-search/tree/feature/multimodal_semantic_search
  2. [FEATURE] Support multi-modal semantic search  #318
@martin-gaievski martin-gaievski added Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement v2.11.0 Issues targeting release v2.11.0 labels Sep 29, 2023
@martin-gaievski martin-gaievski self-assigned this Sep 29, 2023
@heemin32
Copy link
Collaborator

Shouldn't we have embeddings for each field instead of single embedding of all fields combined? Otherwise, query recall will degrade a lot when data is ingested with both text and image and search is happening using either text or image only?

there will be always a single embeddings vector, this doesn’t depend on number of fields

@martin-gaievski
Copy link
Member Author

Shouldn't we have embeddings for each field instead of single embedding of all fields combined? Otherwise, query recall will degrade a lot when data is ingested with both text and image and search is happening using either text or image only?

there will be always a single embeddings vector, this doesn’t depend on number of fields

that depends on a model, anyway the search happens inside the model. Just to clarify, when ingesting we combine both fields and create single embedding, for search we combine both fields into one input DTO for ml client and send it in one call, receive single embedding for query and do knn using that single vector.

@heemin32
Copy link
Collaborator

Is there user requests for this feature? As a user, I would search either using text or image but not both in general. By having single embedding, wouldn't it harm on recall compared to having embedding for each field?

@martin-gaievski
Copy link
Member Author

Is there user requests for this feature? As a user, I would search either using text or image but not both in general. By having single embedding, wouldn't it harm on recall compared to having embedding for each field?

There is a feature request #318. Actual use case is to search using both image and text, even more types like video or audio can to be added later. You can check on similar feature implemented or planned by other platforms: https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search, https://www.pinecone.io/learn/clip-image-search/.
For information retrieval usually recall is not a main concern, it's NDCG.

@heemin32
Copy link
Collaborator

heemin32 commented Sep 29, 2023

Thanks for sharing the links. Seems like this feature will be needed to integrate with Vision Language Model(VLM).

@jmazanec15
Copy link
Member

I dont know if I like hardcoding the name as "query_image". Like you mentioned, it is not extensible. Are there any other options available?

Additionally, for testing, could you elaborate on how we will benchmark the system? What datasets are available and what will be run?

@martin-gaievski
Copy link
Member Author

Sure, I'll put other options, but everything that is extensible would put us out of the 2.11 release time, so simplicity of implementation overwrites all other reasons here. Other options include some generic format and/or info of type.

As for the datasets, I think we can take something like flickr where the mixed media is present, and use recall and/or NSGD to measure changes in score accuracy. I'll search for exact datasets and update this RFC with findings

@martin-gaievski martin-gaievski removed the v2.11.0 Issues targeting release v2.11.0 label Oct 4, 2023
@navneet1v
Copy link
Collaborator

@martin-gaievski can we resolve this issue. As this feature is released.

@martin-gaievski
Copy link
Member Author

@martin-gaievski can we resolve this issue. As this feature is released.

I added few alternative options for future reference, closing this RFC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement untriaged
Projects
None yet
Development

No branches or pull requests

4 participants