Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Hybrid request does not return inner_hits for nested objects. #718

Open
Kovsonq opened this issue Apr 30, 2024 · 17 comments
Open

Comments

@Kovsonq
Copy link

Kovsonq commented Apr 30, 2024

Is your feature request related to a problem?

Yes, I'm experiencing a problem when I use the hybrid search plugin in OpenSearch v2.11.0. Specifically, when I include the "inner_hits" parameter in my query for nested objects, I do not receive any inner hits in the response. This is causing frustration as my system requires this level of detail for optimal operation.

What solution would you like?

I would like the hybrid search plugin to be updated to include the functionality to correctly return inner hits from nested queries. Ideally, this would function seamlessly as it does in standard OpenSearch queries. This improvement would allow me and other users to fully utilize the power of the hybrid search plugin.

@martin-gaievski
Copy link
Member

Can you please share more details for us to understand your request better: index mapping, query example, expected response?

@Kovsonq
Copy link
Author

Kovsonq commented May 1, 2024

I removed vectors values, do you need them also?

Index mapping :

{
  "mappings": {
    "properties": {
      "chunks": {
        "type": "nested",
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 1536,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib",
              "parameters": {
                "ef_construction": 128,
                "m": 24
              }
            }
          },
          "payload": {
            "index": "true",
            "norms": "false",
            "store": "true",
            "type": "text"
          },
          "length": {
            "type": "integer"
          },
          "id": {
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 5,
      "number_of_replicas": 1
    }
  }
}

Document example:

{
    "chunks": [
        {
            "id": 1,
            "length": 173,
            "payload": "Text 1 example",
            "tokens": 256,
            "embedding": [...]
        },
        {
            "id": 2,
            "length": 173,
            "payload": "Text 2 example",
            "tokens": 256,
            "embedding": [...]
        },
        {
            "id": 3,
            "length": 173,
            "payload": "Text 3 example",
            "tokens": 256,
            "embedding": [...]
        }
    ]
}

request:

{
    "_source": false,
    "query": {
        "hybrid": {
            "queries": [
                {
                    "nested": {
                        "path": "chunks",
                        "query": {
                            "knn": {
                                "chunks.embedding": {
                                    "vector": [...],
                                    "k": 10
                                }
                            }
                        },
                        "inner_hits": {
                            "size": 10,
                            "_source": {
                                "includes": [
                                    "chunks.payload",
                                    "chunks.id"
                                ]
                            }
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "nested": {
                                    "path": "chunks",
                                    "query": {
                                        "simple_query_string": {
                                            "query": "*",
                                            "fields": [
                                                "chunks.payload"
                                            ],
                                            "default_operator": "and"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

response:

{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index_name",
                "_id": "doc_id_1",
                "_score": 1.0,
            }
        ]
    }
}

expected response:

{
    "took": 17,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1,
        "hits": [
            {
                "_index": "index_name",
                "_id": "doc_id_1",
                "_score": 1,
                "inner_hits": {
                    "hsr_chunks": {
                        "hits": {
                            "total": {
                                "value": 3,
                                "relation": "eq"
                            },
                            "max_score": 0.7954481,
                            "hits": [
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "hsr_chunks",
                                        "offset": 0
                                    },
                                    "_score": 0.7954481,
                                    "_source": {
                                        "payload": "Text 1 example",
                                        "id": 1
                                    }
                                },
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "hsr_chunks",
                                        "offset":1
                                    },
                                    "_score": 0.7949572,
                                    "_source": {
                                        "payload": "Text 2 example",
                                        "id": 2
                                    }
                                },
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "chunks",
                                        "offset": 2
                                    },
                                    "_score": 0.75225127,
                                    "_source": {
                                        "payload": "Text 3 example",
                                        "id": 3
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

@dswitzer
Copy link

dswitzer commented May 6, 2024

This issue is also biting me.

We have nested property which stores attachments on a document. We use the inner_hits today to reflect when the query was found in one of the attachments. However, in trying to implement a hybrid search which combines a simple_query_string with a neural_sparse search, we're losing the inner_hits, which means we cannot identify when the search came from our nested search.

@navneet1v
Copy link
Collaborator

@dswitzer can we try 2 text queries with hybrid search and see if inner hits are coming or not. Reason I am asking this is for vector search there are improvements which are doing in 2.12 and 2.13 version relates to nested fields with vectors.
Ref: opensearch-project/k-NN#1447
Ref: opensearch-project/k-NN#1065

@heemin32
Copy link
Collaborator

@navneet1v The issue persist even if it contains query with non-vector fields only.
The issue with hybrid search with inner_hits is that, the innerHit result does not get generated at all.

@navneet1v
Copy link
Collaborator

@heemin32 thanks for confirming it. Can you please share the example on this issue on what and how you tested it.

@heemin32
Copy link
Collaborator

heemin32 commented May 20, 2024

Create Index

PUT /my-hybrid
{
  "mappings": {
    "properties": {
      "chunks": {
        "type": "nested",
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 3,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib",
              "parameters": {
                "ef_construction": 128,
                "m": 24
              }
            }
          },
          "payload": {
            "index": "true",
            "norms": "false",
            "store": "true",
            "type": "text"
          },
          "length": {
            "type": "integer"
          },
          "id": {
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true
    }
  }
}

Add doc

PUT /_bulk?refresh=true
{ "index": { "_index": "my-hybrid", "_id": "1" } }
{ "chunks": [{"id": 1, "length": 173, "payload": "Text 1 example", "tokens": 256, "embedding": [1, 1, 1]}, {"id": 2, "length": 173, "payload": "Text 2 example", "tokens": 256, "embedding": [2, 2, 2]},{"id": 3,"length": 173,"payload": "Text 3 example","tokens": 256,"embedding": [3, 3, 3]}]}

Query

GET /my-hybrid/_search
{
  "_source": false,
  "query": {
    "hybrid": {
      "queries": [
        {
          "nested": {
            "path": "chunks",
            "query": {
              "simple_query_string": {
                "query": "*",
                "fields": [
                  "chunks.payload"
                ],
                "default_operator": "and"
              }
            },
            "inner_hits": {
              "size": 10,
              "_source": {
                "includes": [
                  "chunks.payload",
                  "chunks.id"
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Response

Expect innerHit field is included in the result but no innerHit appears in the result.

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -9549512000
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -4422440400
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": 1
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -9549512000
      }
    ]
  }
}

@martin-gaievski
Copy link
Member

@Kovsonq @dswitzer what is the main use case for those inner hits returned in the result? How critical is the score information for that use case?

I spent some time checking what can be done for inner hits and our limitations. We can include an inner hits section in the response, similar to what's done for other queries in OpenSearch. The only limitation I'm seeing is with the scores. Inner hits have their own logic for retrieving scores; at a high level, they run a light version of the search again during the Fetch phase. At this point, the score normalization process for the hybrid query has been completed, and scores are updated in the query result section of the response. Scores added for inner hits will not be normalized but will be in raw form and scale. This means that, depending on the query, scores can be unbounded and will not correlate with the main hits in the query results (as those are normalized).

@dswitzer
Copy link

@martin-gaievski,

My primary use case is to just be able to highlight the matching terms. The score of the inner hits does not matter much to me, because I'm just using it to highlight keyword matches.

@Kovsonq
Copy link
Author

Kovsonq commented May 31, 2024

@martin-gaievski,

The primary use case for inner_hits in OpenSearch is to retrieve detailed matching information from nested objects within documents. This is particularly useful in scenarios where documents have complex structures with nested fields, and there is a need to understand which specific parts of these documents match the query criteria.

In the context of nested objects, score information for inner hits is important because it allows users to identify the most relevant chunks or sub-documents within a larger document. When a hybrid search is performed, having access to the scores of inner hits enables users to rank and prioritize these nested sections effectively.

Scenario: we need to return the top 20 most relevant nested documents (not parent documents) for the query.

@martin-gaievski
Copy link
Member

@Kovsonq
I'm still not 100% understand why you need normalized scores in a final list of results. If we enable inner_scores without normalized scores, then relative order of child documents will be present in the final result list. As the inner_hits is passed at the sub_query level those hits for child documents will be local to that sub-query anyway, not global for all hybrid query.
If you need to retrieve information about child documents with normalized scores then I feel those child document should be modeled as top level (-> parent) documents.

@martin-gaievski
Copy link
Member

martin-gaievski commented Jun 28, 2024

After doing deep dive for this request I can conclude that we need more time and some additional mechanisms (most likely include core OpenSearch) to implement this feature correctly.
Simplistic approach where inner hits are given per sub-query doesn't work and may provide false positives. Example scenario:

  • hybrid query has two sub-queries, one text match, second is neural query. user specify inner hits for match query
  • one document has low score in match query, say it's in position 12. At the same time same document has much better score in neural query - something like 0.95, position 2.
  • after doing normalization final position of that document is 3.
  • inner hits for the document will have information collected for match query

In result user may have false impression that high final position of the document in due to hits in match, but in reality it's neural that contributed the most.
In other words, we need an inner hits concept at the high level hybrid query, not at the level of sub-query.

I've created issue in core OpenSearch for possible extension mechanisms opensearch-project/OpenSearch#14546

@yuhongsun96
Copy link

I'm also trying to do the same, it seems also that the normalization isn't being applied correctly for hybrid search on nested fields as well. I've verified for normalizing using all of the values of the nested field, using the highest value of the nested field for each doc, using the sum of the values of the nested field. The normalization just doesn't come out correctly.

For context my use case is to run hybrid search on chunks of documents and ideally I wouldn't need to create a new document in opensearch for every chunk that I want to index.

I believe this is a common use case, it would be super AMAZING if we could get this support!

@yuye-aws
Copy link
Member

yuye-aws commented Sep 9, 2024

Is there any blocking issue to support this feature? cc: @martin-gaievski @vibrantvarun

@martin-gaievski
Copy link
Member

@yuye-aws yes, there are fundamental blockers for inner hits: the process is split into two parts, first run at the shard level and doesn't have access to normalized scores and combined order of documents, second part is at the fetch phase and it's also at the shard level. Second item has additional problem of query and fetch phases not communicating with each other directly.

@yuye-aws
Copy link
Member

@martin-gaievski Thanks for your prompt reply. Although I do not have much context for the inner hits and hybrid query, it really seems to be a tricky problem to resolve. Is there any existing info for me to get more knowledge? (Like PR #776)

@yuhongsun96
Copy link

Still really excited to have this support! We're waiting for this to switch over to OpenSearch, it has everything else we need, but to hack around this to create our own implementation using just the top level docs is too messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants