Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319

Closed
reuschling opened this issue Apr 12, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@reuschling
Copy link

Like in my FR #2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}
@reuschling reuschling added enhancement New feature or request untriaged labels Apr 12, 2024
@zane-neo zane-neo self-assigned this Apr 22, 2024
@zane-neo
Copy link
Collaborator

@reuschling Currently implementation doesn't allow empty string since empty string can produce embeddings successfully but it only consumes more disk space and doesn't provide any search relevance improvement. Instead null values usually won't be indexed in OpenSearch so we allow null value here. So if you can do a pre-process to your data to replace all empty strings to null.

Also we support partial presence of the fields, e.g. even if you configured both title and body, but title is not shown in the document, the body still can be embedded successfully.

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

Empty string is not allowed, but null is allowed.

@dhrubo-os
Copy link
Collaborator

@zane-neo are you planning to release this in 2.15? What's the plan?

@zane-neo
Copy link
Collaborator

zane-neo commented May 8, 2024

This doesn't looks like a bug, need @reuschling confirmation if the above response solved the issue.

@reuschling
Copy link
Author

I think this is not a good solution currently, and it is not documented also. Preprocessing of the documents to add fields that are not exist but have to appear with null values can be a huge effort. You have to write code if your mapping changes, for all existing document suppliers. Sometimes you even have no access to the document supplier code further.

Why not change the default behavior of the text_embedding ingest processor to interpret a non-existing field as field with null value? I.e. that simply nothing should be done? Then no existing applications have to be changed in order to make embeddings in OpenSearch work.

@ylwu-amzn
Copy link
Collaborator

Should move to neural-search repo

@reuschling
Copy link
Author

I created this issue also in the neural-search repo now, thanks for the hint: opensearch-project/neural-search#774

@rbhavna
Copy link
Collaborator

rbhavna commented Jun 18, 2024

This issue is more related to neural search. Closing this on in ml-commons

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

5 participants