-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319
Comments
@reuschling Currently implementation doesn't allow empty string since empty string can produce embeddings successfully but it only consumes more disk space and doesn't provide any search relevance improvement. Instead Also we support partial presence of the fields, e.g. even if you configured both
Empty string is not allowed, but |
@zane-neo are you planning to release this in 2.15? What's the plan? |
This doesn't looks like a bug, need @reuschling confirmation if the above response solved the issue. |
I think this is not a good solution currently, and it is not documented also. Preprocessing of the documents to add fields that are not exist but have to appear with null values can be a huge effort. You have to write code if your mapping changes, for all existing document suppliers. Sometimes you even have no access to the document supplier code further. Why not change the default behavior of the text_embedding ingest processor to interpret a non-existing field as field with null value? I.e. that simply nothing should be done? Then no existing applications have to be changed in order to make embeddings in OpenSearch work. |
Should move to neural-search repo |
I created this issue also in the neural-search repo now, thanks for the hint: opensearch-project/neural-search#774 |
This issue is more related to neural search. Closing this on in ml-commons |
Like in my FR #2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.
Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}
Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...
I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.
One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.
So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.
The text was updated successfully, but these errors were encountered: