[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319

reuschling · 2024-04-12T12:40:04Z

Like in my FR #2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}

zane-neo · 2024-04-22T11:47:58Z

@reuschling Currently implementation doesn't allow empty string since empty string can produce embeddings successfully but it only consumes more disk space and doesn't provide any search relevance improvement. Instead null values usually won't be indexed in OpenSearch so we allow null value here. So if you can do a pre-process to your data to replace all empty strings to null.

Also we support partial presence of the fields, e.g. even if you configured both title and body, but title is not shown in the document, the body still can be embedded successfully.

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

Empty string is not allowed, but null is allowed.

dhrubo-os · 2024-05-07T18:22:09Z

@zane-neo are you planning to release this in 2.15? What's the plan?

zane-neo · 2024-05-08T14:38:37Z

This doesn't looks like a bug, need @reuschling confirmation if the above response solved the issue.

reuschling · 2024-05-15T13:45:08Z

I think this is not a good solution currently, and it is not documented also. Preprocessing of the documents to add fields that are not exist but have to appear with null values can be a huge effort. You have to write code if your mapping changes, for all existing document suppliers. Sometimes you even have no access to the document supplier code further.

Why not change the default behavior of the text_embedding ingest processor to interpret a non-existing field as field with null value? I.e. that simply nothing should be done? Then no existing applications have to be changed in order to make embeddings in OpenSearch work.

ylwu-amzn · 2024-06-04T17:53:38Z

Should move to neural-search repo

reuschling · 2024-06-05T09:26:01Z

I created this issue also in the neural-search repo now, thanks for the hint: opensearch-project/neural-search#774

rbhavna · 2024-06-18T17:41:39Z

This issue is more related to neural search. Closing this on in ml-commons

reuschling added enhancement New feature or request untriaged labels Apr 12, 2024

zane-neo self-assigned this Apr 22, 2024

dhrubo-os added this to ml-commons projects May 7, 2024

dhrubo-os removed the untriaged label May 7, 2024

reuschling mentioned this issue Jun 5, 2024

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map opensearch-project/neural-search#774

Closed

rbhavna closed this as completed Jun 18, 2024

github-project-automation bot moved this to Done in ml-commons projects Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319

reuschling commented Apr 12, 2024

zane-neo commented Apr 22, 2024

dhrubo-os commented May 7, 2024

zane-neo commented May 8, 2024

reuschling commented May 15, 2024

ylwu-amzn commented Jun 4, 2024

reuschling commented Jun 5, 2024

rbhavna commented Jun 18, 2024 •

edited

Loading

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #2319

Comments

reuschling commented Apr 12, 2024

zane-neo commented Apr 22, 2024

dhrubo-os commented May 7, 2024

zane-neo commented May 8, 2024

reuschling commented May 15, 2024

ylwu-amzn commented Jun 4, 2024

reuschling commented Jun 5, 2024

rbhavna commented Jun 18, 2024 • edited Loading

rbhavna commented Jun 18, 2024 •

edited

Loading