Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support empty string for fields in text embedding processor #1041

Merged

Conversation

yizheliu-amazon
Copy link
Contributor

@yizheliu-amazon yizheliu-amazon commented Dec 24, 2024

Description

Allow empty string for field in field map.

The basic idea is to mark field with empty string as null, so that empty value for such field won't be taken into account.

Related Issues

Resolves #774

What is the current experience

For current text_embedding processor, it does not allow empty field of fieldMap due to validation here

Given such fieldMap

{
  "description": "text embedding pipeline for hybrid",
  "processors": [
    {
      "text_embedding": {
        "model_id": "L34TCpQBkpdyEl29cLe8",
        "field_map": {
           "title": "embedding_tns_title",
	  "description": "embedding_tns_description",
	  "body": "embedding_tns_body"
        }
      }
    }
  ]
}

if below document is being ingested, ingestion will fail

POST http://localhost:9200/_ingest/pipeline/nlp-ingest-pipeline-nested-allow-empty/_simulate

{
	"docs": [
		{
			"_index": "neural-search-index-v2",
			"_id": "1",
			"_source": {
				"title": "this is title",
				"body": "this is body",
				"description": " "
			}
		}
	]
}

Result:

{
	"docs": [
		{
			"error": {
				"root_cause": [
					{
						"type": "illegal_argument_exception",
						"reason": "map type field [description] has empty string value, cannot process it"
					}
				],
				"type": "illegal_argument_exception",
				"reason": "map type field [description] has empty string value, cannot process it"
			}
		}
	]
}

After the PR how will the experience look like

Given same fieldMap and request as above, with this PR, the result looks like below

{
	"docs": [
		{
			"doc": {
				"_index": "neural-search-index-v2",
				"_id": "1",
				"_source": {
				    "title": "this is title",
                                      "embedding_tns_title": [~768~]
				    "body": "this is body",
                                      "embedding_tns_body": [~768~]
				    "description": " "
                                 },
				"_ingest": {
					"timestamp": "2024-12-27T21:15:37.530145Z"
				}
			}
		}
	]
}

What are the use cases of this fix

The use case we want to support is: we still allow document ingestion even if it has some fields in fieldMap with empty/null value

As mentioned in #774 , because not all document have valid values for field in fieldMap. Also,

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@junqiu-lei junqiu-lei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks. Congrats @yizheliu-amazon on the first PR in neural search repo.

@vibrantvarun vibrantvarun added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Dec 27, 2024
CHANGELOG.md Outdated Show resolved Hide resolved
@vibrantvarun
Copy link
Member

Do not merge until changelog comment is addressed.

@vibrantvarun vibrantvarun changed the title Allow empty string for field in field map Support empty string for fields in text embedding processor Dec 27, 2024
@vibrantvarun vibrantvarun merged commit ee24b1c into opensearch-project:main Dec 27, 2024
40 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1041-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ee24b1c92b41e9f9f1625e1036f790555d7fba07
# Push it to GitHub
git push --set-upstream origin backport/backport-1041-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1041-to-2.x.

yizheliu-amazon added a commit to yizheliu-amazon/neural-search that referenced this pull request Dec 27, 2024
…ch-project#1041)

* Allow empty string for field in field map

Signed-off-by: Yizhe Liu <[email protected]>

* Allow empty string when validation

Signed-off-by: Yizhe Liu <[email protected]>

* Add to change log

Signed-off-by: Yizhe Liu <[email protected]>

* Update CHANGELOG to: Support empty string for fields in text embedding processor

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>
heemin32 pushed a commit that referenced this pull request Dec 30, 2024
…1046)

* Allow empty string for field in field map



* Allow empty string when validation



* Add to change log



* Update CHANGELOG to: Support empty string for fields in text embedding processor



---------

Signed-off-by: Yizhe Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map
3 participants