Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot iterate over nested list of objects to apply text embeddings processor #469

Closed
krishy91 opened this issue Oct 19, 2023 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@krishy91
Copy link
Contributor

What is the bug?

When a nested field contains a list of objects, and a foreach processor is used to generate the text embeddings for a text in each nested object, it does not work.

How can one reproduce the bug?

  1. Create a pipeline with foreach processor that iterates over a nested list of object & applies text embedding processor to each nested object. NOTE: The text embedding model also has to be present & loaded

This is way of using the foreach processor (to access internal nested objects with _ingest._value). Since text embedding processor does not current support . notation for fields, I had to write in the below way.

{
    "processors": [
        {
            "foreach": {
                "field": "nested-field",
                "processor": {
                    "text_embedding": {
                        "model_id": "7l8rQ4sB2f-9nv8R9Veh",
                        "field_map": {
                            "_ingest": {
                                "_value" :{
                                    "text-field": "vector-field"
                                }
                            }
                        }
                    }
                }
            }
        }
    ]
}
  1. Test the pipeline with the following nested document.
{
    "docs": [
        {
            "_index": "test-index",
            "_id": "1",
            "_source": {
                "nested-field": [
                    {
                        "text-field": "This is a test"
                    },
                    {
                        "text-field": "This is another test"
                    }
                ]
            }
        }
    ]
}

The above does not work. But the thing is, neural search supports query nested fields (as nested query) and there is the possibility to only get the actual matches using inner_hits.

What is the expected behavior?

The expected behavior is that for the above example, text embedding should be generated (vector-field) for both the nested objects in the above example.

I understand that multi-value fields aren't currently supported for generating text embeddings, but since we are dealing here with nested objects & they are independant objects in lucene, this should work.

What is your host/environment?

I tested with the official docker image of Opensearch 2.9 running on Windows.

@krishy91 krishy91 added bug Something isn't working untriaged labels Oct 19, 2023
@navneet1v
Copy link
Collaborator

@zane-neo can you take a look into this.

@krishy91
Copy link
Contributor Author

krishy91 commented Oct 27, 2023

So, on looking at the neural-search plugin code, this case (list of maps inside a nested map) wasn't supported. I could add this case & now the text embeddings processor works with a a nested field which has a list of objects.

Processor configuration

curl --location --request PUT 'http://localhost:9200/_ingest/pipeline/embeddings-nested' \
--header 'Content-Type: application/json' \
--data '{
    "processors": [
        {
            "text_embedding": {
                "model_id": "o-bPa4sBu3zYeKjn7baf",
                "field_map": {
                    "nested-field": {
                        "text-field": "vector-field"
                    }
                }
            }
        }
    ]
}'

Request with simulate endpoint

curl --location 'http://localhost:9200/_ingest/pipeline/embeddings-nested/_simulate?verbose=true' \
--header 'Content-Type: application/json' \
--data '{
    "docs": [
        {
            "_id": "1",
            "_source": {
                "nested-field": [
                    {
                        "text-field": "This is another test"
                    },
                    {
                        "text-field": "This is a third field"
                    },
                    {
                        "text-field": "This is a fourth field"
                    }
                ]
            }
        }
    ]
}'

The corresponding output after my fix:
NOTE: I have removed the vectors to make the output shorter to post here

{
    "docs": [
        {
            "processor_results": [
                {
                    "processor_type": "text_embedding",
                    "status": "success",
                    "doc": {
                        "_index": "_index",
                        "_id": "1",
                        "_source": {
                            "nested-field": [
                                {
                                    "text-field": "This is another test",
                                    "vector-field": [
                                        0.029963804,
                                        ....
                                    ]
                                },
                                {
                                    "text-field": "This is a third field",
                                    "vector-field": [
                                        ...
                                    ]
                                },
                                {
                                    "text-field": "This is a fourth field",
                                    "vector-field": [
                                        ...
                                    ]
                                }
                            ]
                        },
                        "_ingest": {
                            "pipeline": "embeddings-nested",
                            "timestamp": "2023-10-26T12:24:00.218116911Z"
                        }
                    }
                }
            ]
        }
    ]
}

I would like to contribute this "fix" (or feature depending on how you would like to classify it). I can fork the project & make the commit there.

Can somehow look at my changes? @zane-neo @navneet1v

(I have made the changes & if its okay to commit, I will add tests etc.)

@zane-neo
Copy link
Collaborator

zane-neo commented Oct 27, 2023

@krishy91 have you checked this issue? #245

@krishy91
Copy link
Contributor Author

@zane-neo I looked into it & it seems like a issue with indexing mutli-valued fields. I think it makes sense since as per my last understanding, Lucene still did not support mutli valued vector fields (apache/lucene#12313)

The issue here is with a nested field which can be a list of objects. Since the nested objects are independant documents in Lucene, having a vector field in each nested object should not be an issue (already verfied). Therefore, in my opinion this issue is not very closely related to the issue with multi valued fields.

@zane-neo
Copy link
Collaborator

@krishy91 Ok, do you have a PR regarding the fix? I haven't got time to reproduce this issue but would like to check your PR first.

@krishy91
Copy link
Contributor Author

@zane-neo I have a rough sketch of the solution which works. I have created a draft pull request here: #477

If you guys feel the solution is reasonable (I think so as it is adding just another case which needs to be handled), I will go ahead & fix the validateEmbeddingFieldsValue method (which checks if supplied input is valid) & add some tests for it. (before the pull request can be deemed fit for merging)

@zane-neo
Copy link
Collaborator

@krishy91 When we implemented this feature, we considered mostly the split documents case(which is a list of string), so not supporting the list of object case. But I do think your case is valid, I have no concern to provide this capability to user. cc: @navneet1v @model-collapse.

@manzke
Copy link

manzke commented Feb 6, 2024

@zane-neo @navneet1v @model-collapse PR is open and already in use. We would love to see it merged and available in 2.12. (or 2.11.2)
WDYT?

#477

@zane-neo
Copy link
Collaborator

@krishy91 Recently did a test on the nested map type configuration, found some issues:
pipeline configuration:

{
  "description": "An example neural search pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "qhO5xY4BYwgbtrHt7KDf",
        "field_map": {
          "category": {
            "name": {
              "en": "category_name_vector"
            }
          }
        }
      }
    }
  ]
}

And simulate the pipeline processor:

{
    "docs": [
        {
            "_index": "neural-search-index-v2",
            "_id": "1",
            "_source": {
                "category": [
                    {
                        "name": {
                            "en": "this is a name"
                        }
                    },
                    {
                        "name": {
                            "en": "hello world"
                        }
                    }
                ]
            }
        }
    ]
}

Result:

{
    "docs": [
        {
            "doc": {
                "_index": "neural-search-index-v2",
                "_id": "1",
                "_source": {
                    "category": [
                        {
                            "name": [
                                -0.10758455,
                                0.07971476,
                                -0.04948872,
                                ...
                            ]
                        },
                        {
                            "name": [
                                -0.034477253,
                                0.031023245,
                                0.006734962,
                                ...
                            ]
                        }
                    ]
                },
                "_ingest": {
                    "timestamp": "2024-04-10T03:51:53.496385Z"
                }
            }
        }
    ]
}

The embedding results override the original values in nested maps which seems a bug needs to be fixed.

@zane-neo
Copy link
Collaborator

Created this issue to track: #686

@jmazanec15
Copy link
Member

@zane-neo can this be closed?

@zane-neo
Copy link
Collaborator

zane-neo commented Oct 9, 2024

Closing this as it's been fixed.

@zane-neo zane-neo closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants