[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

martin-gaievski · 2024-06-14T00:01:36Z

Describe the bug

Doc values got updated after update_by_query call in case ingest pipeline is configured and one of processors in that pipeline has failed.

Related component

Indexing

To Reproduce

Setup cluster with distribution OS 2.11 with following plugins: ml-commons, knn, neural. Create index with settings similar to following:

{
    "settings": {
        "index.knn": true,
        "default_pipeline": "pipeline-test"
    },
    "mappings": {
        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 8
                    }
                }
            },
            "name": {
                "type": "text"
            },
            "passage_text": {
                "type": "text"
            }
        }
    }
}

Setup a model using remote connector of ml-commons (https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/connectors/), configure it in a way it throttles requests. In our test we use openai model and configured it to accept 6 requests per minute. Get model id of that model.
Create ingest pipeline with at least one processor that has "ignore_failures" flag "false":

PUT /_ingest/pipeline/pipeline-test
{
    "description": "An NLP ingest pipeline",
    "processors": [
        {
            "text_embedding": {
                "model_id": "<model_id>",
                "field_map": {
                    "name": "passage_embedding"
                },
                "ignore_failure": false
            }
        }
    ]
}

Ingest several documents:

POST /_bulk
{ "index": { "_index": "index-test" } }
{ "name": "permission", "test": "Writing a list of random sentences is harder than I initially thought it would be.", "doc_keyword": "workable", "doc_index": 4976 }
{ "index": { "_index": "index-test" } }
{ "name": "sister", "test": "The fifty mannequin heads floating in the pool kind of freaked them out", "doc_keyword": "angry"}
{ "index": { "_index": "index-test" } }
{ "name": "hair", "test": "Too many prisons have become early coffins", "doc_keyword": "likeable", "doc_index": 2351  }
{ "index": { "_index": "index-test" } }
{ "name": "editor", "test": "Greetings from the real universe", "doc_index": 9871 }
{ "index": { "_index": "index-test" } }
{ "name": "statement", "test": "People keep telling me orange but I still prefer pink", "doc_keyword": "entire", "doc_index": 8242  }

Check that there are no documents with empty passage_embedding value:

GET /index-test/_search
{
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
}

Execute update_by_query request multiple times until you got an error from the model:

POST /index-test/_update_by_query
{
  "query": {
    "range": {
      "doc_index": {
        "gte": 4000,
        "lte": 5000
      }
    }
  },
  "script" : {
    "source": "ctx._source.doc_index++; ctx._source.doc_keyword=\"key1\";ctx._source.test=\"Text random 1\"",
    "lang": "painless"
  }
}

Run check for documents with empty passage_embedding. If search has returned anything (>= 1 hits) that means there are docs without embeddings. This is not the right behavior, all docs were ingested with embeddings, and only operation that caused embeddings to disappear was update :

GET /index-test/_search
{
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
}

Expected behavior

Because processor has been configured with 'ignore_failures false` we expect that update call has failed and no changes are stored.

Additional Details

Plugins
ml-commons, k-NN, neural-search

Host/Environment (please complete the following information):

Version 2.11

Additional context
I've tried same scenario without exclude setting for "passage_embedding" field and it works as expected.

        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },

I assume that behind the scenes document is still updated but because all fields are "included" it copies passage_embedding field value from original document.

The text was updated successfully, but these errors were encountered:

peternied · 2024-06-19T15:24:26Z

[Triage - attendees 1 2 3 4 5]
@martin-gaievski Thanks for creating this issue, could you create a pull request to fix this issue?

martin-gaievski added bug Something isn't working untriaged labels Jun 14, 2024

github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jun 14, 2024

peternied added ingest-pipeline and removed untriaged labels Jun 19, 2024

ylwu-amzn mentioned this issue Jul 8, 2024

[BUG] Connector doesn't return correct status code to client side opensearch-project/ml-commons#2618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

martin-gaievski commented Jun 14, 2024 •

edited

Loading

peternied commented Jun 19, 2024

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

Comments

martin-gaievski commented Jun 14, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Jun 19, 2024

martin-gaievski commented Jun 14, 2024 •

edited

Loading