Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

Open
martin-gaievski opened this issue Jun 14, 2024 · 1 comment
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing ingest-pipeline

Comments

@martin-gaievski
Copy link
Member

martin-gaievski commented Jun 14, 2024

Describe the bug

Doc values got updated after update_by_query call in case ingest pipeline is configured and one of processors in that pipeline has failed.

Related component

Indexing

To Reproduce

  1. Setup cluster with distribution OS 2.11 with following plugins: ml-commons, knn, neural. Create index with settings similar to following:
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "pipeline-test"
    },
    "mappings": {
        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 8
                    }
                }
            },
            "name": {
                "type": "text"
            },
            "passage_text": {
                "type": "text"
            }
        }
    }
}
  1. Setup a model using remote connector of ml-commons (https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/connectors/), configure it in a way it throttles requests. In our test we use openai model and configured it to accept 6 requests per minute. Get model id of that model.
  2. Create ingest pipeline with at least one processor that has "ignore_failures" flag "false":
PUT /_ingest/pipeline/pipeline-test
{
    "description": "An NLP ingest pipeline",
    "processors": [
        {
            "text_embedding": {
                "model_id": "<model_id>",
                "field_map": {
                    "name": "passage_embedding"
                },
                "ignore_failure": false
            }
        }
    ]
}
  1. Ingest several documents:
POST /_bulk
{ "index": { "_index": "index-test" } }
{ "name": "permission", "test": "Writing a list of random sentences is harder than I initially thought it would be.", "doc_keyword": "workable", "doc_index": 4976 }
{ "index": { "_index": "index-test" } }
{ "name": "sister", "test": "The fifty mannequin heads floating in the pool kind of freaked them out", "doc_keyword": "angry"}
{ "index": { "_index": "index-test" } }
{ "name": "hair", "test": "Too many prisons have become early coffins", "doc_keyword": "likeable", "doc_index": 2351  }
{ "index": { "_index": "index-test" } }
{ "name": "editor", "test": "Greetings from the real universe", "doc_index": 9871 }
{ "index": { "_index": "index-test" } }
{ "name": "statement", "test": "People keep telling me orange but I still prefer pink", "doc_keyword": "entire", "doc_index": 8242  } 
  1. Check that there are no documents with empty passage_embedding value:
GET /index-test/_search
{
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
}
  1. Execute update_by_query request multiple times until you got an error from the model:
POST /index-test/_update_by_query
{
  "query": {
    "range": {
      "doc_index": {
        "gte": 4000,
        "lte": 5000
      }
    }
  },
  "script" : {
    "source": "ctx._source.doc_index++; ctx._source.doc_keyword=\"key1\";ctx._source.test=\"Text random 1\"",
    "lang": "painless"
  }
}
  1. Run check for documents with empty passage_embedding. If search has returned anything (>= 1 hits) that means there are docs without embeddings. This is not the right behavior, all docs were ingested with embeddings, and only operation that caused embeddings to disappear was update :
GET /index-test/_search
{
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
}

Expected behavior

Because processor has been configured with 'ignore_failures false` we expect that update call has failed and no changes are stored.

Additional Details

Plugins
ml-commons, k-NN, neural-search

Host/Environment (please complete the following information):

  • Version 2.11

Additional context
I've tried same scenario without exclude setting for "passage_embedding" field and it works as expected.

        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },

I assume that behind the scenes document is still updated but because all fields are "included" it copies passage_embedding field value from original document.

@martin-gaievski martin-gaievski added bug Something isn't working untriaged labels Jun 14, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jun 14, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@martin-gaievski Thanks for creating this issue, could you create a pull request to fix this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing ingest-pipeline
Projects
None yet
Development

No branches or pull requests

2 participants