[BUG] mapper_parsing_exception when ingesting a nested knn vector from a remote model #2995

IanMenendez · 2024-09-27T05:42:38Z

What is the bug?
I am trying to ingest nested KNN vectors into my index.

To do this I use an ml_inference processor with an API of my own connected with a remote ML model.

But the workflow fails with a parsing exception when indexing the document

How can one reproduce the bug?

First you will need a way to mock the API. Simple API made with FastAPI

from fastapi import FastAPI

app = FastAPI()
@app.post("/test/embedding")
def embedding_endpoint():
    return {"embedding": [{"knn": [1,2,3], "object": "dog"}, {"knn": [4,5,6], "object": "person"}], "time": 0}

3.Register and deploy the remote ML model

 POST /_plugins/_ml/models/_register
{
  "name": "test_model",
  "description": "test",
  "function_name": "remote",
  "connector": {
    "name": "connector",
    "description": "",
    "version": "1",
    "protocol": "http",
    "parameters": {
      "endpoint": "fastapi-app:8000/test/embedding"
    },
    "credential": {},
    "actions": [
      {
        "action_type": "predict",
        "method": "POST",
        "url": "http://${parameters.endpoint}",
        "request_body": """{"url": "${parameters.url}"}"""
      }
    ]
  }
}



POST _plugins/_ml/models/<MODEL_ID>/_deploy

Create ML inference processor for the model

PUT /_ingest/pipeline/ml_inference_test
{
  "processors": [
    {
      "ml_inference": {
        "full_response_path": true,
        "model_id": "_UjlMZIBXGHX_049wT9J",
        "input_map": [
          {
            "url": "url"
          }
        ],
        "output_map": [
          {
            "embedding.knn": "$.inference_results.*.output.*.dataAsMap.embedding.*.knn"
          }
        ]
      }
    }
  ]
}

Create index with nested mapping

PUT test-index
{
  "settings": {
    "index": {
      "default_pipeline": "ml_inference_pipeline",
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "url": {
        "type":"text"
      },
      "embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": "3"
          }
        }
      }
    }
  }
}

Ingest document into the index

POST test-index/_doc
{
  "url": "test.com"
}

Fail with mapper_parsing_exception

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [embedding.knn] of type [knn_vector] in document with id 'AUjrMZIBXGHX_049x0Bj'. Preview of field's value: '[1.0, 2.0, 3.0]'"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [embedding.knn] of type [knn_vector] in document with id 'AUjrMZIBXGHX_049x0Bj'. Preview of field's value: '[1.0, 2.0, 3.0]'",
    "caused_by": {
      "type": "i_o_exception",
      "reason": "Current token (START_ARRAY) not numeric, can not use numeric value accessors\n at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 23]"
    }
  },
  "status": 400
}

Additional information

I even tried using a post_process_function with the connector but it failed with the same exception

post_process_function I tried with:

"post_process_function": """
    List jsonList = new ArrayList();
    
    def name = "sentence_embedding";
    def dataType = "FLOAT32";
    
    for (def entry : params['embedding']) {
        def knnShape = [entry.knn.length];
        def knnJson = "{" +
                      "\"name\":\"" + name + "\"," +
                      "\"data_type\":\"" + dataType + "\"," +
                      "\"shape\":" + knnShape + "," +
                      "\"data\":" + entry.knn +
                      "}";
        
        jsonList.add(knnJson);
    }

    return jsonList.toString();
    """

What is the expected behavior?
Should not fail and ingest the document

What is your host/environment?

OpenSearch version: 2.16.0
Operating System: Ubuntu 22.04 jammy

The text was updated successfully, but these errors were encountered:

ylwu-amzn · 2024-09-27T18:37:25Z

Tested, this should work. Don't need to configure post_process_function in connector.

PUT /_ingest/pipeline/ml_inference_test
{
  "processors": [
    {
      "ml_inference": {
        "full_response_path": true,
        "model_id": "q1yxNJIBFuZi0K4LDQZ0",
        "input_map": [
          {
            "url": "url"
          }
        ],
        "output_map": [
          {
            "embedding": "$.inference_results.*.output.*.dataAsMap.embedding.*"
          }
        ]
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
          if (ctx.embedding != null) {
            for (int i = 0; i < ctx.embedding.size(); i++) {
              ctx.embedding[i].remove('object');
            }
          }
        """
      }
    }
  ]
}

Model output is

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "embedding": [
              {
                "knn": [
                  1,
                  2,
                  3
                ],
                "object": "dog"
              },
              {
                "knn": [
                  4,
                  5,
                  6
                ],
                "object": "person"
              }
            ],
            "time": 0
          }
        }
      ],
      "status_code": 200
    }
  ]
}

If use this in output mapping "embedding.knn": "$.inference_results.*.output.*.dataAsMap.embedding.*.knn", the embedding.knn will be

[
[1, 2, 3],
[4, 5, 6],
]

That's not expected input for embedding.knn field.

So we should use "embedding": "$.inference_results.*.output.*.dataAsMap.embedding.*" to get such output

[
    {
        "knn": [
            1,
            2,
            3
        ],
        "object": "dog"
    },
    {
        "knn": [
            4,
            5,
            6
        ],
        "object": "person"
    }
]

Then remove the object from result with

    {
      "script": {
        "lang": "painless",
        "source": """
          if (ctx.embedding != null) {
            for (int i = 0; i < ctx.embedding.size(); i++) {
              ctx.embedding[i].remove('object');
            }
          }
        """
      }
    }

You don't need to configure this painless processor if you want to keep the object

IanMenendez · 2024-09-27T22:58:23Z

@ylwu-amzn Thanks, this worked!

I will be updating ml_inference OS docs as it's a bit confusing what the ml_inference processor is expecting as an input

ylwu-amzn · 2024-09-30T20:41:32Z

Thanks @IanMenendez , can you share the OS doc change issue and PR link here. In case someone else has similar issue, they can refer to your OS doc issue and PR .

IanMenendez added bug Something isn't working untriaged labels Sep 27, 2024

ylwu-amzn removed the untriaged label Sep 27, 2024

IanMenendez closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] mapper_parsing_exception when ingesting a nested knn vector from a remote model #2995

[BUG] mapper_parsing_exception when ingesting a nested knn vector from a remote model #2995

IanMenendez commented Sep 27, 2024

ylwu-amzn commented Sep 27, 2024 •

edited

Loading

IanMenendez commented Sep 27, 2024

ylwu-amzn commented Sep 30, 2024

[BUG] mapper_parsing_exception when ingesting a nested knn vector from a remote model #2995

[BUG] mapper_parsing_exception when ingesting a nested knn vector from a remote model #2995

Comments

IanMenendez commented Sep 27, 2024

ylwu-amzn commented Sep 27, 2024 • edited Loading

IanMenendez commented Sep 27, 2024

ylwu-amzn commented Sep 30, 2024

ylwu-amzn commented Sep 27, 2024 •

edited

Loading