Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mapper_parsing_exception when ingesting a nested knn vector from a remote model #2995

Closed
IanMenendez opened this issue Sep 27, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@IanMenendez
Copy link

What is the bug?
I am trying to ingest nested KNN vectors into my index.

To do this I use an ml_inference processor with an API of my own connected with a remote ML model.

But the workflow fails with a parsing exception when indexing the document

How can one reproduce the bug?

  1. First you will need a way to mock the API. Simple API made with FastAPI
from fastapi import FastAPI

app = FastAPI()
@app.post("/test/embedding")
def embedding_endpoint():
    return {"embedding": [{"knn": [1,2,3], "object": "dog"}, {"knn": [4,5,6], "object": "person"}], "time": 0}

3.Register and deploy the remote ML model

 POST /_plugins/_ml/models/_register
{
  "name": "test_model",
  "description": "test",
  "function_name": "remote",
  "connector": {
    "name": "connector",
    "description": "",
    "version": "1",
    "protocol": "http",
    "parameters": {
      "endpoint": "fastapi-app:8000/test/embedding"
    },
    "credential": {},
    "actions": [
      {
        "action_type": "predict",
        "method": "POST",
        "url": "http://${parameters.endpoint}",
        "request_body": """{"url": "${parameters.url}"}"""
      }
    ]
  }
}



POST _plugins/_ml/models/<MODEL_ID>/_deploy
  1. Create ML inference processor for the model
PUT /_ingest/pipeline/ml_inference_test
{
  "processors": [
    {
      "ml_inference": {
        "full_response_path": true,
        "model_id": "_UjlMZIBXGHX_049wT9J",
        "input_map": [
          {
            "url": "url"
          }
        ],
        "output_map": [
          {
            "embedding.knn": "$.inference_results.*.output.*.dataAsMap.embedding.*.knn"
          }
        ]
      }
    }
  ]
}

  1. Create index with nested mapping
PUT test-index
{
  "settings": {
    "index": {
      "default_pipeline": "ml_inference_pipeline",
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "url": {
        "type":"text"
      },
      "embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": "3"
          }
        }
      }
    }
  }
}
  1. Ingest document into the index
POST test-index/_doc
{
  "url": "test.com"
}
  1. Fail with mapper_parsing_exception
{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [embedding.knn] of type [knn_vector] in document with id 'AUjrMZIBXGHX_049x0Bj'. Preview of field's value: '[1.0, 2.0, 3.0]'"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [embedding.knn] of type [knn_vector] in document with id 'AUjrMZIBXGHX_049x0Bj'. Preview of field's value: '[1.0, 2.0, 3.0]'",
    "caused_by": {
      "type": "i_o_exception",
      "reason": "Current token (START_ARRAY) not numeric, can not use numeric value accessors\n at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 23]"
    }
  },
  "status": 400
}

Additional information

I even tried using a post_process_function with the connector but it failed with the same exception

post_process_function I tried with:

"post_process_function": """
    List jsonList = new ArrayList();
    
    def name = "sentence_embedding";
    def dataType = "FLOAT32";
    
    for (def entry : params['embedding']) {
        def knnShape = [entry.knn.length];
        def knnJson = "{" +
                      "\"name\":\"" + name + "\"," +
                      "\"data_type\":\"" + dataType + "\"," +
                      "\"shape\":" + knnShape + "," +
                      "\"data\":" + entry.knn +
                      "}";
        
        jsonList.add(knnJson);
    }

    return jsonList.toString();
    """

What is the expected behavior?
Should not fail and ingest the document

What is your host/environment?

  • OpenSearch version: 2.16.0
  • Operating System: Ubuntu 22.04 jammy
@IanMenendez IanMenendez added bug Something isn't working untriaged labels Sep 27, 2024
@ylwu-amzn
Copy link
Collaborator

ylwu-amzn commented Sep 27, 2024

Tested, this should work. Don't need to configure post_process_function in connector.

PUT /_ingest/pipeline/ml_inference_test
{
  "processors": [
    {
      "ml_inference": {
        "full_response_path": true,
        "model_id": "q1yxNJIBFuZi0K4LDQZ0",
        "input_map": [
          {
            "url": "url"
          }
        ],
        "output_map": [
          {
            "embedding": "$.inference_results.*.output.*.dataAsMap.embedding.*"
          }
        ]
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
          if (ctx.embedding != null) {
            for (int i = 0; i < ctx.embedding.size(); i++) {
              ctx.embedding[i].remove('object');
            }
          }
        """
      }
    }
  ]
}

Model output is

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "embedding": [
              {
                "knn": [
                  1,
                  2,
                  3
                ],
                "object": "dog"
              },
              {
                "knn": [
                  4,
                  5,
                  6
                ],
                "object": "person"
              }
            ],
            "time": 0
          }
        }
      ],
      "status_code": 200
    }
  ]
}

If use this in output mapping "embedding.knn": "$.inference_results.*.output.*.dataAsMap.embedding.*.knn", the embedding.knn will be

[
[1, 2, 3],
[4, 5, 6],
]

That's not expected input for embedding.knn field.

So we should use "embedding": "$.inference_results.*.output.*.dataAsMap.embedding.*" to get such output

[
    {
        "knn": [
            1,
            2,
            3
        ],
        "object": "dog"
    },
    {
        "knn": [
            4,
            5,
            6
        ],
        "object": "person"
    }
]

Then remove the object from result with

    {
      "script": {
        "lang": "painless",
        "source": """
          if (ctx.embedding != null) {
            for (int i = 0; i < ctx.embedding.size(); i++) {
              ctx.embedding[i].remove('object');
            }
          }
        """
      }
    }

You don't need to configure this painless processor if you want to keep the object

@IanMenendez
Copy link
Author

@ylwu-amzn Thanks, this worked!

I will be updating ml_inference OS docs as it's a bit confusing what the ml_inference processor is expecting as an input

@ylwu-amzn
Copy link
Collaborator

Thanks @IanMenendez , can you share the OS doc change issue and PR link here. In case someone else has similar issue, they can refer to your OS doc issue and PR .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants