Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] JSON-to-JSON Transformer #12795

Closed
jackiehanyang opened this issue Mar 20, 2024 · 2 comments
Closed

[RFC] JSON-to-JSON Transformer #12795

jackiehanyang opened this issue Mar 20, 2024 · 2 comments
Labels
enhancement Enhancement or improvement to existing feature or request Other RFC Issues requesting major changes

Comments

@jackiehanyang
Copy link

Is your feature request related to a problem? Please describe

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another, such as converting input JSON objects(e.g., search results from a previous flow step) into a different JSON format like a prompt template. It offers three approaches for data transformation: the Painless Script (P0 item), string manipulation function JSONPath (P0 item), and automated transformation based on specified inputs and outputs (P1 item). This utility should be stand alone and can be integrated into any processor, either before or after the processor execution flow, as a data transformation step.

j-j-1 drawio

Describe the solution you'd like

Providing a public utility method in core package that can be used by any processor. Depends on future requirement, we can expose this utility method to a REST API, or even a processor.

public static JsonNode JsonDataTransformation(List<JsonNode>, 
                                              DataTransformApproach approach, 
                                              List<String> source) {
   ...
}
  • List<JsonNode>, the dataset that needs to perform transform on. Usually it’s a list of SearchHits object.
  • DataTransformApproach approach, Enum PAINLESS, or Enum JSONPATH, the approach customer would like to use to transform the dataset.
  • List<String> source, the painless script source, or JSONPath field mapping instruction

Supported Transform Approach 1. Painless Script

Painless is a performant, secure scripting language that provides numerous capabilities. Writing Painless Scripts can be challenging for customers, and we aim to eliminate that difficulty. However, we still want to maintain this method as the default approach, allowing customers to achieve their objectives when string manipulation function JSONPath are not enough.

Supported Transform Approach 2. String Manipulation (JSONPath)

JSONPath is a query language designed for navigating and extracting parts of a JSON document. With JSONPath, you can specify and navigate to different parts of a JSON structure, making it easier to retrieve specific data elements without needing to process the entire structure manually in code.

AppSec has been clear for using JSONPath in ml-commons since 2.12. Will initiate another AppSec for this use case.

2.1. N-1 Transform: Merge multiple JSONs into one JSON or other format of data
In some cases, the transform has to be applied in a “many-to-one” mode by transforming all multiple objects like search results into a single JSON output. For instance, a re-ranker type mode may require the incoming search results (hits.fields) to be collapsed into a single array of strings as input into a re-ranker (eg. Cohere ReRank)

For example, when customer has the following

[
    {
        "hits": [
            {
                "_index": "media_library",
                "_id": "63MhYY0BFJSF4M0W0eUG",
                "_score": 1,
                "_source": {
                    "books": {
                        "name": "To Kill a Mockingbird",
                        "author": "Harper Lee",
                        "genres": "fiction",
                        "price": 15.99
                    },
                    "songs": {
                        "title": "Pocketful of Sunshine"
                    }
                }
            }
        ]
    },
    {
        "hits": [
            {
                "_index": "books_songs",
                "_id": "5nMhYY0BFJSF4M0W0eUG",
                "_score": 1,
                "_source": {
                    "books": {
                        "title": "Where the Crawdads Sing",
                        "author": "Delia Owens",
                        "genres": "fiction",
                        "cost": 12.99,
                        "year": 2018
                    },
                    "songs": {
                        "title": "If"
                    }
                }
            }
        ]
    }
]

Customer will need to provide the following JSONPath transform instruction

{
    "book_name": "$[*].hits[*]._source.books.name",
    "song_name": "$[*].hits[*]._source.songs.name"
}

The output would be

{
 "book_name_63MhYY0BFJSF4M0W0eUG" : "To Kill a Mockingbird",
 "song_name_63MhYY0BFJSF4M0W0eUG" : "Pocketful of Sunshine",
 "book_name_5nMhYY0BFJSF4M0W0eUG" : "Where the Crawdads Sing",
 "song_name_5nMhYY0BFJSF4M0W0eUG" : "If"
}

2.2. 1-1 Transform: Map a specific field in one JSON to another JSON
1-1 Transform is essentially the same as an N-1 Transform, with the distinction being that in a 1-1 Transform, N equals 1. Therefore, we don't need a separate DataTransformApproach Enum to differentiate between 1-1 and N-1 Transforms. However, for an N-1 Transform scenario, customers would need to use a painless script, as JSONPath may not be sufficient for such transformations.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

@jackiehanyang jackiehanyang added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 20, 2024
@github-actions github-actions bot added the Other label Mar 20, 2024
@andrross
Copy link
Member

@jackiehanyang This issue talks a lot about a solution, and gives an example usage. However, I'd recommend starting with a very detailed description of the problem this is trying to solve. It's not really possible to evaluate the merits of a solution without understanding in detail the problem attempting to be solved. Can you update this issue to start with a clear description of the problem statement?

@peternied peternied added RFC Issues requesting major changes and removed untriaged labels Mar 27, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@jackiehanyang Thanks for creating this issue; however, it isn't being accepted due to being unclear how this is related to the OpenSearch (@andrross said this very well in his comment). Please feel free to open a new issue after addressing the reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Other RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests

3 participants