Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]New expand PPL Command #657

Closed
YANG-DB opened this issue Sep 13, 2024 · 2 comments
Closed

[FEATURE]New expand PPL Command #657

YANG-DB opened this issue Sep 13, 2024 · 2 comments
Assignees
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 13, 2024

Is your feature request related to a problem?
Adding a PPL new expand command which adds array and nested object expansion functionality to PPL

Is your feature request related to a problem? Please describe.
OpenSearch's Piped Processing Language (PPL) currently lacks an efficient way to expand arrays and nested objects into separate events, similar to SQL's UNNEST or JSON expansion functions. This limitation hinders the analysis of complex data structures, particularly when working with JSON logs or documents containing arrays or nested objects.

Describe the solution you'd like
We propose adding a new command to PPL that would allow users to expand arrays and nested objects into separate events, similar to SQL's UNNEST function, but with additional flexibility.

The functionality should:

  1. Expand array fields or nested objects into separate events (similar to SQL's UNNEST)
  2. Retain all other fields from the original event in each new event (addressing a limitation of SQL UNNEST)
  3. Support nested fields and complex JSON structures (going beyond basic SQL capabilities)
  4. Allow for subsequent processing of each expanded value in the PPL pipeline
  5. Work seamlessly with unstructured or semi-structured data (unlike SQL, which typically requires predefined schemas)

SQL-like example and comparison:
Consider this SQL-like syntax:

SELECT * FROM my_index
CROSS JOIN UNNEST(items) AS expanded_item
WHERE expanded_item.status = 'active'

The proposed OpenSearch PPL equivalent might look like:

source =  my_index 
| parse my_nested_field `?<items>[*]` as items
| expand items
| where items.status = "active"

Key differences and advantages:

  1. No need for explicit JOIN syntax, making it more intuitive for log analysis
  2. Automatic handling of nested structures without need for complex JSON parsing functions
  3. Ability to work with dynamic schemas and unstructured data

Describe alternatives you've considered
Current alternatives include:

  1. Using complex JSON path queries, which can be cumbersome
  2. Processing the data outside of OpenSearch, reducing real-time analysis capabilities

Additional context
This feature would bridge the gap between SQL's structured data handling and the need for flexible, real-time analysis of semi-structured log data. It combines the power of SQL's UNNEST with the flexibility required for log and event processing.

Potential Impact

  • Simplified queries for complex data structures in logs and events
  • Enhanced real-time analytics capabilities for nested JSON data
  • Improved performance compared to client-side processing of nested structures
  • Better alignment with SQL-like functionality while maintaining PPL's simplicity

Proposed Implementation
The new command (e.g., expand) could be implemented as a new command in the PPL engine, combining the concepts of SQL's UNNEST with the flexibility needed for unstructured log data.

@YANG-DB YANG-DB added enhancement New feature or request untriaged labels Sep 13, 2024
@YANG-DB YANG-DB moved this to Todo in PPL Commands Sep 13, 2024
@YANG-DB YANG-DB self-assigned this Sep 13, 2024
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support and removed untriaged labels Sep 14, 2024
@YANG-DB YANG-DB removed their assignment Sep 14, 2024
@salyh
Copy link
Contributor

salyh commented Oct 6, 2024

@YANG-DB @vamsi-amazon It would be great if we can have a few examples (in- and output in tabular form) and if can clarify the relationship to the flatten (#669) command because it seems that its partially overlapping

@salyh salyh moved this from Todo to Design in PPL Commands Oct 6, 2024
@YANG-DB YANG-DB added the 0.6 label Oct 9, 2024
@YANG-DB YANG-DB moved this from Design to In Progress in PPL Commands Oct 17, 2024
@salyh salyh moved this from In Progress to Design in PPL Commands Oct 25, 2024
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 30, 2024

The main objective of expand is to mimic the following behaviour :

Given the next ip->product Id purchase table,

IP Address Total Purchases Total Products Product IDs
107.3.146.207 72 3 DB-SG-G01, FS-SG-G03, WC-SH-G04
128.241.220.82 95 2 DB-SG-G01, DC-SG-G02
194.215.205.19 60 4 DB-SG-G01, DC-SG-G02, FS-SG-G03, WC-SH-G04
211.166.11.101 91 2 DB-SG-G01, WC-SH-G04
87.194.216.51 134 3 DC-SG-G02, FS-SG-G03, WC-SH-G04

we would like to expand each individual product, or rows, for each value in a multi-value field.
So now a new row is created for each product ID. The multivalued fields are expanded into individual search results. The other fields are unchanged.

ipaddress total_purchases total_products productId
107.3.146.207 72 3 DB-SG-G01
107.3.146.207 72 3 FS-SG-G03
107.3.146.207 72 3 WC-SH-G04
128.241.220.82 95 2 DB-SG-G01
128.241.220.82 95 2 DC-SG-G02
194.215.205.19 60 4 DB-SG-G01
194.215.205.19 60 4 DC-SG-G02
194.215.205.19 60 4 FS-SG-G03
194.215.205.19 60 4 WC-SH-G04
........

The following spark sql command should allow our expanding of the multi-valued field

SELECT ipaddress, total_purchases, total_products, exploded_productId
FROM your_table
LATERAL VIEW explode(productId) AS exploded_productId

Notes
Ensure that the Product IDs in the original data are stored as an array.

This functionality will enhance our ability to analyze product purchases by providing more granular data for each product associated with an IP address.

Example ppl expand queries:

 - source = table | expand productId as product_Id

@YANG-DB YANG-DB self-assigned this Oct 30, 2024
@ykmr1224 ykmr1224 mentioned this issue Oct 31, 2024
5 tasks
@YANG-DB YANG-DB changed the title [FEATURE]New expand_field PPL Command [FEATURE]New expand PPL Command Oct 31, 2024
@YANG-DB YANG-DB moved this from Design to In Progress in PPL Commands Oct 31, 2024
@YANG-DB YANG-DB mentioned this issue Nov 5, 2024
5 tasks
@YANG-DB YANG-DB moved this from In Progress to InReview in PPL Commands Nov 5, 2024
@github-project-automation github-project-automation bot moved this from InReview to Done in PPL Commands Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
Status: Done
Development

No branches or pull requests

3 participants