-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]Add flatten
Command to PPL
#669
Comments
Shouldn't it be bridges.name in flattened object? |
As mentioned above "Consider supporting multi-level flattening for more deeply nested fields (e.g., flatten details.address)." I read it as: yes, we support it. Question is what is the default and should it be configurable? When dealing with nested fields, see #565 |
@YANG-DB @vamsi-amazon Before opening a PR, a few design questions and requirement refinements need to be discussed.
|
source=employees
| flatten contact
| fields name, age, contact.phone as phone, contact.address.city as city, contact.address.zipcode as zipcode
results using alias:
source=employees
| flatten contact
| fields name, age, contact.phone, contact.address.city, contact.address.zipcode results:
|
I have a question related to the implementation details of the
Furthermore, the {
"_time": "2024-09-13T12:00:00",
"bridges": [
{"name": "Rialto Bridge", "length": 48},
{"name": "Bridge of Sighs", "length": 11}
],
"city": "Venice",
"country": "Italy",
"coor": {
"lat": 45.4408,
"long": 12.3155,
"alt": 2
}
} On the other hand, flattening arrays requires the usage of generator functions, that is, functions that create new rows. Unfortunately, usage of the generator function is related to quite severe restrictions:
Therefore, it is impossible to implement a condition that depends on the field type to use a generator function to flatten arrays or Avoiding using condition functions together with generator functions like in the below example also results in type-related validation error ( select inline_outer(array), struct.*
from (
select
if(startswith(typeof(bridges), 'array'), bridges, null) as array,
if(startswith(typeof(bridges), 'struct'), bridges, null) as struct
from table1 c);
line 1:7 no viable alternative at input 'select inline_outer'
Can only star expand struct data types. Attribute: `ArrayBuffer(struct)`.; line 1 pos 28 The approach also has another disadvantage. It is impossible to manipulate the names of flattened columns freely. To solve the problems described above, we can use two approaches.
... flatten array my_field_with_aray
| flatten struct my_field_with_struct However, this solution will contain cons for freely manipulating flattened column names.
@YANG-DB Could you please tell me what are your preference and how we should implement the |
Currently, I am using the below JSON file to test my implementation [{"_time":"2024-09-13T12:00:00","bridges":[{"name":"Tower Bridge","length":801},{"name":"London Bridge","length":928}],"city":"London","country":"England","coor":{"lat":51.5074,"long":-0.1278,"alt":35}},{"_time":"2024-09-13T12:00:00","bridges":[{"name":"Pont Neuf","length":232},{"name":"Pont Alexandre III","length":160}],"city":"Paris","country":"France","coor":{"lat":48.8566,"long":2.3522,"alt":35}},{"_time":"2024-09-13T12:00:00","bridges":[{"name":"Rialto Bridge","length":48},{"name":"Bridge of Sighs","length":11}],"city":"Venice","country":"Italy","coor":{"lat":45.4408,"long":12.3155,"alt":2}},{"_time":"2024-09-13T12:00:00","bridges":[{"name":"Charles Bridge","length":516},{"name":"Legion Bridge","length":343}],"city":"Prague","country":"Czech Republic","coor":{"lat":50.0755,"long":14.4378,"alt":200}},{"_time":"2024-09-13T12:00:00","bridges":[{"name":"Chain Bridge","length":375},{"name":"Liberty Bridge","length":333}],"city":"Budapest","country":"Hungary","coor":{"lat":47.4979,"long":19.0402,"alt":96}},{"_time":"1990-09-13T12:00:00","bridges":null,"city":"Warsaw","country":"Poland","coor":null}] The file is loaded using PySpark and the following commands df_coor = spark.read.option("multiline","true").json('/home/lukasz/projects/aws/bridges_coordinates.json')
df_coor.write.saveAsTable("bridges_coor") I can use the following SQL statement to load data in integration tests. CREATE TEMPORARY VIEW bridges_coor
USING org.apache.spark.sql.json
OPTIONS (
path "/home/lukasz/projects/aws/bridges_coordinates.json",
multiLine true
); @YANG-DB Is this testing method appropriate from your point of view, or do you have other recommendations (S3 tables)? |
@lukasz-soszynski-eliatra Thanks for the analysis - IMO separating
@LantaoJin @penghuo WDYT ? |
please review #780 to get a few ideas |
@salyh @lukasz-soszynski-eliatra please review #780 to get a few ideas |
@YANG-DB
The implementation in the current shape does not support arrays with simple types like Here are some examples of the The table
+-------------------+--------------------+--------+--------------------+--------------+
| _time| bridges| city| coor| country|
+-------------------+--------------------+--------+--------------------+--------------+
|2024-09-13T12:00:00|[{801, Tower Brid...| London|{35, 51.5074, -0....| England|
|2024-09-13T12:00:00|[{232, Pont Neuf}...| Paris|{35, 48.8566, 2.3...| France|
|2024-09-13T12:00:00|[{48, Rialto Brid...| Venice|{2, 45.4408, 12.3...| Italy|
|2024-09-13T12:00:00|[{516, Charles Br...| Prague|{200, 50.0755, 14...|Czech Republic|
|2024-09-13T12:00:00|[{375, Chain Brid...|Budapest|{96, 47.4979, 19....| Hungary|
|1990-09-13T12:00:00| NULL| Warsaw| NULL| Poland|
+-------------------+--------------------+--------+--------------------+--------------+
The columns source=coor | flatten bridges | flatten coor; The result of the above command is
Please remember that the above result contains additional rows created when the array |
@lukasz-soszynski-eliatra ok - lets complete this PR and figure out the simple type array support afterwards - maybe in a different syntax ? |
Sounds good. This means that tests and documentation only still need to be included. |
Supporting arrays of simple types should be trivial if a dedicated syntax is used. However, it might still be possible to support such arrays with the current syntax, but such an approach is probably a bit more challenging. Anyway, I will not include support for such arrays in the current PR. |
Unfortunately, after tests, it turns out maps are also unsupported. |
Is your feature request related to a problem?
OpenSearch Piped Processing Language (PPL) currently lacks a native command to flatten nested objects or arrays in documents. Many datasets, especially those containing JSON objects, have deeply nested fields that are difficult to work with in their raw form. The
flatten
command will simplify these structures and make it easier to analyze and extract data.What solution would you like?
Introduce a flatten command in PPL that can handle arrays or nested fields, producing a
flattened
result that contains all the nested elements at the top level.Syntax:
flatten
command takes a nested array or object field and returns each element as part of a flat structure.Example Use Cases
This query
flattens
the bridges array field.Example Input:
Expected Output:
This query
flattens
the details object field.Example Input:
Expected Output:
Additional Considerations
The text was updated successfully, but these errors were encountered: