-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support flatten with alias #927
Conversation
Signed-off-by: Heng Qian <[email protected]>
: qualifiedName (COMMA qualifiedName)* # identsAsQualifiedNameSeq | ||
| LT_PRTHS qualifiedName (COMMA qualifiedName)* RT_PRTHS # identsAsQualifiedNameSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This grammar allows four patterns:
flatten struct_col2 as field1
flatten struct_col2 as (field1)
.flatten struct_col2 as field1, field2
flatten struct_col2 as (field1, field2)
.
Is (
)
necessary or restrict to two?
flatten struct_col2 as field1
flatten struct_col2 as (field1, field2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not restrict to two, as you said, both 4 patterns are all allowed. For most of the case, both is ok with or without (), but having () is safer to avoid ambiguous.
For example, in SQL, when having sql like select flatten(a) as b, c
, field c
will be parsed as a child of select
instead of flatten alias, so put alias inner brace will avoid this. I'm not sure if PPL will have case like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in SQL, when having sql like select flatten(a) as b, c, field c will be parsed as a child of select instead of flatten alias, so put alias inner brace will avoid this. I'm not sure if PPL will have case like that.
Yes. That's my point. How about only remains
- flatten struct_col2 as field1
- flatten struct_col2 as (field1)
- flatten struct_col2 as (field1, field2)
or
- flatten struct_col2 as (field1)
- flatten struct_col2 as (field1, field2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YANG-DB any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin I personally dont mind the extra verbose - so Im good with @qianheng-aws suggestion...
ppl-spark-integration/src/main/java/org/opensearch/sql/ast/tree/Flatten.java
Outdated
Show resolved
Hide resolved
@@ -347,4 +347,30 @@ class FlintSparkPPLFlattenITSuite | |||
val expectedPlan = Project(Seq(UnresolvedStar(None)), flattenMultiValue) | |||
comparePlans(logicalPlan, expectedPlan, checkAnalysis = false) | |||
} | |||
|
|||
test("flatten struct nested table using alias") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for duplicated aliases:
| flatten struct_col as (field1, field2_2)
| flatten struct_col2 as (field1, field2_2)
And
| source = $structNestedTable
| flatten struct_col
| flatten field1 as int_col
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case 1 will be similar to this test except that it aliased field2
to field2_2
.
Line 298 in b050da3
frame.columns.sameElements(Array("int_col", "field2", "subfield", "field2", "subfield"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case 1 will
I mean that will the duplicated field2_2
in query fail?
source = ... | flatten struct_col as (field1, field2_2) | flatten struct_col2 as (field1, field2_2)
High level question: why the solution changed to add aliases instead of distinguishing by |
|
@qianheng-aws Please keep this PR DRAFT before PPL design review is passed. |
We cannot do that in parsing phase. That's say, we don't know the actual fields inner the struct field. So I give up that solution and choose to support alias syntax instead of giving alias automatically inner our parser, then users can still do that to avoid duplicate column name in their PPL. And as said in the issue #911 (comment), it's actually a common issue for asyn-query not just for flatten. I think maybe support alias is a more appropriate way to address such issue. |
Then we need to figure out another way to pass that fields mapping to Spark operator |
Ok, I see now. So for the column |
docs/ppl-lang/ppl-flatten-command.md
Outdated
@@ -7,9 +7,10 @@ Using `flatten` command to flatten a field of type: | |||
|
|||
|
|||
### Syntax | |||
`flatten <field>` | |||
`flatten <field> [As alias]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[as aliasSequence]
@@ -347,4 +347,30 @@ class FlintSparkPPLFlattenITSuite | |||
val expectedPlan = Project(Seq(UnresolvedStar(None)), flattenMultiValue) | |||
comparePlans(logicalPlan, expectedPlan, checkAnalysis = false) | |||
} | |||
|
|||
test("flatten struct nested table using alias") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case 1 will
I mean that will the duplicated field2_2
in query fail?
source = ... | flatten struct_col as (field1, field2_2) | flatten struct_col2 as (field1, field2_2)
| source = $structNestedTable | ||
| | flatten struct_col | ||
| | flatten field1 as subfield_1 | ||
| | flatten struct_col2 as (field1, field2_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case: it should fail
source =... | flatten struct_col as (field1_1)
The basic approach works for me. Only some minor comments. @YANG-DB please take a look. |
Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: Heng Qian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qianheng-aws plz add additional flatten query sample here
@@ -153,4 +153,45 @@ class PPLLogicalPlanFlattenCommandTranslatorTestSuite | |||
comparePlans(expectedPlan, logPlan, checkAnalysis = false) | |||
} | |||
|
|||
test("test flatten with one alias") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qianheng-aws plz add additional flatten with aliases queries unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YANG-DB we already have but named "test flatten with alias list"
Signed-off-by: Heng Qian <[email protected]>
|
||
// Throw AnalysisException if The number of aliases supplied in the AS clause does not match the | ||
// number of columns output | ||
assertThrows[AnalysisException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error message assertion in case we could be informed any behaviour changes in Spark upgrading.
// duplicate alias names | ||
val frame2 = sql(s""" | ||
| source = $structNestedTable | ||
| | flatten struct_col as (field1, field2_2) | ||
| | flatten field1 as subfield_1 | ||
| | flatten struct_col2 as (field1, field2_2) | ||
| | flatten field1 as subfield_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated alias case I mentioned is
| flatten struct_col as (field1, field2_2)
| flatten struct_col2 as (field1, field2_2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concern on your case? I think the purpose is to test on duplicate alias names between 2 flatten command. I only change the field name in the original test cases to satisfy our purpose, so we can reuse the constructed expected results.
: qualifiedName (COMMA qualifiedName)* # identsAsQualifiedNameSeq | ||
| LT_PRTHS qualifiedName (COMMA qualifiedName)* RT_PRTHS # identsAsQualifiedNameSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YANG-DB any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks @qianheng-aws
Signed-off-by: Heng Qian <[email protected]>
// Compare the results | ||
implicit val rowOrdering: Ordering[Row] = Ordering.by[Row, Int](_.getAs[Int](0)) | ||
assert(results.sorted.sameElements(expectedResults.sorted)) | ||
assert(results.sorted.sameElements(expectedResults)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try using assertSameRows
in future which won't consider any ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, seems that function will do sorting internally
* Support flatten with alias Signed-off-by: Heng Qian <[email protected]> * Address comments Signed-off-by: Heng Qian <[email protected]> * Address comments Signed-off-by: Heng Qian <[email protected]> * Add flatten with alias example Signed-off-by: Heng Qian <[email protected]> * Add IT for ambiguous exception Signed-off-by: Heng Qian <[email protected]> --------- Signed-off-by: Heng Qian <[email protected]> (cherry picked from commit dc690b4) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Support flatten with alias * Address comments * Address comments * Add flatten with alias example * Add IT for ambiguous exception --------- (cherry picked from commit dc690b4) Signed-off-by: Heng Qian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Support flatten with alias Signed-off-by: Heng Qian <[email protected]> * Address comments Signed-off-by: Heng Qian <[email protected]> * Address comments Signed-off-by: Heng Qian <[email protected]> * Add flatten with alias example Signed-off-by: Heng Qian <[email protected]> * Add IT for ambiguous exception Signed-off-by: Heng Qian <[email protected]> --------- Signed-off-by: Heng Qian <[email protected]>
Description
Support flatten with alias. e.g.
Related Issues
Resolve #911
We can use alias to avoid duplicate column name in the final result.
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.