Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support flatten with alias #927

Merged
merged 5 commits into from
Nov 20, 2024

Conversation

qianheng-aws
Copy link
Contributor

Description

Support flatten with alias. e.g.

source=table | flatten coor as (altitude, latitude, longitude)

source=table | flatten struct as subfield 

Related Issues

Resolve #911

We can use alias to avoid duplicate column name in the final result.

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Comment on lines +1036 to +1037
: qualifiedName (COMMA qualifiedName)* # identsAsQualifiedNameSeq
| LT_PRTHS qualifiedName (COMMA qualifiedName)* RT_PRTHS # identsAsQualifiedNameSeq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grammar allows four patterns:

  • flatten struct_col2 as field1
  • flatten struct_col2 as (field1).
  • flatten struct_col2 as field1, field2
  • flatten struct_col2 as (field1, field2).

Is ( ) necessary or restrict to two?

  • flatten struct_col2 as field1
  • flatten struct_col2 as (field1, field2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not restrict to two, as you said, both 4 patterns are all allowed. For most of the case, both is ok with or without (), but having () is safer to avoid ambiguous.

For example, in SQL, when having sql like select flatten(a) as b, c, field c will be parsed as a child of select instead of flatten alias, so put alias inner brace will avoid this. I'm not sure if PPL will have case like that.

Copy link
Member

@LantaoJin LantaoJin Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in SQL, when having sql like select flatten(a) as b, c, field c will be parsed as a child of select instead of flatten alias, so put alias inner brace will avoid this. I'm not sure if PPL will have case like that.

Yes. That's my point. How about only remains

  • flatten struct_col2 as field1
  • flatten struct_col2 as (field1)
  • flatten struct_col2 as (field1, field2)

or

  • flatten struct_col2 as (field1)
  • flatten struct_col2 as (field1, field2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YANG-DB any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin I personally dont mind the extra verbose - so Im good with @qianheng-aws suggestion...

@@ -347,4 +347,30 @@ class FlintSparkPPLFlattenITSuite
val expectedPlan = Project(Seq(UnresolvedStar(None)), flattenMultiValue)
comparePlans(logicalPlan, expectedPlan, checkAnalysis = false)
}

test("flatten struct nested table using alias") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for duplicated aliases:

| flatten struct_col as (field1, field2_2)
| flatten struct_col2 as (field1, field2_2)

And

| source = $structNestedTable
| flatten struct_col
| flatten field1 as int_col

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case 1 will be similar to this test except that it aliased field2 to field2_2.

frame.columns.sameElements(Array("int_col", "field2", "subfield", "field2", "subfield")))
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case 1 will

I mean that will the duplicated field2_2 in query fail?

source = ... | flatten struct_col as (field1, field2_2) | flatten struct_col2 as (field1, field2_2)

@LantaoJin
Copy link
Member

LantaoJin commented Nov 18, 2024

High level question: why the solution changed to add aliases instead of distinguishing by struct_col.field1 and struct_col2.field1? The second solution should be more graceful.

@LantaoJin
Copy link
Member

source=table | flatten struct as subfield seems quite confused.
I'd like to change the PPL design to
source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,... even in the alias solution.

@LantaoJin
Copy link
Member

LantaoJin commented Nov 18, 2024

@qianheng-aws Please keep this PR DRAFT before PPL design review is passed.

@qianheng-aws
Copy link
Contributor Author

qianheng-aws commented Nov 18, 2024

High level question: why the solution changed to add aliases instead of distinguishing by struct_col.field1 and struct_col2.field1? The second solution should be more graceful.

We cannot do that in parsing phase. That's say, we don't know the actual fields inner the struct field. So I give up that solution and choose to support alias syntax instead of giving alias automatically inner our parser, then users can still do that to avoid duplicate column name in their PPL.

And as said in the issue #911 (comment), it's actually a common issue for asyn-query not just for flatten. I think maybe support alias is a more appropriate way to address such issue.

@qianheng-aws
Copy link
Contributor Author

source=table | flatten struct as subfield seems quite confused. I'd like to change the PPL design to source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,... even in the alias solution.

Then we need to figure out another way to pass that fields mapping to Spark operator Generate. Unfortunately, Spark doesn't have such mechanism, it only support make alias to the original operator output in sequence.

@qianheng-aws qianheng-aws marked this pull request as draft November 18, 2024 09:02
@LantaoJin
Copy link
Member

source=table | flatten struct as subfield seems quite confused. I'd like to change the PPL design to source=table | flatten struct rename origin_subfield1 as new_subfield1, origin_subfield1 as new_subfield1,... even in the alias solution.

Then we need to figure out another way to pass that fields mapping to Spark operator Generate. Unfortunately, Spark doesn't have such mechanism, it only support make alias to the original operator output in sequence.

Ok, I see now. So for the column struct_col STRUCT<field1: STRUCT<subfield:STRING>, field2: INT> case, we must add aliases sequence such as | flatten struct_col as (field1_1, field2_2).

@@ -7,9 +7,10 @@ Using `flatten` command to flatten a field of type:


### Syntax
`flatten <field>`
`flatten <field> [As alias]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[as aliasSequence]

docs/ppl-lang/ppl-flatten-command.md Outdated Show resolved Hide resolved
docs/ppl-lang/ppl-flatten-command.md Outdated Show resolved Hide resolved
@@ -347,4 +347,30 @@ class FlintSparkPPLFlattenITSuite
val expectedPlan = Project(Seq(UnresolvedStar(None)), flattenMultiValue)
comparePlans(logicalPlan, expectedPlan, checkAnalysis = false)
}

test("flatten struct nested table using alias") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case 1 will

I mean that will the duplicated field2_2 in query fail?

source = ... | flatten struct_col as (field1, field2_2) | flatten struct_col2 as (field1, field2_2)

| source = $structNestedTable
| | flatten struct_col
| | flatten field1 as subfield_1
| | flatten struct_col2 as (field1, field2_2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case: it should fail

source =... | flatten struct_col as (field1_1)

@LantaoJin
Copy link
Member

The basic approach works for me. Only some minor comments. @YANG-DB please take a look.

Signed-off-by: Heng Qian <[email protected]>
@qianheng-aws qianheng-aws marked this pull request as ready for review November 18, 2024 09:51
Signed-off-by: Heng Qian <[email protected]>
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.6 sanity-0.6 backport 0.6 labels Nov 18, 2024
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qianheng-aws plz add additional flatten query sample here

@@ -153,4 +153,45 @@ class PPLLogicalPlanFlattenCommandTranslatorTestSuite
comparePlans(expectedPlan, logPlan, checkAnalysis = false)
}

test("test flatten with one alias") {
Copy link
Member

@YANG-DB YANG-DB Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qianheng-aws plz add additional flatten with aliases queries unit tests

Copy link
Contributor Author

@qianheng-aws qianheng-aws Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YANG-DB we already have but named "test flatten with alias list"


// Throw AnalysisException if The number of aliases supplied in the AS clause does not match the
// number of columns output
assertThrows[AnalysisException] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error message assertion in case we could be informed any behaviour changes in Spark upgrading.

Comment on lines +375 to +381
// duplicate alias names
val frame2 = sql(s"""
| source = $structNestedTable
| | flatten struct_col as (field1, field2_2)
| | flatten field1 as subfield_1
| | flatten struct_col2 as (field1, field2_2)
| | flatten field1 as subfield_2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated alias case I mentioned is

| flatten struct_col as (field1, field2_2)
| flatten struct_col2 as (field1, field2_2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concern on your case? I think the purpose is to test on duplicate alias names between 2 flatten command. I only change the field name in the original test cases to satisfy our purpose, so we can reuse the constructed expected results.

Comment on lines +1036 to +1037
: qualifiedName (COMMA qualifiedName)* # identsAsQualifiedNameSeq
| LT_PRTHS qualifiedName (COMMA qualifiedName)* RT_PRTHS # identsAsQualifiedNameSeq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YANG-DB any thoughts?

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @qianheng-aws

// Compare the results
implicit val rowOrdering: Ordering[Row] = Ordering.by[Row, Int](_.getAs[Int](0))
assert(results.sorted.sameElements(expectedResults.sorted))
assert(results.sorted.sameElements(expectedResults))
Copy link
Member

@LantaoJin LantaoJin Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try using assertSameRows in future which won't consider any ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, seems that function will do sorting internally

@LantaoJin LantaoJin merged commit dc690b4 into opensearch-project:main Nov 20, 2024
4 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 20, 2024
* Support flatten with alias

Signed-off-by: Heng Qian <[email protected]>

* Address comments

Signed-off-by: Heng Qian <[email protected]>

* Address comments

Signed-off-by: Heng Qian <[email protected]>

* Add flatten with alias example

Signed-off-by: Heng Qian <[email protected]>

* Add IT for ambiguous exception

Signed-off-by: Heng Qian <[email protected]>

---------

Signed-off-by: Heng Qian <[email protected]>
(cherry picked from commit dc690b4)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
YANG-DB pushed a commit that referenced this pull request Nov 20, 2024
* Support flatten with alias



* Address comments



* Address comments



* Add flatten with alias example



* Add IT for ambiguous exception



---------


(cherry picked from commit dc690b4)

Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
* Support flatten with alias

Signed-off-by: Heng Qian <[email protected]>

* Address comments

Signed-off-by: Heng Qian <[email protected]>

* Address comments

Signed-off-by: Heng Qian <[email protected]>

* Add flatten with alias example

Signed-off-by: Heng Qian <[email protected]>

* Add IT for ambiguous exception

Signed-off-by: Heng Qian <[email protected]>

---------

Signed-off-by: Heng Qian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 backport 0.6 Lang:PPL Pipe Processing Language support sanity-0.6
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flatten query get error due to duplicate fields
3 participants