Translate PPL `dedup` Command Part 1: allowedDuplication=1 #521

LantaoJin · 2024-08-05T08:29:20Z

Description

Syntax

dedup [int] <field-list> keepempty=<bool>] [consecutive=<bool>]

This PR translates following three PPL dedup commands to different Spark Logical Plans.
Note, allowed duplication in this PR is 1. Assuming the <field-list> is a, b.

keepempty=false: | dedup [1] a, b [keepempty=false]

Deduplicate ['a, 'b]
+- Filter (isnotnull('a) AND isnotnull('b))
     +- Project
          +- UnresolvedRelation

keepempty=true : | dedup [1] a, b keepempty=true

Union
:- Deduplicate ['a, 'b]
:   +- Filter (isnotnull('a) AND isnotnull('b))
:        +- Project
:             +- UnresolvedRelation
+- Filter (isnull('a) OR isnull('b))
     +- Project
          +- UnresolvedRelation

consecutive=true: | dedup [1] a, b [keepempty=true] consecutive=true

UnsupportedOperationException("Consecutive deduplication is not supported")

Issues Resolved

Resolves #523 (Subtask of #421)

Check List

Updated documentation (ppl-spark-integration/README.md)
Implemented unit tests
Implemented tests for combination with other commands
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <[email protected]>

LantaoJin · 2024-08-05T09:30:03Z

Part 2

To translate dedup command with allowedDuplication > 1, such as | dedup 2 a,b to Spark plan, the solution in my mind is translating to a plan with Window function (e.g row_number) and a new column row_number_col as Filter.

For | dedup 2 a, b keepempty=false

Project ['a, 'b] // ensure the row_number_col is removed. not a and b only, should contain all fields except row_number_col
+- Filter ('row_number_col <= 2) // allowed duplication = 2
   +- Window [row_number() windowspecdefinition('a, 'b, 'order_key ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 'row_number_col], ['a, 'b], ['order_key ASC NULLS FIRST]
       +- Filter (isnotnull('a) AND isnotnull('b)) // keepempty=false
          +- Project
             +- UnresolvedRelation

For | dedup 2 a, b keepempty=true

Union
:- Project ['a, 'b]
:  +- Filter ('row_number_col <= 2)
:     +- Window [row_number() windowspecdefinition('a, 'b, 'order_key ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 'row_number_col], ['a, 'b], ['order_key ASC NULLS FIRST]
:        +- Filter (isnotnull('a) AND isnotnull('b))
:           +- Project
:              +- UnresolvedRelation
+- Filter (isnull('a) OR isnull('b))
   +- Project
      +- UnresolvedRelation

One un-clarified point is what is the 'order_key in Window node?

Option 1

AFAIK, there always should be a time series field existing for log analysis, such as _time or @timestamp. How to get it in AST?

Option 2

If we cannot point out an existing time series field as order_key, an option is change the the syntax of dedup command as following (make sure sql repo will change first)

| dedup 2 a, b order by c
| dedup 2 a, b orderkey=c

Option 3

partitioning and ordering by the same columns:

Window.partitionBy(dedupColumns).orderBy(dedupColumns)

If that, the results can be non-deterministic for large datasets. This non-determinism arises because the sorting does not have a secondary tie-breaking rule to ensure a consistent order of rows within each partition. If multiple rows have the same values for the partition and order columns, their relative order can vary across different runs

A complex case for option 3 is while we have a sorter before dedup command

| sort c | dedup 2 a,b

Any thoughts? @YANG-DB @penghuo @dai-chen

FYI: In Spark, window function requires window to be ordered. Query fails if order by clause is not specified. For example, SELECT <wf>(expr) OVER ([PARTITION BY window_partition] ORDER BY window_ordering) from table
But in MySQL or opensearch-sql, the order by clause can be omitted.

LantaoJin · 2024-08-05T10:15:08Z

The second un-clarified point is about node Deduplicate and DeduplicateWithinWatermark
https://issues.apache.org/jira/browse/SPARK-42931 introduced a new LogicalPlan DeduplicateWithinWatermark to deduplicate events within the watermark and was available in 3.5.0+
I am not clear the architecture of Flint, but seems it performs on Spark Streaming. Deduplication on streaming dataset has different semantic with batch dataset:

We document the behavior clearly that the event time column should be a part of the subset columns for deduplication to clean up the state, but it cannot be applied to the customers as timestamps are not exactly the same for duplicated events in their use cases.

Need more guidance on the purpose of this repository and how it should be used in production or in future. For current time-being, I choose Deduplicate in this PR (DeduplicateWithinWatermark required 3.5.0+).

penghuo · 2024-08-06T22:10:02Z

Part 2

To translate dedup command with allowedDuplication > 1, such as | dedup 2 a,b to Spark plan, the solution in my mind is translating to a plan with Window function (e.g row_number) and a new column row_number_col as Filter.

For | dedup 2 a, b keepempty=false
Project ['a, 'b] // ensure the row_number_col is removed. not a and b only, should contain all fields except row_number_col
+- Filter ('row_number_col <= 2) // allowed duplication = 2
   +- Window [row_number() windowspecdefinition('a, 'b, 'order_key ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 'row_number_col], ['a, 'b], ['order_key ASC NULLS FIRST]
       +- Filter (isnotnull('a) AND isnotnull('b)) // keepempty=false
          +- Project
             +- UnresolvedRelation
For | dedup 2 a, b keepempty=true
Union
:- Project ['a, 'b]
:  +- Filter ('row_number_col <= 2)
:     +- Window [row_number() windowspecdefinition('a, 'b, 'order_key ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 'row_number_col], ['a, 'b], ['order_key ASC NULLS FIRST]
:        +- Filter (isnotnull('a) AND isnotnull('b))
:           +- Project
:              +- UnresolvedRelation
+- Filter (isnull('a) OR isnull('b))
   +- Project
      +- UnresolvedRelation
One un-clarified point is what is the 'order_key in Window node? AFAIK, there always should be a time series field existing for log analysis, such as _time or @timestamp. How to get it in AST? If we cannot point out an existing time series field as order_key, the syntax of dedup command might change to
| dedup 2 a, b order by c
| dedup 2 a, b orderkey=c
Any thoughts? @YANG-DB @penghuo @dai-chen

FYI: In Spark, window function requires window to be ordered. Query fails if order by clause is not specified. For example, SELECT <wf>(expr) OVER ([PARTITION BY window_partition] ORDER BY window_ordering) from table But in MySQL or opensearch-sql, the order by clause can be omitted.

Agree on Windows function.

Questions

What is the difference of Deduplicate and Window? Does Deduplicate will be optimized as WIndows function?
Can we windows_partition and window_ordering be same field?

LantaoJin · 2024-08-07T02:32:05Z

Questions

What is the difference of Deduplicate and Window? Does Deduplicate will be optimized as WIndows function?
Can we windows_partition and window_ordering be same field?

They are totally different. Actually, Deduplicate will be optimized as Aggregation. Using Deduplicate instead of Aggregation in code is more readable. Ref https://github.com/apache/spark/blob/b3877894acb512477d41038269f20158d7bca8f3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2261
Yes we can. But it brings non-deterministic results, order can vary across different runs.

I updated the comment #521 (comment)

LantaoJin · 2024-08-07T05:34:25Z

Summary

Discussed with @penghuo offline. Here is the conclusions:

For node Deduplicate and DeduplicateWithinWatermark

Deduplicate will be optimized as Aggregation, but DeduplicateWithinWatermark will be transformed to a new physical plan node StreamingDeduplicateWithinWatermarkExec. Considering current usage, #521 will keep using Deduplicate.

For the 'order_key in Window

Will choose the option 3 (partitioning and ordering by the same columns). Ref #521 (comment)

CC @YANG-DB and @dai-chen

Again, all above shouldn't block code review for this PR.

YANG-DB · 2024-08-06T19:35:24Z

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

@@ -271,7 +273,105 @@ public LogicalPlan visitWindowFunction(WindowFunction node, CatalystPlanContext

    @Override
    public LogicalPlan visitDedupe(Dedupe node, CatalystPlanContext context) {
-        throw new IllegalStateException("Not Supported operation : dedupe ");
+        node.getChild().get(0).accept(this, context);


not mandatory but for future - lets extract the concrete login into a dedicated strategy class and inject that into the QueryPlanner
this would reduce the complexity and simplify the testing and seperation of concerns

LantaoJin · 2024-08-08T01:53:27Z

@penghuo @dai-chen @YANG-DB Any other concern on it? Would this PR be merged now?

Signed-off-by: YANGDB <[email protected]>

* Translate PPL Dedup Command: only one duplication allowd Signed-off-by: Lantao Jin <[email protected]> * add document Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]> Signed-off-by: YANGDB <[email protected]> Co-authored-by: YANGDB <[email protected]> (cherry picked from commit 7c4244f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Translate PPL Dedup Command: only one duplication allowd * add document --------- (cherry picked from commit 7c4244f) Signed-off-by: Lantao Jin <[email protected]> Signed-off-by: YANGDB <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: YANGDB <[email protected]>

LantaoJin added 2 commits August 5, 2024 15:46

Translate PPL Dedup Command: only one duplication allowd

941155a

Signed-off-by: Lantao Jin <[email protected]>

add document

f53a3ec

Signed-off-by: Lantao Jin <[email protected]>

LantaoJin marked this pull request as ready for review August 5, 2024 09:45

LantaoJin requested review from dai-chen, rupal-bq, vamsimanohar, penghuo, seankao-az, anirudha, kaituo and YANG-DB as code owners August 5, 2024 09:45

YANG-DB approved these changes Aug 7, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into issues/421

e18dc18

YANG-DB approved these changes Aug 7, 2024

View reviewed changes

Merge branch 'main' into pr/issues/421

64e1658

Signed-off-by: YANGDB <[email protected]>

YANG-DB merged commit 7c4244f into opensearch-project:main Aug 8, 2024
4 checks passed

penghuo mentioned this pull request Aug 12, 2024

Translate PPL dedup Command Part 2: allowedDuplication>1 #543

Merged

4 tasks

YANG-DB added Lang:PPL Pipe Processing Language support backport 0.5 labels Aug 14, 2024

opensearch-trigger-bot bot mentioned this pull request Aug 14, 2024

[Backport 0.5] Translate PPL dedup Command Part 1: allowedDuplication=1 #566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate PPL `dedup` Command Part 1: allowedDuplication=1 #521

Translate PPL `dedup` Command Part 1: allowedDuplication=1 #521

LantaoJin commented Aug 5, 2024 •

edited

Loading

LantaoJin commented Aug 5, 2024 •

edited

Loading

LantaoJin commented Aug 5, 2024

penghuo commented Aug 6, 2024

Part 2

LantaoJin commented Aug 7, 2024 •

edited

Loading

LantaoJin commented Aug 7, 2024 •

edited

Loading

YANG-DB Aug 6, 2024

LantaoJin commented Aug 8, 2024

Translate PPL dedup Command Part 1: allowedDuplication=1 #521

Translate PPL dedup Command Part 1: allowedDuplication=1 #521

Conversation

LantaoJin commented Aug 5, 2024 • edited Loading

Description

Issues Resolved

Check List

LantaoJin commented Aug 5, 2024 • edited Loading

Part 2

Option 1

Option 2

Option 3

LantaoJin commented Aug 5, 2024

penghuo commented Aug 6, 2024

Part 2

LantaoJin commented Aug 7, 2024 • edited Loading

LantaoJin commented Aug 7, 2024 • edited Loading

Summary

YANG-DB Aug 6, 2024

Choose a reason for hiding this comment

LantaoJin commented Aug 8, 2024

Translate PPL `dedup` Command Part 1: allowedDuplication=1 #521

Translate PPL `dedup` Command Part 1: allowedDuplication=1 #521

LantaoJin commented Aug 5, 2024 •

edited

Loading

LantaoJin commented Aug 5, 2024 •

edited

Loading

LantaoJin commented Aug 7, 2024 •

edited

Loading

LantaoJin commented Aug 7, 2024 •

edited

Loading