Add `sample` parameter to `top` & `rare` command #879

YANG-DB · 2024-11-07T19:35:17Z

Description

Add a new sample command (sample) to reduce amount of scanned data points and allow approximation of a top or rare statements when faster sample based results if favour of exact long running results

source = testTable  | rare address sample(50 percent)
source = testTable  | top 5 address by country sample(25 percent)

Issues Resolved

[BUG][SanityTest] stats by a high cardinality field will cause writing job fail with "size exceed limitation" #740

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ad of the entire table to reduce time Signed-off-by: YANGDB <[email protected]>

Signed-off-by: YANGDB <[email protected]>

LantaoJin · 2024-11-08T03:25:22Z

One high level question:
How do we determine the relationship between percentage and precision? Or how much precision does it lose when sampling is decreased from 100% to 80% or from 80% to 50%?

I'm wondering what kind of scenario needs to run top on the sample data.

LantaoJin · 2024-11-08T03:31:55Z

ppl-spark-integration/src/main/antlr4/OpenSearchPPLLexer.g4

@@ -79,6 +80,7 @@ DESC:                               'DESC';
 DATASOURCES:                        'DATASOURCES';
 USING:                              'USING';
 WITH:                               'WITH';
+PERCENT:                            'PERCENT';


SAMPLE and PERCENT should be added to keywordsCanBeId

LantaoJin · 2024-11-08T03:45:06Z

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystPlanContext.java

+    public LogicalPlan withSampleRelation(Sample sampleRelation) {
+        this.relations.add(sampleRelation.child());
+        return with(sampleRelation);
+    }


I really don't like this implementation. As I mentioned in another PR, sample IMO should be a common plan node instead of binding to relation. I am not sure how current behaviour works when a plan contains joins or correlated subqueries. For example, in query

source=tableA | join ON tableB.id = tableB.id tableB | top 1 tableA.id sample(50 per)

does it equals to

source=tableA sample(50 per) | join ON tableA.id = tableB.id [ source=tableB sample(50 per) ] | top 1 tableA.id

or

source=tableA | join ON tableA.id = tableB.id tableB | sample 50 per | top 1 tableA.id

LantaoJin · 2024-11-08T03:47:09Z

...st/src/integration/scala/org/opensearch/flint/spark/ppl/FlintSparkPPLTopAndRareITSuite.scala

@@ -84,6 +84,52 @@ class FlintSparkPPLTopAndRareITSuite
    comparePlans(expectedPlan, logicalPlan, checkAnalysis = false)
  }

+  test("create ppl rare address field query test sample 75 %") {


I strongly suggest to add some IT cases for top/rare with sample in complex join and subquery query.

YANG-DB · 2024-11-12T03:32:22Z

closing since this has not yet shown to have a significant use case

YANG-DB added 6 commits November 6, 2024 16:39

add sample option flag to sample a percentage of the table data inste…

8589188

…ad of the entire table to reduce time Signed-off-by: YANGDB <[email protected]>

add support for agg sample context

6e9d485

Signed-off-by: YANGDB <[email protected]>

add sample class

793c1ad

Signed-off-by: YANGDB <[email protected]>

add sample class

a9adaa7

Signed-off-by: YANGDB <[email protected]>

add sample tests

dce23d9

Signed-off-by: YANGDB <[email protected]>

update visitor child method on the catalyst plan visitor

1a34544

Signed-off-by: YANGDB <[email protected]>

YANG-DB requested review from dai-chen, mengweieric, vmmusings, penghuo, seankao-az, anirudha, kaituo, noCharger, LantaoJin and ykmr1224 as code owners November 7, 2024 19:35

YANG-DB changed the title ~~Add sample top rare command~~ Add sample parameter to top & rare command Nov 7, 2024

YANG-DB marked this pull request as draft November 7, 2024 19:35

YANG-DB added Lang:PPL Pipe Processing Language support 0.6 labels Nov 7, 2024

YANG-DB added 2 commits November 7, 2024 12:22

add documentation and fix IT tests

a4992e8

Signed-off-by: YANGDB <[email protected]>

update scala fmt

870d434

Signed-off-by: YANGDB <[email protected]>

YANG-DB marked this pull request as ready for review November 7, 2024 20:24

YANG-DB added 4 commits November 7, 2024 16:17

fix explain visitChild error

2197eab

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into add-sample-top-rare-command

1f2ae52

remove non relevant tests from this PR

7c63446

Signed-off-by: YANGDB <[email protected]>

update scala fmt issue

09b6bda

Signed-off-by: YANGDB <[email protected]>

LantaoJin reviewed Nov 8, 2024

View reviewed changes

YANG-DB mentioned this pull request Nov 8, 2024

PPL tablesample command for faster approximation statements #796

Closed

YANG-DB closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `sample` parameter to `top` & `rare` command #879

Add `sample` parameter to `top` & `rare` command #879

YANG-DB commented Nov 7, 2024

LantaoJin commented Nov 8, 2024 •

edited

Loading

LantaoJin Nov 8, 2024

LantaoJin Nov 8, 2024 •

edited

Loading

LantaoJin Nov 8, 2024

YANG-DB commented Nov 12, 2024

Add sample parameter to top & rare command #879

Add sample parameter to top & rare command #879

Conversation

YANG-DB commented Nov 7, 2024

Description

Issues Resolved

Check List

LantaoJin commented Nov 8, 2024 • edited Loading

LantaoJin Nov 8, 2024

Choose a reason for hiding this comment

LantaoJin Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

LantaoJin Nov 8, 2024

Choose a reason for hiding this comment

YANG-DB commented Nov 12, 2024

Add `sample` parameter to `top` & `rare` command #879

Add `sample` parameter to `top` & `rare` command #879

LantaoJin commented Nov 8, 2024 •

edited

Loading

LantaoJin Nov 8, 2024 •

edited

Loading