PPL `tablesample` command for faster approximation statements #796

YANG-DB · 2024-10-21T03:38:00Z

Description

Add a new sample command (tablesample) to reduce amount of scanned data points and allow approximation of a statement when faster sample based results if favour of exact long running results

source = testTable TABLESAMPLE(50 percent) | rare address

Issues Resolved

[BUG][SanityTest] stats by a high cardinality field will cause writing job fail with "size exceed limitation" #740

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: YANGDB <[email protected]>

…ation of a statement when faster sample based results if favour of exact long running results Signed-off-by: YANGDB <[email protected]>

Signed-off-by: YANGDB <[email protected]>

YANG-DB · 2024-10-21T03:40:46Z

@LantaoJin @penghuo I'd like u'r comments on this direction for approximation based queries...

Signed-off-by: YANGDB <[email protected]>

YANG-DB · 2024-10-21T22:12:31Z

The following example demonstrates how to sample 50% of the data from the table and then perform aggregation (finding rare occurrences of address).

PPL query:

os> source = account  TABLESAMPLE(75 percent) | top 3 country by occupation

This query samples 75% of the records from account table, then retrieves the top 3 countries grouped by occupation

SELECT *
FROM (
         SELECT country, occupation, COUNT(country) AS count_country
         FROM account
                  TABLESAMPLE(75 PERCENT)
         GROUP BY country, occupation
         ORDER BY COUNT(country) DESC NULLS LAST
             LIMIT 3
     ) AS subquery
    LIMIT 3;

Logical Plan Equivalent:

'Project [*]
+- 'GlobalLimit 3
   +- 'LocalLimit 3
      +- 'Sort ['COUNT('country) AS count_country#68 DESC NULLS LAST], true
         +- 'Aggregate ['country, 'occupation AS occupation#67], ['COUNT('country) AS count_country#66, 'country, 'occupation AS occupation#67]
            +- 'Sample 0.0, 0.75, false, 0
               +- 'UnresolvedRelation [account], [], false

By introducing the TABLESAMPLE instruction into the source command, one can now sample data as part of your queries and reducing the amount of data being scanned thereby converting precision with performance.

The percent parameter will give the actual approximation of the true value with the needed trade of between accuracy and performance.

Signed-off-by: YANGDB <[email protected]>

LantaoJin · 2024-10-21T23:32:07Z

high level question, will below query work as expected?

os> source = account1 TABLESAMPLE(75 percent), account2 TABLESAMPLE(10 percent) | top 3 country by occupation

LantaoJin · 2024-10-21T23:42:45Z

For long-term thinking, how above just rename TABLESAMPLE to SAMPLE, seems sample could be applied not only table but also any plan node.
And it just brings me why not as a new command? For example, source = t status=200 | sample 50 percent | ...
has very different semantic with source = t sample(50 percent) status=200 | ...

LantaoJin · 2024-10-21T23:36:25Z

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

@@ -160,11 +160,18 @@ public LogicalPlan visitRelation(Relation node, CatalystPlanContext context) {
                            true,
                            DescribeRelation$.MODULE$.getOutputAttrs()));
        }
+        //populate table sampling
+        context.withSampling(node.getTablesampleContext());


should this line move after L169?

good point - I'll refactor this part

LantaoJin · 2024-10-22T00:10:51Z

Is Sample a non-deterministic operator? It may prevent DSL pushdown in sql. Not sure how it works with Spark Pushdown optimizations such as FilterPushdown or AggregatePushdown, etc. So it might be a performance barrier which is out of user's expectation. We need to confirm and highlight it in doc if yes.

Signed-off-by: YANGDB <[email protected]>

# Conflicts: # docs/ppl-lang/README.md

YANG-DB · 2024-10-23T00:21:01Z

For long-term thinking, how above just rename TABLESAMPLE to SAMPLE, seems sample could be applied not only table but also any plan node. And it just brings me why not as a new command? For example, source = t status=200 | sample 50 percent | ... has very different semantic with source = t sample(50 percent) status=200 | ...

I agree about the renaming but I think it should be attached to the index since its clear the sampling is the first action and has no prior operation.
In general the pipe operation implies it has no specific restriction which is not the case in the sample operation, it needs to be very clear (IMO) to anyone reading the query that the sampling operates directly on the index/table

Signed-off-by: YANGDB <[email protected]>

…l-tablesample-feature

Signed-off-by: YANGDB <[email protected]>

# Conflicts: # ppl-spark-integration/src/test/scala/org/opensearch/flint/spark/ppl/PPLLogicalPlanAggregationQueriesTranslatorTestSuite.scala

# Conflicts: # docs/ppl-lang/PPL-Example-Commands.md # docs/ppl-lang/ppl-between.md # docs/ppl-lang/ppl-fillnull-command.md

YANG-DB · 2024-11-08T17:22:37Z

closing this as this PR give a better more precise answer to the issue at hand

YANG-DB added 2 commits October 18, 2024 09:56

add tablesample antlr command

b554949

Signed-off-by: YANGDB <[email protected]>

add sample to reduce amount of scanned data points and allow approxim…

64f9379

…ation of a statement when faster sample based results if favour of exact long running results Signed-off-by: YANGDB <[email protected]>

YANG-DB requested review from dai-chen, rupal-bq, vamsimanohar, penghuo, seankao-az, anirudha, kaituo and LantaoJin as code owners October 21, 2024 03:38

YANG-DB marked this pull request as draft October 21, 2024 03:38

update scala fmt

2e58102

Signed-off-by: YANGDB <[email protected]>

YANG-DB added 3 commits October 21, 2024 09:36

update tests for new use cases

a6b2e74

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into ppl-tablesample-feature

9cf7fee

update documentation with tablesample(50 percent) option

076ae34

Signed-off-by: YANGDB <[email protected]>

YANG-DB added Lang:PPL Pipe Processing Language support 0.6 labels Oct 21, 2024

YANG-DB self-assigned this Oct 21, 2024

YANG-DB added 3 commits October 21, 2024 12:16

Merge branch 'main' into ppl-tablesample-feature

9e36fcb

update scala fmt

1326858

Signed-off-by: YANGDB <[email protected]>

add tests with inner table tablesample(? percent)

26a6599

Signed-off-by: YANGDB <[email protected]>

YANG-DB marked this pull request as ready for review October 21, 2024 22:11

head Vs TABLESAMPLE documentation

0b45e43

Signed-off-by: YANGDB <[email protected]>

LantaoJin reviewed Oct 21, 2024

View reviewed changes

fix a test

d2e3bd6

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into ppl-tablesample-feature

a0e4b5b

# Conflicts: # docs/ppl-lang/README.md

YANG-DB added 4 commits October 22, 2024 20:13

update with comments feedback

28b3273

Signed-off-by: YANGDB <[email protected]>

update scala fmt format

5a1f357

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into ppl-tablesample-feature

29cfba5

Merge branch 'main' into ppl-tablesample-feature

91057b3

YANG-DB marked this pull request as draft October 24, 2024 04:16

YANG-DB added 5 commits October 28, 2024 12:38

Merge branch 'main' into ppl-tablesample-feature

84e610c

Merge remote-tracking branch 'origin/ppl-tablesample-feature' into pp…

aae972e

…l-tablesample-feature

update sample command

95d96f0

Signed-off-by: YANGDB <[email protected]>

Merge branch 'main' into ppl-tablesample-feature

6ebefdc

# Conflicts: # ppl-spark-integration/src/test/scala/org/opensearch/flint/spark/ppl/PPLLogicalPlanAggregationQueriesTranslatorTestSuite.scala

Merge branch 'main' into ppl-tablesample-feature

6c3cb11

# Conflicts: # docs/ppl-lang/PPL-Example-Commands.md # docs/ppl-lang/ppl-between.md # docs/ppl-lang/ppl-fillnull-command.md

YANG-DB closed this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPL `tablesample` command for faster approximation statements #796

PPL `tablesample` command for faster approximation statements #796

YANG-DB commented Oct 21, 2024 •

edited

Loading

YANG-DB commented Oct 21, 2024

YANG-DB commented Oct 21, 2024

LantaoJin commented Oct 21, 2024

LantaoJin commented Oct 21, 2024

LantaoJin Oct 21, 2024

YANG-DB Oct 22, 2024

LantaoJin commented Oct 22, 2024

YANG-DB commented Oct 23, 2024 •

edited

Loading

YANG-DB commented Nov 8, 2024

PPL tablesample command for faster approximation statements #796

PPL tablesample command for faster approximation statements #796

Conversation

YANG-DB commented Oct 21, 2024 • edited Loading

Description

Issues Resolved

YANG-DB commented Oct 21, 2024

YANG-DB commented Oct 21, 2024

LantaoJin commented Oct 21, 2024

LantaoJin commented Oct 21, 2024

LantaoJin Oct 21, 2024

Choose a reason for hiding this comment

YANG-DB Oct 22, 2024

Choose a reason for hiding this comment

LantaoJin commented Oct 22, 2024

YANG-DB commented Oct 23, 2024 • edited Loading

YANG-DB commented Nov 8, 2024

PPL `tablesample` command for faster approximation statements #796

PPL `tablesample` command for faster approximation statements #796

YANG-DB commented Oct 21, 2024 •

edited

Loading

YANG-DB commented Oct 23, 2024 •

edited

Loading