Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL tablesample command for faster approximation statements #796

Closed

Conversation

YANG-DB
Copy link
Member

@YANG-DB YANG-DB commented Oct 21, 2024

Description

Add a new sample command (tablesample) to reduce amount of scanned data points and allow approximation of a statement when faster sample based results if favour of exact long running results

source = testTable TABLESAMPLE(50 percent) | rare address

Issues Resolved

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ation of a statement when faster sample based results if favour of exact long running results

Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 21, 2024

@LantaoJin @penghuo I'd like u'r comments on this direction for approximation based queries...

@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.6 labels Oct 21, 2024
@YANG-DB YANG-DB self-assigned this Oct 21, 2024
@YANG-DB YANG-DB marked this pull request as ready for review October 21, 2024 22:11
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 21, 2024

The following example demonstrates how to sample 50% of the data from the table and then perform aggregation (finding rare occurrences of address).

PPL query:

os> source = account  TABLESAMPLE(75 percent) | top 3 country by occupation

This query samples 75% of the records from account table, then retrieves the top 3 countries grouped by occupation

SELECT *
FROM (
         SELECT country, occupation, COUNT(country) AS count_country
         FROM account
                  TABLESAMPLE(75 PERCENT)
         GROUP BY country, occupation
         ORDER BY COUNT(country) DESC NULLS LAST
             LIMIT 3
     ) AS subquery
    LIMIT 3;

Logical Plan Equivalent:

'Project [*]
+- 'GlobalLimit 3
   +- 'LocalLimit 3
      +- 'Sort ['COUNT('country) AS count_country#68 DESC NULLS LAST], true
         +- 'Aggregate ['country, 'occupation AS occupation#67], ['COUNT('country) AS count_country#66, 'country, 'occupation AS occupation#67]
            +- 'Sample 0.0, 0.75, false, 0
               +- 'UnresolvedRelation [account], [], false

By introducing the TABLESAMPLE instruction into the source command, one can now sample data as part of your queries and reducing the amount of data being scanned thereby converting precision with performance.

The percent parameter will give the actual approximation of the true value with the needed trade of between accuracy and performance.

@LantaoJin
Copy link
Member

high level question, will below query work as expected?

os> source = account1 TABLESAMPLE(75 percent), account2 TABLESAMPLE(10 percent) | top 3 country by occupation

@LantaoJin
Copy link
Member

For long-term thinking, how above just rename TABLESAMPLE to SAMPLE, seems sample could be applied not only table but also any plan node.
And it just brings me why not as a new command? For example, source = t status=200 | sample 50 percent | ...
has very different semantic with source = t sample(50 percent) status=200 | ...

@@ -160,11 +160,18 @@ public LogicalPlan visitRelation(Relation node, CatalystPlanContext context) {
true,
DescribeRelation$.MODULE$.getOutputAttrs()));
}
//populate table sampling
context.withSampling(node.getTablesampleContext());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this line move after L169?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - I'll refactor this part

@LantaoJin
Copy link
Member

Is Sample a non-deterministic operator? It may prevent DSL pushdown in sql. Not sure how it works with Spark Pushdown optimizations such as FilterPushdown or AggregatePushdown, etc. So it might be a performance barrier which is out of user's expectation. We need to confirm and highlight it in doc if yes.

Signed-off-by: YANGDB <[email protected]>
# Conflicts:
#	docs/ppl-lang/README.md
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 23, 2024

For long-term thinking, how above just rename TABLESAMPLE to SAMPLE, seems sample could be applied not only table but also any plan node. And it just brings me why not as a new command? For example, source = t status=200 | sample 50 percent | ... has very different semantic with source = t sample(50 percent) status=200 | ...

I agree about the renaming but I think it should be attached to the index since its clear the sampling is the first action and has no prior operation.
In general the pipe operation implies it has no specific restriction which is not the case in the sample operation, it needs to be very clear (IMO) to anyone reading the query that the sampling operates directly on the index/table

@YANG-DB YANG-DB marked this pull request as draft October 24, 2024 04:16
# Conflicts:
#	ppl-spark-integration/src/test/scala/org/opensearch/flint/spark/ppl/PPLLogicalPlanAggregationQueriesTranslatorTestSuite.scala
# Conflicts:
#	docs/ppl-lang/PPL-Example-Commands.md
#	docs/ppl-lang/ppl-between.md
#	docs/ppl-lang/ppl-fillnull-command.md
@YANG-DB
Copy link
Member Author

YANG-DB commented Nov 8, 2024

closing this as this PR give a better more precise answer to the issue at hand

@YANG-DB YANG-DB closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants