-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPL tablesample
command for faster approximation statements
#796
Conversation
Signed-off-by: YANGDB <[email protected]>
…ation of a statement when faster sample based results if favour of exact long running results Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
@LantaoJin @penghuo I'd like u'r comments on this direction for approximation based queries... |
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
The following example demonstrates how to sample 50% of the data from the table and then perform aggregation (finding rare occurrences of address). PPL query:
This query samples 75% of the records from account table, then retrieves the top 3 countries grouped by occupation SELECT *
FROM (
SELECT country, occupation, COUNT(country) AS count_country
FROM account
TABLESAMPLE(75 PERCENT)
GROUP BY country, occupation
ORDER BY COUNT(country) DESC NULLS LAST
LIMIT 3
) AS subquery
LIMIT 3; Logical Plan Equivalent: 'Project [*]
+- 'GlobalLimit 3
+- 'LocalLimit 3
+- 'Sort ['COUNT('country) AS count_country#68 DESC NULLS LAST], true
+- 'Aggregate ['country, 'occupation AS occupation#67], ['COUNT('country) AS count_country#66, 'country, 'occupation AS occupation#67]
+- 'Sample 0.0, 0.75, false, 0
+- 'UnresolvedRelation [account], [], false
By introducing the The |
Signed-off-by: YANGDB <[email protected]>
high level question, will below query work as expected?
|
For long-term thinking, how above just rename |
@@ -160,11 +160,18 @@ public LogicalPlan visitRelation(Relation node, CatalystPlanContext context) { | |||
true, | |||
DescribeRelation$.MODULE$.getOutputAttrs())); | |||
} | |||
//populate table sampling | |||
context.withSampling(node.getTablesampleContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this line move after L169?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - I'll refactor this part
Is |
Signed-off-by: YANGDB <[email protected]>
# Conflicts: # docs/ppl-lang/README.md
I agree about the renaming but I think it should be attached to the index since its clear the sampling is the first action and has no prior operation. |
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
…l-tablesample-feature
Signed-off-by: YANGDB <[email protected]>
# Conflicts: # ppl-spark-integration/src/test/scala/org/opensearch/flint/spark/ppl/PPLLogicalPlanAggregationQueriesTranslatorTestSuite.scala
# Conflicts: # docs/ppl-lang/PPL-Example-Commands.md # docs/ppl-lang/ppl-between.md # docs/ppl-lang/ppl-fillnull-command.md
closing this as this PR give a better more precise answer to the issue at hand |
Description
Add a new sample command (
tablesample
) to reduce amount of scanned data points and allow approximation of a statement when faster sample based results if favour of exact long running resultsIssues Resolved
by
a high cardinality field will cause writing job fail with "size exceed limitation" #740By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.