Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample parameter to top & rare command #879

Closed

Conversation

YANG-DB
Copy link
Member

@YANG-DB YANG-DB commented Nov 7, 2024

Description

Add a new sample command (sample) to reduce amount of scanned data points and allow approximation of a top or rare statements when faster sample based results if favour of exact long running results

source = testTable  | rare address sample(50 percent)
source = testTable  | top 5 address by country sample(25 percent)

Issues Resolved

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@YANG-DB YANG-DB changed the title Add sample top rare command Add sample parameter to top & rare command Nov 7, 2024
@YANG-DB YANG-DB marked this pull request as draft November 7, 2024 19:35
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.6 labels Nov 7, 2024
@YANG-DB YANG-DB marked this pull request as ready for review November 7, 2024 20:24
@LantaoJin
Copy link
Member

LantaoJin commented Nov 8, 2024

One high level question:
How do we determine the relationship between percentage and precision? Or how much precision does it lose when sampling is decreased from 100% to 80% or from 80% to 50%?

I'm wondering what kind of scenario needs to run top on the sample data.

@@ -79,6 +80,7 @@ DESC: 'DESC';
DATASOURCES: 'DATASOURCES';
USING: 'USING';
WITH: 'WITH';
PERCENT: 'PERCENT';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAMPLE and PERCENT should be added to keywordsCanBeId

Comment on lines +146 to +149
public LogicalPlan withSampleRelation(Sample sampleRelation) {
this.relations.add(sampleRelation.child());
return with(sampleRelation);
}
Copy link
Member

@LantaoJin LantaoJin Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this implementation. As I mentioned in another PR, sample IMO should be a common plan node instead of binding to relation. I am not sure how current behaviour works when a plan contains joins or correlated subqueries. For example, in query

source=tableA | join ON tableB.id = tableB.id tableB | top 1 tableA.id sample(50 per)

does it equals to

source=tableA sample(50 per) | join ON tableA.id = tableB.id [ source=tableB sample(50 per) ] | top 1 tableA.id

or

source=tableA | join ON tableA.id = tableB.id tableB | sample 50 per | top 1 tableA.id

@@ -84,6 +84,52 @@ class FlintSparkPPLTopAndRareITSuite
comparePlans(expectedPlan, logicalPlan, checkAnalysis = false)
}

test("create ppl rare address field query test sample 75 %") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest to add some IT cases for top/rare with sample in complex join and subquery query.

@YANG-DB
Copy link
Member Author

YANG-DB commented Nov 12, 2024

closing since this has not yet shown to have a significant use case

@YANG-DB YANG-DB closed this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants