-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample
parameter to top
& rare
command
#879
Add sample
parameter to top
& rare
command
#879
Conversation
…ad of the entire table to reduce time Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
sample
parameter to top
& rare
command
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
One high level question: I'm wondering what kind of scenario needs to run |
@@ -79,6 +80,7 @@ DESC: 'DESC'; | |||
DATASOURCES: 'DATASOURCES'; | |||
USING: 'USING'; | |||
WITH: 'WITH'; | |||
PERCENT: 'PERCENT'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SAMPLE and PERCENT should be added to keywordsCanBeId
public LogicalPlan withSampleRelation(Sample sampleRelation) { | ||
this.relations.add(sampleRelation.child()); | ||
return with(sampleRelation); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this implementation. As I mentioned in another PR, sample
IMO should be a common plan node instead of binding to relation
. I am not sure how current behaviour works when a plan contains joins or correlated subqueries. For example, in query
source=tableA | join ON tableB.id = tableB.id tableB | top 1 tableA.id sample(50 per)
does it equals to
source=tableA sample(50 per) | join ON tableA.id = tableB.id [ source=tableB sample(50 per) ] | top 1 tableA.id
or
source=tableA | join ON tableA.id = tableB.id tableB | sample 50 per | top 1 tableA.id
@@ -84,6 +84,52 @@ class FlintSparkPPLTopAndRareITSuite | |||
comparePlans(expectedPlan, logicalPlan, checkAnalysis = false) | |||
} | |||
|
|||
test("create ppl rare address field query test sample 75 %") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly suggest to add some IT cases for top/rare with sample in complex join and subquery query.
closing since this has not yet shown to have a significant use case |
Description
Add a new sample command (
sample
) to reduce amount of scanned data points and allow approximation of atop
orrare
statements when faster sample based results if favour of exact long running resultsIssues Resolved
by
a high cardinality field will cause writing job fail with "size exceed limitation" #740Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.