-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ppl count approximate support #884
Ppl count approximate support #884
Conversation
- distinct count - top - rare Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
…plify node inner child access visibility Signed-off-by: YANGDB <[email protected]>
Signed-off-by: YANGDB <[email protected]>
docs/ppl-lang/ppl-rare-command.md
Outdated
* field-list: mandatory. comma-delimited list of field names. | ||
* by-clause: optional. one or more fields to group the results by. | ||
* top_approx: approximate the count by using estimated [cardinality by HyperLogLog++ algorithm](https://spark.apache.org/docs/3.5.2/sql-ref-functions-builtin.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be rare_approx
here?
docs/ppl-lang/ppl-top-command.md
Outdated
@@ -19,6 +20,7 @@ The example finds most common gender of all the accounts. | |||
PPL query: | |||
|
|||
os> source=accounts | top gender; | |||
os> source=accounts_approx | top gender; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we add this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing error - thanks !
docs/ppl-lang/ppl-top-command.md
Outdated
@@ -33,7 +35,7 @@ The example finds most common gender of all the accounts. | |||
|
|||
PPL query: | |||
|
|||
os> source=accounts | top 1 gender; | |||
os> source=accounts_approx | top 1 gender; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
TOP: 'TOP'; | ||
RARE_APPROX: 'RARE_APPROX'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Indent problem
@@ -400,7 +402,7 @@ statsAggTerm | |||
statsFunction | |||
: statsFunctionName LT_PRTHS valueExpression RT_PRTHS # statsFunctionCall | |||
| COUNT LT_PRTHS RT_PRTHS # countAllFunctionCall | |||
| (DISTINCT_COUNT | DC) LT_PRTHS valueExpression RT_PRTHS # distinctCountFunctionCall | |||
| (DISTINCT_COUNT | DC | DISTINCT_COUNT_APPROX) LT_PRTHS valueExpression RT_PRTHS # distinctCountFunctionCall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DISTINCT_COUNT_APPROX
should be added to keywordsCanBeId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - we should add this to the github issue check - list ;-)
@@ -185,6 +185,7 @@ public enum BuiltinFunctionName { | |||
NESTED(FunctionName.of("nested")), | |||
PERCENTILE(FunctionName.of("percentile")), | |||
PERCENTILE_APPROX(FunctionName.of("percentile_approx")), | |||
APPROX_COUNT_DISTINCT(FunctionName.of("approx_count_distinct")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have used DISTINCT_COUNT_APPROX
in lexer, why rename it to APPROX_COUNT_DISTINCT
here? I think we should keep the name DISTINCT_COUNT_APPROX
in code too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep DISTINCT_COUNT_APPROX
here and add a mapping item in BuiltinFunctionTransformer.SPARK_BUILTIN_FUNCTION_NAME_MAPPING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see https://github.com/opensearch-project/opensearch-spark/pull/884/files#diff-66576caee2b95944ec89dac941862c61a3e415f999a44ddcf03684f3d7c61115R60, might not need to add mapping item above, but no harmful.
/** | ||
* update context using the given action and node | ||
*/ | ||
public CatalystPlanContext update(UnaryOperator<CatalystPlanContext> action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a fundamental API, can you give us more examples/explanations when to use this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of generalize any access to the context under a functional call - look like its too much probably...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments are minor change requests beside about keywordsCanBeId
. LGTM basically.
- DISTINCT_COUNT_APPROX should be added to keywordsCanBeId Signed-off-by: YANGDB <[email protected]>
* add functional approximation support for: - distinct count - top - rare Signed-off-by: YANGDB <[email protected]> * update license and scalafmt Signed-off-by: YANGDB <[email protected]> * update additional tests using APPROX_COUNT_DISTINCT Signed-off-by: YANGDB <[email protected]> * add visitFirstChild(node, context) method for the PlanVisitor for simplify node inner child access visibility Signed-off-by: YANGDB <[email protected]> * update inline documentation Signed-off-by: YANGDB <[email protected]> * update according to PR comments - DISTINCT_COUNT_APPROX should be added to keywordsCanBeId Signed-off-by: YANGDB <[email protected]> --------- Signed-off-by: YANGDB <[email protected]>
Description
support approximation operations for
Related Issues
#882
related context
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.