Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ppl count approximate support #884

Merged

Conversation

YANG-DB
Copy link
Member

@YANG-DB YANG-DB commented Nov 9, 2024

Description

support approximation operations for

  • count distinct
  • top
  • rare

Related Issues

#882

related context

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

- distinct count
- top
- rare

Signed-off-by: YANGDB <[email protected]>
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.7 labels Nov 9, 2024
@YANG-DB YANG-DB marked this pull request as draft November 9, 2024 03:54
@YANG-DB YANG-DB marked this pull request as ready for review November 10, 2024 00:31
* field-list: mandatory. comma-delimited list of field names.
* by-clause: optional. one or more fields to group the results by.
* top_approx: approximate the count by using estimated [cardinality by HyperLogLog++ algorithm](https://spark.apache.org/docs/3.5.2/sql-ref-functions-builtin.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be rare_approx here?

@@ -19,6 +20,7 @@ The example finds most common gender of all the accounts.
PPL query:

os> source=accounts | top gender;
os> source=accounts_approx | top gender;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we add this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing error - thanks !

@@ -33,7 +35,7 @@ The example finds most common gender of all the accounts.

PPL query:

os> source=accounts | top 1 gender;
os> source=accounts_approx | top 1 gender;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

TOP: 'TOP';
RARE_APPROX: 'RARE_APPROX';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Indent problem

@@ -400,7 +402,7 @@ statsAggTerm
statsFunction
: statsFunctionName LT_PRTHS valueExpression RT_PRTHS # statsFunctionCall
| COUNT LT_PRTHS RT_PRTHS # countAllFunctionCall
| (DISTINCT_COUNT | DC) LT_PRTHS valueExpression RT_PRTHS # distinctCountFunctionCall
| (DISTINCT_COUNT | DC | DISTINCT_COUNT_APPROX) LT_PRTHS valueExpression RT_PRTHS # distinctCountFunctionCall
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DISTINCT_COUNT_APPROX should be added to keywordsCanBeId

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - we should add this to the github issue check - list ;-)

@@ -185,6 +185,7 @@ public enum BuiltinFunctionName {
NESTED(FunctionName.of("nested")),
PERCENTILE(FunctionName.of("percentile")),
PERCENTILE_APPROX(FunctionName.of("percentile_approx")),
APPROX_COUNT_DISTINCT(FunctionName.of("approx_count_distinct")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have used DISTINCT_COUNT_APPROX in lexer, why rename it to APPROX_COUNT_DISTINCT here? I think we should keep the name DISTINCT_COUNT_APPROX in code too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep DISTINCT_COUNT_APPROX here and add a mapping item in BuiltinFunctionTransformer.SPARK_BUILTIN_FUNCTION_NAME_MAPPING

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* update context using the given action and node
*/
public CatalystPlanContext update(UnaryOperator<CatalystPlanContext> action) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a fundamental API, can you give us more examples/explanations when to use this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of generalize any access to the context under a functional call - look like its too much probably...

Copy link
Member

@LantaoJin LantaoJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are minor change requests beside about keywordsCanBeId. LGTM basically.

- DISTINCT_COUNT_APPROX should be added to keywordsCanBeId

Signed-off-by: YANGDB <[email protected]>
@YANG-DB YANG-DB merged commit b53a699 into opensearch-project:main Nov 11, 2024
4 checks passed
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
* add functional approximation support for:
- distinct count
- top
- rare

Signed-off-by: YANGDB <[email protected]>

* update license and scalafmt

Signed-off-by: YANGDB <[email protected]>

* update additional tests using APPROX_COUNT_DISTINCT

Signed-off-by: YANGDB <[email protected]>

* add visitFirstChild(node, context) method for the PlanVisitor for simplify node inner child access visibility

Signed-off-by: YANGDB <[email protected]>

* update inline documentation

Signed-off-by: YANGDB <[email protected]>

* update according to PR comments
- DISTINCT_COUNT_APPROX should be added to keywordsCanBeId

Signed-off-by: YANGDB <[email protected]>

---------

Signed-off-by: YANGDB <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants