Ppl count approximate support #884

YANG-DB · 2024-11-09T03:54:35Z

Description

support approximation operations for

count distinct
top
rare

Related Issues

#882

related context

https://spark.apache.org/docs/3.5.2/sql-ref-functions-builtin.html

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

- distinct count - top - rare Signed-off-by: YANGDB <[email protected]>

Signed-off-by: YANGDB <[email protected]>

…plify node inner child access visibility Signed-off-by: YANGDB <[email protected]>

Signed-off-by: YANGDB <[email protected]>

LantaoJin · 2024-11-11T02:26:13Z

docs/ppl-lang/ppl-rare-command.md

 * field-list: mandatory. comma-delimited list of field names.
 * by-clause: optional. one or more fields to group the results by.
+* top_approx: approximate the count by using estimated [cardinality by HyperLogLog++ algorithm](https://spark.apache.org/docs/3.5.2/sql-ref-functions-builtin.html).


should be rare_approx here?

LantaoJin · 2024-11-11T02:27:18Z

docs/ppl-lang/ppl-top-command.md

@@ -19,6 +20,7 @@ The example finds most common gender of all the accounts.
 PPL query:

    os> source=accounts | top gender;
+    os> source=accounts_approx | top gender;


why do we add this line?

typing error - thanks !

LantaoJin · 2024-11-11T02:27:35Z

docs/ppl-lang/ppl-top-command.md

@@ -33,7 +35,7 @@ The example finds most common gender of all the accounts.

 PPL query:

-    os> source=accounts | top 1 gender;
+    os> source=accounts_approx | top 1 gender;


LantaoJin · 2024-11-11T02:30:16Z

ppl-spark-integration/src/main/antlr4/OpenSearchPPLLexer.g4

 TOP:                                'TOP';
+RARE_APPROX:                         'RARE_APPROX';


minor: Indent problem

LantaoJin · 2024-11-11T02:32:04Z

ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

@@ -400,7 +402,7 @@ statsAggTerm
 statsFunction
   : statsFunctionName LT_PRTHS valueExpression RT_PRTHS                                                                            # statsFunctionCall
   | COUNT LT_PRTHS RT_PRTHS                                                                                                        # countAllFunctionCall
-   | (DISTINCT_COUNT | DC) LT_PRTHS valueExpression RT_PRTHS                                                                        # distinctCountFunctionCall
+   | (DISTINCT_COUNT | DC | DISTINCT_COUNT_APPROX) LT_PRTHS valueExpression RT_PRTHS                                                                        # distinctCountFunctionCall


DISTINCT_COUNT_APPROX should be added to keywordsCanBeId

yes - we should add this to the github issue check - list ;-)

LantaoJin · 2024-11-11T02:39:45Z

...rk-integration/src/main/java/org/opensearch/sql/expression/function/BuiltinFunctionName.java

@@ -185,6 +185,7 @@ public enum BuiltinFunctionName {
  NESTED(FunctionName.of("nested")),
  PERCENTILE(FunctionName.of("percentile")),
  PERCENTILE_APPROX(FunctionName.of("percentile_approx")),
+  APPROX_COUNT_DISTINCT(FunctionName.of("approx_count_distinct")),


We have used DISTINCT_COUNT_APPROX in lexer, why rename it to APPROX_COUNT_DISTINCT here? I think we should keep the name DISTINCT_COUNT_APPROX in code too.

Keep DISTINCT_COUNT_APPROX here and add a mapping item in BuiltinFunctionTransformer.SPARK_BUILTIN_FUNCTION_NAME_MAPPING

Oh, I see https://github.com/opensearch-project/opensearch-spark/pull/884/files#diff-66576caee2b95944ec89dac941862c61a3e415f999a44ddcf03684f3d7c61115R60, might not need to add mapping item above, but no harmful.

LantaoJin · 2024-11-11T02:49:33Z

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystPlanContext.java

+    /**
+     * update context using the given action and node 
+     */
+    public CatalystPlanContext update(UnaryOperator<CatalystPlanContext> action) {


As a fundamental API, can you give us more examples/explanations when to use this method?

I was thinking of generalize any access to the context under a functional call - look like its too much probably...

LantaoJin

All my comments are minor change requests beside about keywordsCanBeId. LGTM basically.

- DISTINCT_COUNT_APPROX should be added to keywordsCanBeId Signed-off-by: YANGDB <[email protected]>

* add functional approximation support for: - distinct count - top - rare Signed-off-by: YANGDB <[email protected]> * update license and scalafmt Signed-off-by: YANGDB <[email protected]> * update additional tests using APPROX_COUNT_DISTINCT Signed-off-by: YANGDB <[email protected]> * add visitFirstChild(node, context) method for the PlanVisitor for simplify node inner child access visibility Signed-off-by: YANGDB <[email protected]> * update inline documentation Signed-off-by: YANGDB <[email protected]> * update according to PR comments - DISTINCT_COUNT_APPROX should be added to keywordsCanBeId Signed-off-by: YANGDB <[email protected]> --------- Signed-off-by: YANGDB <[email protected]>

YANG-DB added 2 commits November 8, 2024 19:51

add functional approximation support for:

fd45d52

- distinct count - top - rare Signed-off-by: YANGDB <[email protected]>

update license and scalafmt

61c6bb8

Signed-off-by: YANGDB <[email protected]>

YANG-DB requested review from dai-chen, mengweieric, vmmusings, penghuo, seankao-az, anirudha, kaituo, noCharger, LantaoJin and ykmr1224 as code owners November 9, 2024 03:54

YANG-DB added Lang:PPL Pipe Processing Language support 0.7 labels Nov 9, 2024

YANG-DB marked this pull request as draft November 9, 2024 03:54

YANG-DB added 3 commits November 9, 2024 14:51

update additional tests using APPROX_COUNT_DISTINCT

724cbe9

Signed-off-by: YANGDB <[email protected]>

add visitFirstChild(node, context) method for the PlanVisitor for sim…

424fad4

…plify node inner child access visibility Signed-off-by: YANGDB <[email protected]>

update inline documentation

8dac8fe

Signed-off-by: YANGDB <[email protected]>

YANG-DB marked this pull request as ready for review November 10, 2024 00:31

LantaoJin reviewed Nov 11, 2024

View reviewed changes

LantaoJin approved these changes Nov 11, 2024

View reviewed changes

YANG-DB added 2 commits November 11, 2024 12:58

Merge branch 'main' into ppl-count-approximate-support

b7f0855

update according to PR comments

0ae73e4

- DISTINCT_COUNT_APPROX should be added to keywordsCanBeId Signed-off-by: YANGDB <[email protected]>

YANG-DB merged commit b53a699 into opensearch-project:main Nov 11, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ppl count approximate support #884

Ppl count approximate support #884

YANG-DB commented Nov 9, 2024

LantaoJin Nov 11, 2024

LantaoJin Nov 11, 2024

YANG-DB Nov 11, 2024

LantaoJin Nov 11, 2024

LantaoJin Nov 11, 2024

LantaoJin Nov 11, 2024

YANG-DB Nov 11, 2024

LantaoJin Nov 11, 2024

LantaoJin Nov 11, 2024

LantaoJin Nov 11, 2024

LantaoJin Nov 11, 2024

YANG-DB Nov 11, 2024

LantaoJin left a comment •

edited

Loading

Ppl count approximate support #884

Ppl count approximate support #884

Conversation

YANG-DB commented Nov 9, 2024

Description

Related Issues

related context

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LantaoJin left a comment • edited Loading

Choose a reason for hiding this comment

LantaoJin left a comment •

edited

Loading