Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL fieldsummary command #766

Merged
merged 18 commits into from
Oct 25, 2024

Conversation

YANG-DB
Copy link
Member

@YANG-DB YANG-DB commented Oct 11, 2024

Description

This PR implements the fieldsummary PPL command.

Issues Resolved

#662

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

 - antlr syntax
 - ast expression builder
 - ast node builder
 - catalyst ast builder

Signed-off-by: YANGDB <[email protected]>
 - antlr syntax
 - ast expression builder
 - ast node builder
 - catalyst ast builder

Signed-off-by: YANGDB <[email protected]>
fix scala style format

Signed-off-by: YANGDB <[email protected]>
# Conflicts:
#	ppl-spark-integration/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java
@YANG-DB YANG-DB marked this pull request as draft October 11, 2024 00:37
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.6 labels Oct 11, 2024
@YANG-DB YANG-DB marked this pull request as ready for review October 17, 2024 22:15
Signed-off-by: YANGDB <[email protected]>
@LantaoJin
Copy link
Member

Would you mind change the PR status to DRAFT if you are going to refactor?

@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 18, 2024

@LantaoJin plz review
I've removed the topValues in favour of performance - and added the next issue for enhancement of the top, rare commands

@ToString
@RequiredArgsConstructor
@EqualsAndHashCode(callSuper = false)
public class NamedExpression extends UnresolvedExpression {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this definition or find an alternative, for example reusing Argument? Because

  1. NamedExpression should have a name
  2. NamedExpression generally should be an abstract class and the parent of Attribute or Alias. We should refactor many codes if we really need it.
  3. It's confused me with Spark NamedExpression, it is not worth to introduce a new expression for fieldsummary command IMO. At least, not NamedExpression

@@ -39,7 +42,8 @@
*/
public interface DataTypeTransformer {
static <T> Seq<T> seq(T... elements) {
return seq(List.of(elements));
return seq(Arrays.stream(elements).filter(Objects::nonNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment for this changes? What case did you see?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its not relevant any more - thanks for pointing it out

SUM(FunctionName.of("sum")),
COUNT(FunctionName.of("count")),
COUNT_DISTINCT(FunctionName.of("count_distinct")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a DISTINCT_COUNT. (did we miss it here?) And seems you are not use it as a built-in function name in following codes, instead, count_distinct is an alias name.

Comment on lines +91 to +93
FIELDSUMMARY: 'FIELDSUMMARY';
INCLUDEFIELDS: 'INCLUDEFIELDS';
NULLS: 'NULLS';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these keywords to keywordsCanBeId too.


os> source = t | fieldsummary includefields= id, status_code, request_path nulls=true
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| Fiels | COUNT | COUNT_DISTINCT | MIN | MAX | AVG | MEAN | STDDEV | NUlls | TYPEOF |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


os> source = t | where status_code != 200 | fieldsummary includefields= status_code nulls=true
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| Fiels | COUNT | COUNT_DISTINCT | MIN | MAX | AVG | MEAN | STDDEV | NUlls | TYPEOF |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fiels -> Fields?

@YANG-DB YANG-DB requested a review from LantaoJin October 24, 2024 00:58
@YANG-DB YANG-DB enabled auto-merge (squash) October 24, 2024 05:52
@seankao-az seankao-az disabled auto-merge October 25, 2024 17:25
@seankao-az seankao-az merged commit 6513ead into opensearch-project:main Oct 25, 2024
4 checks passed
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
* add support for FieldSummary
 - antlr syntax
 - ast expression builder
 - ast node builder
 - catalyst ast builder

Signed-off-by: YANGDB <[email protected]>

* add support for FieldSummary
 - antlr syntax
 - ast expression builder
 - ast node builder
 - catalyst ast builder

Signed-off-by: YANGDB <[email protected]>

* update sample query
fix scala style format

Signed-off-by: YANGDB <[email protected]>

* support spark prior to 3.5 with its extended table identifier (existing table identifier only has 2 parts)

Signed-off-by: YANGDB <[email protected]>

* update union queries based summary

Signed-off-by: YANGDB <[email protected]>

* update scala fmt style

Signed-off-by: YANGDB <[email protected]>

* update scala fmt style

Signed-off-by: YANGDB <[email protected]>

* update query with where clause predicate

Signed-off-by: YANGDB <[email protected]>

* update command and remove the topvalues

Signed-off-by: YANGDB <[email protected]>

* update command docs

Signed-off-by: YANGDB <[email protected]>

* update with comments feedback

Signed-off-by: YANGDB <[email protected]>

* update `FIELD SUMMARY` symbols to the keywordsCanBeId bag of words

Signed-off-by: YANGDB <[email protected]>

---------

Signed-off-by: YANGDB <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants