Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FILLNULL command in PPL (#3032) #3075

Merged
merged 11 commits into from
Dec 5, 2024

Conversation

normanj-bitquill
Copy link
Contributor

@normanj-bitquill normanj-bitquill commented Oct 16, 2024

Description

Adds the FILLNULL command for PPL. FILLNULL will replace NULL values in specified fields.

Related Issues

Resolves #3032
Based on this PR for Spark: opensearch-project/opensearch-spark#723

Check List

  • [Y] New functionality includes testing.
  • [Y] New functionality has been documented.
  • [Y] New functionality has javadoc added.
  • [Y] New functionality has a user manual doc added.
  • [N/A] API changes companion pull request created.
  • [Y] Commits are signed per the DCO using --signoff.
  • [Y] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

An example query using fillnull.

    os> source=accounts | fields email, employer | fillnull with '<not found>' in email ;
    fetched rows / total rows = 4/4
    +-----------------------+----------+
    | email                 | employer |
    |-----------------------+----------|
    | [email protected]  | Pyrami   |
    | [email protected] | Netagy   |
    | <not found>           | Quility  |
    | [email protected]   | null     |
    +-----------------------+----------+

Copy link
Collaborator

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Could you add some integration tests? They live here.

YANG-DB
YANG-DB previously approved these changes Oct 18, 2024
Copy link
Contributor

@jduo jduo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add an ExplainTest to validate that you get an Eval physical plan when using FILLNULL.

@normanj-bitquill
Copy link
Contributor Author

@MaxKsyunz I have added some integration tests.

@normanj-bitquill
Copy link
Contributor Author

@jduo I have added an explain test.

@normanj-bitquill normanj-bitquill force-pushed the opensearch-3032 branch 2 times, most recently from 240d0d6 to 7d17615 Compare October 30, 2024 22:52
@normanj-bitquill
Copy link
Contributor Author

@normanj-bitquill please lets align the docs with the existing ppl fillnull doc in spark

@YANG-DB The docs have been updated to match the docs for Spark. The sample queries are still different, since they are run in this repo.

YANG-DB
YANG-DB previously approved these changes Nov 28, 2024
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill plz fix the keywordsCanBeId list and we can merge

@normanj-bitquill
Copy link
Contributor Author

@normanj-bitquill plz fix the keywordsCanBeId list and we can merge

@YANG-DB I have added it.

YANG-DB
YANG-DB previously approved these changes Dec 3, 2024
@acarbonetto
Copy link
Collaborator

acarbonetto commented Dec 3, 2024

Can we do this?

  • Public documentation issue/PR created.

: USING nullableField EQUAL nullReplacement (COMMA nullableField EQUAL nullReplacement)*
;

nullableField
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the purpose for nullableField and nullReplacement, except for explicit trees. But 🤷 .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we could use things like fieldList instead of nullableField (COMMA nullableField)*
maybe something like:

fillNullWithTheSameValue
   : WITH nullReplacement = valueExpression IN nullableFieldList = fieldList
   ;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and

fillNullWithFieldVariousValues
   : USING nullReplacementExpression (COMMA nullReplacementExpression)*
   ;

nullReplacementExpression
   : nullableField = fieldExpression EQUAL nullReplacement = valueExpression

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're using valueExpression instead of expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want an expression that produces a single value. expression could produce a column such as x + 5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acarbonetto For you first comment here about using fieldList, it looks like it makes sense. I took the syntax directly from the parser file in opensearch-spark. If I make the change you are suggesting here, then the two files will get further out of sync.

There are a few ways forward:

  • Take this syntax as is and clean it up when the parser files are unified
  • Fix the syntax here and deal with the mismatch later on when unifying the parser files
  • Fix the syntax here and create a PR for opensearch-spark with a similar change

Do you have any preference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. @YANG-DB do you have a preference? Should we raise an issue and tag the issue to be fixed in opensearch-project/piped-processing-language#23?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill @acarbonetto
I would prefer going with the 3rd option (Fix the syntax here and create a PR for opensearch-spark with a similar change)
thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here and created a PR for the Spark project.

opensearch-project/opensearch-spark#968

@Test
public void testFillNullVariousValues() {
assertEquals(
"source=t | fillnull using f1 = 0, f2 = -1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds this anonymize the expression?
e.g.

        "source=t | fillnull using f1 = ***, f2 = ***",

reference: testAndExpression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. Using a value of *** would likely change the data type in this case since it looks like f1 and f2 are int values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anonymizer should be removing all user-defined values and anonymizing the logger output.
You can look at other examples in this test file for anonymized output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting way off track. Only null values are replaced. This feature is not at all suited for anonymizing data (all non null values would be unchanged). If you want to anonymize data, you could use eval to create a new field and then fields to include the new field and ignore the old field.

YANG-DB
YANG-DB previously approved these changes Dec 3, 2024
normanj-bitquill and others added 11 commits December 3, 2024 15:52
* Add FILLNULL command in PPL

Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
…nymizer.java


Formatting fix

Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: normanj-bitquill <[email protected]>
…nymizerTest.java


Fix formatting

Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: normanj-bitquill <[email protected]>
….java


Fix formatting

Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: normanj-bitquill <[email protected]>
…Test.java


Fix formatting

Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: normanj-bitquill <[email protected]>
@YANG-DB
Copy link
Member

YANG-DB commented Dec 4, 2024

@normanj-bitquill LGTM

@acarbonetto acarbonetto merged commit b6846ce into opensearch-project:main Dec 5, 2024
13 of 15 checks passed
@acarbonetto acarbonetto deleted the opensearch-3032 branch December 5, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]Add fillnull command to PPL
7 participants