PPL command implementation for `appendCol` #990

andy-k-improving · 2024-12-11T02:25:28Z

Description

Introduce the new PPL command appendCol which aim to aggregate result from multiple searches into a single comprehensive table for user to view.

This is accomplished by reading both main-search and sub-search in the form of node then transform it into the following of SQL with by adding _row_number_ for the dataset's natural order, then join both main and sub search with the _row_number_ column.

select t1.*, t2.* 

FROM (
     SELECT *, row_number() over (order by '1') as row_org 
     FROM employees) as t1

 LEFT JOIN (
     SELECT *, row_number() over (order by '1') as row_app 
     FROM employees) as t2 

ON t1.row_org = t2.row_app;

Related Issues

Resolves: #956

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Test plan:

# Produce the artifact
sbt clean sparkPPLCosmetic/publishM2

# Start Spark with the plugin
bin/spark-sql --jars "/ABSOLUTE_PATH_TO_ARTIFACT/opensearch-spark-ppl_2.12-0.6.0-SNAPSHOT.jar" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions"  \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \
--conf "spark.hadoop.hive.cli.print.header=true"

# Insert test table and data
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT, con STRING);

INSERT INTO employees VALUES ("Lisa", "Sales------", 10000, 35, 'test');
INSERT INTO employees VALUES ("Evan", "Sales------", 32000, 38, 'test');
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28, 'test');
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33, 'test');
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33, 'test');
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28, 'test');
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38, 'test');
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23, 'test');
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25, 'test');

# Append one sub-search:

source=employees | FIELDS name, dept, salary | APPENDCOL  [ stats count() as event_count];

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL


# Append multiple sub-searches:

source=employees | FIELDS name, dept, salary | APPENDCOL  [ stats count() as event_count] | APPENDCOL [stats avg(age) as avg_age];

name	dept	salary	event_count	avg_age
Lisa	Sales------	10000	9	31.22222222222222
Fred	Engineering	21000	NULL	NULL
Paul	Engineering	29000	NULL	NULL
Evan	Sales------	32000	NULL	NULL
Chloe	Engineering	23000	NULL	NULL
Tom	Engineering	23000	NULL	NULL
Alex	Sales	30000	NULL	NULL
Jane	Marketing	29000	NULL	NULL
Jeff	Marketing	35000	NULL	NULL



# With override option (`salary` column from the main-search is being dropped and replaced by the `salary` column over the sub-search)

source=employees | FIELDS name, dept, salary | APPENDCOL OVERRIDE=true [stats avg(salary) as salary];

name	dept	salary
Lisa	Sales------	25777.777777777777
Fred	Engineering	NULL
Paul	Engineering	NULL
Evan	Sales------	NULL
Chloe	Engineering	NULL
Tom	Engineering	NULL
Alex	Sales	NULL
Jane	Marketing	NULL
Jeff	Marketing	NULL

Signed-off-by: Andy Kwok <[email protected]>

docs/ppl-lang/ppl-appendcol-command.md

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java

Signed-off-by: Andy Kwok <[email protected]>

LantaoJin · 2024-12-18T02:55:39Z

Two high level questions:

appendCol command syntax is
APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

opensearch-spark/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

Lines 27 to 29 in 957de4e

    
           subSearch 
        
              : searchCommand (PIPE commands)* 
        
              ;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.

why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL

instead of

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9

PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

LantaoJin · 2024-12-18T03:26:22Z

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

+            LogicalPlan joinedQuery = join(
+                    mainSearchWithRowNumber, subSearchWithRowNumber,
+                    Join.JoinType.LEFT,
+                    Optional.of(new EqualTo(t1Attr, t2Attr)),
+                    new Join.JoinHint());


Now I know why you got NULL in the example.
I think this is not what we expected. How about use inner and cross join together.
For example:

with same group-by key, convert it to join key in inner

source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]

without group-by key, use a cross join

source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age ]

The implementation equals to

def appendCols(mainSearchDF: DataFrame, subSearchDF: DataFrame, joinKey: Option[String] = None): DataFrame = { joinKey match { case Some(key) => // If a join key is provided, perform a join mainSearchDF.join(subSearchDF, Seq(key), "inner") case None => // If no join key is provided, assume a global aggregation and use a cross join mainSearchDF.crossJoin(subSearchDF) } }

@LantaoJin thanks for reviewing and feedback !
AFAIK the appendcol is intended for a single index

@YANG-DB , @LantaoJin the comment's seems to contradict what we have discussed over #956 and the Splunk doc, hence would that be possible to clarify?

Join condition

By referring to the prior discussion on Github issue and the Spunk doc, I don't think that is the case, as user won't be asked to enter the neither the join column or the join condition, hence it's up the code logic to generate the natural_oder for the dataFrame and use it to join, Can I confirm this?

I'm asking this because the above comment seems to indicate user will provide a the name of column for the join (Both inner || cross) join, but that is not the case the for the method signature || grammar of appenCol( )

Expected result

Accordingly the Splunk doc

Appends the fields of the [subsearch](https://docs.splunk.com/Splexicon:Subsearch) results with the input search results. All fields of the subsearch are combined into the current results, with the exception of [internal fields](https://docs.splunk.com/Splexicon:Internalfield). For example, the first subsearch result is merged with the first main result, the second subsearch result is merged with the second main result, and so on.

Both main and the sub are simply being joined row by row from top the bottom, 1:1 without explicit joining condition, and in the case of row.length( ) different between two dataFrame, null will be used to fill the gap, and this seems to align with described on Github issue.

From Github issue:

Behavior The new column(s) would be aligned with the rows of the original dataset based on their order of appearance. Each appended column must produce the same number of rows as the base dataset to ensure proper alignment. Any discrepancies in row counts could result in null values for mismatched rows.

Hence I don't think neither cross and inner join will do the same, as rows will be truncated in both scenario.
And I reckon this is what make appendCol as standalone implementation, or else user can simply achieved by existing subQuery or LookUp command?

Let me clarify it.

I'm asking this because the above comment seems to indicate user will provide a the name of column for the join (Both inner || cross) join, but that is not the case the for the method signature || grammar of appenCol( )

No, my suggestion doesn't ask user to provide a name of column for the join. My point is about the result of current implementation seems incorrect IMO.
Query

source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]

should outcome

name dept salary event_count Lisa Sales------ 10000 9 Fred Engineering 21000 9 Paul Engineering 29000 9 Evan Sales------ 32000 9 Chloe Engineering 23000 9 Tom Engineering 23000 9 Alex Sales 30000 9 Jane Marketing 29000 9 Jeff Marketing 35000 9

rather than

name dept salary event_count Lisa Sales------ 10000 9 Fred Engineering 21000 NULL Paul Engineering 29000 NULL Evan Sales------ 32000 NULL Chloe Engineering 23000 NULL Tom Engineering 23000 NULL Alex Sales 30000 NULL Jane Marketing 29000 NULL Jeff Marketing 35000 NULL

Similar, here are two examples:
Q1: output rows are unmatched:

source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats count() as cnt ]

Will outcome

dept total_salary cnt Sales 72000 9 Engineering 96000 9 Marketing 64000 9

Rather that

dept total_salary cnt Sales 72000 9 Engineering 96000 NULL Marketing 64000 NULL

Q2: output rows are matched:

source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats count() as cnt by cnt_dept ]

Will outcome

dept total_salary cnt_dept Sales 72000 3 Engineering 96000 4 Marketing 64000 2

What are the outputs of above queries in your implementation?

For the Q1, my suggestion is implementing by crossJoin (no join key required)
For the Q2, my suggestion is implementing by innerJoin with group-by keys as join keys.

@YANG-DB @LantaoJin Thanks for the clarification, that make much more sense now!
If I follow your thought on above logic, that means the code logic need to somehow know how many rows will be returned from both main and sub search.

As it will need such number, in order to determine whether cross join || inner being applied, however I don't think such info is available during the visitor steps, when compositing the Spark's logical plan.

Is there any way I could predict such number or some reasonable assumption can be made?

Ex: if both main and sub search has identical group by field, then the output rows should match.

@LantaoJin Would you mind to have a look on above for the join condition and advise?

Thanks :)

andy-k-improving · 2024-12-19T00:09:53Z

Two high level questions:

appendCol command syntax is
APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

opensearch-spark/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

Lines 27 to 29 in 957de4e

subSearch

: searchCommand (PIPE commands)*

;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.
2. why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL 
instead of
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9
PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

Yep, the sub-search under appendcol is restricted to use the same dataSource as the main PPL command, for the expected result, we can probably discuss it on the other thread, in order to centralise the convo.

Signed-off-by: Andy Kwok <[email protected]>

docs/ppl-lang/README.md

docs/ppl-lang/ppl-appendcol-command.md

ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

ppl-spark-integration/src/main/java/org/opensearch/sql/ast/tree/AppendCol.java

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

Co-authored-by: Taylor Curran <[email protected]> Signed-off-by: Andy Kwok <[email protected]>

Signed-off-by: Andy Kwok <[email protected]>

andy-k-improving added 7 commits December 6, 2024 13:40

Update grammar def

6fe52e8

Signed-off-by: Andy Kwok <[email protected]>

Skeleton for Append fields

0b3b50c

Signed-off-by: Andy Kwok <[email protected]>

Visitor skeleton

50f4bd5

Signed-off-by: Andy Kwok <[email protected]>

Update import

c74ac1a

Signed-off-by: Andy Kwok <[email protected]>

Update import

f23c2db

Signed-off-by: Andy Kwok <[email protected]>

Update osrt

893146a

Signed-off-by: Andy Kwok <[email protected]>

Changes

a9f10f0

Signed-off-by: Andy Kwok <[email protected]>

andy-k-improving requested review from dai-chen, mengweieric, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger, LantaoJin and ykmr1224 as code owners December 11, 2024 02:25

andy-k-improving marked this pull request as draft December 11, 2024 02:26

andy-k-improving added 12 commits December 11, 2024 11:49

Consolidate String constant

ed71d58

Signed-off-by: Andy Kwok <[email protected]>

Update projection clause

3d7b8b1

Signed-off-by: Andy Kwok <[email protected]>

Remove dep on parent method

77fddf1

Signed-off-by: Andy Kwok <[email protected]>

Consolidate relation inject logic

0e7e65a

Signed-off-by: Andy Kwok <[email protected]>

Move constant

d2aa146

Signed-off-by: Andy Kwok <[email protected]>

Move out constant from lambda

477c4fc

Signed-off-by: Andy Kwok <[email protected]>

Consolidate method

662a57c

Signed-off-by: Andy Kwok <[email protected]>

Update logic

365cc12

Signed-off-by: Andy Kwok <[email protected]>

Test 1 2

351ea88

Signed-off-by: Andy Kwok <[email protected]>

Test-cases 3 and 4

16406a0

Signed-off-by: Andy Kwok <[email protected]>

Update code format

13f4cb9

Signed-off-by: Andy Kwok <[email protected]>

Update code style

d34abf1

Signed-off-by: Andy Kwok <[email protected]>

andy-k-improving changed the title ~~DRAFT: PPL command appendCol implementaion~~ PPL command implementation for appendCol Dec 16, 2024

andy-k-improving marked this pull request as ready for review December 16, 2024 22:45

Code style

2847e5a

Signed-off-by: Andy Kwok <[email protected]>