[FEATURE] Support aggregate functions in Eval expressions #755

LantaoJin · 2024-10-08T23:49:51Z

What is the bug?
Aggregate functions could work in select clause even there is no group by as long as all items in select are aggregate functions.
Here are some examples:

select max(a) from t
select max(a), min(a), count(a) from t

But the following queries should throw exceptions

select a, max(a) from t
select a, max(a), min(a), count(a) from t
select *, max(a), min(a), count(a) from t

They could work with a group by

select a, max(a) from t group by b
select a, max(a), min(a), count(a) from t group by b
select *, max(a), min(a), count(a) from t group by b

Similar, aggregate functions in PPL could work in eval command, because an eval expression equals to add a projection to existing project list.
Here are some examples:

source=t | eval m = max(a) | fields m
source=t | eval m = max(a), n = count(a) | fields m, n
source=t | eval m = max(a) | n = count(a) | fields m, n

But the following PPL queries should throw exceptions

source=t | eval m = max(a) -- equals to all fields plus m
source=t | eval m = max(a), n = count(a) | fields a, m, n
source=t | eval m = max(a) | n = count(a) | fields a, m, n

How can one reproduce the bug?
The query failed with syntax error:

source=t | eval m = max(a) | fields m

The text was updated successfully, but these errors were encountered:

LantaoJin · 2024-10-10T09:11:04Z

I found the current draft PR still has some problem. And more further, it is sorts of tricky to provide this enhancement.
As we know, eval command will add new columns to a plan by adding a Spark Project node.
More specific, one comma separated eval expression will add Project(*, 'a, 'b, 'c, child), multiple eval expressions will add

Project(*, 'a, child)
+- Project(*, 'b, child)
     +- Project(*, 'c, child)

This plan will be optimized to Project(*, 'a, 'b, 'c, child) in Spark optimizer.
Note the * we added in the first project list is to make sure the attribute reference could be resolved since the plan we build is unresolved logical plan. It will be analyzed later. For example, the attribute reference x cannot be resolved if we remove the * in the Project nodes added by eval commands.

Project('x, child)
+- Project('a, child)
    +- Project('b, child)
         +- Project('c, child)

Now, this feature is arm to support aggregate functions in eval expression. Give an example, the command:

| eval max_a = max(a) | eval min_a = min(a) | eval count_a = count(a) | ...

Will build a plan like this:

Project(*, 'max(a), child)
+- Project(*, 'min(a), child)
     +- Project(*, 'count(a), child)

Unfortunately, Spark analyzer will resolve this Project node as translating to Aggregate:

Aggregate([], [*, 'max(a)], child)
+- Aggregate([], [*, 'min(a)], child)
     +- Aggregate([], [*, 'count(a)], child)

Then the resolution will fail because aggregate expression [*, 'max(a)] contains a non-aggregate function and grouping expression [] is empty.

So the only successful case is putting one comma separated eval expression in the second to last position:

| eval max_a = max(a), min_a = min(a), count_a = count(a) | fields max_a, min_a, count_a

This design is tricky and anti-robust, and even it could be rewrite to

| stats max(a) as max_a, min(a) as min_a, count(a) as count_a

So I think we should not continue this feature since it should work by stats.
@YANG-DB @penghuo how do you think?

YANG-DB · 2024-10-11T01:08:38Z

@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...

LantaoJin · 2024-10-11T01:46:41Z

@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...

Sure, I will give a doc PR later.

LantaoJin added bug Something isn't working untriaged labels Oct 8, 2024

LantaoJin mentioned this issue Oct 9, 2024

Support aggregate functions in Eval expressions #757

Closed

3 tasks

YANG-DB added 0.5 Lang:PPL Pipe Processing Language support 0.6 and removed untriaged 0.5 labels Oct 9, 2024

LantaoJin changed the title ~~[BUG] Aggregate functions cannot be recognized in eval expression~~ [Feature] Aggregate functions cannot be recognized in eval expression Oct 10, 2024

LantaoJin changed the title ~~[Feature] Aggregate functions cannot be recognized in eval expression~~ [Feature] Support aggregate functions in Eval expressions Oct 10, 2024

LantaoJin changed the title ~~[Feature] Support aggregate functions in Eval expressions~~ [FEATURE] Support aggregate functions in Eval expressions Oct 10, 2024

YANG-DB closed this as completed Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support aggregate functions in Eval expressions #755

[FEATURE] Support aggregate functions in Eval expressions #755

LantaoJin commented Oct 8, 2024 •

edited

Loading

LantaoJin commented Oct 10, 2024

YANG-DB commented Oct 11, 2024

LantaoJin commented Oct 11, 2024

[FEATURE] Support aggregate functions in Eval expressions #755

[FEATURE] Support aggregate functions in Eval expressions #755

Comments

LantaoJin commented Oct 8, 2024 • edited Loading

LantaoJin commented Oct 10, 2024

YANG-DB commented Oct 11, 2024

LantaoJin commented Oct 11, 2024

LantaoJin commented Oct 8, 2024 •

edited

Loading