Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support aggregate functions in Eval expressions #755

Open
LantaoJin opened this issue Oct 8, 2024 · 3 comments
Open

[FEATURE] Support aggregate functions in Eval expressions #755

LantaoJin opened this issue Oct 8, 2024 · 3 comments
Labels
0.6 bug Something isn't working Lang:PPL Pipe Processing Language support

Comments

@LantaoJin
Copy link
Member

LantaoJin commented Oct 8, 2024

What is the bug?
Aggregate functions could work in select clause even there is no group by as long as all items in select are aggregate functions.
Here are some examples:

select max(a) from t
select max(a), min(a), count(a) from t

But the following queries should throw exceptions

select a, max(a) from t
select a, max(a), min(a), count(a) from t
select *, max(a), min(a), count(a) from t

They could work with a group by

select a, max(a) from t group by b
select a, max(a), min(a), count(a) from t group by b
select *, max(a), min(a), count(a) from t group by b

Similar, aggregate functions in PPL could work in eval command, because an eval expression equals to add a projection to existing project list.
Here are some examples:

source=t | eval m = max(a) | fields m
source=t | eval m = max(a), n = count(a) | fields m, n
source=t | eval m = max(a) | n = count(a) | fields m, n

But the following PPL queries should throw exceptions

source=t | eval m = max(a) -- equals to all fields plus m
source=t | eval m = max(a), n = count(a) | fields a, m, n
source=t | eval m = max(a) | n = count(a) | fields a, m, n

How can one reproduce the bug?
The query failed with syntax error:

source=t | eval m = max(a) | fields m
@LantaoJin LantaoJin added bug Something isn't working untriaged labels Oct 8, 2024
@YANG-DB YANG-DB added 0.5 Lang:PPL Pipe Processing Language support 0.6 and removed untriaged 0.5 labels Oct 9, 2024
@LantaoJin LantaoJin changed the title [BUG] Aggregate functions cannot be recognized in eval expression [Feature] Aggregate functions cannot be recognized in eval expression Oct 10, 2024
@LantaoJin LantaoJin changed the title [Feature] Aggregate functions cannot be recognized in eval expression [Feature] Support aggregate functions in Eval expressions Oct 10, 2024
@LantaoJin LantaoJin changed the title [Feature] Support aggregate functions in Eval expressions [FEATURE] Support aggregate functions in Eval expressions Oct 10, 2024
@LantaoJin
Copy link
Member Author

I found the current draft PR still has some problem. And more further, it is sorts of tricky to provide this enhancement.
As we know, eval command will add new columns to a plan by adding a Spark Project node.
More specific, one comma separated eval expression will add Project(*, 'a, 'b, 'c, child), multiple eval expressions will add

Project(*, 'a, child)
+- Project(*, 'b, child)
     +- Project(*, 'c, child)

This plan will be optimized to Project(*, 'a, 'b, 'c, child) in Spark optimizer.
Note the * we added in the first project list is to make sure the attribute reference could be resolved since the plan we build is unresolved logical plan. It will be analyzed later. For example, the attribute reference x cannot be resolved if we remove the * in the Project nodes added by eval commands.

Project('x, child)
+- Project('a, child)
    +- Project('b, child)
         +- Project('c, child)

Now, this feature is arm to support aggregate functions in eval expression. Give an example, the command:

| eval max_a = max(a) | eval min_a = min(a) | eval count_a = count(a) | ...

Will build a plan like this:

Project(*, 'max(a), child)
+- Project(*, 'min(a), child)
     +- Project(*, 'count(a), child)

Unfortunately, Spark analyzer will resolve this Project node as translating to Aggregate:

Aggregate([], [*, 'max(a)], child)
+- Aggregate([], [*, 'min(a)], child)
     +- Aggregate([], [*, 'count(a)], child)

Then the resolution will fail because aggregate expression [*, 'max(a)] contains a non-aggregate function and grouping expression [] is empty.

So the only successful case is putting one comma separated eval expression in the second to last position:

| eval max_a = max(a), min_a = min(a), count_a = count(a) | fields max_a, min_a, count_a

This design is tricky and anti-robust, and even it could be rewrite to

| stats max(a) as max_a, min(a) as min_a, count(a) as count_a

So I think we should not continue this feature since it should work by stats.
@YANG-DB @penghuo how do you think?

@YANG-DB
Copy link
Member

YANG-DB commented Oct 11, 2024

@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...

@LantaoJin
Copy link
Member Author

@LantaoJin thanks for this deep analysis - I agree that if we can support the same functionality in stats we can ignore this issue for now - just please document this so that we have it in a common place and readers wont need to search in our issue to understand why ...

Sure, I will give a doc PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 bug Something isn't working Lang:PPL Pipe Processing Language support
Projects
None yet
Development

No branches or pull requests

2 participants