Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDAF refactor: Add PhysicalExpr trait dependency on datafusion-expr and remove logical expressions requirement for creating physical aggregate expression #11845

Merged
merged 30 commits into from
Aug 9, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Aug 6, 2024

Which issue does this PR close?

Closes #11359
Closes #11761

Rationale for this change

Logical expressions are no longer required for creating physical aggregate expression

What changes are included in this PR?

Lots of crate created

  • expr-common
  • functions-aggregate-common
  • physical-expr-common
  • physical-expr-functions-aggregate
  • expr has dependency on functions-aggregate-common + physical-expr-common
  • Column, Cast, Literal moved to physical-expr

Other cleanup

  • create_aggregate_expr_with_dfschema removed
  • create_aggregate_expr removed
  • remove input_types in AccumulatorArgs
  • remove dfschema in AccumulatorArgs
  • rename data_type to return_type in AccumulatorArgs

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Aug 6, 2024
@@ -44,6 +44,8 @@ arrow-schema = { workspace = true }
datafusion-common = { workspace = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-functions-aggregate-common = { workspace = true }
datafusion-physical-expr = { workspace = true }
Copy link
Contributor Author

@jayzhan211 jayzhan211 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is no longer an issue given the datafusion-physical-expr is somewhat thin now. We could move out expressions (Column, Literal, ...) for minimum dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "thin"? I checked on this branch and here is what datafusion/physical-expr still has

(we could potentially move the groups accumulators into datafusion-functions-aggegate-common as well, as a follow on PR)

$ find datafusion/physical-expr/src
datafusion/physical-expr/src
datafusion/physical-expr/src/physical_expr.rs
datafusion/physical-expr/src/partitioning.rs
datafusion/physical-expr/src/analysis.rs
datafusion/physical-expr/src/aggregate
datafusion/physical-expr/src/aggregate/mod.rs
datafusion/physical-expr/src/aggregate/stats.rs
datafusion/physical-expr/src/aggregate/groups_accumulator
datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs
datafusion/physical-expr/src/aggregate/groups_accumulator/mod.rs
datafusion/physical-expr/src/lib.rs
datafusion/physical-expr/src/functions.rs
datafusion/physical-expr/src/utils
datafusion/physical-expr/src/utils/guarantee.rs
datafusion/physical-expr/src/utils/mod.rs
datafusion/physical-expr/src/planner.rs
datafusion/physical-expr/src/intervals
datafusion/physical-expr/src/intervals/test_utils.rs
datafusion/physical-expr/src/intervals/mod.rs
datafusion/physical-expr/src/intervals/cp_solver.rs
datafusion/physical-expr/src/intervals/utils.rs
datafusion/physical-expr/src/equivalence
datafusion/physical-expr/src/equivalence/properties.rs
datafusion/physical-expr/src/equivalence/mod.rs
datafusion/physical-expr/src/equivalence/projection.rs
datafusion/physical-expr/src/equivalence/class.rs
datafusion/physical-expr/src/equivalence/ordering.rs
datafusion/physical-expr/src/window
datafusion/physical-expr/src/window/row_number.rs
datafusion/physical-expr/src/window/ntile.rs
datafusion/physical-expr/src/window/window_expr.rs
datafusion/physical-expr/src/window/lead_lag.rs
datafusion/physical-expr/src/window/cume_dist.rs
datafusion/physical-expr/src/window/nth_value.rs
datafusion/physical-expr/src/window/built_in.rs
datafusion/physical-expr/src/window/built_in_window_function_expr.rs
datafusion/physical-expr/src/window/aggregate.rs
datafusion/physical-expr/src/window/rank.rs
datafusion/physical-expr/src/window/mod.rs
datafusion/physical-expr/src/window/sliding_aggregate.rs
datafusion/physical-expr/src/expressions
datafusion/physical-expr/src/expressions/like.rs
datafusion/physical-expr/src/expressions/negative.rs
datafusion/physical-expr/src/expressions/literal.rs
datafusion/physical-expr/src/expressions/cast.rs
datafusion/physical-expr/src/expressions/binary.rs
datafusion/physical-expr/src/expressions/is_not_null.rs
datafusion/physical-expr/src/expressions/unknown_column.rs
datafusion/physical-expr/src/expressions/mod.rs
datafusion/physical-expr/src/expressions/is_null.rs
datafusion/physical-expr/src/expressions/try_cast.rs
datafusion/physical-expr/src/expressions/column.rs
datafusion/physical-expr/src/expressions/no_op.rs
datafusion/physical-expr/src/expressions/binary
datafusion/physical-expr/src/expressions/binary/kernels.rs
datafusion/physical-expr/src/expressions/case.rs
datafusion/physical-expr/src/expressions/in_list.rs
datafusion/physical-expr/src/expressions/not.rs
datafusion/physical-expr/src/scalar_function.rs
datafusion/physical-expr/src/math_expressions.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think half of them are window which will be moved to functions-window-common in the future. Another half are equivalence. However, I agree they are still unnecessary dependency for functions-aggregate. 😞 I think we could iteratively move window and maybe equivalence 🤔 out from physical-expr and keep most of expressions under datafusion/physical-expr/src/expressions/ at the end if we expect minimum dependency for 3rd party.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see if window and/or equivalence staying here causes any problems in practice. If so, I agree that they can be moved in the future. However, it is also possible that they do not pose any challenges in practice and can stay. We'll see soon enough 🚀

Copy link
Contributor

@alamb alamb Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see if window and/or equivalence staying here causes any problems in practice. If so, I agree that they can be moved in the future. However, it is also possible that they do not pose any challenges in practice and can stay. We'll see soon enough 🚀

💯 agree

FWIW one project that would likely result in moving window functions would be #8709

I suspect that is a relatively smaller project than moving scalar or aggregates as there are so many fewer window functions

@ozankabak
Copy link
Contributor

ozankabak commented Aug 6, 2024

I know this is still a draft, but can you remind us which issue/discussion this refactor is related to? Is it about removing logical stuff from aggregates? We can help more effectively with more context.

Refactors are great when done judiciously but we should make sure we solve pressing pain points when we make major API changes or touch so many files at the same time. If removing logicals from aggregates is the intent here, that would be great to achieve and IMO would be worth to do a big-ish refactor for.

Signed-off-by: jayzhan211 <[email protected]>
@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

I know this is still a draft, but can you remind us which issue/discussion this refactor is related to? Is it about removing logical stuff from aggregates? We can help more effectively with more context.

I agree it would be great to ensure this PR has the background / rationale for the changes

Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211
Copy link
Contributor Author

@ozankabak @alamb This PR is from #11359, the main goal is to eliminate logical expressions for creating physical expressions.

@@ -20,10 +20,13 @@
#[macro_use]
mod binary;
mod case;
mod cast;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move cast, literal, column back to physical-expr, since physical-expr-common is now a common-level crate

@@ -108,35 +106,6 @@ pub fn reverse_order_bys(order_bys: &[PhysicalSortExpr]) -> Vec<PhysicalSortExpr
.collect()
}

/// Converts `datafusion_expr::Expr` into corresponding `Arc<dyn PhysicalExpr>`.
/// If conversion is not supported yet, returns Error.
pub fn limited_convert_logical_expr_to_physical_expr_with_dfschema(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workaround cleanup

/// an error if the given schema has fewer columns than the original schema.
/// Note that the resulting expression may not be valid if data types in the
/// new schema is incompatible with expression nodes.
pub fn with_new_schema(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to physical_expr::expressions::column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this takes in a general PhysicalExpr and rewrites it, it is more natural for it to be a top-level utility. Do we have to nest it inside expressions::column due to some dependency issues? Avoiding that nesting would be good due to the signature/function of this routine.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best place we could move to is probably expressions::mod, it should be in the same crate with Column, but I don't think there is much different, so I keep it besides column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expressions::mod sounds good to me 👍

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Comment on lines 88 to 113
let n = match acc_args.physical_exprs[1]
.as_any()
.downcast_ref::<Literal>()
{
Some(lit) => match lit.value() {
ScalarValue::Int64(Some(value)) => {
if acc_args.is_reversed {
-*value
} else {
*value
}
}
_ => {
return not_impl_err!(
"{} not supported for n: {}",
self.name(),
&acc_args.physical_exprs[1]
);
}
},
None => {
return not_impl_err!(
"{} not supported for n: {}",
self.name(),
&acc_args.physical_exprs[1]
);
Copy link
Contributor

@jcsherin jcsherin Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code is readable in the current form.

But I wanted to explore ways for removing the repetition in the error handling block. Finally, ended up having to split it into two separate bindings.

I don't have any strong opinions about this. So please feel free to use it or not 😃

let scalar = acc_args.physical_exprs[1]
	.as_any()
	.downcast_ref::<Literal>()
	.and_then(|lit| {
		if let ScalarValue::Int64(inner) = lit.value() {
			Some(*inner)
		} else {
			None
		}
	})
	.flatten();

let n = if let Some(value) = scalar {
            if acc_args.is_reversed {
                -value
            } else {
                value
            }
} else {
	return not_impl_err!(
		"{} not supported for n: {}",
		self.name(),
		&acc_args.physical_exprs[1]
	);
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I come out another way that is slightly more readable

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
use datafusion_expr::function::StateFieldsArgs;
use datafusion_expr::type_coercion::aggregates::check_arg_count;
use datafusion_expr::utils::AggregateOrderSensitivity;
use datafusion_expr::AggregateUDF;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocker for us to move this crate into common-level crate functions-aggregate-common is AggregateUDF. It has fn call() -> Expr that tightly coupled with Expr.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove call(), but create another function like
aggregate_expr(sum_udaf(), vec![args]) -> Expr
And, aggregate_physical_expr(sum_udaf(), vec![physical_args]) -> Arc<dyn AggregateExpr>> similarly. 🤔

Or we force to use ExprFuncBuilder for whatever

@jayzhan211 jayzhan211 marked this pull request as ready for review August 7, 2024 03:34
@andygrove andygrove added the api change Changes the API exposed to users of the crate label Aug 7, 2024
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for driving this forward.

I gave this a careful read and left some comments/suggestions. Most of them are stylistic, but there are also some with questions and possible discussion points. I will be happy to iterate with you to move this forward so we succeed in removing logical expressions from accumulators/aggregates.

One general point in addition to inline comments: Given that the structs will only contain physical expressions after this, I'm not sure if we need to be overly explicit and call the expression fields physical_exprs. The prefix physical would make sense in the old design where both logical and physical expressions were there, but now we only have one type of expression. So calling it exprs seem to make more sense to me (and the linter will show the type anyway).

Comment on lines 260 to 261
use datafusion_physical_expr::expressions::col;
use datafusion_physical_expr::expressions::lit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use datafusion_physical_expr::expressions::col;
use datafusion_physical_expr::expressions::lit;
use datafusion_physical_expr::expressions::{col, lit};

use crate::type_coercion::binary::get_result_type;
use crate::Operator;
use arrow_buffer::{IntervalDayTime, IntervalMonthDayNano};
use arrow::datatypes::{IntervalDayTime, IntervalMonthDayNano};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this down next to other arrow::datatypes imports and have one import statement instead of three for arrow datatypes

Comment on lines 23 to 24
use datafusion_physical_expr::expressions::Column;
use datafusion_physical_expr::expressions::{IsNotNullExpr, IsNullExpr};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use datafusion_physical_expr::expressions::Column;
use datafusion_physical_expr::expressions::{IsNotNullExpr, IsNullExpr};
use datafusion_physical_expr::expressions::{Column, IsNotNullExpr, IsNullExpr};

Comment on lines 684 to 686
use crate::expressions::{col, lit, try_cast, Literal};

use crate::expressions::Column;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use crate::expressions::{col, lit, try_cast, Literal};
use crate::expressions::Column;
use crate::expressions::{col, lit, try_cast, Column, Literal};

Comment on lines 35 to 36
use super::Column;
use super::Literal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use super::Column;
use super::Literal;
use super::{Column, Literal};

datafusion/physical-expr/src/expressions/case.rs Outdated Show resolved Hide resolved
@@ -395,6 +379,8 @@ impl BuiltInWindowFunctionExpr for WindowUDFExpr {
}
}

// TODO: Find a way to make clippy happy
#[allow(clippy::needless_borrow)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we get this warning now. Is it because we used to clone with expr and now do it with &expr? We should understand if this is a spurious warning or indicative of a code smell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: this expression creates a reference which is immediately dereferenced by the compiler
   --> datafusion/physical-plan/src/windows/mod.rs:401:28
    |
401 |                 Arc::clone(&expr),
    |                            ^^^^^ help: change this to: `expr`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow
    = note: `-D clippy::needless-borrow` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::needless_borrow)]`

This is the clippy error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I presume if you just change to expr, like how it was before, we get some other error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, actually it has no error. The error I got is from rust-analyzer.

expected &Arc<dyn PhysicalExpr, Global>, found Arc<dyn PhysicalExpr, Global>

😕

@jayzhan211 jayzhan211 marked this pull request as draft August 8, 2024 00:33
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Aug 8, 2024

I cleanup dfschema and input_types together in this PR because they turned out to be more straightforward than I initially thought

@jayzhan211 jayzhan211 changed the title UDAF refactor UDAF refactor; Add PhysicalExpr trait dependency on datafusion-expr and remove logical expressions for creating physical aggregate expression Aug 8, 2024
@jayzhan211 jayzhan211 changed the title UDAF refactor; Add PhysicalExpr trait dependency on datafusion-expr and remove logical expressions for creating physical aggregate expression UDAF refactor: Add PhysicalExpr trait dependency on datafusion-expr and remove logical expressions for creating physical aggregate expression Aug 8, 2024
@jayzhan211 jayzhan211 changed the title UDAF refactor: Add PhysicalExpr trait dependency on datafusion-expr and remove logical expressions for creating physical aggregate expression UDAF refactor: Add PhysicalExpr trait dependency on datafusion-expr and remove logical expressions requirement for creating physical aggregate expression Aug 8, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review August 8, 2024 06:24
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. I looked at this in detail and it looks good to me.

Hopefully we will not need to do such big refactors in the future as we stabilize the architecture.

Given that it is a big PR with many changes (both structurally and API), it would be good to get multiple eyes on this. @alamb, can you please take a look? Also @andygrove, you were interested in removing logical expressions from aggregates, it'd be good if you can take a look as well. I also announced on Discord and Slack to attract more eyes.

@jayzhan211
Copy link
Contributor Author

Thanks @ozankabak for your review!

pub is_distinct: bool,

/// The physical expression of arguments the aggregate function takes.
pub exprs: &'a [Arc<dyn PhysicalExpr>],
Copy link
Contributor Author

@jayzhan211 jayzhan211 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing PhysicalExpr and PhysicalSortExpr to AccumulatorArgs which used in AggregateUDFImpl is the main reason for this large crate refactor

@ozankabak
Copy link
Contributor

ozankabak commented Aug 8, 2024

It would be great to visualize the dependency graph after this refactor. We can probably use diagrams like that in our documentation too.

@timsaucer
Copy link
Contributor

timsaucer commented Aug 8, 2024

reduced datafusion dependencies.pdf

reduced_datafusion_dependencies

I'm trying to wrap my head around the current dependency graph and what the purpose of each of the crates is.

I generated this diagram using cargo depgraph --workspace-only --exclude datafusion-sqllogictest,datafusion-wasmtest,datafusion-docs-tests,gen,gen-common,datafusion-examples,test-utils,datafusion-benchmarks > depgraph.dot and then I further removed datafusion (core) because it depends on nearly everything here and clutters the diagram, datafusion-common because nearly everything here depends on it, datafusion-proto / datafusion-proto-common because they don't really add a lot to the graph and just clutter it, and datafusion-substrait because it only depends on datafusion (core). I then colored the nodes going into datafusion-expr to make it easier for me to follow.

Hopefully this helps.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2024

FYI @andygrove and @viirya as I suspect this may impact comet (though I haven't looked at it thoroughly yet)

Given the size of this change and that we are getting close (days if not sooner) to releasing DataFusion 41 -- #11476 -- it might be a good idea to wait to merge this PR until right after the RC is created

That would give these changes the maximal "bake time" on main and give early adopters who integrate from main rather than releases on crates.io time to test and make adjustments

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 and @ozankabak -- I think this is an epic refactor and very nicely / cleanly separates the logical from physical APIs ❤️

I am sure we can always improve things going forward too, but I think this is a great step in the right direction

I left some small documentation/comment suggestions but I think they could be done as a follow on PR as well.

I do think we should consider waiting to merge until we have made a 41.0.0 release candidate so we have the maximum time to work out any kinks this might cause


[package]
name = "datafusion-expr-common"
description = "Logical plan and expression representation for DataFusion query engine"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could adjust this to say it had commonly shared types

Suggested change
description = "Logical plan and expression representation for DataFusion query engine"
description = "Traits and types for logical plans and expressions for DataFusion query engine"

@@ -17,7 +17,7 @@

//! Vectorized [`GroupsAccumulator`]

use arrow_array::{ArrayRef, BooleanArray};
use arrow::array::{ArrayRef, BooleanArray};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines 18 to 20
//! Logical Expr Common packages for [DataFusion]
//!
//! [DataFusion]: <https://crates.io/crates/datafusion>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some additional rationale here might help

Suggested change
//! Logical Expr Common packages for [DataFusion]
//!
//! [DataFusion]: <https://crates.io/crates/datafusion>
//! Logical Expr types and traits for [DataFusion]
//!
//! This crate contains types and traits that are used by both Logical and Physical expressions.
//! They are kept in their own crate to avoid physical expressions depending on logical expressions.
//!
//!
//! [DataFusion]: <https://crates.io/crates/datafusion>

@@ -287,202 +280,3 @@ impl fmt::Display for Operator {
write!(f, "{display}")
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️


[package]
name = "datafusion-functions-aggregate-common"
description = "Common aggregate function packages for the DataFusion query engine"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description = "Common aggregate function packages for the DataFusion query engine"
description = "Utility functions for implementing aggregate functions for the DataFusion query engine"

Comment on lines 18 to 19
//! Contains the trait `AggregateExpr` which defines the interface all aggregate expressions
//! (built-in and custom) need to satisfy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this first sentence is rendered as the summary on the docs page, I think it is nice to make a link to the relevant struct/trait if possible

for example https://docs.rs/datafusion/latest/datafusion/index.html

Screenshot 2024-08-08 at 8 27 01 AM

Suggested change
//! Contains the trait `AggregateExpr` which defines the interface all aggregate expressions
//! (built-in and custom) need to satisfy.
//! [`AggregateExpr`] which defines the interface all aggregate expressions
//! (built-in and custom) need to satisfy.

Comment on lines 18 to 21
//! Aggregate Function Common packages for [DataFusion]
//! This package could be used to build for 3rd party aggregate function
//!
//! [DataFusion]: <https://crates.io/crates/datafusion>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Aggregate Function Common packages for [DataFusion]
//! This package could be used to build for 3rd party aggregate function
//!
//! [DataFusion]: <https://crates.io/crates/datafusion>
//! Common Aggregate functionality for [DataFusion]
//!
//! This crate contains traits and utilities commonly used to implement aggregate functions
//! They are kept in their own crate to avoid physical expressions depending on logical expressions.
//!
//! [DataFusion]: <https://crates.io/crates/datafusion>

@@ -44,6 +44,8 @@ arrow-schema = { workspace = true }
datafusion-common = { workspace = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-functions-aggregate-common = { workspace = true }
datafusion-physical-expr = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "thin"? I checked on this branch and here is what datafusion/physical-expr still has

(we could potentially move the groups accumulators into datafusion-functions-aggegate-common as well, as a follow on PR)

$ find datafusion/physical-expr/src
datafusion/physical-expr/src
datafusion/physical-expr/src/physical_expr.rs
datafusion/physical-expr/src/partitioning.rs
datafusion/physical-expr/src/analysis.rs
datafusion/physical-expr/src/aggregate
datafusion/physical-expr/src/aggregate/mod.rs
datafusion/physical-expr/src/aggregate/stats.rs
datafusion/physical-expr/src/aggregate/groups_accumulator
datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs
datafusion/physical-expr/src/aggregate/groups_accumulator/mod.rs
datafusion/physical-expr/src/lib.rs
datafusion/physical-expr/src/functions.rs
datafusion/physical-expr/src/utils
datafusion/physical-expr/src/utils/guarantee.rs
datafusion/physical-expr/src/utils/mod.rs
datafusion/physical-expr/src/planner.rs
datafusion/physical-expr/src/intervals
datafusion/physical-expr/src/intervals/test_utils.rs
datafusion/physical-expr/src/intervals/mod.rs
datafusion/physical-expr/src/intervals/cp_solver.rs
datafusion/physical-expr/src/intervals/utils.rs
datafusion/physical-expr/src/equivalence
datafusion/physical-expr/src/equivalence/properties.rs
datafusion/physical-expr/src/equivalence/mod.rs
datafusion/physical-expr/src/equivalence/projection.rs
datafusion/physical-expr/src/equivalence/class.rs
datafusion/physical-expr/src/equivalence/ordering.rs
datafusion/physical-expr/src/window
datafusion/physical-expr/src/window/row_number.rs
datafusion/physical-expr/src/window/ntile.rs
datafusion/physical-expr/src/window/window_expr.rs
datafusion/physical-expr/src/window/lead_lag.rs
datafusion/physical-expr/src/window/cume_dist.rs
datafusion/physical-expr/src/window/nth_value.rs
datafusion/physical-expr/src/window/built_in.rs
datafusion/physical-expr/src/window/built_in_window_function_expr.rs
datafusion/physical-expr/src/window/aggregate.rs
datafusion/physical-expr/src/window/rank.rs
datafusion/physical-expr/src/window/mod.rs
datafusion/physical-expr/src/window/sliding_aggregate.rs
datafusion/physical-expr/src/expressions
datafusion/physical-expr/src/expressions/like.rs
datafusion/physical-expr/src/expressions/negative.rs
datafusion/physical-expr/src/expressions/literal.rs
datafusion/physical-expr/src/expressions/cast.rs
datafusion/physical-expr/src/expressions/binary.rs
datafusion/physical-expr/src/expressions/is_not_null.rs
datafusion/physical-expr/src/expressions/unknown_column.rs
datafusion/physical-expr/src/expressions/mod.rs
datafusion/physical-expr/src/expressions/is_null.rs
datafusion/physical-expr/src/expressions/try_cast.rs
datafusion/physical-expr/src/expressions/column.rs
datafusion/physical-expr/src/expressions/no_op.rs
datafusion/physical-expr/src/expressions/binary
datafusion/physical-expr/src/expressions/binary/kernels.rs
datafusion/physical-expr/src/expressions/case.rs
datafusion/physical-expr/src/expressions/in_list.rs
datafusion/physical-expr/src/expressions/not.rs
datafusion/physical-expr/src/scalar_function.rs
datafusion/physical-expr/src/math_expressions.rs

@@ -136,6 +138,35 @@ pub fn col(name: &str, schema: &Schema) -> Result<Arc<dyn PhysicalExpr>> {
Ok(Arc::new(Column::new_with_schema(name, schema)?))
}

// TODO: Move expressions out of physical-expr?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to where would it go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover, it should be kept with expressions/

)
let args = [col("b", schema)?];

AggregateExprBuilder::new(first_value_udaf(), args.to_vec())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@ozankabak
Copy link
Contributor

I do think we should consider waiting to merge until we have made a 41.0.0 release candidate so we have the maximum time to work out any kinks this might cause

Reasonable, agreed

@andygrove
Copy link
Member

I do think we should consider waiting to merge until we have made a 41.0.0 release candidate so we have the maximum time to work out any kinks this might cause

Reasonable, agreed

I plan on creating 41.0.0-rc1 today

@alamb
Copy link
Contributor

alamb commented Aug 9, 2024

RC1 is out https://lists.apache.org/thread/gz15fc6qbn4q99rp0of6xjsngvf8jbw0 so let's get this PR in!

Thanks again @jayzhan211 and @ozankabak 👏

@alamb alamb merged commit e088945 into apache:main Aug 9, 2024
26 checks passed
@jayzhan211
Copy link
Contributor Author

Thanks everyone 🚀

@jayzhan211 jayzhan211 deleted the udaf-refactor branch August 9, 2024 14:44
@ozankabak
Copy link
Contributor

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review use of logical expressions in physical AggregateFunctionExpr
6 participants