Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert ntile builtIn function to UDWF #13040

Merged
merged 18 commits into from
Oct 25, 2024

Conversation

jatin510
Copy link
Contributor

@jatin510 jatin510 commented Oct 21, 2024

Which issue does this PR close?

Closes #12694.

Rationale for this change

Same as described in #8709.

What changes are included in this PR?

Converts BuiltinWindowFunction::Ntile to user-defined window function.
Export a fluent-style API for creating ntile expressions.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Physical Expressions proto Related to proto crate labels Oct 21, 2024
/// Create a new `ntile` function
pub fn new() -> Self {
Self {
signature: Signature::any(0, Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ntile(expression) window function takes a single argument of the integer type.

Suggested change
signature: Signature::any(0, Volatility::Immutable),
signature: Signature::uniform(
1,
vec![
DataType::UInt64,
DataType::UInt32,
DataType::UInt16,
DataType::UInt8,
DataType::Int64,
DataType::Int32,
DataType::Int16,
DataType::Int8,
],
Volatility::Immutable,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this is a draft PR 😅. Let me know once it's ready for review.

@jcsherin
Copy link
Contributor

Closes #12649.

The issue number for BuiltInWindowFunction::Ntile is #12694.

let nullable = false;
let return_type: &DataType = field_args.input_types().first().unwrap();

Ok(Field::new(self.name(), return_type.clone(), nullable))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When constructing the result field (column) we need to pass the fully qualified name in the schema which is available in WindowUDFFieldArgs.

/// Call `field_args.name()` to get the fully qualified name for defining
/// the [`Field`]. For a complete example see the implementation in the
/// [Basic Example](WindowUDFImpl#basic-example) section.
fn field(&self, field_args: WindowUDFFieldArgs) -> Result<Field>;

It is easy to get confused with the name of the udwf 😅.

External error: query failed: DataFusion error: Internal error: Input field name ntile does not match with the projection expression ntile(Int64(8)) ORDER BY [aggregate_test_100.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] SELECT
NTILE(8) OVER (ORDER BY C4) as ntile1,
NTILE(12) OVER (ORDER BY C12 DESC) as ntile2
FROM aggregate_test_100
ORDER BY c7
LIMIT 5
at test_files/window.slt:743
Suggested change
Ok(Field::new(self.name(), return_type.clone(), nullable))
Ok(Field::new(field_args.name(), return_type.clone(), nullable))

@jatin510
Copy link
Contributor Author

jatin510 commented Oct 24, 2024

@jcsherin

should i remove this duplicate code from : datafusion/physical-plan/src/windows/mod.rs:
because now we have made copy of these in : datafusion/functions-window/src/utils.rs

warning: function `get_scalar_value_from_args` is never used

   --> datafusion/physical-plan/src/windows/mod.rs:168:4
    |
168 | fn get_scalar_value_from_args(
    |    ^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: function `get_unsigned_integer` is never used
   --> datafusion/physical-plan/src/windows/mod.rs:199:4
    |
199 | fn get_unsigned_integer(value: ScalarValue) -> Result<u64> {
    |    ^^^^^^^^^^^^^^^^^^^^


@jcsherin
Copy link
Contributor

@jcsherin

should i remove this duplicate code from : datafusion/physical-plan/src/windows/mod.rs: because now we have made copy of these in : datafusion/functions-window/src/utils.rs

warning: function `get_scalar_value_from_args` is never used

   --> datafusion/physical-plan/src/windows/mod.rs:168:4
    |
168 | fn get_scalar_value_from_args(
    |    ^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: function `get_unsigned_integer` is never used
   --> datafusion/physical-plan/src/windows/mod.rs:199:4
    |
199 | fn get_unsigned_integer(value: ScalarValue) -> Result<u64> {
    |    ^^^^^^^^^^^^^^^^^^^^

Yes, please.

@jatin510 jatin510 marked this pull request as ready for review October 24, 2024 05:49
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a inner doc comment:

Suggested change
//! `ntile` window function implementation

Comment on lines 42 to 44
pub fn ntile(arg: i64) -> datafusion_expr::Expr {
ntile_udwf().call(vec![arg.lit()])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression API changed here. The arg type is Expr in the existing API.

https://docs.rs/datafusion/latest/datafusion/logical_expr/window_function/fn.ntile.html

Copy link
Contributor

@jcsherin jcsherin Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Revert to existing API contract
  • Add a logical plan roundtrip test

Marked as completed.

Comment on lines 108 to 112
fn expressions(&self, expr_args: ExpressionArgs) -> Vec<Arc<dyn PhysicalExpr>> {
parse_expr(expr_args.input_exprs(), expr_args.input_types())
.into_iter()
.collect::<Vec<_>>()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jcsherin jcsherin Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this completely and rely on the default implementation for ntile. From what I see ntile doesn't depend on the input expression in partition evaluator.

This implementation is required in lead/lag because it is possible that it can be called with NULL (input expression), and we try to deduce the type for that NULL if a default value argument exists. For lead/lag the types of the input expression and the default value needs to be the same in partition evaluator.

Comment on lines 130 to 135
let n = get_unsigned_integer(scalar_n)?;
if n == 0 {
return exec_err!("NTILE requires a positive integer");
}

Ok(Box::new(NtileEvaluator { n }))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original code from create_built_in_window_expr for parsing n unmodified?

Suggested change
let n = get_unsigned_integer(scalar_n)?;
if n == 0 {
return exec_err!("NTILE requires a positive integer");
}
Ok(Box::new(NtileEvaluator { n }))
if scalar_n.is_unsigned() {
let n = get_unsigned_integer(scalar_n)?;
Ok(Box::new(NtileEvaluator { n }))
} else {
let n: i64 = get_signed_integer(scalar_n)?;
if n <= 0 {
return exec_err!("NTILE requires a positive integer");
}
Ok(Box::new(NtileEvaluator { n: n as u64 }))
}

Comment on lines +143 to +145
fn documentation(&self) -> Option<&Documentation> {
Some(get_ntile_doc())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫶

@@ -51,3 +51,18 @@ pub(crate) fn get_scalar_value_from_args(
None
})
}

pub(crate) fn get_unsigned_integer(value: ScalarValue) -> Result<u64> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to extract this function inside ntile.

@@ -165,25 +163,6 @@ fn window_expr_from_aggregate_expr(
}
}

fn get_scalar_value_from_args(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines -149 to -182
## Ranking functions

- [rank](#rank)
- [dense_rank](#dense_rank)
- [ntile](#ntile)

### `rank`

Rank of the current row with gaps; same as row_number of its first peer.

```sql
rank()
```

### `dense_rank`

Rank of the current row without gaps; this function counts peer groups.

```sql
dense_rank()
```

### `ntile`

Integer ranging from 1 to the argument value, dividing the partition as equally as possible.

```sql
ntile(expression)
```

#### Arguments

- **expression**: An integer describing the number groups the partition should be split into

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs should not be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have moved them to docs/source/user-guide/sql/window_functions_new.md
Should I still keep them here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me now. You are removing the {rank, dense_rank} functions here because they were already migrated to new documentation.

Let this be 👍

@@ -179,6 +180,18 @@ Returns the rank of the current row without gaps. This function ranks rows in a
dense_rank()
```

### `ntile`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jcsherin
Copy link
Contributor

@jatin510 Can you please mark this PR as draft so that it is not accidentally merged into main while you are working on changes. You can change it back when you are ready, so the committers will know it is ready to merge.

@@ -940,6 +940,7 @@ async fn roundtrip_expr_api() -> Result<()> {
vec![lit(1), lit(2), lit(3)],
vec![lit(10), lit(20), lit(30)],
),
cume_dist(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +46 to +48
pub fn ntile(arg: Expr) -> Expr {
ntile_udwf().call(vec![arg])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jatin510 jatin510 marked this pull request as draft October 24, 2024 15:27
@jcsherin
Copy link
Contributor

@jatin510 Thanks 🙌.

The udwf epic is almost complete!

@jatin510
Copy link
Contributor Author

@jcsherin Oops !
I accidentally worked on ntile udwf, I was supposed to work on nth udwf.
Also the branch name is wrong : jatin510:feature/12649-udwf-nth-value

Should I delete this remote branch and update the branch name and create a PR again !

@jcsherin
Copy link
Contributor

jcsherin commented Oct 24, 2024

@jcsherin Oops ! I accidentally worked on ntile udwf, I was supposed to work on nth udwf. Also the branch name is wrong : jatin510:feature/12649-udwf-nth-value

Should I delete this remote branch and update the branch name and create a PR again !

It is alright, that is not a problem at all 😂.

You just need to edit the Closes # in the PR description to refer instead to the ntile issue.

The nth_value conversion is also a good first issue which I hope somebody who wishes to gain experience with DataFusion will pick up soon.

@jatin510 jatin510 marked this pull request as ready for review October 24, 2024 16:47
@jatin510
Copy link
Contributor Author

jatin510 commented Oct 24, 2024

Thank you so much ! @jcsherin

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Is this PR ready to go? It looks so to me. Thank you @jatin510 and @jcsherin and @jayzhan211

@jcsherin
Copy link
Contributor

@alamb This PR is ready.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2024

Thanks again @jatin510 for this PR, @jcsherin for the guidance and @jayzhan211 for the review 🚀

@alamb alamb merged commit 02b9693 into apache:main Oct 25, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert BuiltInWindowFunction::Ntile to a user defined window function
4 participants