Infer data type from schema for `Values` and add struct coercion to `coalesce` #12864

jayzhan211 · 2024-10-11T10:11:23Z

Which issue does this PR close?

Closes #5046 .

Follow up from #12839

Infer Values from schema if exists
Array and Coalesce has the similar logic, applies the struct coercion logic to coalesce, while values are quite different, rewrite it's own logic.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 · 2024-10-11T13:38:53Z

datafusion/sqllogictest/test_files/subquery.slt

+13)------CoalesceBatchesExec: target_batch_size=2
+14)--------RepartitionExec: partitioning=Hash([t1_id@0], 4), input_partitions=4
+15)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+16)------------MemoryExec: partitions=1, partition_sizes=[1]


Not pretty sure about this kind of change 🤔

In main branch, when you create table with values, RoundRobinBatch is applied to it because we have cast expr. The value is i64 by default, so when we have int column, we need to cast to i32.

fn benefits_from_input_partitioning(&self) -> Vec<bool> { let all_simple_exprs = self .expr .iter() .all(|(e, _)| e.as_any().is::<Column>() || e.as_any().is::<Literal>()); // If expressions are all either column_expr or Literal, then all computations in this projection are reorder or rename, // and projection would not benefit from the repartition, benefits_from_input_partitioning will return false. vec![!all_simple_exprs] }

But, in this change, it is already cast in Value, so there is no cast expr in Projection, so it is like MemoryExec: partitions=1, partition_sizes=[1] instead of MemoryExec: partitions=4, partition_sizes=[1, 0, 0, 0].

Therefore, I think the plan makes sense to me.

jayzhan211 · 2024-10-12T03:10:50Z

datafusion/sqllogictest/test_files/group_by.slt

@@ -3579,8 +3580,7 @@ physical_plan
 08)--------------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
 09)----------------ProjectionExec: expr=[zip_code@0 as zip_code, country@1 as country, sn@2 as sn, ts@3 as ts, currency@4 as currency, amount@5 as amount, sum(l.amount) ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING@6 as sum_amount]
 10)------------------BoundedWindowAggExec: wdw=[sum(l.amount) ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: Ok(Field { name: "sum(l.amount) ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(1)), end_bound: Following(UInt64(1)), is_causal: false }], mode=[Sorted]
-11)--------------------CoalescePartitionsExec
-12)----------------------MemoryExec: partitions=8, partition_sizes=[1, 0, 0, 0, 0, 0, 0, 0]
+11)--------------------MemoryExec: partitions=1, partition_sizes=[1]


This looks like improvement

jayzhan211 · 2024-10-12T03:11:00Z

datafusion/sqllogictest/test_files/group_by.slt

@@ -3360,7 +3360,8 @@ physical_plan
 05)--------CoalesceBatchesExec: target_batch_size=4
 06)----------RepartitionExec: partitioning=Hash([sn@0, amount@1], 8), input_partitions=8
 07)------------AggregateExec: mode=Partial, gby=[sn@0 as sn, amount@1 as amount], aggr=[]
-08)--------------MemoryExec: partitions=8, partition_sizes=[1, 0, 0, 0, 0, 0, 0, 0]
+08)--------------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
+09)----------------MemoryExec: partitions=1, partition_sizes=[1]


This seems equivalent

(for other readers the explanation of why this changed is below at https://github.com/apache/datafusion/pull/12864/files#r1797538174)

…es-schema

jayzhan211 · 2024-10-18T06:24:40Z

If this change is too large, I can try to split this to several one

Signed-off-by: jayzhan211 <[email protected]>

berkaysynnada · 2024-10-21T07:22:58Z

Hi @jayzhan211, what is the current status of this PR? I have time to review it if it is ready

jayzhan211 · 2024-10-21T08:22:10Z

Hi @jayzhan211, what is the current status of this PR? I have time to review it if it is ready

This is ready for review, I forgot to turn on it.

jayzhan211 · 2024-10-21T08:25:28Z

Conflict again

…es-schema

Signed-off-by: jayzhan211 <[email protected]>

alamb

Thank you @jayzhan211 and @findepi

It seems to me that this PR is an improvement over what is on main, and thus makes sense to merge in.

I had some API / comment suggestions, but I also think they are not required.

alamb · 2024-10-22T18:26:01Z

datafusion/expr/src/logical_plan/builder.rs

@@ -177,7 +179,7 @@ impl LogicalPlanBuilder {
    /// so it's usually better to override the default names with a table alias list.
    ///
    /// If the values include params/binders such as $1, $2, $3, etc, then the `param_data_types` should be provided.
-    pub fn values(mut values: Vec<Vec<Expr>>) -> Result<Self> {
+    pub fn values(values: Vec<Vec<Expr>>, schema: Option<&DFSchemaRef>) -> Result<Self> {


👍 This makes a lot of sense to me to pass in the schema too if knon.

I think it might be nicer on users if we didn't make an API change and instead added a new API like

pub fn values_with_schema(values: Vec<Vec<Expr>>, schema: Option<&DFSchemaRef>) -> Result<Self> { ... }

Even if we also deprecated values it would help users prepare for upgrade

alamb · 2024-10-22T18:28:44Z

datafusion/expr/src/logical_plan/builder.rs

+        }
+    }
+
+    fn infer_from_schema(values: Vec<Vec<Expr>>, schema: &DFSchema) -> Result<Self> {


Maybe we can give this a name to make it clear it is related to VALUES processing

Suggested change

fn infer_from_schema(values: Vec<Vec<Expr>>, schema: &DFSchema) -> Result<Self> {

fn infer_values_from_schema(values: Vec<Vec<Expr>>, schema: &DFSchema) -> Result<Self> {

datafusion/expr-common/src/type_coercion/binary.rs

alamb · 2024-10-22T18:34:56Z

datafusion/sql/src/planner.rs

@@ -154,6 +156,7 @@ impl PlannerContext {
            ctes: HashMap::new(),
            outer_query_schema: None,
            outer_from_schema: None,
+            table_schema: None,


Maybe we can name it create_table_schema to reflect that it is used for CREATE TABLE processing?

alamb · 2024-10-22T18:35:46Z

datafusion/sqllogictest/test_files/group_by.slt

@@ -3360,7 +3360,8 @@ physical_plan
 05)--------CoalesceBatchesExec: target_batch_size=4
 06)----------RepartitionExec: partitioning=Hash([sn@0, amount@1], 8), input_partitions=8
 07)------------AggregateExec: mode=Partial, gby=[sn@0 as sn, amount@1 as amount], aggr=[]
-08)--------------MemoryExec: partitions=8, partition_sizes=[1, 0, 0, 0, 0, 0, 0, 0]
+08)--------------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
+09)----------------MemoryExec: partitions=1, partition_sizes=[1]


(for other readers the explanation of why this changed is below at https://github.com/apache/datafusion/pull/12864/files#r1797538174)

alamb · 2024-10-22T18:37:09Z

datafusion/sqllogictest/test_files/struct.slt

@@ -392,12 +392,12 @@ create table t(a struct<r varchar, c int>, b struct<r varchar, c float>) as valu
 query T
 select arrow_typeof([a, b]) from t;
 ----
-List(Field { name: "item", data_type: Struct([Field { name: "r", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
+List(Field { name: "item", data_type: Struct([Field { name: "r", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })


this seems much more correct to me -- "c" is float not Int32 👍

alamb · 2024-10-22T18:37:19Z

datafusion/sqllogictest/test_files/struct.slt

+select * from t;
+----
+{c0: a, c1: 1.0}
+{c0: b, c1: 2.3}


alamb · 2024-10-22T18:42:15Z

I merged up from main and resolved a clippy failure

findepi · 2024-10-22T20:48:26Z

datafusion/sql/src/planner.rs

@@ -154,6 +156,7 @@ impl PlannerContext {
            ctes: HashMap::new(),
            outer_query_schema: None,
            outer_from_schema: None,
+            table_schema: None,


Maybe we can name it create_table_schema to reflect that it is used for CREATE TABLE processing?

that would work too

findepi · 2024-10-22T20:49:38Z

datafusion/sql/src/values.rs

+        let schema = planner_context
+            .table_schema()
+            .unwrap_or(Arc::new(DFSchema::empty()));


That's good point. I think sql_to_expr must still get empty schema. -- The VALUES being constructed cannot refer to columns of the table being created.

findepi · 2024-10-22T20:52:23Z

datafusion/sql/src/values.rs

+        if schema.fields().is_empty() {
+            LogicalPlanBuilder::values(values, None)?.build()
+        } else {
+            LogicalPlanBuilder::values(values, Some(&schema))?.build()


how do we know that VALUES being processed here should actually obey the schema of the "main table" of the query?

My point here is that doing what we're doing here is legal only for certain query shapes involving CREATE TABLE + VALUES (eg example above), but is not applicable to other query shapes involving CREATE TABLE and VALUES (e.g. CREATE TABLE + SELECT .... FROM VALUES)

Where is the logic guaranteeing that table-inferred schema gets applied in the former case, but not in the latter?

Signed-off-by: jayzhan211 <[email protected]>

findepi · 2024-10-23T06:53:47Z

datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs

@@ -417,7 +417,7 @@ mod tests {
            Box::new(lit(1)),
        ));
        let values = vec![vec![expr1, expr2]];
-        let plan = LogicalPlanBuilder::values(values, None)?.build()?;
+        let plan = LogicalPlanBuilder::values_with_schema(values, None)?.build()?;


just values(values) since schema is None?

findepi · 2024-10-23T06:53:50Z

datafusion/proto/src/logical_plan/mod.rs

@@ -282,7 +282,7 @@ impl AsLogicalPlan for LogicalPlanNode {
                        .map_err(|e| e.into())
                }?;

-                LogicalPlanBuilder::values(values, None)?.build()
+                LogicalPlanBuilder::values_with_schema(values, None)?.build()


just values(values) since schema is None?

findepi · 2024-10-23T06:57:49Z

This should fail to build the initial plan. Instead it fails at some later stages:

> EXPLAIN CREATE TABLE t(a int) AS VALUES (a + a);
+-----------+------+
| plan_type | plan |
+-----------+------+
+-----------+------+

It seems it'sa regression

findepi · 2024-10-23T07:01:49Z

This is invalid query, but it succeeds

> CREATE TABLE t(a int) AS SELECT x FROM (VALUES (a)) t(x) WHERE false;
0 row(s) fetched.
Elapsed 0.014 seconds.

It seems it'sa regression

datafusion/expr/src/logical_plan/builder.rs

Signed-off-by: jayzhan211 <[email protected]>

alamb · 2024-10-23T10:26:34Z

@findepi what is your suggested path forward?

From my perspective this PR improves several cases (queries that should run, but currently do not on main) and thus while perhaps not perfect it seems like an improvement even though there are additional areas potentially to improve.

So my personal suggestion is

address comments as best as possible
file ticket(s) to track additional issues that were identified during review
merge this PR
work on the additional issues as follow on PRs

findepi · 2024-10-23T10:47:26Z

datafusion/sql/src/values.rs

+        if schema.fields().is_empty() {
+            LogicalPlanBuilder::values_with_schema(values, None)?.build()
+        } else {
+            LogicalPlanBuilder::values_with_schema(values, Some(&schema))?.build()


This remains correct only for VALUES that are direct input to CREATE TABLE.

findepi · 2024-10-23T10:52:11Z

I am not convinced that that code we eventually should have will be natural evolution of the code here, so it's hard for me to judge whether this is a step in the right direction.
The problem is with how table-being-created's schema is passed via a "global variable" to some VALUES being planned, without checking whether those values are direct input to the table being created. They may or may not be.

How should this be solved? I think this is a question to you @alamb more than me. But I can try to guess. For example, when planning CREATE TABLE we could check whether it's input is VALUES and bypass generic sql_values_to_plan in such case, calling a new function with this new functionality.

What are your thoughts?

alamb · 2024-10-23T11:07:15Z

I am not convinced that that code we eventually should have will be natural evolution of the code here, so it's hard for me to judge whether this is a step in the right direction.

In my mind the additional test cases that pass with this code but not on main represent the step forward

The internal implementation (aka the code in this PR) may well change over time / need a different structure than what is in this PR, but the end interface (aka what SQL can run / what the plans are) should be the same.

jayzhan211 · 2024-10-23T13:27:08Z

I think as long as there is no regression we should move on and file the ticket for the remaining issue. I guess what @findepi metioned is something like insert into values() that is not direct input to the table 🤔 But that is not the problem I tend to solve, so I don't think there is any reason to block this change on other issue

This remains correct only for VALUES that are direct input to CREATE TABLE.

The problem is with how table-being-created's schema is passed via a "global variable" to some VALUES being planned, without checking whether those values are direct input to the table being created. They may or may not be.

I would be nice if there is an example for this. I think it is something like insert into values()?

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 · 2024-10-24T00:21:58Z

Thanks @alamb and @findepi

findepi · 2024-10-25T07:48:23Z

I think as long as there is no regression we should move on and file the ticket for the remaining issue.

Sorry for not following earlier, was on a full-day event yesterday.

There are regressions.
I kind of felt it's obvious from the way it works, sorry for not providing good examples earlier.

Before the change

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.005 seconds.

+---+
| a |
+---+
| 4 |
+---+

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.078 seconds.

+---+
| a |
+---+
| 4 |
+---+

on current `main`

Wrong result:

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.071 seconds.

+---+
| a |
+---+
| 3 |

failure

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
Arrow error: Cast error: Cannot cast string 'abcd' to value of Int32 type

jayzhan211 · 2024-10-25T08:24:20Z

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan. After this change, incorrect values ('abcd') is cast to column a instead of the result of the function.

The ideally solution is to find the result type of the function and check whether it matches the column type.

btw, I wonder is this query valid in postgres or elsewhere?

Valid query in postgres
CREATE TABLE t AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);

Invalid query in postgres
CREATE TABLE t (a int) AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);

ERROR:  syntax error at or near "AS"
LINE 1: CREATE TABLE t(a int) AS SELECT length(a)::int AS a FROM (VA...

It seems there is no way to insert value together with table if the column type is defined 🤔

findepi · 2024-10-25T08:45:57Z

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan.

Values plan is build for VALUES ('abcd') part of the query. The type is known to be Utf8.
The the surrounding query is planned, the length(Utf8) function is known to return the length as a number.

What was accidental about this?

jayzhan211 · 2024-10-25T10:59:33Z

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan.

Values plan is build for VALUES ('abcd') part of the query. The type is known to be Utf8.
The the surrounding query is planned, the length(Utf8) function is known to return the length as a number.

What was accidental about this?

I think mistakenly output the values on the wrong branch.

Let's find out such a valid query in postgres to make sure we need to support this kind of query in datafusion. And add the test to ensure the coverage.

alamb · 2024-10-26T12:07:26Z

I have filed #13124 to track the issues raised above explicitly

findepi · 2024-10-26T17:20:59Z

thank you @alamb

jayzhan211 added 2 commits October 11, 2024 17:53

first draft

4fb0814

Signed-off-by: jayzhan211 <[email protected]>

cleanup

062f118

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate functions labels Oct 11, 2024

jayzhan211 added 2 commits October 11, 2024 19:54

add values table without schema

08e6bf3

Signed-off-by: jayzhan211 <[email protected]>

cleanup

ee7880e

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 marked this pull request as ready for review October 11, 2024 13:36

jayzhan211 commented Oct 11, 2024

View reviewed changes

jayzhan211 commented Oct 12, 2024

View reviewed changes

Merge branch 'main' of https://github.com/apache/datafusion into valu…

9fc3415

…es-schema

jayzhan211 requested a review from alamb October 18, 2024 06:24

jayzhan211 requested a review from avantgardnerio October 18, 2024 06:25

fmt

66d29e1

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 marked this pull request as draft October 19, 2024 00:59

rm unused import

17ffd2b

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 marked this pull request as ready for review October 21, 2024 08:21

jayzhan211 marked this pull request as draft October 21, 2024 08:25

jayzhan211 added 2 commits October 21, 2024 16:29

Merge branch 'main' of https://github.com/apache/datafusion into valu…

8207282

…es-schema

fmt

afee3b7

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 marked this pull request as ready for review October 21, 2024 09:39

alamb approved these changes Oct 22, 2024

View reviewed changes

Fix clippy

47fa782

findepi suggested changes Oct 22, 2024

View reviewed changes

add values back and rename

a0eb9ca

Signed-off-by: jayzhan211 <[email protected]>

findepi reviewed Oct 23, 2024

View reviewed changes

datafusion/expr/src/logical_plan/builder.rs Show resolved Hide resolved

findepi reviewed Oct 23, 2024

View reviewed changes

datafusion/expr/src/logical_plan/builder.rs Show resolved Hide resolved

invalid query

3310cbf

Signed-off-by: jayzhan211 <[email protected]>

findepi reviewed Oct 23, 2024

View reviewed changes

use values if no schema

b9b8524

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot removed the optimizer Optimizer rules label Oct 23, 2024

add doc

4e0056c

Signed-off-by: jayzhan211 <[email protected]>

jayzhan211 merged commit 18b2aaa into apache:main Oct 24, 2024
26 checks passed

jayzhan211 deleted the values-schema branch October 24, 2024 00:21

findepi mentioned this pull request Oct 26, 2024

Release DataFusion 43.0.0 #12470

Open

4 tasks

alamb mentioned this pull request Oct 26, 2024

Error with type coercion with CREATE TABLE AS SELECT ... inserting VALUES #13124

Open

	fn infer_from_schema(values: Vec<Vec<Expr>>, schema: &DFSchema) -> Result<Self> {
	fn infer_values_from_schema(values: Vec<Vec<Expr>>, schema: &DFSchema) -> Result<Self> {

Infer data type from schema for Values and add struct coercion to coalesce #12864

Infer data type from schema for Values and add struct coercion to coalesce #12864

Conversation

jayzhan211 commented Oct 11, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Oct 18, 2024

berkaysynnada commented Oct 21, 2024

jayzhan211 commented Oct 21, 2024

jayzhan211 commented Oct 21, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Oct 23, 2024 • edited Loading

findepi commented Oct 23, 2024

alamb commented Oct 23, 2024

Choose a reason for hiding this comment

findepi commented Oct 23, 2024

alamb commented Oct 23, 2024

jayzhan211 commented Oct 23, 2024 • edited Loading

jayzhan211 commented Oct 24, 2024

findepi commented Oct 25, 2024

Before the change

on current main

jayzhan211 commented Oct 25, 2024 • edited Loading

findepi commented Oct 25, 2024

jayzhan211 commented Oct 25, 2024

alamb commented Oct 26, 2024

findepi commented Oct 26, 2024

Infer data type from schema for `Values` and add struct coercion to `coalesce` #12864

Infer data type from schema for `Values` and add struct coercion to `coalesce` #12864

jayzhan211 commented Oct 11, 2024 •

edited

Loading

findepi commented Oct 23, 2024 •

edited

Loading

jayzhan211 commented Oct 23, 2024 •

edited

Loading

on current `main`

jayzhan211 commented Oct 25, 2024 •

edited

Loading