Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unparsing plans after applying optimize_projections rule #13267

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sgrebnov
Copy link
Member

@sgrebnov sgrebnov commented Nov 5, 2024

Which issue does this PR close?

The optimize_projections optimization is very useful as it pushes down projections to the TableScan and ensures only required columns are fetched. This is useful when used alongside unparsing scenarios for plans involving multiple data sources, as the plan must be optimized to push down projections and fetch only the required columns. The downside of this process is that the rule modifies the original plan in a way that makes it difficult to unparse, and the resultant plan is not always optimal or efficient for unparsing use cases, for example

https://gist.github.com/sgrebnov/5071d2834e812b62bfdf434cf7e7e54c

Original query (TPC-DS Q72)

select  i_item_desc
      ,w_warehouse_name
      ,d1.d_week_seq
      ,sum(case when p_promo_sk is null then 1 else 0 end) no_promo
      ,sum(case when p_promo_sk is not null then 1 else 0 end) promo
      ,count(*) total_cnt
from catalog_sales
join inventory on (cs_item_sk = inv_item_sk)
join warehouse on (w_warehouse_sk=inv_warehouse_sk)
join item on (i_item_sk = cs_item_sk)
join customer_demographics on (cs_bill_cdemo_sk = cd_demo_sk)
join household_demographics on (cs_bill_hdemo_sk = hd_demo_sk)
join date_dim d1 on (cs_sold_date_sk = d1.d_date_sk)
join date_dim d2 on (inv_date_sk = d2.d_date_sk)
join date_dim d3 on (cs_ship_date_sk = d3.d_date_sk)
left outer join promotion on (cs_promo_sk=p_promo_sk)
left outer join catalog_returns on (cr_item_sk = cs_item_sk and cr_order_number = cs_order_number)
where d1.d_week_seq = d2.d_week_seq
  and inv_quantity_on_hand < cs_quantity
  and d3.d_date > d1.d_date + INTERVAL '5 days'
  and hd_buy_potential = '501-1000'
  and d1.d_year = 1999
  and cd_marital_status = 'S'
group by i_item_desc,w_warehouse_name,d1.d_week_seq
order by total_cnt desc, i_item_desc, w_warehouse_name, d_week_seq
 LIMIT 100;

Plan and query after applying optimize_projections rule. Notice the additional projections added after joins.

image
select
	"i_item_desc",
	"w_warehouse_name",
	"d_week_seq",
	sum(case when "p_promo_sk" is null then 1 else 0 end) as "no_promo",
	sum(case when "p_promo_sk" is not null then 1 else 0 end) as "promo",
	count(1) as "total_cnt"
from
	(
	select
		"w_warehouse_name",
		"i_item_desc",
		"d_week_seq",
		"p_promo_sk"
	from
		(
		select
			"cs_item_sk",
			"cs_order_number",
			"w_warehouse_name",
			"i_item_desc",
			"d_week_seq",
			"promotion"."p_promo_sk"
		from
			(
			select
				"cs_item_sk",
				"cs_promo_sk",
				"cs_order_number",
				"w_warehouse_name",
				"i_item_desc",
				"d_week_seq"
			from
				(
				select
					"cs_ship_date_sk",
					"cs_item_sk",
...

Rationale for this change

To support unparsing plans after optimize_projections is applied, it is proposed to add the optimize_projections_preserve_existing_projections configuration option to prevent the optimization logic from creating or removing projections and to preserve the original structure. It ensures the query layout remains simple and readable, relying on the underlying SQL engine to apply its own optimizations during execution.

Are these changes tested?

Added test for optimize_projections_preserve_existing_projections configuration option. Unparsing have been tested by running all TPC-H and TPC-DS queries with optimization_projections enabled.

Are there any user-facing changes?

Yes, a new optimize_projections_preserve_existing_projections configuration option has been introduced, which can be specified via SessionConfig or at a lower level using OptimizerContext::new_with_options.

SessionStateBuilder::new()
  .with_config(
      SessionConfig::new().with_optimize_projections_preserve_existing_projections(true),
  )
  .build();

Or

let mut config = ConfigOptions::new();
config
    .optimizer
    .optimize_projections_preserve_existing_projections =
    preserve_projections;
let optimizer_context = OptimizerContext::new_with_options(config);

There are no changes in default behavior.

@github-actions github-actions bot added sql SQL Planner optimizer Optimizer rules common Related to common crate execution Related to the execution crate labels Nov 5, 2024

/// When set to true, the `optimize_projections` rule will not attempt to move, add, or remove existing projections.
/// This flag helps maintain the original structure of the `LogicalPlan` when converting it back into SQL via the `unparser` module. It ensures the query layout remains simple and readable, relying on the underlying SQL engine to apply its own optimizations during execution.
pub optimize_projections_preserve_existing_projections: bool, default = false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be better to make this flag more generic, for example, just preserve_existing_projections or prefer_existing_plan_nodes, so it can be reused in the future in similar cases

// Avoid creating a duplicate Projection node, which would result in an additional subquery if a projection already exists.
// For example, if the `optimize_projection` rule is applied, there will be a Projection node, and duplicate projection
// information included in the TableScan node.
if !already_projected {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents from generating queries like SELECT a, b from (SELECT a, b from my_table).

@@ -882,6 +882,7 @@ fn test_table_scan_pushdown() -> Result<()> {
let query_from_table_scan_with_projection = LogicalPlanBuilder::from(
table_scan(Some("t1"), &schema, Some(vec![0, 1]))?.build()?,
)
.project(vec![col("id"), col("age")])?
.project(vec![wildcard()])?
Copy link
Member Author

@sgrebnov sgrebnov Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually be a real plan with a wildcard projection and a TableScan that includes only two columns? I would expect them to match. If this is a real use case I will improve logic above (check for parent projection is a wildcard or does not match). Running all TPC-H and TPC-DS queries I've not found query where it was the case.
https://github.com/apache/datafusion/pull/13267/files#r1830100022

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 6, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate documentation Improvements or additions to documentation execution Related to the execution crate optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant