Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45760][SQL] Add With expression to avoid duplicating expressions #43623

Closed
wants to merge 5 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Sometimes we need to duplicate expressions when rewriting the plan. It's OK for small query, as codegen has common-subexpression-elimination (CSE) to avoid evaluating the same expression. However, when the query is big, duplicating expressions can lead to a very big expression tree and make catalyst rules very slow, or even OOM when updating a leaf node (need to copy all tree nodes).

This PR introduces a new expression to do expression-level CTE: it adds a Project to pre-evaluate the common expressions, so that they appear only once on the query plan tree, and are evaluated only once. NullIf now uses this new expression to avoid duplicating the left child expression.

Why are the changes needed?

make catalyst more efficient.

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test suite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 1, 2023
@cloud-fan
Copy link
Contributor Author

cc @viirya @wangyum

}

var exprsToAdd = commonExprs.toSeq
val newChildren = p.children.map { child =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be?

Suggested change
val newChildren = p.children.map { child =>
val newChildren = newPlan.children.map { child =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the same, the new plan was produced by transformExpressionsDown so the children won't change.

}

if (exprsToAdd.nonEmpty) {
// If we cannot rewrite the common expressions, force to inline them so that the query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we cannot rewrite them?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. It is like an explicit/manual CSE? I wonder if we want to expose this as a expression function?

@cloud-fan
Copy link
Contributor Author

@viirya yes we can add SQL syntax in the future, following https://github.com/google/zetasql/blob/a745bef47b315bb11fecab4eeefa2bcc41be5951/docs/operators.md?plain=1#L2865

@beliefer
Copy link
Contributor

beliefer commented Nov 2, 2023

Why not add an optimizer rule to find the common expressions and insert a Project to pre-evaluate?
It seems this PR need we update each expression with With expression one by one.

@cloud-fan
Copy link
Contributor Author

Why not add an optimizer rule to find the common expressions and insert a Project to pre-evaluate?

We need this With expression anyway to support the WITH syntax in google bigquery, and it's safer to use it incrementally to avoid expression duplication. It's also easier to build it incrementally as the agg/window function support can be added later.

@cloud-fan
Copy link
Contributor Author

BTW, I feel it's useful to have a way to do explicit/manual CSE, instead of relying on optimizer features or codegen features.

object Helper extends RuleExecutor[LogicalPlan] {
val batches =
Batch("Finish Analysis", Once, ReplaceExpressions) ::
Batch("Rewrite With expression", FixedPoint(10), RewriteWithExpression) :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the Scala style.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Seems there are still test failures?

@peter-toth
Copy link
Contributor

Looks good to me, but I think we can make the new rule idempotent with a small refactor: cloud-fan#19

@Kimahriman
Copy link
Contributor

What's the longer term goal with this, especially in the context of all the attempts to add CSE to conditional expressions? Is the idea to make a with Column expression that can be manually used? Honestly when I started with Spark I assumed this was the default behavior if you re-used a Column object multiple times in an expression.

Just out of curiosity, since this is basically manual CSE at the optimizer stage, would it make sense to do the existing fully recursive, automatic CSE at the optimizer stage instead of the physical stage to achieve a similar affect?

@cloud-fan
Copy link
Contributor Author

What's the longer term goal with this, especially in the context of all the attempts to add CSE to conditional expressions?

I think the final state should be implementing CSE at the logical plan level, so that it works for both codegen backend and native (vectorized) backend. But we still have gaps now. The With expression only works for operators that can add extra projects. It doesn't work for join conditions, while codegen is more flexible. Also codegen CSE is more adaptive and can be used for conditional expressions, but With is static. Before filling the gaps, With can't replace the current codegen CSE.

* make RewriteWithExpression idempotent

* restore def index usage in alias, minor change to shorten code
@cloud-fan
Copy link
Contributor Author

The doc generation issue is unrelated to my PR

ImportError: Warning: Latest version of pandas (2.1.2) is required to generate the documentation; however, your version was 2.1.1

I think we need to upgrade pandas version on GA machines. cc @HyukjinKwon @LuciferYang

@cloud-fan
Copy link
Contributor Author

I'm merging it to master, thanks for the reviews!

@cloud-fan cloud-fan closed this in 01c294b Nov 7, 2023
@LuciferYang
Copy link
Contributor

The doc generation issue is unrelated to my PR

ImportError: Warning: Latest version of pandas (2.1.2) is required to generate the documentation; however, your version was 2.1.1

I think we need to upgrade pandas version on GA machines. cc @HyukjinKwon @LuciferYang

Already upgrade: #43689 :)

cloud-fan added a commit that referenced this pull request Nov 28, 2023
### What changes were proposed in this pull request?

This is a followup of #43623 to fix a regression. For `With` inside conditional branches, they may not be evaluated at all and we should not pull out the common expressions into a `Project`, but just inline.

### Why are the changes needed?

avoid perf regression

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43978 from cloud-fan/with.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Nov 28, 2023
### What changes were proposed in this pull request?

This is a followup of apache/spark#43623 to fix a regression. For `With` inside conditional branches, they may not be evaluated at all and we should not pull out the common expressions into a `Project`, but just inline.

### Why are the changes needed?

avoid perf regression

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43978 from cloud-fan/with.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants