-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49679][SQL] validateSchemaOutput should refer to case sensitivity flag #48127
Conversation
…the result in validateSchemaOutput
cc: @xil-db for review. And do you have any suggestions on where to put the testcases for this patch? Thx for any help~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a test to sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerSuite.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @cloud-fan
cc: @cloud-fan for merge, thx~ |
I don't quite agree with this. The underlying catalog/data source may be case-sensitive and It's important to keep the query schema(column names) unchanged |
Yeah, I think that's a fair point. Don't have the full context here, I assume the intention of this PR is to avoid certain "false-alarms" raised by this validation, but if Maybe I'm missing something, maybe it'd be better to just fix |
Sorry I should state more clearly about the context. So here's the context, users have this query running under
Though they have stated ppmonth in two different ways, ppmonth and Ppmonth, as they are running under case insensitive mode, they expect this query to run successfully. However, they got error message saying that: And this error message is because AggregatePushDownThroughJoins are trying to push down aggregation through join operator and change the plan from:
To
And there's nothing wrong with The error appear because we're checking the schema in case sensitive mode even the caseSensitivity flag is set to false. And DataType actually provide caseSensitiveCheck and caseInsensitiveCheck, we just need to use corresponding checking method according to the config in validateSchemaOutput. I understand there might be risk introducing rules which actually change the case of schema, but as it can only pass the check under case insensitive mode, it should not raise much concerns as under case insensitive mode, it is acceptable IIUC. Please correct me if my understanding is wrong. For example, there might be high risks introduced by this new schema check. |
no matter Spark is case-sensitive or not, it must be case-preserving. We need to fix the optimizer rule that breaks case-preserving. |
I see. But for this case, |
Is it really false alarm? Seems the rule changed |
icic. I can rewrite AggregatePushThroughJoins to separate the usage of group by expressions and projected lists making generated Projection using projected expression and push down aggregatePart using group by expressions. |
@cloud-fan it turns out Analyzer are using different validatePlanChanges implementation only call validateExprIdUniqueness. And this validateSchemaOutput is only used by Optimizer. I think this pr align with our discussion. We'll slack the constraints for optimizer and maintain Analyzer's check. In this case, can we merge the pr? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
Outdated
Show resolved
Hide resolved
closed as we wanna keep the alarm. |
What changes were proposed in this pull request?
validateSchemaOutput check case sensitivity flag and pick comparison methods according to the result.
Why are the changes needed?
If we're using
spark.sql.caseSensitive
set to false, we should accept queries like this:However, validateSchemaOutput in optimizer's checks about plan schema changes does not use this flag, which leads to a situation that some queries will fail this check even if the optimization is correct. Take this query as an example:
After AggregatePushdownThroughJoins, the schema Ppmonth does not match with schema ppmonth in validateSchemaOutput and it then fails this check.
We need to use this flag in validateSchemaOutput.
Does this PR introduce any user-facing change?
yes, the above query is accepted and run successfully.
How was this patch tested?
WIP
Was this patch authored or co-authored using generative AI tooling?
No