Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]udfCompiler produced a wrong analyzed logical plan in a UDF case on Spark 3.4.0+ #10381

Open
GaryShen2008 opened this issue Feb 6, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@GaryShen2008
Copy link
Collaborator

Describe the bug
When enabling spark.rapids.sql.udfCompiler.enabled=true on Spark 3.4.0+, one UT case failed with different result.

Steps/Code to reproduce bug
Start a Spark 3.4.0+ spark-shell with plugin 23.12.1+ and enable udfCompiler.
spark-shell --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.udfCompiler.enabled=true

Paste below code:

import org.apache.spark.sql.{Dataset, Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.{udf => makeUdf}

val myudf: (String, String) => String = (a,b) => {
   if (null==a) {
      a
   } else {
      b
   }
}
val u = makeUdf(myudf)
val dataset = List(("","z")).toDF("x","y")
val result = dataset.withColumn("new", u(col("x"),col("y")))
val ref = dataset.withColumn("new", lit("z"))

result.show()
ref.show()

result.explain(true)

Expected behavior
The test should pass.

Environment details (please complete the following information)

  • Environment location: spark local
  • Spark configuration settings related to the issue: spark.rapids.sql.udfCompiler.enabled=true

Additional context
The issue doesn't happen on Spark 3.3.2 but observed from Spark 3.4.0.
The logical plan became wrong in Spark 3.4.0.

== Parsed Logical Plan ==
'Project [x#10, y#11, UDF('x, 'y) AS new#14]
+- Project [_1#5 AS x#10, _2#6 AS y#11]
   +- LocalRelation [_1#5, _2#6]

== Analyzed Logical Plan ==
x: string, y: string, new: string
Project [x#10, y#11, x#10 AS new#14]
+- Project [_1#5 AS x#10, _2#6 AS y#11]
   +- LocalRelation [_1#5, _2#6]

== Optimized Logical Plan ==
LocalRelation [x#10, y#11, new#14]

== Physical Plan ==
LocalTableScan [x#10, y#11, new#14]

The output on Spark 3.3.2.

== Parsed Logical Plan ==
'Project [x#10, y#11, UDF('x, 'y) AS new#14]
+- Project [_1#5 AS x#10, _2#6 AS y#11]
   +- LocalRelation [_1#5, _2#6]

== Analyzed Logical Plan ==
x: string, y: string, new: string
Project [x#10, y#11, if (NOT isnotnull(x#10)) x#10 else y#11 AS new#14]
+- Project [_1#5 AS x#10, _2#6 AS y#11]
   +- LocalRelation [_1#5, _2#6]

== Optimized Logical Plan ==
LocalRelation [x#10, y#11, new#14]

== Physical Plan ==
LocalTableScan [x#10, y#11, new#14]
@GaryShen2008 GaryShen2008 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 6, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 6, 2024
@mattahrens mattahrens added the ? - Needs Triage Need team to review and classify label Nov 1, 2024
@abellina
Copy link
Collaborator

abellina commented Nov 1, 2024

This issue can cause silent data corruption according to the info pasted above. If we can only detect it after a query has a new explain with a different plan, that the user didn't intend, that's really bad. We should discourage the use of the UDF compiler, at least for Spark 3.4.0+

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 5, 2024
@zpuller
Copy link
Collaborator

zpuller commented Nov 8, 2024

I'm unable to reproduce this. I'm using the latest version of the plugin (24.12), and tried on spark 3.3, 3.4, and 3.5, and always get the following plan:

== Parsed Logical Plan ==
'Project [x#10, y#11, UDF('x, 'y) AS new#14]
+- Project [_1#5 AS x#10, _2#6 AS y#11]
   +- LocalRelation [_1#5, _2#6]

   == Analyzed Logical Plan ==
   x: string, y: string, new: string
   Project [x#10, y#11, UDF(x#10, y#11) AS new#14]
   +- Project [_1#5 AS x#10, _2#6 AS y#11]
      +- LocalRelation [_1#5, _2#6]

      == Optimized Logical Plan ==
      LocalRelation [x#10, y#11, new#14]

      == Physical Plan ==
      LocalTableScan [x#10, y#11, new#14]

I verified this config as well

scala> spark.conf.get("spark.rapids.sql.udfCompiler.enabled")
res4: String = true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants