[BUG] JSON reads should check for partial results config #11254

jlowe · 2024-07-25T15:43:09Z

Since Spark 3.4.0, Spark supports parsing of structs, maps, or arrays in JSON when one or more fields do not match the schema, see spark.sql.json.enablePartialResults. We are not checking for this config during tagging/converting, and as such we are probably only capable of supporting this either enabled or disabled. We should be checking for this setting and either supporting both modes or disabling GPU JSON reads when it is configured in an unsupported mode.

ttnghia · 2024-10-30T21:32:53Z

I did a quick test:

scala> val df = Seq("{\"a\": {\"x\": 1, \"y\": true}, \"b\": {\"x\": 1}}", "{\"a\": {\"x\": 2}, \"b\": {\"x\": 2}}").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
24/10/30 21:29:36 WARN GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> value#6 could run on GPU

+-----------------------------------------+
|value                                    |
+-----------------------------------------+
|{"a": {"x": 1, "y": true}, "b": {"x": 1}}|
|{"a": {"x": 2}, "b": {"x": 2}}           |
+-----------------------------------------+

df.repartition(1).selectExpr("from_json(value, 'a struct<x: int, y: struct<x: int>>, b struct<x: int>')").show(false)
....
*Expression <JsonToStructs> from_json(...) will run on GPU
...
+----------------+
|from_json(value)|
+----------------+
|{{1, null}, {1}}|
|{{2, null}, {2}}|
+----------------+

Ref:

So it seems that we always enable enablePartialResults by default. This is cudf behavior, we need to have support from them to make disable it an option.

revans2 · 2024-10-31T14:22:01Z

In the sort term we are just going to fall back to the CPU for this.

https://github.com/apache/spark/blob/60d68034e6a7eb4ebade4217b398c95bd8b06eb2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4442-L4449

spark.sql.json.enablePartialResults is a spark internal config that default to true. So very few people should be setting this, which means we have very little reason to put in the work to support it disabled.

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 25, 2024

mattahrens added Spark 3.4+ Spark 3.4+ issues and removed ? - Needs Triage Need team to review and classify labels Jul 30, 2024

mattahrens assigned jihoonson Aug 12, 2024

sameerz assigned revans2 and unassigned jihoonson Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] JSON reads should check for partial results config #11254

[BUG] JSON reads should check for partial results config #11254

jlowe commented Jul 25, 2024 •

edited

Loading

ttnghia commented Oct 30, 2024 •

edited

Loading

revans2 commented Oct 31, 2024

[BUG] JSON reads should check for partial results config #11254

[BUG] JSON reads should check for partial results config #11254

Comments

jlowe commented Jul 25, 2024 • edited Loading

ttnghia commented Oct 30, 2024 • edited Loading

revans2 commented Oct 31, 2024

jlowe commented Jul 25, 2024 •

edited

Loading

ttnghia commented Oct 30, 2024 •

edited

Loading