Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON reads should check for partial results config #11254

Open
jlowe opened this issue Jul 25, 2024 · 2 comments
Open

[BUG] JSON reads should check for partial results config #11254

jlowe opened this issue Jul 25, 2024 · 2 comments
Assignees
Labels
bug Something isn't working Spark 3.4+ Spark 3.4+ issues

Comments

@jlowe
Copy link
Member

jlowe commented Jul 25, 2024

Since Spark 3.4.0, Spark supports parsing of structs, maps, or arrays in JSON when one or more fields do not match the schema, see spark.sql.json.enablePartialResults. We are not checking for this config during tagging/converting, and as such we are probably only capable of supporting this either enabled or disabled. We should be checking for this setting and either supporting both modes or disabling GPU JSON reads when it is configured in an unsupported mode.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 25, 2024
@mattahrens mattahrens added Spark 3.4+ Spark 3.4+ issues and removed ? - Needs Triage Need team to review and classify labels Jul 30, 2024
@sameerz sameerz assigned revans2 and unassigned jihoonson Oct 30, 2024
@ttnghia
Copy link
Collaborator

ttnghia commented Oct 30, 2024

I did a quick test:

scala> val df = Seq("{\"a\": {\"x\": 1, \"y\": true}, \"b\": {\"x\": 1}}", "{\"a\": {\"x\": 2}, \"b\": {\"x\": 2}}").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
24/10/30 21:29:36 WARN GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> value#6 could run on GPU

+-----------------------------------------+
|value                                    |
+-----------------------------------------+
|{"a": {"x": 1, "y": true}, "b": {"x": 1}}|
|{"a": {"x": 2}, "b": {"x": 2}}           |
+-----------------------------------------+

df.repartition(1).selectExpr("from_json(value, 'a struct<x: int, y: struct<x: int>>, b struct<x: int>')").show(false)
....
*Expression <JsonToStructs> from_json(...) will run on GPU
...
+----------------+
|from_json(value)|
+----------------+
|{{1, null}, {1}}|
|{{2, null}, {2}}|
+----------------+

Ref:

So it seems that we always enable enablePartialResults by default. This is cudf behavior, we need to have support from them to make disable it an option.

@revans2
Copy link
Collaborator

revans2 commented Oct 31, 2024

In the sort term we are just going to fall back to the CPU for this.

https://github.com/apache/spark/blob/60d68034e6a7eb4ebade4217b398c95bd8b06eb2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4442-L4449

spark.sql.json.enablePartialResults is a spark internal config that default to true. So very few people should be setting this, which means we have very little reason to put in the work to support it disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark 3.4+ Spark 3.4+ issues
Projects
None yet
Development

No branches or pull requests

5 participants