-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite Java API Table.readJSON
to return the output from libcudf read_json
directly
#17180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this change. My biggest problem is that this is a breaking change.
I put in a number of changes into the Spark Plugin to provide the emptyRowCount and fully removing those changes is not simple. So either please make sure that this provides some backwards compatibility so we can make the change in a few steps or can we please have the other PR ready to go so that there is very little down time between the two.
This reverts commit a82fdb699a13008b878deaab18ae85a440cf05af.
Signed-off-by: Nghia Truong <[email protected]>
Changed to non-breaking as the old Java methods are not removed in this PR. We can remove them later on when all the plugin code complete their adaptation. |
After this (with #17029), the overhead of reordering columns is significantly reduced (above is before this, and below is with this): |
Signed-off-by: Nghia Truong <[email protected]>
/merge |
With this PR,
Table.readJSON
will return the output from libcudfread_json
directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudfread_json
already does these thus we no longer have to do it.Depends on:
Partially contributes to NVIDIA/spark-rapids#11560.
Closes #17002