-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional column_order in JSON reader #17029
Add optional column_order in JSON reader #17029
Conversation
@ttnghia This PR is ready for testing. This will enforce the column order and also insert empty all-null columns if not present in the json data. |
cpp/src/io/json/json_column.cu
Outdated
child_columns.emplace_back(std::move(all_null_column)); | ||
continue; | ||
} | ||
column_names.emplace_back(found_col->first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we have if (prune_columns and found_col == std::end
thus here if !prune_columns
then we still have found_col == std::end
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added prune_columns
condition to col_order
decision. This case won't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review, nothing blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not expected such a large and robust feature, great stuff!
Got a several comments, mostly nits
changed logic of has_column_order used std::invalid_argument update auto to references at few places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for addressing all suggestions!
/merge |
…read_json` directly (#17180) With this PR, `Table.readJSON` will return the output from libcudf `read_json` directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudf `read_json` already does these thus we no longer have to do it. Depends on: * #17029 Partially contributes to NVIDIA/spark-rapids#11560. Closes #17002 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #17180
Description
This PR adds optional column order to enforce column order in the output. This feature is required by spark from_json.
Optional
column_order
is added toschema_element
, and it is validated during reader_option creation. The column order can be specified at root level and for any struct in any level.• For root level, the dtypes should be schema_element with type STRUCT. (schema_element is added to variant dtypes)
• For nested level, column_order can be specified for any STRUCT type. (could be a map of schema_element , or schema_element)
If the column order is not specified, the order of columns is same as the order of columns that appear in json file.
Closes #17240 (metadata updated)
Closes #17091 (will return all nulls column if not present in input json)
Closes #17090 (fixed with new schema_element as dtype)
Closes #16799 (output columns are created from column_order if present)
Checklist