Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional column_order in JSON reader #17029

Merged
merged 27 commits into from
Nov 8, 2024

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 9, 2024

Description

This PR adds optional column order to enforce column order in the output. This feature is required by spark from_json.

Optional column_order is added to schema_element, and it is validated during reader_option creation. The column order can be specified at root level and for any struct in any level.
• For root level, the dtypes should be schema_element with type STRUCT. (schema_element is added to variant dtypes)
• For nested level, column_order can be specified for any STRUCT type. (could be a map of schema_element , or schema_element)
If the column order is not specified, the order of columns is same as the order of columns that appear in json file.

Closes #17240 (metadata updated)
Closes #17091 (will return all nulls column if not present in input json)
Closes #17090 (fixed with new schema_element as dtype)
Closes #16799 (output columns are created from column_order if present)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann self-assigned this Oct 9, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 9, 2024
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
@karthikeyann
Copy link
Contributor Author

@ttnghia This PR is ready for testing. This will enforce the column order and also insert empty all-null columns if not present in the json data.

child_columns.emplace_back(std::move(all_null_column));
continue;
}
column_names.emplace_back(found_col->first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we have if (prune_columns and found_col == std::end thus here if !prune_columns then we still have found_col == std::end.

Copy link
Contributor Author

@karthikeyann karthikeyann Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added prune_columns condition to col_order decision. This case won't happen.

@karthikeyann karthikeyann added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024
@ttnghia ttnghia removed their assignment Nov 5, 2024
@karthikeyann karthikeyann changed the title add optional column_order to schema_element Add optional column_order to schema_element in JSON reader Nov 6, 2024
@karthikeyann karthikeyann changed the title Add optional column_order to schema_element in JSON reader Add optional column_order in JSON reader Nov 6, 2024
@karthikeyann karthikeyann added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 6, 2024
@vuule vuule self-requested a review November 6, 2024 17:13
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, nothing blocking

cpp/src/io/json/nested_json.hpp Show resolved Hide resolved
cpp/tests/io/json/json_test.cpp Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not expected such a large and robust feature, great stuff!
Got a several comments, mostly nits

cpp/src/io/json/host_tree_algorithms.cu Outdated Show resolved Hide resolved
cpp/src/io/json/host_tree_algorithms.cu Show resolved Hide resolved
cpp/src/io/json/host_tree_algorithms.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/tests/io/json/json_test.cpp Show resolved Hide resolved
cpp/tests/io/json/json_test.cpp Outdated Show resolved Hide resolved
changed logic of has_column_order
used std::invalid_argument
update auto to references at few places
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for addressing all suggestions!

@karthikeyann
Copy link
Contributor Author

/merge

@karthikeyann karthikeyann added Spark Functionality that helps Spark RAPIDS feature request New feature or request and removed improvement Improvement / enhancement to an existing function labels Nov 7, 2024
@rapids-bot rapids-bot bot merged commit b3b5ce9 into rapidsai:branch-24.12 Nov 8, 2024
102 checks passed
rapids-bot bot pushed a commit that referenced this pull request Nov 8, 2024
…read_json` directly (#17180)

With this PR, `Table.readJSON` will return the output from libcudf `read_json` directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudf `read_json` already does these thus we no longer have to do it.

Depends on:
 * #17029

Partially contributes to NVIDIA/spark-rapids#11560.
Closes #17002

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #17180
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
Status: Burndown
5 participants