Add optional column_order in JSON reader #17029

karthikeyann · 2024-10-09T18:03:33Z

Description

This PR adds optional column order to enforce column order in the output. This feature is required by spark from_json.

Optional column_order is added to schema_element, and it is validated during reader_option creation. The column order can be specified at root level and for any struct in any level.
• For root level, the dtypes should be schema_element with type STRUCT. (schema_element is added to variant dtypes)
• For nested level, column_order can be specified for any STRUCT type. (could be a map of schema_element , or schema_element)
If the column order is not specified, the order of columns is same as the order of columns that appear in json file.

Closes #17240 (metadata updated)
Closes #17091 (will return all nulls column if not present in input json)
Closes #17090 (fixed with new schema_element as dtype)
Closes #16799 (output columns are created from column_order if present)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/include/cudf/io/json.hpp

karthikeyann · 2024-10-24T00:50:17Z

@ttnghia This PR is ready for testing. This will enforce the column order and also insert empty all-null columns if not present in the json data.

cpp/include/cudf/io/json.hpp

cpp/src/io/json/host_tree_algorithms.cu

cpp/src/io/json/json_column.cu

ttnghia · 2024-10-24T02:23:09Z

cpp/src/io/json/json_column.cu

+          child_columns.emplace_back(std::move(all_null_column));
+          continue;
+        }
+        column_names.emplace_back(found_col->first);


Above we have if (prune_columns and found_col == std::end thus here if !prune_columns then we still have found_col == std::end.

I added prune_columns condition to col_order decision. This case won't happen.

cpp/src/io/json/json_column.cu

…into fea-json_column_order

vuule

partial review, nothing blocking

cpp/src/io/json/nested_json.hpp

cpp/tests/io/json/json_test.cpp

vuule

did not expected such a large and robust feature, great stuff!
Got a several comments, mostly nits

cpp/src/io/json/host_tree_algorithms.cu

cpp/src/io/json/json_column.cu

cpp/tests/io/json/json_test.cpp

changed logic of has_column_order used std::invalid_argument update auto to references at few places

vuule

thank you for addressing all suggestions!

karthikeyann · 2024-11-07T23:23:02Z

/merge

…read_json` directly (#17180) With this PR, `Table.readJSON` will return the output from libcudf `read_json` directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudf `read_json` already does these thus we no longer have to do it. Depends on: * #17029 Partially contributes to NVIDIA/spark-rapids#11560. Closes #17002 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #17180

add optional column_order to schema_element

1b6ca58

karthikeyann self-assigned this Oct 9, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 9, 2024

karthikeyann and others added 3 commits October 21, 2024 12:51

Merge branch 'branch-24.12' into fea-json_column_order

e0c373a

doc fixes

732f234

fix ambiguous std::map call

ffdd817

karthikeyann mentioned this pull request Oct 21, 2024

JSON spark reader plan for 24.12 #17138

Open

simplify schema_element interface

02e8ab3

lamarrr reviewed Oct 22, 2024

View reviewed changes

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved

karthikeyann added 2 commits October 23, 2024 18:22

create all null columns

ac05ae9

metadata for all null non-present columns

f10d9c2

karthikeyann requested a review from ttnghia October 24, 2024 00:49