Refactor LogicalType for Parquet #14264

etseidl · 2023-10-10T00:27:29Z

Description

Continuation of #14097, this PR refactors the LogicalType struct to use the new way of treating unions defined in the parquet thrift (more enum like than struct like).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2023-10-10T00:27:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2023-10-10T00:29:17Z

The ultimate goal is to allow greater use of LogicalType in the parquet reader and writer, so we can rely less on special cases that test the input or output length.

…cal_type

cpp/src/io/parquet/compact_protocol_writer.cpp

ttnghia · 2023-10-10T21:29:15Z

/ok to test

vuule

couple of small questions/suggestions

vuule · 2023-10-11T20:55:28Z

cpp/src/io/parquet/compact_protocol_writer.cpp

-  //    isset.TIME or isset.TIMESTAMP or isset.INTEGER or isset.UNKNOWN or isset.JSON or isset.BSON)
-  //    {
-  if (isset.TIMESTAMP or isset.TIME) { c.field_struct(10, s.logical_type); }
+  if (s.field_id.has_value()) { c.field_int(9, s.field_id.value()); }


Note: we could probably get rid of this if has_value -> write value pattern with SFINAE in ProtobufWriter, assuming that the field_xyz names map to the actual type of the parameter.
If all types were supported with an overload set, optional support could be a part of the set, and simply delegate to <T::value> implementation if has_value

That's a lot of template foo to avoid a few invocations of has_value. We can leave this for another day :D

I think it would have a larger impact than that, but maybe I'm mixing it up with the reader side. Either way it's not a suggestion for this PR.

(Not sure if I'm on the same page)
How about std::visit to deal with such template issue and optional?

That can be in the next refactor of CompactProtocolReader/Writer 🤣 (or should I say compact_protocol_reader/writer?)

I don't think std::visit is applicable here. Outside of reflection that would allow us to iterate over data members (which AFAIK does not exist), I think an overload set is as good as this gets. I'd be happy to learn about other solutions.

cpp/src/io/parquet/parquet.hpp

cpp/src/io/parquet/compact_protocol_writer.cpp

cpp/src/io/parquet/parquet.hpp

cpp/src/io/parquet/writer_impl.cu

vuule · 2023-10-12T00:03:30Z

/ok to test

ttnghia · 2023-10-20T05:55:20Z

cpp/src/io/parquet/parquet.hpp

-  NullType UNKNOWN;
-  JsonType JSON;
-  BsonType BSON;
+  enum Type {


enum class?

The Type enums are already buried inside the struct, so we're already getting the benefits of a scoped enum. And keeping it non-scoped allows me to use the enum values as the positional argument to the field_struct calls in the writer without having to do a cast.

vuule · 2023-10-20T16:28:58Z

/ok to test

vuule · 2023-10-20T21:05:50Z

/merge

rjzamora · 2023-10-24T21:41:20Z

@galipremsagar - As far as I can tell the test_create_metadata_file_inconsistent_schema failures must be coming from a change in this PR. Test passes for the commit just before this was merged.

vuule · 2023-10-24T21:43:46Z

Hi @rjzamora, we identified other issue with changes to timestamp logical type and opened #14322 to fix them. Do you think this change would fix your failing test?

rjzamora · 2023-10-24T21:47:35Z

Not sure, but I don't immediately think so. The test failure we are investigating suggests that the behavior of cudf.io.merge_parquet_filemetadata has changed. We have a test where we aggregate footer metadata into a global _metadata file. One of those files contains all null values in one of the columns, while the other file contains int. We expect the aggregated metadata to "promote" the null type to align with the int type.

vuule · 2023-10-24T23:00:13Z

@rjzamora can you please point us to the failing tests? Is there an issue open?

rjzamora · 2023-10-24T23:13:27Z

Failing test is here:

cudf/python/dask_cudf/dask_cudf/io/tests/test_parquet.py

Line 447 in bc4d38d

def test_create_metadata_file_inconsistent_schema(tmpdir):

- I don't believe it is showing up in CI anywhere because all dask-cudf tests are being skipped for some reason.

I'll trying to boil this down to a simpler reproducer (if possible).

vuule · 2023-10-24T23:27:44Z

are being skipped for some reason.

oof

galipremsagar · 2023-10-25T01:04:19Z

are being skipped for some reason.

🤯 🤯 🤯

etseidl · 2023-10-25T01:06:12Z

@rjzamora I think I'm running this to ground...it seems that the blob passed in to merge_row_group_metadata has the logical type annotation UNKNOWN, which is used for all null columns where the schema type can't be discovered. The change in this PR is causing that UNKNOWN to be written back out in the merged metadata, even though the physical type is written as INT32. I suspect that having the logical type present is throwing off whatever dask is doing to infer the type when it uses the _metadata file. I can think of two ways to fix this...either change the thrift protocol writer to not write the logical type if the type is 'UNKNOWN', or override an UNKNOWN logical type in merge_row_group_metadata. I'll have to discuss with @vuule offline to see which approach to take.

rjzamora · 2023-10-25T01:27:50Z

Thanks for looking into this @etseidl !

I suspect that having the logical type present is throwing off whatever dask is doing to infer the type when it uses the _metadata file.

Dask is using whatever type pyarrow.dataset happens to infer. I wouldn't be surprised if it was just using the logical type.

I can think of two ways to fix this...either change the thrift protocol writer to not write the logical type if the type is 'UNKNOWN', or override an UNKNOWN logical type in merge_row_group_metadata. I'll have to discuss with @vuule offline to see which approach to take.

Okay, either approach sounds reasonable to me. To be completely honest, I'm somewhat doubtful that there is anyone actually depending on the behavior covered in this test. In fact, we are already expecting the "wrong" result for dask.dataframe. Therefore, if this proves tricky to fix, I think it should be fine to modify or xfail the test for now.

refactor LogicalType

1a0fd1f

etseidl requested a review from a team as a code owner October 10, 2023 00:27

etseidl requested review from harrism and karthikeyann October 10, 2023 00:27

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 10, 2023

etseidl added 2 commits October 10, 2023 09:19

pass optional to compare_binary

a0ed18d

use enum for field number

35c9b96

vuule self-requested a review October 10, 2023 17:21

vuule added code quality non-breaking Non-breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Oct 10, 2023

etseidl added 5 commits October 10, 2023 10:38

add helpers for nanoseconds

09c0709

Merge remote-tracking branch 'origin/branch-23.12' into refactor_logi…

d1c9899

…cal_type

finish merge

a854121

add more helper functions

cbdc193

Merge remote-tracking branch 'origin/branch-23.12' into refactor_logi…

c4e4a9d

…cal_type

etseidl commented Oct 10, 2023

View reviewed changes

cpp/src/io/parquet/compact_protocol_writer.cpp Show resolved Hide resolved

etseidl and others added 4 commits October 10, 2023 18:09

fix for unknown converted type

a018c60

Merge branch 'branch-23.12' into refactor_logical_type

9b0b36d

lost a line

4513b94

Merge branch 'branch-23.12' into refactor_logical_type

f5a7cb7

vuule reviewed Oct 11, 2023

View reviewed changes

etseidl added 4 commits October 11, 2023 15:47

address some review comments

ebacc12

fail when writing invalid LogicalType

517460a

get rid of some superfluous braces

d9139ff

make writers for ColumnOrder and TimeUnit match behavior of LogicalType

c2f57fe

etseidl requested a review from vuule October 11, 2023 23:57

vuule approved these changes Oct 12, 2023

View reviewed changes

ttnghia reviewed Oct 20, 2023

View reviewed changes

Merge branch 'branch-23.12' into refactor_logical_type

c56d399

vuule requested a review from ttnghia October 20, 2023 18:50

ttnghia approved these changes Oct 20, 2023

View reviewed changes

rapids-bot bot merged commit 253f6a6 into rapidsai:branch-23.12 Oct 20, 2023
57 checks passed

etseidl deleted the refactor_logical_type branch October 20, 2023 21:20

This was referenced Oct 23, 2023

[BUG] Spark 3.2+/ParquetFilterSuite/Parquet filter pushdown - timestamp/ FAILED NVIDIA/spark-rapids#9507

Closed

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

Closed

rjzamora mentioned this pull request Oct 25, 2023

[BUG] Null types are not promoted in cudf.io.merge_parquet_filemetadata #14326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LogicalType for Parquet #14264

Refactor LogicalType for Parquet #14264

etseidl commented Oct 10, 2023

copy-pr-bot bot commented Oct 10, 2023

etseidl commented Oct 10, 2023

ttnghia commented Oct 10, 2023

vuule left a comment

vuule Oct 11, 2023

etseidl Oct 11, 2023

vuule Oct 11, 2023

ttnghia Oct 20, 2023

etseidl Oct 20, 2023

vuule Oct 20, 2023

vuule commented Oct 12, 2023

ttnghia Oct 20, 2023

etseidl Oct 20, 2023

vuule commented Oct 20, 2023

vuule commented Oct 20, 2023

rjzamora commented Oct 24, 2023

vuule commented Oct 24, 2023

rjzamora commented Oct 24, 2023

vuule commented Oct 24, 2023

rjzamora commented Oct 24, 2023 •

edited

Loading

vuule commented Oct 24, 2023

galipremsagar commented Oct 25, 2023

etseidl commented Oct 25, 2023 •

edited

Loading

rjzamora commented Oct 25, 2023

Refactor LogicalType for Parquet #14264

Refactor LogicalType for Parquet #14264

Conversation

etseidl commented Oct 10, 2023

Description

Checklist

copy-pr-bot bot commented Oct 10, 2023

etseidl commented Oct 10, 2023

ttnghia commented Oct 10, 2023

vuule left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Oct 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Oct 20, 2023

vuule commented Oct 20, 2023

rjzamora commented Oct 24, 2023

vuule commented Oct 24, 2023

rjzamora commented Oct 24, 2023

vuule commented Oct 24, 2023

rjzamora commented Oct 24, 2023 • edited Loading

vuule commented Oct 24, 2023

galipremsagar commented Oct 25, 2023

etseidl commented Oct 25, 2023 • edited Loading

rjzamora commented Oct 25, 2023

rjzamora commented Oct 24, 2023 •

edited

Loading

etseidl commented Oct 25, 2023 •

edited

Loading