-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor parquet thrift reader #14097
Refactor parquet thrift reader #14097
Conversation
This reverts commit e84baf6.
as a demonstration of `ParquetFieldOptional`
I've also moved the definitions of the field functors to Also added an implementation PoC for unions that uses optional fields rather than an |
and missed an 'if'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick first pass
thanks for clearing out the header!
cpp/src/io/parquet/parquet.hpp
Outdated
/** Empty struct to signal the order defined by the physical or logical type */ | ||
struct TypeDefinedOrder {}; | ||
|
||
/** | ||
* Union to specify the order used for the min_value and max_value fields for a column. | ||
*/ | ||
struct ColumnOrder { | ||
std::optional<TypeDefinedOrder> TYPE_ORDER; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent to bool is_type_defined_order
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet spec says unions are like enhanced enums. Many of the fields are just placeholders to signify which field is valid, but then the fields can have state, as with LogicalType
. Right now, LogicalType
is a struct of structs, with a struct field that's a struct of bools to indicate which field is valid. I'm proposing replacing that setup with a struct of optional structs, where only one field has a value.
ColumnOrder
is a strange case because there's only one defined value at present, but that may change in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, since this is supposed to be thought of as an enum, we could do something like:
struct TypeDefinedOrder {};
struct SomethingElse {
int some_state;
};
struct ColumnOrder {
enum class Type {
UNDEFINED = 0,
TYPE_ORDER = 1,
SOMETHING_ELSE = 2
};
std::optional<TypeDefinedOrder> type_order;
std::optional<SomethingElse> something_else;
Type operator() const {
if (type_order.has_value()) { return Type::TYPE_ORDER; }
if (something_else.has_value()) { return Type::SOMETHING_ELSE; }
return Type::UNDEFINED;
}
// or
Type type; // assigned during read
Type operator() const { return type; }
};
Then the ColumnOrder could be used in a switch statement, and the case for SOMETHING_ELSE
could access column_order.something_else.value().some_state
. Empty structs could be replaced with a bool
I suppose, or not even be defined if we save the enumerator value (the logic for empty struct is already separate from the non-empty case).
I would probably do it this way if I was writing a generic thrift parser, but for just parsing the parquet types, I think this may be overkill 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wound up making the above change (with mods to make it compile). I think treating these unions as enums is the way to go. And having a bunch of empty optional structs is unnecessary and takes up space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Depending on how @vuule sees it, shall we take this chance to rename involved classes or functions with snake_case to align with libcudf guidance?
I like to keep such changes separate. |
For the most part, yes. So from the parquet thrift IDL: /** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}
union ColumnOrder {
1: TypeDefinedOrder TYPE_ORDER;
} and I'm adding struct ColumnOrder {
enum Type { UNDEFINED, TYPE_ORDER };
Type type;
operator Type() const { return type; }
}; There are a few places where the names don't quite match, so But for function names we're certainly free to switch to snake_case as @PointKernel suggests. |
So, assuming we're keeping the names from parquet as they are, shall I rename |
wind up on device. also rename the field functors to use snake_case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few nitpicks, looks good otherwise
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for the refactoring.
shall I rename CompactProtocolReader, its functions, and all the field functors in compact_protocol_reader.cpp?
We could do it in a separate PR and clean up both thrift reader and writer.
/merge |
Continuation of #14097, this PR refactors the LogicalType struct to use the new way of treating unions defined in the parquet thrift (more enum like than struct like). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #14264
Description
Refactors the current
CompactProtocolReader
used to parse parquet file metadata. The main goal of the refactor is to allow easier use ofstd::optional
fields in the thrift structs to prevent situations as in #14024 where an optional field is an empty string. The writer cannot distinguish between present-but-empty and not-present, so chooses the latter when writing the field. This PR adds aParquetFieldOptional
functor that can wrap the other field functors, obviating the need to write a new optional functor for each type.Checklist