-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2261 Size Statistics #14000
PARQUET-2261 Size Statistics #14000
Conversation
f02a652
to
d581e51
Compare
d581e51
to
9589fc3
Compare
@@ -289,7 +289,7 @@ class parquet_field_union_struct : public parquet_field { | |||
inline bool operator()(CompactProtocolReader* cpr, int field_type) | |||
{ | |||
T v; | |||
bool const res = parquet_field_struct<T>(field(), v).operator()(cpr, field_type); | |||
bool const res = parquet_field_struct<T>{field(), v}(cpr, field_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank @vuule :D
cpp/src/io/parquet/page_enc.cu
Outdated
|
||
__syncthreads(); | ||
// page fragment size must fit in a 32-bit signed integer | ||
if (s->frag.fragment_data_size > std::numeric_limits<int32_t>::max()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, compare signed vs unsigned.
if (s->frag.fragment_data_size > std::numeric_limits<int32_t>::max()) { | |
if (s->frag.fragment_data_size > static_cast<uint32_t>(std::numeric_limits<int32_t>::max())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL std::numeric_limits<T>::max()
returns a T
😅
But I don't think the cast is necessary since the result of max()
will be implicitly converted to unsigned int
per the "usual arithmetic conversions" anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that the compiler can issue a warning which turn into error for this issue. Currently, it doesn't but that can change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* | ||
* This value should not be written if max_repetition_level is 0. | ||
*/ | ||
thrust::optional<std::vector<int64_t>> repetition_level_histogram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to mix thrust::
with std::
? Why don't just use std::optional
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compact protocol reader/writer uses thrust::optional
because some optional fields need to be usable on the device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how thrust::optional<std::vector>>
can be used on device? Or this is just for consistency with other thrust::optional<plain_type>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, consistency. LogicalType, for instance, is optional as well, and is referenced in device code. Since that one had to be thrust::optional
, I figured it was easier for them all to be.
/ok to test |
Removing |
/merge |
While investigating #14597 it was found that there was a race introduced by #14000. Adding a sync between invocations of a block reduce resolves the error. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #14598
Adds Parquet size statistics introduced in apache/parquet-format#197. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#14000
While investigating rapidsai#14597 it was found that there was a race introduced by rapidsai#14000. Adding a sync between invocations of a block reduce resolves the error. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#14598
#14000 added the ability to write new page statistics to the Parquet writer. This PR uses these new statistics to avoid some string size computations. Benchmarks show an improvement in read times of up to 20%. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #14973
Description
Adds Parquet size statistics introduced in apache/parquet-format#197.
Checklist