Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2261 Size Statistics #14000

Merged
merged 173 commits into from
Dec 6, 2023
Merged

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 29, 2023

Description

Adds Parquet size statistics introduced in apache/parquet-format#197.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 29, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 29, 2023
@PointKernel PointKernel self-requested a review December 5, 2023 18:24
@ttnghia ttnghia self-requested a review December 5, 2023 23:05
@@ -289,7 +289,7 @@ class parquet_field_union_struct : public parquet_field {
inline bool operator()(CompactProtocolReader* cpr, int field_type)
{
T v;
bool const res = parquet_field_struct<T>(field(), v).operator()(cpr, field_type);
bool const res = parquet_field_struct<T>{field(), v}(cpr, field_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank @vuule :D


__syncthreads();
// page fragment size must fit in a 32-bit signed integer
if (s->frag.fragment_data_size > std::numeric_limits<int32_t>::max()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, compare signed vs unsigned.

Suggested change
if (s->frag.fragment_data_size > std::numeric_limits<int32_t>::max()) {
if (s->frag.fragment_data_size > static_cast<uint32_t>(std::numeric_limits<int32_t>::max())) {

Copy link
Contributor Author

@etseidl etseidl Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL std::numeric_limits<T>::max() returns a T 😅

But I don't think the cast is necessary since the result of max() will be implicitly converted to unsigned int per the "usual arithmetic conversions" anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that the compiler can issue a warning which turn into error for this issue. Currently, it doesn't but that can change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* This value should not be written if max_repetition_level is 0.
*/
thrust::optional<std::vector<int64_t>> repetition_level_histogram;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to mix thrust:: with std::? Why don't just use std::optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compact protocol reader/writer uses thrust::optional because some optional fields need to be usable on the device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how thrust::optional<std::vector>> can be used on device? Or this is just for consistency with other thrust::optional<plain_type>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, consistency. LogicalType, for instance, is optional as well, and is referenced in device code. Since that one had to be thrust::optional, I figured it was easier for them all to be.

@etseidl etseidl requested a review from ttnghia December 6, 2023 15:50
@ttnghia
Copy link
Contributor

ttnghia commented Dec 6, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 6, 2023
@raydouglass raydouglass removed the request for review from a team December 6, 2023 20:31
@raydouglass
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@vuule vuule removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 6, 2023
@vuule
Copy link
Contributor

vuule commented Dec 6, 2023

/merge

@rapids-bot rapids-bot bot merged commit b136d8b into rapidsai:branch-24.02 Dec 6, 2023
68 checks passed
@etseidl etseidl deleted the size_statistics branch December 6, 2023 23:52
rapids-bot bot pushed a commit that referenced this pull request Dec 9, 2023
While investigating #14597 it was found that there was a race introduced by #14000. Adding a sync between invocations of a block reduce resolves the error.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14598
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Adds Parquet size statistics introduced in apache/parquet-format#197.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: rapidsai#14000
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
While investigating rapidsai#14597 it was found that there was a race introduced by rapidsai#14000. Adding a sync between invocations of a block reduce resolves the error.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#14598
rapids-bot bot pushed a commit that referenced this pull request Mar 7, 2024
#14000 added the ability to write new page statistics to the Parquet writer. This PR uses these new statistics to avoid some string size computations. Benchmarks show an improvement in read times of up to 20%.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14973
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants