Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HISTOGRAM and MERGE_HISTOGRAM aggregations #14045

Merged
merged 114 commits into from
Sep 27, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Sep 6, 2023

This adds two more aggregations for groupby and reduction:

  • HISTOGRAM: Count the number of occurrences (aka frequency) for each element, and
  • MERGE_HISTOGRAM: Merge different outputs generated by HISTOGRAM aggregations

This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting.

Closes #13885.

Merging checklist:

  • Working prototypes.
  • Cleanup and docs.
  • Unit test.
  • Test with spark-rapids integration tests.

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Sep 6, 2023
@ttnghia ttnghia self-assigned this Sep 6, 2023
@ttnghia ttnghia changed the title Implement COUNT_FREQUENCY and MERGE_FREQUENCY aggregations Implement HISTOGRAM and MERGE_HISTOGRAM aggregations Sep 7, 2023
@github-actions github-actions bot added the CMake CMake build issue label Sep 11, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small suggestions
surprisingly manageable PR for its size, good job!

cpp/include/cudf/aggregation.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/histogram_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_histogram.cu Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_histogram.cu Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_histogram.cu Outdated Show resolved Hide resolved
cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake changes

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is the best way to handle the hash-based histogram reduction.

Ideally, we shouldn't add any new code using the legacy map but refactoring with the new map will require more time for implementing, benchmarking, and reviewing thus this PR will probably slip.

cpp/include/cudf/aggregation.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/histogram_helpers.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/histogram_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
@PointKernel
Copy link
Member

One option would be splitting the current work into two PRs. the first for groupby and the second for reduction. This will simplify the review process so the groupby can still get into 23.10. and allow us to perform a thorough performance investigation on reduction with the new map.

cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a include/cudf/reduction/detail/ folder. Probably move this file to that place first then consider further cleanups.

Copy link
Contributor Author

@ttnghia ttnghia Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong, it seems that the histogram needs payloads for count accumulation.

In both histogram and distinct, the payload is in an external buffer. Only the key indices are needed, so static_set will be enough.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one question

cpp/src/reductions/histogram.cu Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 27, 2023
@ttnghia ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 27, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 27, 2023

Thanks all!

Just add DO NOT MERGE label to have some more time to test, just in case there is some hidden bug. I expect to not have anything new added.

@ttnghia ttnghia removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 27, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 27, 2023

/merge

@rapids-bot rapids-bot bot merged commit ce24796 into rapidsai:branch-23.10 Sep 27, 2023
54 checks passed
rapids-bot bot pushed a commit that referenced this pull request Sep 27, 2023
This implements JNI for  `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction.

Depends on:
 * #14045

Contributes to:
 * #13885.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #14154
@ttnghia ttnghia deleted the percentile branch October 16, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] distributed quantile group by aggregations and reductions
4 participants