-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement HISTOGRAM
and MERGE_HISTOGRAM
aggregations
#14045
Conversation
COUNT_FREQUENCY
and MERGE_FREQUENCY
aggregationsHISTOGRAM
and MERGE_HISTOGRAM
aggregations
# Conflicts: # cpp/src/reductions/hash_reduce_by_row.cuh # cpp/src/stream_compaction/distinct_reduce.cu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small suggestions
surprisingly manageable PR for its size, good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is the best way to handle the hash-based histogram reduction.
Ideally, we shouldn't add any new code using the legacy map but refactoring with the new map will require more time for implementing, benchmarking, and reviewing thus this PR will probably slip.
One option would be splitting the current work into two PRs. the first for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a include/cudf/reduction/detail/
folder. Probably move this file to that place first then consider further cleanups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if I'm wrong, it seems that the histogram needs payloads for count accumulation.
In both histogram
and distinct
, the payload is in an external buffer. Only the key indices are needed, so static_set
will be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks all! Just add |
/merge |
This implements JNI for `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction. Depends on: * #14045 Contributes to: * #13885. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #14154
This adds two more aggregations for groupby and reduction:
HISTOGRAM
: Count the number of occurrences (aka frequency) for each element, andMERGE_HISTOGRAM
: Merge different outputs generated byHISTOGRAM
aggregationsThis is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting.
Closes #13885.
Merging checklist: