Implement `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations #14045

ttnghia · 2023-09-06T21:28:12Z

This adds two more aggregations for groupby and reduction:

HISTOGRAM: Count the number of occurrences (aka frequency) for each element, and
MERGE_HISTOGRAM: Merge different outputs generated by HISTOGRAM aggregations

This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting.

Closes #13885.

Merging checklist:

Working prototypes.
Cleanup and docs.
Unit test.
Test with spark-rapids integration tests.

# Conflicts: # cpp/src/reductions/hash_reduce_by_row.cuh # cpp/src/stream_compaction/distinct_reduce.cu

vuule

some small suggestions
surprisingly manageable PR for its size, good job!

cpp/include/cudf/aggregation.hpp

cpp/include/cudf/detail/histogram_helpers.hpp

cpp/src/groupby/sort/group_histogram.cu

cpp/src/reductions/histogram.cu

robertmaynard

Approving CMake changes

PointKernel

Not sure what is the best way to handle the hash-based histogram reduction.

Ideally, we shouldn't add any new code using the legacy map but refactoring with the new map will require more time for implementing, benchmarking, and reviewing thus this PR will probably slip.

cpp/include/cudf/aggregation.hpp

cpp/include/cudf/detail/histogram_helpers.hpp

cpp/src/reductions/histogram.cu

PointKernel · 2023-09-26T20:43:57Z

One option would be splitting the current work into two PRs. the first for groupby and the second for reduction. This will simplify the review process so the groupby can still get into 23.10. and allow us to perform a thorough performance investigation on reduction with the new map.

cpp/src/reductions/histogram.cu

PointKernel · 2023-09-26T21:24:52Z

cpp/include/cudf/detail/reduction/histogram.hpp

We have a include/cudf/reduction/detail/ folder. Probably move this file to that place first then consider further cleanups.

correct me if I'm wrong, it seems that the histogram needs payloads for count accumulation.

In both histogram and distinct, the payload is in an external buffer. Only the key indices are needed, so static_set will be enough.

vuule

LGTM with one question

cpp/src/reductions/histogram.cu

PointKernel

LGTM

ttnghia · 2023-09-27T01:13:51Z

Thanks all!

Just add DO NOT MERGE label to have some more time to test, just in case there is some hidden bug. I expect to not have anything new added.

ttnghia · 2023-09-27T17:10:25Z

/merge

This implements JNI for `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction. Depends on: * #14045 Contributes to: * #13885. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #14154

Add COUNT_FREQUENCY and MERGE_FREQUENCY aggregations

e385fda

ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Sep 6, 2023

ttnghia self-assigned this Sep 6, 2023

Change the new aggregations to HISTOGRAM and MERGE_HISTOGRAM

e3df8d4

ttnghia changed the title ~~Implement COUNT_FREQUENCY and MERGE_FREQUENCY aggregations~~ Implement HISTOGRAM and MERGE_HISTOGRAM aggregations Sep 7, 2023

ttnghia added 4 commits September 6, 2023 21:03

Update copyright year

7bc7f91

Implement interface for the new aggregations

0fd2000

Add new files

1b04436

Add skeleton APIs

1977d69

github-actions bot added the CMake CMake build issue label Sep 11, 2023

ttnghia added 16 commits September 11, 2023 12:00

Extract hash_reduce_by_row

6fa93fc

Adopt hash_reduce_by_row in distinct_reduce

d11dd7f

Merge branch 'branch-23.10' into percentile

b632570

Rename struct and simplify code

e58f3e3

Refactor hash_reduce_by_row

3cf1948

Rewrite hash_reduce_by_row.cuh

8488646

Rename and rewrite distinct_reduce.hpp

1994684

Rewrite distinct.cu

5dcbac9

Rewrite distinct_reduce.cu

6236fcc

Rewrite hash_reduce_by_row.cuh

42a778f

Minor changes

584ff8d

Fix style

4a3d60d

Merge branch 'refactor_hash_reduce' into percentile

4e74119

# Conflicts: # cpp/src/reductions/hash_reduce_by_row.cuh # cpp/src/stream_compaction/distinct_reduce.cu

Fix comment

34cb488

Merge branch 'refactor_hash_reduce' into percentile

c863b53

Move file

e73c07f

ttnghia added 2 commits September 21, 2023 16:30

Move header

b06ed2a

Fix docs

b6b720a

vuule reviewed Sep 22, 2023

View reviewed changes

robertmaynard approved these changes Sep 25, 2023

View reviewed changes

PointKernel requested changes Sep 26, 2023

View reviewed changes

ttnghia requested review from vuule and PointKernel September 26, 2023 21:02

ttnghia force-pushed the percentile branch from d1bdf60 to acc3ae5 Compare September 26, 2023 21:04

Move detail header file, use std::invalid_argument, and some comments

b30f70c

ttnghia force-pushed the percentile branch from acc3ae5 to b30f70c Compare September 26, 2023 21:05

PointKernel reviewed Sep 26, 2023

View reviewed changes

ttnghia added 3 commits September 26, 2023 14:31

Move header file and fix comment

89f3628

Append enum

c3ad104

Redeclare hash_table_allocator_type

f504c86

vuule reviewed Sep 26, 2023

View reviewed changes

cpp/src/reductions/histogram.cu Show resolved Hide resolved

Temporarily fail on nested input

e9d723e

PointKernel approved these changes Sep 26, 2023

View reviewed changes

vuule approved these changes Sep 27, 2023

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 27, 2023

ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 27, 2023

ttnghia and others added 2 commits September 26, 2023 18:14

Merge branch 'branch-23.10' into percentile

d1980e0

Fix compile issue due to merge conflict

83b8a37

ttnghia removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 27, 2023

rapids-bot bot merged commit ce24796 into rapidsai:branch-23.10 Sep 27, 2023
54 checks passed

ttnghia deleted the percentile branch October 16, 2023 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations #14045

Implement `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations #14045

ttnghia commented Sep 6, 2023 •

edited

Loading

vuule left a comment

robertmaynard left a comment

PointKernel left a comment

PointKernel commented Sep 26, 2023

PointKernel Sep 26, 2023

ttnghia Sep 26, 2023 •

edited

Loading

vuule left a comment

PointKernel left a comment

ttnghia commented Sep 27, 2023

ttnghia commented Sep 27, 2023

Implement HISTOGRAM and MERGE_HISTOGRAM aggregations #14045

Implement HISTOGRAM and MERGE_HISTOGRAM aggregations #14045

Conversation

ttnghia commented Sep 6, 2023 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

robertmaynard left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

PointKernel commented Sep 26, 2023

PointKernel Sep 26, 2023

Choose a reason for hiding this comment

ttnghia Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

ttnghia commented Sep 27, 2023

ttnghia commented Sep 27, 2023

Implement `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations #14045

Implement `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations #14045

ttnghia commented Sep 6, 2023 •

edited

Loading

ttnghia Sep 26, 2023 •

edited

Loading