[FEA] distributed quantile group by aggregations and reductions #13885

revans2 · 2023-08-15T18:41:28Z

Is your feature request related to a problem? Please describe.

To be clear I am not sure that this is something that CUDF wants or not. If not that is okay we can probably implement something ourselves, but it will take a non-trivial amount of work to do it for group-by.

#4706 appears to be very similar so there could be a lot of overlap between the two, except the dask documentation says that it is "row wise approximate", and has tdigest as a method in addition to dask. So I think they really are doing percentile approx and CUDF already has some support for tdigest. So maybe it is not the same.

Describe the solution you'd like
I would like two new aggregations and a post processing kernel. This is very similar to how TDIGEST works, except instead of computing approximate percentiles it would calculate a 100% accurate percentile.

We would have a COLLECT_FOR_QUANTILE aggregation that behave a lot like the TDIGEST aggregation does. It would collect the information into some intermediate data structure.

Then we want a MERGE_FOR_QUANTILE aggregation that would behave a lot like the MERGE_TDIGEST aggregation does. It would take as input the output of COLLECT_FOR_QUANTILE and merge 1 or more of them together.

Finally we would want a quantile API similar to percentile_approx that would take the output of the aggregation along with a set of percentiles, and return the desired values. Ideally we would have LINEAR interpolation, at least by default because that is what Spark does.

To describe what Spark does for reference. They will create a map from the value to a count, filtering out nulls before inserting them into the map. That is the internal data structure used by Spark. It would be nice to try and do something similar if possible.

We could, in theory, implement this in terms of COLLECT_LIST, MERGE_LIST and then a new quantile/percentile API that would return the result based off of the desired percentiles. But if there are a lot of duplicate values, which we suspect to be common, then the amount of data that we would end up shuffling would be much larger than Spark does, and would slow us down. I don't mind if we want to play games with how the data is encoded, like if all of the counts are 1, we drop the count from the data. But to make that happen the intermediate data returned by MERGE_FOR_QUANTILE and COLLECT_FOR_QUANTILE would have to be generic enough to support both (possibly a LIST or something like that).

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-08-23T05:18:01Z

Thank you @revans2 for detailing this request. We will need to revisit this topic and coordinate between RAPIDS and Dask. This comment suggests that Dask may support percentile aggregations, and additional work in libcudf would be for performance benefits rather than adding functionality.

@nvdbaranec Do you think the proposed COLLECT_FOR_QUANTILE and MERGE_FOR_QUANTILE aggregations would be based on the tdigest implementation, or would we need to implement something quite different?

This adds two more aggregations for groupby and reduction: * `HISTOGRAM`: Count the number of occurrences (aka frequency) for each element, and * `MERGE_HISTOGRAM`: Merge different outputs generated by `HISTOGRAM` aggregations This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting. Closes #13885. Merging checklist: * [X] Working prototypes. * [X] Cleanup and docs. * [X] Unit test. * [ ] Test with spark-rapids integration tests. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #14045

This implements JNI for `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction. Depends on: * #14045 Contributes to: * #13885. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #14154

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 15, 2023

revans2 mentioned this issue Aug 15, 2023

[FEA] Support Percentile NVIDIA/spark-rapids#6266

Closed

GregoryKimball added this to the Aggregations continuous improvement milestone Aug 23, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 23, 2023

ttnghia mentioned this issue Sep 6, 2023

Implement HISTOGRAM and MERGE_HISTOGRAM aggregations #14045

Merged

4 tasks

ttnghia mentioned this issue Sep 21, 2023

JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations #14154

Merged

rapids-bot bot closed this as completed in #14045 Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] distributed quantile group by aggregations and reductions #13885

[FEA] distributed quantile group by aggregations and reductions #13885

revans2 commented Aug 15, 2023 •

edited

Loading

GregoryKimball commented Aug 23, 2023

[FEA] distributed quantile group by aggregations and reductions #13885

[FEA] distributed quantile group by aggregations and reductions #13885

Comments

revans2 commented Aug 15, 2023 • edited Loading

GregoryKimball commented Aug 23, 2023

revans2 commented Aug 15, 2023 •

edited

Loading