Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] distributed quantile group by aggregations and reductions #13885

Closed
revans2 opened this issue Aug 15, 2023 · 1 comment · Fixed by #14045
Closed

[FEA] distributed quantile group by aggregations and reductions #13885

revans2 opened this issue Aug 15, 2023 · 1 comment · Fixed by #14045
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 15, 2023

Is your feature request related to a problem? Please describe.

To be clear I am not sure that this is something that CUDF wants or not. If not that is okay we can probably implement something ourselves, but it will take a non-trivial amount of work to do it for group-by.

#4706 appears to be very similar so there could be a lot of overlap between the two, except the dask documentation says that it is "row wise approximate", and has tdigest as a method in addition to dask. So I think they really are doing percentile approx and CUDF already has some support for tdigest. So maybe it is not the same.

Describe the solution you'd like
I would like two new aggregations and a post processing kernel. This is very similar to how TDIGEST works, except instead of computing approximate percentiles it would calculate a 100% accurate percentile.

We would have a COLLECT_FOR_QUANTILE aggregation that behave a lot like the TDIGEST aggregation does. It would collect the information into some intermediate data structure.

Then we want a MERGE_FOR_QUANTILE aggregation that would behave a lot like the MERGE_TDIGEST aggregation does. It would take as input the output of COLLECT_FOR_QUANTILE and merge 1 or more of them together.

Finally we would want a quantile API similar to percentile_approx that would take the output of the aggregation along with a set of percentiles, and return the desired values. Ideally we would have LINEAR interpolation, at least by default because that is what Spark does.

To describe what Spark does for reference. They will create a map from the value to a count, filtering out nulls before inserting them into the map. That is the internal data structure used by Spark. It would be nice to try and do something similar if possible.

We could, in theory, implement this in terms of COLLECT_LIST, MERGE_LIST and then a new quantile/percentile API that would return the result based off of the desired percentiles. But if there are a lot of duplicate values, which we suspect to be common, then the amount of data that we would end up shuffling would be much larger than Spark does, and would slow us down. I don't mind if we want to play games with how the data is encoded, like if all of the counts are 1, we drop the count from the data. But to make that happen the intermediate data returned by MERGE_FOR_QUANTILE and COLLECT_FOR_QUANTILE would have to be generic enough to support both (possibly a LIST or something like that).

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 15, 2023
@GregoryKimball
Copy link
Contributor

Thank you @revans2 for detailing this request. We will need to revisit this topic and coordinate between RAPIDS and Dask. This comment suggests that Dask may support percentile aggregations, and additional work in libcudf would be for performance benefits rather than adding functionality.

@nvdbaranec Do you think the proposed COLLECT_FOR_QUANTILE and MERGE_FOR_QUANTILE aggregations would be based on the tdigest implementation, or would we need to implement something quite different?

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 23, 2023
rapids-bot bot pushed a commit that referenced this issue Sep 27, 2023
This adds two more aggregations for groupby and reduction:
 * `HISTOGRAM`: Count the number of occurrences (aka frequency) for each element, and
 * `MERGE_HISTOGRAM`: Merge different outputs generated by `HISTOGRAM` aggregations

This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting.

Closes #13885.

Merging checklist:
 * [X] Working prototypes.
 * [X] Cleanup and docs.
 * [X]  Unit test.
 * [ ] Test with spark-rapids integration tests.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #14045
rapids-bot bot pushed a commit that referenced this issue Sep 27, 2023
This implements JNI for  `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction.

Depends on:
 * #14045

Contributes to:
 * #13885.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #14154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants