Support `corr` in `GroupBy.apply` through the jit engine #13767

shwina · 2023-07-26T14:30:01Z

Description

This PR enables computing the pearson correlation between two columns of a group within a UDF. Concretely, syntax such as the following will be allowed and produce the same result as pandas:

ans = df.groupby('key').apply(lambda group_df: group_df['x'].corr(group_df['y']))

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice · 2023-07-26T16:45:12Z

python/cudf/udf_cpp/shim.cu

+    double delta_l = lhs_ptr[idx] - lhs_mean;
+    double delta_r = rhs_ptr[idx] - rhs_mean;
+
+    numerators[idx] = delta_l * delta_r;


We should reimplement BlockVar to be a special case of a new function for computing Covariance (where the variance is the covariance of the data with itself). Then I think you can use the covariance function to compute the numerators and the variance function for the terms in the denominator.

…-udf

brandon-b-miller · 2023-07-27T19:48:25Z

This should now be ready for review.

bdice · 2023-07-27T22:53:35Z

@brandon-b-miller Before requesting review, can we get a title and description for the PR?

brandon-b-miller · 2023-07-28T13:04:38Z

@brandon-b-miller Before requesting review, can we get a title and description for the PR?

Updated! :)

brandon-b-miller · 2023-07-28T13:08:48Z

I wondered while working on this PR to what extent corr should be supported as a one-off vs writing general machinery for any similar function that maps two vectors to one scalar (the dot product comes to mind as something we already support, I'm sure there's other things). Would be interested in others opinions on this.

python/cudf/cudf/core/udf/groupby_typing.py

python/cudf/udf_cpp/shim.cu

Co-authored-by: Bradley Dice <[email protected]>

bdice

Just needs stronger test cases, and a couple minor tweaks.

python/cudf/udf_cpp/shim.cu

bdice · 2023-07-31T21:19:52Z

python/cudf/cudf/tests/test_groupby.py

@@ -433,6 +435,20 @@ def func(df):
    run_groupby_apply_jit_test(groupby_jit_data, func, ["key1"])


+@pytest.mark.parametrize("dtype", SUPPORTED_GROUPBY_NUMPY_TYPES)
+def test_groupby_apply_jit_correlation(groupby_jit_data, dtype):


Do we need to test data with NaNs? Infinity? Empty groups? Negative numbers? etc.

I'd like to see stronger test coverage for much more of our JIT code paths, not just corr...

Closing the loop on this conversation, after some discussion offline it was found that significant changes are needed to robustly support special values for this reduction which we'll tackle in a separate pull request.

Please file an issue for this -- and we also need to test the behavior of existing functions like variance and standard deviation for NaN support (do other functions ignore the NaN values like corr?).

Co-authored-by: Bradley Dice <[email protected]>

This PR removes some extra stores and loads that don't appear to be necessary in our groupby apply lowering which are possibly slowing things down. This came up during #13767. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13792

bdice

Some follow-up work but this looks good to me.

brandon-b-miller · 2023-08-02T02:11:38Z

/merge

Initial commit

1742685

github-actions bot added the Python Affects Python cuDF API. label Jul 26, 2023

cleanup

0bfc9a8

bdice reviewed Jul 26, 2023

View reviewed changes

brandon-b-miller and others added 4 commits July 26, 2023 11:59

reimplement var in terms of covar, corr in terms of covar and std

3273c4f

Merge branch 'branch-23.10' of github.com:rapidsai/cudf into add-corr…

cc64061

…-udf

Merge branch 'add-corr-udf' of github.com:shwina/cudf into add-corr-udf

4d7ec6a

generalize dtypes, pass tests

011fce7

brandon-b-miller added feature request New feature or request non-breaking Non-breaking change labels Jul 27, 2023

brandon-b-miller changed the title ~~Initial commit~~ Support corr in GroupBy.apply through the jit engine Jul 28, 2023

shwina marked this pull request as ready for review July 28, 2023 15:36

shwina requested a review from a team as a code owner July 28, 2023 15:36

shwina requested review from bdice and charlesbluca July 28, 2023 15:36

remove unnecessary pointer indirection

5730faa

bdice reviewed Jul 31, 2023

View reviewed changes

python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved

python/cudf/udf_cpp/shim.cu Outdated Show resolved Hide resolved

python/cudf/udf_cpp/shim.cu Outdated Show resolved Hide resolved

brandon-b-miller and others added 2 commits July 31, 2023 13:53

Apply suggestions from code review

e02cc02

Co-authored-by: Bradley Dice <[email protected]>

style

dc35515

bdice reviewed Jul 31, 2023

View reviewed changes

brandon-b-miller mentioned this pull request Aug 1, 2023

Remove unnecessary pointer copying in JIT GroupBy Apply #13792

Merged

brandon-b-miller and others added 2 commits August 1, 2023 08:31

drop float for now

e1cdad1

Apply suggestions from code review

14bd1be

Co-authored-by: Bradley Dice <[email protected]>

brandon-b-miller and others added 2 commits August 1, 2023 11:13

style fixes

ff618ba

Empty commit

3f37b61

bdice approved these changes Aug 1, 2023

View reviewed changes

rapids-bot bot merged commit fe307c1 into rapidsai:branch-23.10 Aug 2, 2023
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `corr` in `GroupBy.apply` through the jit engine #13767

Support `corr` in `GroupBy.apply` through the jit engine #13767

shwina commented Jul 26, 2023 •

edited by brandon-b-miller

Loading

bdice Jul 26, 2023

brandon-b-miller commented Jul 27, 2023

bdice commented Jul 27, 2023

brandon-b-miller commented Jul 28, 2023

brandon-b-miller commented Jul 28, 2023

bdice left a comment

bdice Jul 31, 2023

brandon-b-miller Aug 1, 2023

bdice Aug 1, 2023 •

edited

Loading

bdice left a comment

brandon-b-miller commented Aug 2, 2023

Support corr in GroupBy.apply through the jit engine #13767

Support corr in GroupBy.apply through the jit engine #13767

Conversation

shwina commented Jul 26, 2023 • edited by brandon-b-miller Loading

Description

Checklist

bdice Jul 26, 2023

Choose a reason for hiding this comment

brandon-b-miller commented Jul 27, 2023

bdice commented Jul 27, 2023

brandon-b-miller commented Jul 28, 2023

brandon-b-miller commented Jul 28, 2023

bdice left a comment

Choose a reason for hiding this comment

bdice Jul 31, 2023

Choose a reason for hiding this comment

brandon-b-miller Aug 1, 2023

Choose a reason for hiding this comment

bdice Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Aug 2, 2023

Support `corr` in `GroupBy.apply` through the jit engine #13767

Support `corr` in `GroupBy.apply` through the jit engine #13767

shwina commented Jul 26, 2023 •

edited by brandon-b-miller

Loading

bdice Aug 1, 2023 •

edited

Loading