Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support corr in GroupBy.apply through the jit engine #13767

Merged
merged 13 commits into from
Aug 2, 2023

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jul 26, 2023

Description

This PR enables computing the pearson correlation between two columns of a group within a UDF. Concretely, syntax such as the following will be allowed and produce the same result as pandas:

ans = df.groupby('key').apply(lambda group_df: group_df['x'].corr(group_df['y']))

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 26, 2023
double delta_l = lhs_ptr[idx] - lhs_mean;
double delta_r = rhs_ptr[idx] - rhs_mean;

numerators[idx] = delta_l * delta_r;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reimplement BlockVar to be a special case of a new function for computing Covariance (where the variance is the covariance of the data with itself). Then I think you can use the covariance function to compute the numerators and the variance function for the terms in the denominator.

@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change labels Jul 27, 2023
@brandon-b-miller
Copy link
Contributor

This should now be ready for review.

@bdice
Copy link
Contributor

bdice commented Jul 27, 2023

@brandon-b-miller Before requesting review, can we get a title and description for the PR?

@brandon-b-miller brandon-b-miller changed the title Initial commit Support corr in GroupBy.apply through the jit engine Jul 28, 2023
@brandon-b-miller
Copy link
Contributor

@brandon-b-miller Before requesting review, can we get a title and description for the PR?

Updated! :)

@brandon-b-miller
Copy link
Contributor

I wondered while working on this PR to what extent corr should be supported as a one-off vs writing general machinery for any similar function that maps two vectors to one scalar (the dot product comes to mind as something we already support, I'm sure there's other things). Would be interested in others opinions on this.

@shwina shwina marked this pull request as ready for review July 28, 2023 15:36
@shwina shwina requested a review from a team as a code owner July 28, 2023 15:36
python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved
python/cudf/udf_cpp/shim.cu Outdated Show resolved Hide resolved
python/cudf/udf_cpp/shim.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs stronger test cases, and a couple minor tweaks.

python/cudf/udf_cpp/shim.cu Show resolved Hide resolved
python/cudf/udf_cpp/shim.cu Outdated Show resolved Hide resolved
@@ -433,6 +435,20 @@ def func(df):
run_groupby_apply_jit_test(groupby_jit_data, func, ["key1"])


@pytest.mark.parametrize("dtype", SUPPORTED_GROUPBY_NUMPY_TYPES)
def test_groupby_apply_jit_correlation(groupby_jit_data, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test data with NaNs? Infinity? Empty groups? Negative numbers? etc.

I'd like to see stronger test coverage for much more of our JIT code paths, not just corr...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the loop on this conversation, after some discussion offline it was found that significant changes are needed to robustly support special values for this reduction which we'll tackle in a separate pull request.

Copy link
Contributor

@bdice bdice Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue for this -- and we also need to test the behavior of existing functions like variance and standard deviation for NaN support (do other functions ignore the NaN values like corr?).

rapids-bot bot pushed a commit that referenced this pull request Aug 1, 2023
This PR removes some extra stores and loads that don't appear to be necessary in our groupby apply lowering which are possibly slowing things down. This came up during #13767.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13792
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some follow-up work but this looks good to me.

@brandon-b-miller
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit fe307c1 into rapidsai:branch-23.10 Aug 2, 2023
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants