-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support corr
in GroupBy.apply
through the jit engine
#13767
Conversation
python/cudf/udf_cpp/shim.cu
Outdated
double delta_l = lhs_ptr[idx] - lhs_mean; | ||
double delta_r = rhs_ptr[idx] - rhs_mean; | ||
|
||
numerators[idx] = delta_l * delta_r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reimplement BlockVar to be a special case of a new function for computing Covariance (where the variance is the covariance of the data with itself). Then I think you can use the covariance function to compute the numerators and the variance function for the terms in the denominator.
This should now be ready for review. |
@brandon-b-miller Before requesting review, can we get a title and description for the PR? |
corr
in GroupBy.apply
through the jit engine
Updated! :) |
I wondered while working on this PR to what extent |
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just needs stronger test cases, and a couple minor tweaks.
@@ -433,6 +435,20 @@ def func(df): | |||
run_groupby_apply_jit_test(groupby_jit_data, func, ["key1"]) | |||
|
|||
|
|||
@pytest.mark.parametrize("dtype", SUPPORTED_GROUPBY_NUMPY_TYPES) | |||
def test_groupby_apply_jit_correlation(groupby_jit_data, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to test data with NaNs? Infinity? Empty groups? Negative numbers? etc.
I'd like to see stronger test coverage for much more of our JIT code paths, not just corr
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing the loop on this conversation, after some discussion offline it was found that significant changes are needed to robustly support special values for this reduction which we'll tackle in a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file an issue for this -- and we also need to test the behavior of existing functions like variance and standard deviation for NaN support (do other functions ignore the NaN values like corr
?).
Co-authored-by: Bradley Dice <[email protected]>
This PR removes some extra stores and loads that don't appear to be necessary in our groupby apply lowering which are possibly slowing things down. This came up during #13767. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13792
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some follow-up work but this looks good to me.
/merge |
Description
This PR enables computing the pearson correlation between two columns of a group within a UDF. Concretely, syntax such as the following will be allowed and produce the same result as pandas:
Checklist