Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP,API: Enable full Numpy2 compat #15897

Closed
wants to merge 11 commits into from

Conversation

seberg
Copy link
Contributor

@seberg seberg commented May 31, 2024

This PR has a few more changes with which tests seem to pass (except the new comparison tests) on NumPy 2 (there is a chance of issues on NumPy 1 for now).
I won't have much time next week to update/work on this but could continue after that, please don't hesitate to push/take over.
It's not a big PR huge now, but it probably makes sense to split out some more changes that seem obvious OK.

I'll add some inline comments after opening the PR. It may make sense to split out more of the changes.

Most important checks:

  • This does include the API break for e.g. Scalar(-1, dtype=np.uint8) to fail, rather than wrap around. I suspect that is fine, but it is a breaking change. (Necessary/helpful because it matches pandas behavior e.g. for uint8_series + 1000 with NumPy 2.
  • The comparison paths need fixing. I have ommitted them, because I wasn't sure how to best deal with them:
    • Getting them perfectly right, requires new comparison loops in libcudf to compare uint64 and int64.
    • Getting them perfectly right for scalar integers could be done via logic like if other > np.iinfo(self.dtype).max: return full_series_like(self, dtype=bool, value=False).
    • I guess cudf currently used the NumPy promotion, so one could use custom value-based logic. That logic may exist elsewhere or be stolen from pandas. Such logic would be useful to exactly match e.g. pandas.Series.where().

Ping @vyasr, this is to get the ball rolling and simplify discussion.

Tested via (hack but sharing) creating a custom env, since we need a patched cupy version (that artifact is from the conda-forge PR):
set -eu

echo "fetching artifact"
cd conda
curl https://artprodeus21.artifacts.visualstudio.com/A910fa339-c7c2-46e8-a579-7ea247548706/84710dde-1620-425b-80d0-4cf5baca359d/_apis/artifact/cGlwZWxpbmVhcnRpZmFjdDovL2NvbmRhLWZvcmdlL3Byb2plY3RJZC84NDcxMGRkZS0xNjIwLTQyNWItODBkMC00Y2Y1YmFjYTM1OWQvYnVpbGRJZC85NDE1NDgvYXJ0aWZhY3ROYW1lL2NvbmRhX2FydGlmYWN0c18yMDI0MDUyMy4xMC4xX2xpbnV4XzY0X2NfY29tcGlsZXJfdmVyc2lvbjEyY3VkYV9jX2gxMzE0MjdkMmRh0/content?format=zip --output artifact.zip
unzip artifact.zip
rm artifact.zip
mv conda_artifacts_* conda_artifact
cd conda_artifact
unzip *.zip
rm *.zip
cd ../..

ARTIFACT_PATH = `pwd`/conda/conda_artifact/build_artifacts
echo "Modifying channels"
sed -i "s$  - conda-forge$  - file://$ARTIFACT_PATH\n  - conda-forge$" dependencies.yaml
rapids-dependency-file-generator

echo "To finalize, please create a new dev environment, for example with:"
echo ""
echo "conda env create --name cudf_dev --file conda/environments/all_cuda-122_arch-x86_64.yaml"
echo "pip install --force-reinstall --pre pyarrow numpy numba"

xref: rapidsai/build-planning#38

seberg added 10 commits May 31, 2024 14:52
This aligns with NumPy, which deprecated this since a while and
raises an error now, for example for `Scalar(-1, dtype=np.uint8)`.

Necessary to ensure we give the right errors when promotion doesn't
up-cast based on values.
Note that pandas `where` seems to promote the Series based on the
value even with NumPy 2.  This was never copied by cudf (i.e. an
outstanding issue)
Pandas keeps using weak promotion even for strongly typed
"scalars" (i.e. 0-d objects).
This tries to (mostly) match that, but there may be better ways to
do it.  I am having difficulty to think of the best way though.
Copy link

copy-pr-bot bot commented May 31, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label May 31, 2024
raise TypeError(
f"Cannot safely cast non-equivalent "
f"{type(other).__name__} to {source_dtype.name}"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type(other) can fail now e.g. for int8(200), so need to avoid it.

if (
_is_non_decimal_numeric_dtype(source_dtype)
and not other_is_scalar # can-cast fails for Python scalars
and _can_cast(other, source_dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed safe/correct to me to always take the other path for scalars?

# Go via NumPy to get the value
other = np.array(other)
if other.dtype.kind in "ifc":
other = other.item()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what to do best here. Pandas has a lot of isinstance(x, np.integer) checks, I think. But to match pandas we need to accept 0-D arrays and probably even 0-D tensors.
However, it seemed like this path is only taken for arrays?

if (self.min() >= min_) and (self.max() <= max_):
# Use Python floats, which have precise comparison for float64.
# NOTE(seberg): it would make sense to limit to the mantissa range.
if (float(self.min()) >= min_) and (float(self.max()) <= max_):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to explain, NumPy 2 will compare this as float32, but the min_/max_ values can overflow the float32 range.

with pytest.raises(OverflowError):
func(random_series)

return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas raises now on NumPy 2 for these.

@@ -1280,6 +1280,9 @@ def test_concat_join_empty_dataframes(
@pytest.mark.parametrize("sort", [True, False])
@pytest.mark.parametrize("join", ["inner", "outer"])
@pytest.mark.parametrize("axis", [1])
@pytest.mark.filterwarnings(
"ignore:No codes are cached because compilation:UserWarning"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about these, I see them locally with the new cupy, but maybe they are not due to the new cupy version?

@@ -341,7 +341,6 @@ def test_dtype(in_dtype, expect):
np.complex128,
complex,
"S",
"a",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids NumPy deprecation warning.

# TODO: Remove again when NumPy is fixed:
np_ver = np.lib.NumpyVersion(np.__version__)
if dtype == "uint64" and (np_ver == "2.0.0rc1" or np_ver == "2.0.0rc2"):
pytest.skip("test fails due to a NumPy bug on 2.0 rc versions")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy in1d is used, but was broken for some uint64/int64 mixes. I have already fixed it in NumPy.

Comment on lines +84 to +85
# The scalar may be out of bounds, so go via array force-cast
# NOTE: This is a change in behavior
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The scalar may be out of bounds, so go via array force-cast
# NOTE: This is a change in behavior

Maybe the comment isn't needed anymore (or at least not the note).

else:
val = _maybe_convert_to_default_type(
cudf.api.types.pandas_dtype(type(val))
).type(val)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re-arranging enforces that Scalar(-1, dtype=np.uint8) and others out-of-bound versions raise (at least on NumPy 2 or greater).

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change breaking Breaking change and removed non-breaking Non-breaking change labels Jun 3, 2024
rapids-bot bot pushed a commit that referenced this pull request Jul 15, 2024
Splitting out the non API changes from gh-15897, the Scalar API change is required for the tests to pass with NumPy 2, but almost all changes should be relatively straight forward here on their own.

(I will add inline comments.)

---

This PR does not fix integer comparisons, there are currently no tests that run into these.

xref: rapidsai/build-planning#38

Authors:
  - Sebastian Berg (https://github.com/seberg)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #16141
rapids-bot bot pushed a commit that referenced this pull request Jul 15, 2024
…#16140)

This aligns with NumPy, which deprecated this since a while and raises an error now on NumPy 2, for example for `Scalar(-1, dtype=np.uint8)`.

Since it aligns with NumPy, the DeprecationWarning of earlier NumPy versions is inherited for those.

This (or similar handling) is required to be compatible with NumPy 2/pandas, since the default needs to be to reject operation when values are out of bounds for e.g. `uint8_series + 1000`, the 1000 should not be silently cast to a `uint8`.

---

Split from gh-15897

xref: rapidsai/build-planning#38

Authors:
  - Sebastian Berg (https://github.com/seberg)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #16140
@seberg
Copy link
Contributor Author

seberg commented Jul 15, 2024

Closing, opened gh-16282 to track the last fix still needed.

@seberg seberg closed this Jul 15, 2024
@seberg seberg deleted the numpy2-compat branch July 15, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants