Simplify implementation of interval_range() and fix behaviour for floating `freq` #13844

shwina · 2023-08-09T21:05:30Z

This PR simplifies the implementation of interval_range() and fixes a few different bugs in the process. It also moves all tests for interval indexes to tests/indexes/test_interval.py. Finally, while working on this PR, I ran into #13847; a fix for that is also included in this PR.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…udf into fix-interval-index-construction

shwina · 2023-08-10T13:04:50Z

python/cudf/cudf/tests/indexes/test_interval.py

+        (1.0, None, 2.5, 2),
+    ],
+)
+def test_interval_range_floating(start, stop, freq, periods):


These are the only new tests introduced in this PR. The other tests have been moved to this file from elsewhere.

shwina · 2023-08-10T13:05:25Z

python/cudf/cudf/core/index.py

+    if start is None:
+        start = end - freq * periods
+    elif freq is None:
+        quotient, remainder = divmod((end - start).value, periods.value)


divmod seems not to work with cudf.Scalar

Can you do all this manipulation on the host before scalarising at the end, that would have far fewer syncs.

We have tests written already that pass Scalar objects as input :(

The scalar caching should save us here, no? There shouldn't be a sync until device_value is called later on. .value should just return the cached host value if the scalar was constructed from a host scalar

galipremsagar

LGTM, minor copyright comment.

python/cudf/cudf/tests/test_interval.py

shwina · 2023-08-10T13:09:09Z

python/cudf/cudf/tests/indexes/test_interval.py

+    [
+        (0.0, None, 0.2, 5),
+        (0.0, 1.0, None, 5),
+        # (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here


@galipremsagar what's the right way to handle tests that will eventually pass with Pandas 2.x?

pandas-dev/pandas#54477

Suggested a change below: https://github.com/rapidsai/cudf/pull/13844/files#r1290106880

galipremsagar · 2023-08-10T13:12:11Z

python/cudf/cudf/tests/indexes/test_interval.py

+    [
+        (0.0, None, 0.2, 5),
+        (0.0, 1.0, None, 5),
+        # (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here


Suggested change

# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here

pytest.param(

(0.0, 1.0, 0.2, None),

marks=pytest.mark.xfail(

condition=not PANDAS_GE_210,

reason="https://github.com/pandas-dev/pandas/pull/54477",

),

)

We can use this

We might want to introduce PANDAS_GE_210 variable in cudf.core._compat.py too

Co-authored-by: GALI PREM SAGAR <[email protected]>

wence-

Does it make sense to copy all the scalars to device, only to manipulate and copy back. Would it be better do do all the preprocessing on the host and then move stuff once?

wence- · 2023-08-10T15:15:21Z

python/cudf/cudf/core/index.py

+    if start is None:
+        start = end - freq * periods
+    elif freq is None:
+        quotient, remainder = divmod((end - start).value, periods.value)


Can you do all this manipulation on the host before scalarising at the end, that would have far fewer syncs.

wence- · 2023-08-10T15:16:09Z

python/cudf/cudf/core/index.py

+        periods = cudf.Scalar(int((end - start) / freq))
+    elif end is None:
+        end = start + periods * freq
+
    if any(
        not _is_non_decimal_numeric_dtype(x.dtype) if x is not None else False


At this point all args must be not None.

wence- · 2023-08-10T15:18:42Z

python/cudf/cudf/core/index.py

+        step=freq.device_value,
+    )
+    left_col = bin_edges.slice(0, len(bin_edges) - 1)
+    right_col = bin_edges.slice(1, len(bin_edges))


You're doing linspace here with arange-like calls, are we sure the edge cases are handled correctly (specifically if periods has not been provided).

Hmm, what edge cases would you be looking for?

To clarify the behaviour, when periods= hasn't been specified, Pandas does something like arange(start, end, freq):

In [2]: pd.interval_range(start=1, end=3.0, freq=0.7) Out[2]: IntervalIndex([(1.0, 1.7], (1.7, 2.4]], dtype='interval[float64, right]')

Ah ok, they are also wrong then:

In [10]: pd.interval_range(start=1, end=3.0, freq=0.5) Out[10]: IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0]], dtype='interval[float64, right]') In [11]: pd.interval_range(start=1, end=3.0, freq=0.1) Out[11]: IntervalIndex([(1.0, 1.1055555555555556], (1.1055555555555556, 1.211111111111111], (1.211111111111111, 1.3166666666666667], (1.3166666666666667, 1.4222222222222223], (1.4222222222222223, 1.5277777777777777] ... (2.3722222222222222, 2.477777777777778], (2.477777777777778, 2.583333333333333], (2.583333333333333, 2.688888888888889], (2.688888888888889, 2.7944444444444443], (2.7944444444444443, 2.9]], dtype='interval[float64, right]')

Note how I asked for a frequency of 0.1 (so the intervals should be (1, 1.1] etc...)

Ah, because the implementation does loads of stuff with floats that suffers from rounding errors

Yup should hopefully be fixed in 2.1 in pandas to just do an arange in these cases pandas-dev/pandas#54477

python/cudf/cudf/core/scalar.py

shwina · 2023-08-11T17:55:20Z

/merge

Simplify implementation of interval_range()

40e0d3d

github-actions bot added the Python Affects Python cuDF API. label Aug 9, 2023

Add one more test

e6d8cb2

shwina added the bug Something isn't working label Aug 9, 2023

shwina self-assigned this Aug 9, 2023

Merge branch 'branch-23.10' into fix-interval-index-construction

0868909

shwina added the non-breaking Non-breaking change label Aug 9, 2023

shwina added 4 commits August 10, 2023 07:06

Move interval index tests to indexes/test_interval.py

975dc8e

Make reflected scalar binops work correctly

b6c3158

Fixes for interval_range

67dadd5

Merge branch 'fix-interval-index-construction' of github.com:shwina/c…

a8e921d

…udf into fix-interval-index-construction

shwina marked this pull request as ready for review August 10, 2023 13:03

shwina requested a review from a team as a code owner August 10, 2023 13:03

shwina requested review from galipremsagar and brandon-b-miller August 10, 2023 13:03

shwina commented Aug 10, 2023

View reviewed changes

galipremsagar approved these changes Aug 10, 2023

View reviewed changes

python/cudf/cudf/tests/test_interval.py Outdated Show resolved Hide resolved

shwina commented Aug 10, 2023

View reviewed changes

galipremsagar reviewed Aug 10, 2023

View reviewed changes

shwina and others added 2 commits August 10, 2023 09:12

Update python/cudf/cudf/tests/test_interval.py

636b7cd

Co-authored-by: GALI PREM SAGAR <[email protected]>

xfail test

4c6f358

galipremsagar approved these changes Aug 10, 2023

View reviewed changes

wence- reviewed Aug 10, 2023

View reviewed changes

shwina and others added 2 commits August 10, 2023 13:28

Unnecessary None check

ffe778b

Merge branch 'branch-23.10' into fix-interval-index-construction

04ed25f

rapids-bot bot merged commit 6fea2df into rapidsai:branch-23.10 Aug 11, 2023
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify implementation of interval_range() and fix behaviour for floating `freq` #13844

Simplify implementation of interval_range() and fix behaviour for floating `freq` #13844

shwina commented Aug 9, 2023 •

edited

Loading

shwina Aug 10, 2023

shwina Aug 10, 2023

wence- Aug 10, 2023

shwina Aug 10, 2023

brandon-b-miller Aug 10, 2023

galipremsagar left a comment

shwina Aug 10, 2023

galipremsagar Aug 10, 2023

galipremsagar Aug 10, 2023

galipremsagar Aug 10, 2023

wence- left a comment

wence- Aug 10, 2023

wence- Aug 10, 2023

shwina Aug 10, 2023

wence- Aug 10, 2023

shwina Aug 10, 2023

wence- Aug 10, 2023

wence- Aug 10, 2023

wence- Aug 10, 2023

mroeschke Aug 10, 2023

shwina commented Aug 11, 2023

-        # (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here
+        pytest.param(
+            (0.0, 1.0, 0.2, None),
+            marks=pytest.mark.xfail(
+                condition=not PANDAS_GE_210,
+                reason="https://github.com/pandas-dev/pandas/pull/54477",
+            ),
+        )

Simplify implementation of interval_range() and fix behaviour for floating freq #13844

Simplify implementation of interval_range() and fix behaviour for floating freq #13844

Conversation

shwina commented Aug 9, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina commented Aug 11, 2023

Simplify implementation of interval_range() and fix behaviour for floating `freq` #13844

Simplify implementation of interval_range() and fix behaviour for floating `freq` #13844

shwina commented Aug 9, 2023 •

edited

Loading