Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify implementation of interval_range() and fix behaviour for floating freq #13844

Merged

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Aug 9, 2023

Closes #13843
Closes #13847

This PR simplifies the implementation of interval_range() and fixes a few different bugs in the process. It also moves all tests for interval indexes to tests/indexes/test_interval.py. Finally, while working on this PR, I ran into #13847; a fix for that is also included in this PR.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 9, 2023
@shwina shwina added the bug Something isn't working label Aug 9, 2023
@shwina shwina self-assigned this Aug 9, 2023
@shwina shwina added the non-breaking Non-breaking change label Aug 9, 2023
@shwina shwina marked this pull request as ready for review August 10, 2023 13:03
@shwina shwina requested a review from a team as a code owner August 10, 2023 13:03
(1.0, None, 2.5, 2),
],
)
def test_interval_range_floating(start, stop, freq, periods):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the only new tests introduced in this PR. The other tests have been moved to this file from elsewhere.

if start is None:
start = end - freq * periods
elif freq is None:
quotient, remainder = divmod((end - start).value, periods.value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

divmod seems not to work with cudf.Scalar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do all this manipulation on the host before scalarising at the end, that would have far fewer syncs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tests written already that pass Scalar objects as input :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalar caching should save us here, no? There shouldn't be a sync until device_value is called later on. .value should just return the cached host value if the scalar was constructed from a host scalar

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor copyright comment.

python/cudf/cudf/tests/test_interval.py Outdated Show resolved Hide resolved
[
(0.0, None, 0.2, 5),
(0.0, 1.0, None, 5),
# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galipremsagar what's the right way to handle tests that will eventually pass with Pandas 2.x?

pandas-dev/pandas#54477

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[
(0.0, None, 0.2, 5),
(0.0, 1.0, None, 5),
# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here
pytest.param(
(0.0, 1.0, 0.2, None),
marks=pytest.mark.xfail(
condition=not PANDAS_GE_210,
reason="https://github.com/pandas-dev/pandas/pull/54477",
),
)

We can use this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to introduce PANDAS_GE_210 variable in cudf.core._compat.py too

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to copy all the scalars to device, only to manipulate and copy back. Would it be better do do all the preprocessing on the host and then move stuff once?

if start is None:
start = end - freq * periods
elif freq is None:
quotient, remainder = divmod((end - start).value, periods.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do all this manipulation on the host before scalarising at the end, that would have far fewer syncs.

periods = cudf.Scalar(int((end - start) / freq))
elif end is None:
end = start + periods * freq

if any(
not _is_non_decimal_numeric_dtype(x.dtype) if x is not None else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point all args must be not None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix'd

step=freq.device_value,
)
left_col = bin_edges.slice(0, len(bin_edges) - 1)
right_col = bin_edges.slice(1, len(bin_edges))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're doing linspace here with arange-like calls, are we sure the edge cases are handled correctly (specifically if periods has not been provided).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what edge cases would you be looking for?

To clarify the behaviour, when periods= hasn't been specified, Pandas does something like arange(start, end, freq):

In [2]: pd.interval_range(start=1, end=3.0, freq=0.7)
Out[2]: IntervalIndex([(1.0, 1.7], (1.7, 2.4]], dtype='interval[float64, right]')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, they are also wrong then:

In [10]: pd.interval_range(start=1, end=3.0, freq=0.5)
Out[10]: IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0]], dtype='interval[float64, right]')

In [11]: pd.interval_range(start=1, end=3.0, freq=0.1)
Out[11]: IntervalIndex([(1.0, 1.1055555555555556], (1.1055555555555556, 1.211111111111111], (1.211111111111111, 1.3166666666666667], (1.3166666666666667, 1.4222222222222223], (1.4222222222222223, 1.5277777777777777] ... (2.3722222222222222, 2.477777777777778], (2.477777777777778, 2.583333333333333], (2.583333333333333, 2.688888888888889], (2.688888888888889, 2.7944444444444443], (2.7944444444444443, 2.9]], dtype='interval[float64, right]')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note how I asked for a frequency of 0.1 (so the intervals should be (1, 1.1] etc...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, because the implementation does loads of stuff with floats that suffers from rounding errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup should hopefully be fixed in 2.1 in pandas to just do an arange in these cases pandas-dev/pandas#54477

python/cudf/cudf/core/scalar.py Show resolved Hide resolved
@shwina
Copy link
Contributor Author

shwina commented Aug 11, 2023

/merge

@rapids-bot rapids-bot bot merged commit 6fea2df into rapidsai:branch-23.10 Aug 11, 2023
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
5 participants