Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Dask cuDF legacy code #17205

Merged
merged 22 commits into from
Nov 4, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 29, 2024

Description

The "legacy" DataFrame API is now deprecated (dask/dask#11437). The main purpose of this PR is to start isolating legacy code in Dask cuDF.

Old layout:

dask_cudf/
├── expr/
│   ├── _collection.py
│   ├── _expr.py
│   ├── _groupby.py
├── io/
│   ├── tests/
│   ├── ...
│   ├── parquet.py
│   ├── ...
├── tests/
├── accessors.py
├── backends.py
├── core.py
├── groupby.py
├── sorting.py

New layout:

dask_cudf/
├── _expr/
│   ├── accessors.py
│   ├── collection.py
│   ├── expr.py
│   ├── groupby.py
├── _legacy/
│   ├── io/
│   ├── core.py
│   ├── groupby.py
│   ├── sorting.py
├── io/
│   ├── tests/
│   ├── ...
│   ├── parquet.py
│   ├── ...
├── tests/
├── backends.py
├── core.py

Notes

  • This PR adds some backward compatibility to the expr-based API that was previously missing: The user can now import collection classes from dask_cudf.core (previously led to a "silent" bug when query-planning was enabled).
  • The user can also import various IO functions from dask_cudf.io (and sub-modules like dask_cudf.io.parquet), but they will typically get a deprecation warning.
  • This PR is still technically "breaking" in the sense that the user can no longer import some functions/classes from dask_cudf.io.*. Also, the groupby, sorting, and accessors modules have simply moved. It should be uncommon for down-stream code to import from these modules. It's also worth noting that query-planning was already causing problems for these users if they were doing this.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Oct 29, 2024
@rjzamora rjzamora self-assigned this Oct 29, 2024
@rjzamora rjzamora requested a review from a team as a code owner October 29, 2024 21:49
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 29, 2024
@rjzamora rjzamora added improvement Improvement / enhancement to an existing function breaking Breaking change and removed non-breaking Non-breaking change labels Oct 29, 2024
# For dask>2024.2.0, we can silence the loud deprecation
# warning before importing `dask.dataframe` (this won't
# do anything for dask==2024.2.0)
config.set({"dataframe.query-planning-warning": False})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant.

_PATCHED = False


def _patch_dask_expr():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this PR is just moving around existing code. This change is actually new - We used to implicitly "patch" dask-expr when the expr module was imported. I'm trying to make the patching a bit more explicit (but I only want to allow this function to do its thing once).

@madsbk
Copy link
Member

madsbk commented Oct 30, 2024

If _accessors.py & co. are all private, would it make sense to move them to a sub-folder and avoid the _ prefix?

@wence-
Copy link
Contributor

wence- commented Oct 30, 2024

Question: is there any user-visible change to this refactoring, such that we must warn users about the deprecation? Or is it that we can just swap out the implementation from under them?

@rjzamora
Copy link
Member Author

If _accessors.py & co. are all private, would it make sense to move them to a sub-folder and avoid the _ prefix?

I'm happy to do whatever others prefer. I just want to choose some convention that helps us move in the direction of making everything other than dask_cudf.<function-or-class> private. Dask-expr uses a flat directory of .py files with the _ prefix. We can certainly put everything in a _-prefixed directory instead.

Question: is there any user-visible change to this refactoring, such that we must warn users about the deprecation? Or is it that we can just swap out the implementation from under them?

There may be some code that people were importing directly from dask_cudf.io (e.g. dask_cudf.io.parquet.create_metadata_file). If I understand correctly, everything else is already broken when query-planning is enabled, so we would only cause "new" problems for legacy-API users (and the legacy API will definitely be gone for 25.02).

With that said, I'll try to make some changes to better-warn people who try to import things from "deprecated" paths.

@madsbk
Copy link
Member

madsbk commented Oct 31, 2024

@rjzamora agree, the important thing is to mark them private.
What about something like:

dask_cudf/
├── _expr/
│   ├── collection.py
│   ├── expr.py
│   ├── groupby.py
│   ├── accessors.py

@rjzamora
Copy link
Member Author

rjzamora commented Nov 1, 2024

What about something like:

Sounds good @madsbk - I made this change.

Comment on lines -41 to 55
def raise_not_implemented_error(attr_name):
def _deprecated_api(old_api, new_api=None, rec=None):
def inner_func(*args, **kwargs):
if new_api:
# Use alternative
msg = f"{old_api} is now deprecated. "
msg += rec or f"Please use {new_api} instead."
warnings.warn(msg, FutureWarning)
new_attr = new_api.split(".")
module = import_module(".".join(new_attr[:-1]))
return getattr(module, new_attr[-1])(*args, **kwargs)

# No alternative - raise an error
raise NotImplementedError(
f"Top-level {attr_name} API is not available for dask-expr."
f"{old_api} is no longer supported. " + (rec or "")
)

return inner_func
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I decided to expand this utility to help deal with the many dask_cudf.io APIs that are no longer directly accessible with query-planning on.


to_orc = _deprecated_api(
"dask_cudf.to_orc",
new_api="dask_cudf._legacy.io.to_orc",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy to_orc API is actually "fine" (it's compatible with query-planning stuff), but we might as well discourage this kind of usage.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @rjzamora

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 4, 2024
@rjzamora rjzamora changed the title [WIP] Refactor Dask cuDF legacy code Refactor Dask cuDF legacy code Nov 4, 2024
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge 3 - Ready for Review Ready for review by team and removed 3 - Ready for Review Ready for review by team 5 - Ready to Merge Testing and reviews complete, ready to merge labels Nov 4, 2024
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 4, 2024
@rjzamora
Copy link
Member Author

rjzamora commented Nov 4, 2024

/merge

@rapids-bot rapids-bot bot merged commit 45563b3 into rapidsai:branch-24.12 Nov 4, 2024
102 checks passed
@rjzamora rjzamora deleted the refactor-dask-cudf branch November 4, 2024 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change dask Dask issue improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants