Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

Open
vyasr opened this issue Oct 28, 2024 · 0 comments
Open

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

vyasr opened this issue Oct 28, 2024 · 0 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2024

Is your feature request related to a problem? Please describe.
Currently pylibcudf exposes a subset of the factories in libcudf. When they were added in #15257, we omitted the factories for nested types due to various difficulties around ownership and what columns should be constructible from. We also have not strongly considered how to create pylibcudf columns of list or string types whose underlying data and offset arrays are views into other arrays. This type of construction could be done by manual column_view creation in libcudf, but it does require a thorough understanding of Arrow data layouts as well as their implementation in libcudf (especially for strings post the large strings refactor). All of these holes are particularly problematic because strings, lists and structs are the data types for which pylibcudf may have the most to offer: beyond simply providing a higher-performance, low-level API that cudf users could reach for when necessary, for these types pylibcudf can offer various bits of libcudf functionality that simply have no home in cudf at all. Therefore, making it possible to work with these types transparently in pylibcudf is of high importance to satisfy use cases for which we have no satisfactory solution at present.

Describe the solution you'd like
We should investigate the best ways to enable construction of pylibcudf columns of nested types, including from other data sources like pairs of cupy arrays, and we should make these constructors as easy to use as possible.

Additional context
Where appropriate, we should consider adding constructors directly to libcudf as well. While it is possible to do everything we need with low-level libcudf APIs, one of the major synergies I anticipate between pylibcudf and libcudf is that pylibcudf will motivate usability improvements in libcudf that might otherwise have little impetus behind them. This is one such case where improving constructors directly in libcudf to help pylibcudf users can help a wider range of users, so we should seize the opportunity if it presents itself.

@vyasr vyasr added the feature request New feature or request label Oct 28, 2024
@vyasr vyasr added libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API. labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

2 participants