Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add function for "deduplicate map" to libcudf #17236

Open
GregoryKimball opened this issue Nov 1, 2024 · 0 comments
Open

[FEA] Add function for "deduplicate map" to libcudf #17236

GregoryKimball opened this issue Nov 1, 2024 · 0 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Nov 1, 2024

Is your feature request related to a problem? Please describe.

In cuDF, map types are often modeled as list<struct<string, TYPE>>. cuDF supports aggregations with list<...> payloads, using the MERGE_LISTS and MERGE_SETS aggregation kinds. However, the common "map concat" use case requires deduplication over the keys in each row.

For example:

input_data = cudf.DataFrame({'a':[
    [{'key':'k1','value':'red'}, {'key':'k2','value':'green'}],
    [{'key':'k1','value':'orange'}, {'key':'k1','value':'blue'}],
    [{'key':'k1','value':'red'}, {'key':'k2','value':'red'}],
    [{'key':'k1','value':'red'}, {'key':'k1','value':'red'}],    
]})

valid_output = cudf.DataFrame({'a':[
    [{'key':'k1','value':'red'}, {'key':'k2','value':'green'}],
    [{'key':'k1','value':'orange'}],
    [{'key':'k1','value':'red'}, {'key':'k2','value':'red'}],
    [{'key':'k1','value':'red'}],    
]})

This transformation is roughly equivalent to a segmented stream compaction with a custom equality condition. The algorithm could segment by row, run distinct with keep any on the "key" child column, and then gather over both the "key" and "value" child columns into the result. I believe the order of keys does not matter, but I could imagine that some applications would want to apply a segmented sort to the map column.

Describe the solution you'd like
Add a libcudf API that receives a list<struct<string, TYPE>> column, and performs a map deduplication to return a column list<struct<string, TYPE>> with duplicate keys removed in each row.

Describe alternatives you've considered
We can't use MERGE_SETS because it would only dedup if both values and keys matched (needs confirmation).

We can't use distinct because it's fine for the same key to occur in multiple rows.

Additional context
'map concat' is a common aggregation kind in feature engineering. We can unblock this operation in the short term by running a MERGE_LISTS aggregation kind and then applying the map deduplication in post processing. I expect that we could find a performance improvement later by adding a MAP_CONCAT aggregation kind.

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

1 participant