[FEA] Add function for "deduplicate map" to libcudf #17236

GregoryKimball · 2024-11-01T21:36:01Z

Is your feature request related to a problem? Please describe.

In cuDF, map types are often modeled as list<struct<string, TYPE>>. cuDF supports aggregations with list<...> payloads, using the MERGE_LISTS and MERGE_SETS aggregation kinds. However, the common "map concat" use case requires deduplication over the keys in each row.

For example:

input_data = cudf.DataFrame({'a':[
    [{'key':'k1','value':'red'}, {'key':'k2','value':'green'}],
    [{'key':'k1','value':'orange'}, {'key':'k1','value':'blue'}],
    [{'key':'k1','value':'red'}, {'key':'k2','value':'red'}],
    [{'key':'k1','value':'red'}, {'key':'k1','value':'red'}],    
]})

valid_output = cudf.DataFrame({'a':[
    [{'key':'k1','value':'red'}, {'key':'k2','value':'green'}],
    [{'key':'k1','value':'orange'}],
    [{'key':'k1','value':'red'}, {'key':'k2','value':'red'}],
    [{'key':'k1','value':'red'}],    
]})

This transformation is roughly equivalent to a segmented stream compaction with a custom equality condition. The algorithm could segment by row, run distinct with keep any on the "key" child column, and then gather over both the "key" and "value" child columns into the result. I believe the order of keys does not matter, but I could imagine that some applications would want to apply a segmented sort to the map column.

Describe the solution you'd like
Add a libcudf API that receives a list<struct<string, TYPE>> column, and performs a map deduplication to return a column list<struct<string, TYPE>> with duplicate keys removed in each row.

Describe alternatives you've considered
We can't use MERGE_SETS because it would only dedup if both values and keys matched (needs confirmation).

We can't use distinct because it's fine for the same key to occur in multiple rows.

Additional context
'map concat' is a common aggregation kind in feature engineering. We can unblock this operation in the short term by running a MERGE_LISTS aggregation kind and then applying the map deduplication in post processing. I expect that we could find a performance improvement later by adding a MAP_CONCAT aggregation kind.

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Nov 1, 2024

GregoryKimball added this to the Aggregations continuous improvement milestone Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add function for "deduplicate map" to libcudf #17236

[FEA] Add function for "deduplicate map" to libcudf #17236

GregoryKimball commented Nov 1, 2024 •

edited

Loading

[FEA] Add function for "deduplicate map" to libcudf #17236

[FEA] Add function for "deduplicate map" to libcudf #17236

Comments

GregoryKimball commented Nov 1, 2024 • edited Loading

GregoryKimball commented Nov 1, 2024 •

edited

Loading