You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In cuDF, map types are often modeled as list<struct<string, TYPE>>. cuDF supports aggregations with list<...> payloads, using the MERGE_LISTS and MERGE_SETS aggregation kinds. However, the common "map concat" use case requires deduplication over the keys in each row.
This transformation is roughly equivalent to a segmented stream compaction with a custom equality condition. The algorithm could segment by row, run distinct with keep any on the "key" child column, and then gather over both the "key" and "value" child columns into the result. I believe the order of keys does not matter, but I could imagine that some applications would want to apply a segmented sort to the map column.
Describe the solution you'd like
Add a libcudf API that receives a list<struct<string, TYPE>> column, and performs a map deduplication to return a column list<struct<string, TYPE>> with duplicate keys removed in each row.
Describe alternatives you've considered
We can't use MERGE_SETS because it would only dedup if both values and keys matched (needs confirmation).
We can't use distinct because it's fine for the same key to occur in multiple rows.
Additional context
'map concat' is a common aggregation kind in feature engineering. We can unblock this operation in the short term by running a MERGE_LISTS aggregation kind and then applying the map deduplication in post processing. I expect that we could find a performance improvement later by adding a MAP_CONCAT aggregation kind.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
In cuDF, map types are often modeled as
list<struct<string, TYPE>>
. cuDF supports aggregations withlist<...>
payloads, using theMERGE_LISTS
andMERGE_SETS
aggregation kinds. However, the common "map concat" use case requires deduplication over the keys in each row.For example:
This transformation is roughly equivalent to a segmented stream compaction with a custom equality condition. The algorithm could segment by row, run distinct with keep any on the "key" child column, and then gather over both the "key" and "value" child columns into the result. I believe the order of keys does not matter, but I could imagine that some applications would want to apply a segmented sort to the map column.
Describe the solution you'd like
Add a libcudf API that receives a
list<struct<string, TYPE>>
column, and performs a map deduplication to return a columnlist<struct<string, TYPE>>
with duplicate keys removed in each row.Describe alternatives you've considered
We can't use
MERGE_SETS
because it would only dedup if both values and keys matched (needs confirmation).We can't use
distinct
because it's fine for the same key to occur in multiple rows.Additional context
'map concat' is a common aggregation kind in feature engineering. We can unblock this operation in the short term by running a
MERGE_LISTS
aggregation kind and then applying the map deduplication in post processing. I expect that we could find a performance improvement later by adding aMAP_CONCAT
aggregation kind.The text was updated successfully, but these errors were encountered: