[FEA] Support conditional joins in cudf-polars #16926
Labels
cudf.polars
Issues specific to cudf.polars
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Is your feature request related to a problem? Please describe.
Recently, polars gained support for a restricted class of conditional joins. Specifically those where the conditions are combinations of equality conditions and inequality conditions.
As a consequence the query optimiser in polars now turns queries like
left.join(right, on="a").filter("b" < "c")
into a conditional join ifb
andc
are not from the same table (if they are from the same table then this is rewritten as (say)left.filter("b" < "c").join(right, on="a")
: this one we handle fine). There is also a newjoin_where
primitive that directly allows user to write out conditional joins.For us, this means that queries that might previously have run are currently not supported (because we don't have handlers for the new inequality join node (called
IEJoin
).Describe the solution you'd like
Libcudf has a general conditional join implementation that takes an AST expression that is applied to determine if a pair of rows should be kept in the join output. It should be possible to run these inequality joins using that implementation.
However, since we know that we only need to support equality and inequality (for which the key columns must admit a total order) it should be possible to do something smarter by sorting the inequality columns to match and then intersecting the ranges that the inequality conditions impose. For multiple such conditions the ranges describe masks that must be permuted through to the original order (since we must sort each column pair individually), but it (in my head) seems doable in a way that isn't naively$\mathcal{O}(n m)$ when the left table has $n$ rows and the right has $m$ .
Describe alternatives you've considered
Expose the new IR node in polars and rewrite it in translation into a combination of cross joins + filters, which potentially blows up the memory requirement a lot.
The text was updated successfully, but these errors were encountered: