-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement inequality joins by translation to conditional joins #17000
Implement inequality joins by translation to conditional joins #17000
Conversation
Needs pola-rs/polars#19104 |
Which is now merged but not yet released. |
fdfe737
to
4350006
Compare
Polars 1.11 is out, with slight updates to the IR, so we can correctly raise for dynamic groupbys and see inequality joins. These changes adapt to that and do a first pass at supporting inequality joins (by translating to cross + filter). A followup (#17000) will use libcudf's conditional joins. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - Mike Sarahan (https://github.com/msarahan) URL: #17154
952b4ef
to
7b6141d
Compare
7b6141d
to
5d1ecbb
Compare
5d1ecbb
to
8629c74
Compare
Not currently ready for primetime, due to pola-rs/polars#19597 |
8629c74
to
2971f90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signposts
class ColRef(Expr): | ||
__slots__ = ("index", "table_ref") | ||
_non_child = ("dtype", "index", "table_ref") | ||
index: int | ||
table_ref: plc.expressions.TableReference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will translate the predicate to a libcudf AST. We get the pieces referring to the left and right table separately, but want to build a single expression that represents the predicate. So we need a way of remembering which piece came from where.
if POLARS_VERSION_GT_112: | ||
# If we sliced away some data from the start, that | ||
# shifts the row index. | ||
# But prior to 1.13, polars had this wrong, so we match behaviour | ||
# https://github.com/pola-rs/polars/issues/19607 | ||
offset += ( | ||
self.skip_rows | ||
) # pragma: no cover; polars 1.13 not yet released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahead of the game here.
lg, rg = plc.join.conditional_inner_join( | ||
left.table, right.table, self.ast_predicate | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars only supports inner joins for its inequality joins right now, so I didn't refactor to share code.
insert_colrefs( | ||
left.value, | ||
table_ref=plc.expressions.TableReference.LEFT, | ||
name_to_index={ | ||
name: i for i, name in enumerate(inp_left.schema) | ||
}, | ||
), | ||
insert_colrefs( | ||
right.value, | ||
table_ref=plc.expressions.TableReference.RIGHT, | ||
name_to_index={ | ||
name: i for i, name in enumerate(inp_right.schema) | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we track which table the column refers to.
return ir.Slice(schema, offset, length, filtered) | ||
return filtered | ||
|
||
return ir.ConditionalJoin(schema, predicate, node.options, inp_left, inp_right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: but perhaps not in this PR, polars implements joins that contain combinations of equality and inequalities as equality join followed by filter. We could recognise the pattern Filter(Join(...))
and turn that into code that would emit a libcudf mixed join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having looked at how mixed joins work in libcudf, I don't think this would work (except for inner joins).
If I have join(...).filter(...)
then I get an output that is the equality join, followed by a post filter. In mixed_XXX_join
in libcudf, the gather maps (except inner joins) have out of bounds (i.e. nullifying) entries for the output rows where any of the the equality or filter conditions do not hold.
I think the optimisation that is valid is to do a join(...).filter(...)
by pushing the filter expression into the join and using that to pre-filter the join's gather maps before using them for the whole join. But I'm not necessarily sure that's going to be worth it.
marks=pytest.mark.xfail( | ||
POLARS_VERSION_LT_113, | ||
reason="https://github.com/pola-rs/polars/issues/19597", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed, but in an unreleased version.
2e963c9
to
dc4dc94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking question but LGTM
If skip_rows is non-zero, we must add that to the initial row index offset.
For use in conditional and mixed joins, we need a table reference in the ast conversion.
Now that we have conditional joins exposed in pylibcudf, we can use them to implement inequality joins.
Polars doesn't deliver a complete predicate IR due to pola-rs/polars#19597.
dc4dc94
to
8cfb3e2
Compare
/merge |
Description
Implement inequality joins by using the newly-exposed conditional join from pylibcudf.
Checklist