Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement inequality joins by translation to conditional joins #17000

Merged

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Oct 4, 2024

Description

Implement inequality joins by using the newly-exposed conditional join from pylibcudf.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence-
Copy link
Contributor Author

wence- commented Oct 4, 2024

Needs pola-rs/polars#19104

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Oct 4, 2024
@wence-
Copy link
Contributor Author

wence- commented Oct 9, 2024

Needs pola-rs/polars#19104

Which is now merged but not yet released.

@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch from fdfe737 to 4350006 Compare October 16, 2024 16:01
@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Oct 16, 2024
rapids-bot bot pushed a commit that referenced this pull request Oct 24, 2024
Polars 1.11 is out, with slight updates to the IR, so we can correctly raise for dynamic groupbys and see inequality joins.

These changes adapt to that and do a first pass at supporting inequality joins (by translating to cross + filter). A followup (#17000) will use libcudf's conditional joins.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mike Sarahan (https://github.com/msarahan)

URL: #17154
@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch 2 times, most recently from 952b4ef to 7b6141d Compare October 31, 2024 18:17
@github-actions github-actions bot removed the cudf.polars Issues specific to cudf.polars label Oct 31, 2024
@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch from 7b6141d to 5d1ecbb Compare October 31, 2024 18:44
@github-actions github-actions bot added the cudf.polars Issues specific to cudf.polars label Oct 31, 2024
@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch from 5d1ecbb to 8629c74 Compare November 1, 2024 18:31
@wence- wence- changed the title WIP: Implement inequality joins by translating to cross + filter Implement inequality joins by translation to conditional joins Nov 1, 2024
@wence-
Copy link
Contributor Author

wence- commented Nov 1, 2024

Not currently ready for primetime, due to pola-rs/polars#19597

@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch from 8629c74 to 2971f90 Compare November 4, 2024 18:19
@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Nov 4, 2024
@wence- wence- marked this pull request as ready for review November 4, 2024 18:19
@wence- wence- requested a review from a team as a code owner November 4, 2024 18:19
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 4, 2024
Copy link
Contributor Author

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signposts

Comment on lines +254 to +258
class ColRef(Expr):
__slots__ = ("index", "table_ref")
_non_child = ("dtype", "index", "table_ref")
index: int
table_ref: plc.expressions.TableReference
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will translate the predicate to a libcudf AST. We get the pieces referring to the left and right table separately, but want to build a single expression that represents the predicate. So we need a way of remembering which piece came from where.

Comment on lines 472 to 479
if POLARS_VERSION_GT_112:
# If we sliced away some data from the start, that
# shifts the row index.
# But prior to 1.13, polars had this wrong, so we match behaviour
# https://github.com/pola-rs/polars/issues/19607
offset += (
self.skip_rows
) # pragma: no cover; polars 1.13 not yet released
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahead of the game here.

Comment on lines 844 to 846
lg, rg = plc.join.conditional_inner_join(
left.table, right.table, self.ast_predicate
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polars only supports inner joins for its inequality joins right now, so I didn't refactor to share code.

Comment on lines +218 to +230
insert_colrefs(
left.value,
table_ref=plc.expressions.TableReference.LEFT,
name_to_index={
name: i for i, name in enumerate(inp_left.schema)
},
),
insert_colrefs(
right.value,
table_ref=plc.expressions.TableReference.RIGHT,
name_to_index={
name: i for i, name in enumerate(inp_right.schema)
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we track which table the column refers to.

return ir.Slice(schema, offset, length, filtered)
return filtered

return ir.ConditionalJoin(schema, predicate, node.options, inp_left, inp_right)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: but perhaps not in this PR, polars implements joins that contain combinations of equality and inequalities as equality join followed by filter. We could recognise the pattern Filter(Join(...)) and turn that into code that would emit a libcudf mixed join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked at how mixed joins work in libcudf, I don't think this would work (except for inner joins).

If I have join(...).filter(...) then I get an output that is the equality join, followed by a post filter. In mixed_XXX_join in libcudf, the gather maps (except inner joins) have out of bounds (i.e. nullifying) entries for the output rows where any of the the equality or filter conditions do not hold.

I think the optimisation that is valid is to do a join(...).filter(...) by pushing the filter expression into the join and using that to pre-filter the join's gather maps before using them for the whole join. But I'm not necessarily sure that's going to be worth it.

Comment on lines +134 to +137
marks=pytest.mark.xfail(
POLARS_VERSION_LT_113,
reason="https://github.com/pola-rs/polars/issues/19597",
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed, but in an unreleased version.

@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch 2 times, most recently from 2e963c9 to dc4dc94 Compare November 5, 2024 17:21
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking question but LGTM

If skip_rows is non-zero, we must add that to the initial row index
offset.
For use in conditional and mixed joins, we need a table reference in
the ast conversion.
Now that we have conditional joins exposed in pylibcudf, we can use
them to implement inequality joins.
Polars doesn't deliver a complete predicate IR due to
pola-rs/polars#19597.
@wence- wence- force-pushed the wence/fea/16926-polars-iejoin branch from dc4dc94 to 8cfb3e2 Compare November 6, 2024 09:34
@wence-
Copy link
Contributor Author

wence- commented Nov 8, 2024

/merge

@rapids-bot rapids-bot bot merged commit 150d8d8 into rapidsai:branch-24.12 Nov 8, 2024
102 checks passed
@wence- wence- deleted the wence/fea/16926-polars-iejoin branch November 8, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Support conditional joins in cudf-polars
3 participants