Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): Allow lazy joins on lazy tables with requested fields #20731

Merged
merged 20 commits into from
Mar 8, 2024

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Mar 6, 2024

Problem

  • When adding a LazyJoin onto persons that uses any constraint field other than person.id, the lazy_tables resolver doesn't include the constraint fields into the persons LazyTable subquery, and thus causes the lazy join to fail field resolution

Changes

  • If the lazy join uses fields on the constraint other than whats already pre-defined in the lazy tables select subquery, then pass these into the subquery
  • Passes in an alias to the subquery to cater for selecting the likes of properties.field, and update the same alias in the join function constraint
    • I wanted this to happen from within lazy_tables, but I've pushed this functionality down to the individual join_function instead. This isn't ideal, as each new lazy join may need to implement the same logic, but the complexity of finding the right ast.Field's and their associated types from within lazy_tables to update felt hackier and more prone to error
  • Fixed a bunch of mypy errors within lazy_tables
  • Added UI to the data warehouse join modal to allow for hogql expressions

Edit:

  • Updated lazy tables to support TableAliasType nodes with a child of either LazyJoinType or LazyTableType - turns out lazy tables just wasn't running on anything that was aliased before, god knows how this managed to run without erroring (or maybe it just returned bad data..)
  • Upgraded the existing logic that tells lazy joins which fields to request - this now also rewrites the constraint field chains so that property types can be used within constraints, and then removed the override logic in the individual lazy join function that was added earlier
  • LazyJoin from_field field now supports full field chains (e.g. [properties.email]), and have added an optional to_field to help support the data warehouse joins - this is only used in the override logic and is only needed when a dot notated field is used on the right side of the join constraint

How did you test this code?

  • Added clickhouse snapshotting for lazy tables to hand verify the queries are valid

Copy link
Contributor

github-actions bot commented Mar 6, 2024

Size Change: 0 B

Total Size: 815 kB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 815 kB

compressed-size-action

@Gilbert09 Gilbert09 changed the title WIP feat(hogql): Allow lazy joins on lazy tables with requested fields Mar 6, 2024
@Gilbert09 Gilbert09 requested review from mariusandra, EDsCODE and a team March 6, 2024 18:43
@Gilbert09 Gilbert09 marked this pull request as ready for review March 6, 2024 18:43
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I did add some 🤔 inline, and I'm also wondering if there's a reason source_table_key_hogql and joining_table_key_hogql can't just be rolled into source_table_key and source_table_key?

Comment on lines 120 to 133
@pytest.mark.usefixtures("unittest_snapshot")
def test_lazy_join_on_lazy_table_with_person_properties(self):
DataWarehouseJoin(
team=self.team,
source_table_name="persons",
source_table_key="$hogql",
source_table_key_hogql="properties.email",
joining_table_name="events",
joining_table_key="event",
field_name="events",
).save()

printed = self._print_select("select events.event from persons")
assert printed == self.snapshot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example seems more artificial than the others. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, might be good to test these tests with both materialized and non-materialized properties.

and node.type is not None
and node.type.tables.get(join_to_add_table_name) is not None
):
node.type.tables.pop(join_to_add_table_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause problems in some case? Is there any time when a legit join with the same name could have been done to a table?

E.g.

   select event, person.properties.email as real_email
     from events 
left join person as person 
       on (person.properties.email = events.properties.tr00_mail)
left join person as person2
       on (person2.properties.email = events.properties.fake_mail)

I haven't followed the code to verify if this would be an issue or not, but would this run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in this example, this would be done on the alias and not the table itself, (e.g. person and person2). But I'll test this to verify

@Gilbert09
Copy link
Member Author

@mariusandra Gonna re-request your review, managed to redo this without having to resolve types twice etc, feels nicer now and we even support table aliases 😱

@Gilbert09 Gilbert09 requested a review from mariusandra March 7, 2024 20:09
@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

3 snapshot changes in total. 0 added, 3 modified, 0 deleted:

  • chromium: 0 added, 3 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but overall makes sense. Good catch on the aliasing

frontend/src/scenes/data-warehouse/viewLinkLogic.tsx Outdated Show resolved Hide resolved
posthog/hogql/transforms/lazy_tables.py Show resolved Hide resolved

def visit_field(self, node: ast.Field):
for constraint in self.overrides:
if node.chain == constraint.chain_to_replace:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is traversing the entire query again, is it possible that the chain_to_replace matches something unintentionally? It's a super nit edge case but if an outer query reuses an alias/table name that exists on an inner query, it could get overwritten by this override?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I believe our hidden alias system takes care of this for us. For example, doing this query:

select balance, email from (
   select person.customer.balance, person.properties.email from events
   limit 10
)

gives us this generated clickhouse:
image

and so I think we're good here

@Gilbert09 Gilbert09 force-pushed the tom/linking-tables-lazy-joins branch from 1b99050 to bf80a20 Compare March 8, 2024 07:28
@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 0 modified, 0 deleted
  • webkit: 0 added, 1 modified, 0 deleted (diff for shard 2)

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 0 modified, 0 deleted
  • webkit: 0 added, 1 modified, 0 deleted (diff for shard 2)

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got hard to follow in the lazy_table.py part, but I'm going to assume all is good and/or you'll fix it if not 😅.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is concerning

Comment on lines +397 to +405

# When joining a lazy table to another lazy table, the joined table doesn't get resolved
# Doing another pass solves this for us
if self.lazy_finder_counter < 20:
lazy_finder = LazyFinder()
lazy_finder.visit(node)
if lazy_finder.found_lazy:
self.lazy_finder_counter = self.lazy_finder_counter + 1
self.visit_select_query(node)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like code nobody will dare touch for years 😅. I wonder if there's a smarter way to just flag tables that should be revisited, instead of doing it for everything?

@Gilbert09 Gilbert09 merged commit 4a60cad into master Mar 8, 2024
80 checks passed
@Gilbert09 Gilbert09 deleted the tom/linking-tables-lazy-joins branch March 8, 2024 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants