Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(database): faster queries from the persons table #17811

Merged
merged 17 commits into from
Oct 13, 2023

Conversation

mariusandra
Copy link
Collaborator

@mariusandra mariusandra commented Oct 5, 2023

Problem

The existing select * from persons transformation into select * from (select ... from raw_persons) is inefficient and times out for large customers, even on the new cluster.

Changes

Swaps it out for a faster version:

SELECT * FROM raw_persons WHERE (id, version) IN (
   SELECT id, max(version) as version
   FROM raw_persons
   GROUP BY id
   HAVING ifNull(equals(argMax(raw_persons.is_deleted, raw_persons.version), 0), 0)
)

How did you test this code?

Tested the query in the SQL editor. Hoping CI will update the snapshots and report of failures.

"""
SELECT id FROM raw_persons WHERE (id, version) IN (
SELECT id, max(version) as version
FROM raw_persons
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be filtered by team id

Copy link
Collaborator Author

@mariusandra mariusandra Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's automatically, as this is a hogql query, not a clickhouse query


query = parse_select(
"""
SELECT id FROM raw_persons WHERE (id, version) IN (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with this id? Does it get replaced with the actual fields we need (properties, created_at) or is this going into another subquery?

Copy link
Collaborator Author

@mariusandra mariusandra Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look a few lines lower:

    query.select = []
    for field_name, field_chain in requested_fields.items():
        query.select.append(
            ast.Alias(
                alias=field_name,
                expr=ast.Field(chain=["raw_persons"] + field_chain),
            )
        )
    return query

@mariusandra
Copy link
Collaborator Author

Ran some benchmarks... and I'm not sure anymore about anything.

Running the old select * from persons query against the new one for our team... while also selecting properties (blue is old, yellow is new):
image

However if I exclude properties from the queries, the picture is drastically different (blue is old, yellow is new)

image

Even in metabase, the old query returns almost instantly, whereas the new one takes many seconds.

🤔

@mariusandra
Copy link
Collaborator Author

I reworked this optimization to be a Query Modifier that can be toggled on and off as needed.

2023-10-13 13 42 27

It currently defaults to off. With this in, we can now run test the query for a whole host of customers easier.

@mariusandra mariusandra merged commit 7516088 into master Oct 13, 2023
72 checks passed
@mariusandra mariusandra deleted the persons-argmax-v2 branch October 13, 2023 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants