Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] make batches query go brrrrrrr #14629

Merged
merged 6 commits into from
Jul 31, 2024

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Jul 19, 2024

This PR optimises the V2 batches query along with its associated subqueries.
Listing batches for the CI user now executes in 1s were previously it had taken ~14s.
Resolves #14599

@ehigham
Copy link
Member Author

ehigham commented Jul 24, 2024

Verified that the queries return the same results for ci by running both old and new queries and diffing the results.

@ehigham
Copy link
Member Author

ehigham commented Jul 25, 2024

I believe 14366 played a part in motivating the lateral joins. For that query, the execution pans are identical. Here, I'm seeing better performance with the more natural query. I didn't go further for compatibility with user-defined filter operations.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Wrapped my head around this query and its parts. I think I'm convinced that this doesn't alter the results of this query.

batch/batch/front_end/query/query.py Outdated Show resolved Hide resolved
@@ -478,10 +482,14 @@ def __init__(self, term: str):
self.term = term

def query(self) -> Tuple[str, List[str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-picking perhaps, but is this "query" naming semantic correct when this is really producing a WHERE clause? It's also a little confusing downstream in the query_v2 function where all of these additions are called "queries" when they're really just lists of clauses being generated (or not) for a single larger query.

Copy link
Member Author

@ehigham ehigham Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some SQL implementations might call this an Expression - I didn't find a definite grammar from a quick google. I'm not going to make these kinds of changes in this PR. If you'd like to make a follow-up change to improve this that might help clean up confusion.

EXISTS (
SELECT NULL
FROM job_group_attributes AS attrs
WHERE attrs.batch_id = batches.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this new query explicitly mentions batches.id which wasn't part of the previous query. Are we introducing a new field here? Or was this just implicitly included before? Is it possible to document somehow the tables we expect to run these queries against, and what assumptions they're making about what other tables have already have been joined in (and on what keys)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - I should just use the job_groups table.

Comment on lines +138 to +139
FROM batches
INNER JOIN job_groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might benefit from a in-person explanation for why these are the same. It seems possible that

job_groups LEFT JOIN batches would give a different result to batches INNER JOIN job_groups

Like... wouldn't a job group that isn't represented in batches would appear in the left query but not the right query?

But maybe there's some data-model reason why that's not the case? (eg that it's impossible to be a job group without a batch?)

This question also applies for all the other joins down the chain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim of the query is to return a list of batches annotated with some high level metadata like cost, job statuses and whether or not the batch has been cancelled.

To convince yourself that they're the same, you also need to look at the data in the where-clause (defined above this f-string). Of note, we're filtering for job groups with job_group_id = 0.

All batches have at least one job group with job_group_id = 0 (ie the root job group). Performing and inner join or left join will yield the same answer. Operationally I'm doing batches inner join (job_groups filtered by job_group_id = 0).

A job group cannot exist outside of a batch.

Copy link
Member Author

@ehigham ehigham Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A batch is cancelled when a record exists in the job_groups_cancelled table with the same id and job_group_id = 0, ie:

SELECT B.id, C.id IS NOT NULL AS cancelled
FROM batches AS B
LEFT JOIN job_groups_cancelled AS C USING (id)
WHERE C.job_group_id = 0;

Copy link
Member Author

@ehigham ehigham Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficency, we do not add a record into job_groups_cancelled for every cancelled job group.
Instead, a job group is cancelled if a record exists in job_groups_cancelled for an ancestral job group.

The table job_group_self_and_ancestors tells us a job group's ancestor_id and level for each job group.
For the root job group, this ancestor_id is the job group's job_group_id (as the table name implies).

SELECT G.batch_id
     , G.job_group_id
     , C.job_group_id IS NOT NULL AS cancelled
FROM job_groups AS G
INNER JOIN job_groups_self_and_ancestors AS A
   ON G.batch_id     = A.batch_id
  AND G.job_group_id = A.job_group_id
LEFT JOIN job_groups_cancelled AS C
   ON G.batch_id    = C.id
  AND A.ancestor_id = C.job_group_id;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, job_groups_n_jobs_in_complete_states contains aggregated information for the job group hierarchy so we only need the record for job_group_id 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lateral join on the table derived from aggregated_job_group_costs is to avoid computing the costs of the batch for all batches and then joining.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might indeed make the query more human readable, but names like job_groups are exposed to user defined queries that you see in other files. A follow-up change would improve that but harder to test.

@ehigham ehigham requested a review from cjllanwarne July 30, 2024 16:16
Copy link
Collaborator

@cjllanwarne cjllanwarne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've convinced myself I understand this now. I've requested changes because I have a minor suggestion, but feel free to push back on that, in which case I'll convert into an approval

Comment on lines +489 to +490
WHERE attrs.batch_id = job_groups.batch_id
AND attrs.job_group_id = job_groups.job_group_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does attrs.job_group_id = job_groups.job_group_id not imply attrs.batch_id = job_groups.batch_id?

Copy link
Member Author

@ehigham ehigham Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, job_group_id is unique to the batch and is not globally unique.

Comment on lines +143 to +144
INNER JOIN billing_project_users
ON batches.billing_project = billing_project_users.billing_project
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not in this PR, but thinking out loud - this took me a while to parse since it's not used elsewhere in the query and not in the return value. Eventually I noticed it's part of a where clause that's generated above and joined in below.

So thinking out loud, that feels a little icky because we have to remember to add this here and we have to get the table names right for a where clause we happen to be adding below. I wonder whether we could change the Query interface from its current two parts (where clause, parameters) to three parts (join clause, where clause, parameters) to make that a little harder to mix up.

Copy link
Member Author

@ehigham ehigham Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree - there's a lot to remember here. This part of the query (plus the billing_project_users.user = ? in the WHERE clause) is what controls what users can see.
Your suggestion would be really useful for #14410 FYI

ON batches.id = job_groups_n_jobs_in_complete_states.id
AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id
LEFT JOIN (SELECT *, 1 AS cancelled FROM job_groups_cancelled) AS cancelled_t
ON batches.id = cancelled_t.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR either, but it was confusing to me that the ID field in a table called job_groups_cancelled represents the batch ID, not the job group ID

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was one field that was missed when job groups were added. Created #14646

Comment on lines +138 to +139
FROM batches
INNER JOIN job_groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?

@ehigham ehigham requested a review from cjllanwarne July 31, 2024 15:41
@ehigham ehigham added WIP and removed WIP labels Jul 31, 2024
Copy link
Collaborator

@cjllanwarne cjllanwarne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this change is equivalent to what was there before and should be much more performant 👍

@hail-ci-robot hail-ci-robot merged commit 21ad23a into hail-is:main Jul 31, 2024
3 checks passed
@ehigham ehigham deleted the ehigham/batches-go-brrrrr branch July 31, 2024 20:08
ehigham added a commit to ehigham/hail that referenced this pull request Aug 1, 2024
hail-ci-robot pushed a commit that referenced this pull request Aug 1, 2024
hail-ci-robot pushed a commit that referenced this pull request Aug 8, 2024
#14629 improved the speed of listing a user's batches but introduced a
large
regression for listing all batches readble by that user. This change
fixes that
regression by making use index hints and `STRAIGHT_JOIN`s.

The index hint tells MySQL to never consider the index `batches_deleted`
as it
has very low cardinality. In some forms of this query, the planner tries
to use
it to its peril.

A problem in query 0 with #14629 (see below) was that fewer filters on
batches
made the optimiser consider joins in a suboptimal order - it did a table
scan
on `job_groups` first then sorted the results by to `batches.id DESC`
instead
of doing an index scan on `batches` in reverse.

Using `STRAIGHT_JOIN`s instead of `INNER JOIN` mades the optimiser start
from
`batches` and read its index in reverse before considering other tables
in
subsequent joins. From the
[documentation](https://dev.mysql.com/doc/refman/8.4/en/join.html):

> STRAIGHT_JOIN is similar to JOIN, except that the left table is always
read
before the right table. This can be used for those (few) cases for which
the
  join optimizer processes the tables in a suboptimal order.

This is advantageous for a couple of reasons:
- We want to list newer batches first
- For this query, the `batches` table has more applicables indexes
- We want the variable to order by to be in the primary key of the first
  table so we can read the index in reverse

Before and after timings, collected by running the query 5 times, then
using
profiles gathered by MySQL.
```
+-------+---------------------------------------------------*
| query |  description                                      |                                                                                                                                                                                                                                                         
+-------+---------------------------------------------------+
|     0 | All batches accessible to user `ci`               |
|     1 | All batches accessible to user `ci` owned by `ci` |
+-------+---------------------------------------------------*

+-------+--------+--------------------------------------------------------+------------+------------+
| query | branch | timings                                                |    mean    |    stdev   |                                                                                                                                                                                                                                             
+-------+--------+--------------------------------------------------------+------------+------------+
|     0 |  main  | 0.05894400,0.05207850,0.07067875,0.06281800,0.060250   | 0.06095385 | 0.00602255 |
|     1 |  main  | 14.1106150,12.2619323,13.8442850,12.0749633,14.0297822 | 13.2643156 | 0.90087263 |
+-------+--------+--------------------------------------------------------+------------+------------+
|     0 |   PR   | 0.04717375,0.04974350,0.04312150,0.04070850,0.04193650 | 0.04453675 | 0.00339069 |
|     1 |   PR   | 0.04423925,0.03967550,0.03935425,0.04056875,0.05286850 | 0.04334125 | 0.00507140 |
+-------+--------+--------------------------------------------------------+------------+------------+
```

I'm hopeful that this won't introduce regressions for most use cases.
While I
haven't benchmarked other queries, the MySQL client does feel more
responsive
for a wider array of users. One notable exception is for the user
`dking` who
owns 3.7x more batches than has access to, of which all have been
deleted. I
don't think this is a common enough use case to make this query even
more
complicated than it already is.

Resolves #14599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[batch] Searching in the batches UI is slow for users with many batches
4 participants