Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] make batches query go brrrrrrr #14629

Merged
merged 6 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 25 additions & 12 deletions batch/batch/front_end/query/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,10 +455,14 @@ def __init__(self, term: str):
self.term = term

def query(self) -> Tuple[str, List[str]]:
sql = """
((job_groups.batch_id, job_groups.job_group_id) IN
(SELECT batch_id, job_group_id FROM job_group_attributes
WHERE `key` = %s OR `value` = %s))
sql = """\
EXISTS (
SELECT NULL
FROM job_group_attributes AS attrs
WHERE attrs.batch_id = batches.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this new query explicitly mentions batches.id which wasn't part of the previous query. Are we introducing a new field here? Or was this just implicitly included before? Is it possible to document somehow the tables we expect to run these queries against, and what assumptions they're making about what other tables have already have been joined in (and on what keys)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - I should just use the job_groups table.

AND attrs.job_group_id = job_groups.job_group_id
AND (attrs.`key` = %s or attrs.`value` = %s)
ehigham marked this conversation as resolved.
Show resolved Hide resolved
)
"""
return (sql, [self.term, self.term])

Expand All @@ -478,10 +482,14 @@ def __init__(self, term: str):
self.term = term

def query(self) -> Tuple[str, List[str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-picking perhaps, but is this "query" naming semantic correct when this is really producing a WHERE clause? It's also a little confusing downstream in the query_v2 function where all of these additions are called "queries" when they're really just lists of clauses being generated (or not) for a single larger query.

Copy link
Member Author

@ehigham ehigham Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some SQL implementations might call this an Expression - I didn't find a definite grammar from a quick google. I'm not going to make these kinds of changes in this PR. If you'd like to make a follow-up change to improve this that might help clean up confusion.

sql = """
((job_groups.batch_id, job_groups.job_group_id) IN
(SELECT batch_id, job_group_id FROM job_group_attributes
WHERE `key` LIKE %s OR `value` LIKE %s))
sql = """\
EXISTS (
SELECT NULL
FROM job_group_attributes AS attrs
WHERE attrs.batch_id = batches.id
AND attrs.job_group_id = job_groups.job_group_id
AND (attrs.`key` LIKE %s OR attrs.`value` LIKE %s)
)
"""
escaped_term = f'%{self.term}%'
return (sql, [escaped_term, escaped_term])
Expand All @@ -505,10 +513,15 @@ def query(self) -> Tuple[str, List[str]]:
value = self.value
if isinstance(self.operator, PartialMatchOperator):
value = f'%{value}%'
sql = f"""
((job_groups.batch_id, job_groups.job_group_id) IN
(SELECT batch_id, job_group_id FROM job_group_attributes
WHERE `key` = %s AND `value` {op} %s))
sql = f"""\
EXISTS (
SELECT NULL
FROM job_group_attributes AS attrs
WHERE attrs.batch_id = batches.id
AND attrs.job_group_id = job_groups.job_group_id
AND attrs.`key` = %s
AND attrs.`value` {op} %s
)
"""
return (sql, [self.key, value])

Expand Down
58 changes: 32 additions & 26 deletions batch/batch/front_end/query/query_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,40 +125,46 @@ def parse_list_batches_query_v2(user: str, q: str, last_batch_id: Optional[int])
where_conditions.append(f'({cond})')
where_args += args

sql = f"""
SELECT batches.*,
cancelled_t.cancelled IS NOT NULL AS cancelled,
sql = f"""\
SELECT
batches.*,
cancelled_t.cancelled <=> 1 AS cancelled,
job_groups_n_jobs_in_complete_states.n_completed,
job_groups_n_jobs_in_complete_states.n_succeeded,
job_groups_n_jobs_in_complete_states.n_failed,
job_groups_n_jobs_in_complete_states.n_cancelled,
cost_t.cost, cost_t.cost_breakdown
FROM job_groups
LEFT JOIN batches ON batches.id = job_groups.batch_id
LEFT JOIN billing_projects ON batches.billing_project = billing_projects.name
LEFT JOIN job_groups_n_jobs_in_complete_states ON job_groups.batch_id = job_groups_n_jobs_in_complete_states.id AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id
LEFT JOIN LATERAL (
SELECT 1 AS cancelled
FROM job_group_self_and_ancestors
INNER JOIN job_groups_cancelled
ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND
job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id
WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND
job_groups.job_group_id = job_group_self_and_ancestors.job_group_id
) AS cancelled_t ON TRUE
STRAIGHT_JOIN billing_project_users ON batches.billing_project = billing_project_users.billing_project
LEFT JOIN LATERAL (
SELECT COALESCE(SUM(`usage` * rate), 0) AS cost, JSON_OBJECTAGG(resources.resource, COALESCE(`usage` * rate, 0)) AS cost_breakdown
FROM (
SELECT resource_id, CAST(COALESCE(SUM(`usage`), 0) AS SIGNED) AS `usage`
cost_t.cost,
cost_t.cost_breakdown
FROM batches
INNER JOIN job_groups
Comment on lines +138 to +139
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might benefit from a in-person explanation for why these are the same. It seems possible that

job_groups LEFT JOIN batches would give a different result to batches INNER JOIN job_groups

Like... wouldn't a job group that isn't represented in batches would appear in the left query but not the right query?

But maybe there's some data-model reason why that's not the case? (eg that it's impossible to be a job group without a batch?)

This question also applies for all the other joins down the chain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim of the query is to return a list of batches annotated with some high level metadata like cost, job statuses and whether or not the batch has been cancelled.

To convince yourself that they're the same, you also need to look at the data in the where-clause (defined above this f-string). Of note, we're filtering for job groups with job_group_id = 0.

All batches have at least one job group with job_group_id = 0 (ie the root job group). Performing and inner join or left join will yield the same answer. Operationally I'm doing batches inner join (job_groups filtered by job_group_id = 0).

A job group cannot exist outside of a batch.

Copy link
Member Author

@ehigham ehigham Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A batch is cancelled when a record exists in the job_groups_cancelled table with the same id and job_group_id = 0, ie:

SELECT B.id, C.id IS NOT NULL AS cancelled
FROM batches AS B
LEFT JOIN job_groups_cancelled AS C USING (id)
WHERE C.job_group_id = 0;

Copy link
Member Author

@ehigham ehigham Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficency, we do not add a record into job_groups_cancelled for every cancelled job group.
Instead, a job group is cancelled if a record exists in job_groups_cancelled for an ancestral job group.

The table job_group_self_and_ancestors tells us a job group's ancestor_id and level for each job group.
For the root job group, this ancestor_id is the job group's job_group_id (as the table name implies).

SELECT G.batch_id
     , G.job_group_id
     , C.job_group_id IS NOT NULL AS cancelled
FROM job_groups AS G
INNER JOIN job_groups_self_and_ancestors AS A
   ON G.batch_id     = A.batch_id
  AND G.job_group_id = A.job_group_id
LEFT JOIN job_groups_cancelled AS C
   ON G.batch_id    = C.id
  AND A.ancestor_id = C.job_group_id;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, job_groups_n_jobs_in_complete_states contains aggregated information for the job group hierarchy so we only need the record for job_group_id 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lateral join on the table derived from aggregated_job_group_costs is to avoid computing the costs of the batch for all batches and then joining.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might indeed make the query more human readable, but names like job_groups are exposed to user defined queries that you see in other files. A follow-up change would improve that but harder to test.

ON batches.id = job_groups.batch_id
INNER JOIN billing_projects
ON batches.billing_project = billing_projects.name
INNER JOIN billing_project_users
ON batches.billing_project = billing_project_users.billing_project
Comment on lines +143 to +144
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not in this PR, but thinking out loud - this took me a while to parse since it's not used elsewhere in the query and not in the return value. Eventually I noticed it's part of a where clause that's generated above and joined in below.

So thinking out loud, that feels a little icky because we have to remember to add this here and we have to get the table names right for a where clause we happen to be adding below. I wonder whether we could change the Query interface from its current two parts (where clause, parameters) to three parts (join clause, where clause, parameters) to make that a little harder to mix up.

Copy link
Member Author

@ehigham ehigham Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree - there's a lot to remember here. This part of the query (plus the billing_project_users.user = ? in the WHERE clause) is what controls what users can see.
Your suggestion would be really useful for #14410 FYI

INNER JOIN job_groups_n_jobs_in_complete_states
ON batches.id = job_groups_n_jobs_in_complete_states.id
AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id
LEFT JOIN (SELECT *, 1 AS cancelled FROM job_groups_cancelled) AS cancelled_t
ON batches.id = cancelled_t.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR either, but it was confusing to me that the ID field in a table called job_groups_cancelled represents the batch ID, not the job group ID

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was one field that was missed when job groups were added. Created #14646

AND job_groups.job_group_id = cancelled_t.job_group_id
INNER JOIN LATERAL (
WITH resource_costs AS (
SELECT
resource_id,
CAST(COALESCE(SUM(`usage`), 0) AS SIGNED) AS `usage`
FROM aggregated_job_group_resources_v3
WHERE job_groups.batch_id = aggregated_job_group_resources_v3.batch_id AND job_groups.job_group_id = aggregated_job_group_resources_v3.job_group_id
WHERE batch_id = batches.id
GROUP BY resource_id
) AS usage_t
LEFT JOIN resources ON usage_t.resource_id = resources.resource_id
)
SELECT
COALESCE(SUM(`usage` * rate), 0) AS cost,
JSON_OBJECTAGG(resource, COALESCE(`usage` * rate, 0)) AS cost_breakdown
FROM resource_costs
INNER JOIN resources USING (resource_id)
) AS cost_t ON TRUE
WHERE {' AND '.join(where_conditions)}
ORDER BY job_groups.batch_id DESC
ORDER BY batches.id DESC
LIMIT 51;
"""

Expand Down