JobCountByState query performance #106

ligustah · 2024-08-04T23:08:22Z

Hey,

I'm really enjoying river so far, so thank you for that!

I was playing around with River UI, but on a table with just over 5 million jobs the queries become so slow that the api requests are timing out and I'm just getting a white screen (specifically the JobCountByState query). Haven't looked into it too much, so not sure if it's just missing an index somewhere.

This is happening both on my local M2 MacbookPro and on a small AlloyDB instance.

EDIT:

For some numbers, that query took 9s to complete on AlloyDB (which admittedly is under heavy use at the moment). The same query took just over 3s on my otherwise idle Macbook.

The text was updated successfully, but these errors were encountered:

bgentry · 2024-08-04T23:27:58Z

Thanks for the report! Any chance you can run that SQL query with EXPLAIN ANALYZE and share the results? Also, what’s your Postgres version?

ligustah · 2024-08-05T09:45:19Z

Both database have about the same number of jobs in total. AlloyDB is under load, my local DB is mostly idle. Running UI against AlloyDB fails at least one of these two requests quite consistently (after 5s):

AlloyDB (PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit):

Finalize GroupAggregate  (cost=204130.73..204131.24 rows=2 width=12) (actual time=4117.885..4125.452 rows=3 loops=1)
  Group Key: state
  ->  Gather Merge  (cost=204130.73..204131.20 rows=4 width=12) (actual time=4117.876..4125.442 rows=9 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=203130.71..203130.72 rows=2 width=12) (actual time=4077.823..4077.825 rows=3 loops=3)
              Sort Key: state
              Sort Method: quicksort  Memory: 25kB
              Worker 0:  Sort Method: quicksort  Memory: 25kB
              Worker 1:  Sort Method: quicksort  Memory: 25kB
              ->  Partial HashAggregate  (cost=203130.68..203130.70 rows=2 width=12) (actual time=4077.791..4077.792 rows=3 loops=3)
                    Group Key: state
                    Batches: 1  Memory Usage: 24kB
                    Worker 0:  Batches: 1  Memory Usage: 24kB
                    Worker 1:  Batches: 1  Memory Usage: 24kB
                    ->  Parallel Seq Scan on river_job  (cost=0.00..191909.63 rows=2244210 width=4) (actual time=0.007..2475.571 rows=1800885 loops=3)
                          Filter: ((queue IS NOT NULL) AND (scheduled_at IS NOT NULL) AND (id IS NOT NULL) AND (priority > 0))
Planning Time: 0.121 ms
Execution Time: 4125.499 ms

available,1119960
completed,4282637
running,57

Local (PostgreSQL 16.3 (Debian 16.3-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit):

Finalize GroupAggregate  (cost=129340.86..129341.37 rows=2 width=12) (actual time=1378.681..1384.518 rows=3 loops=1)
  Group Key: state
  ->  Gather Merge  (cost=129340.86..129341.33 rows=4 width=12) (actual time=1378.662..1384.500 rows=7 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=128340.84..128340.84 rows=2 width=12) (actual time=1206.975..1206.976 rows=2 loops=3)
              Sort Key: state
              Sort Method: quicksort  Memory: 25kB
              Worker 0:  Sort Method: quicksort  Memory: 25kB
              Worker 1:  Sort Method: quicksort  Memory: 25kB
              ->  Partial HashAggregate  (cost=128340.81..128340.83 rows=2 width=12) (actual time=1206.761..1206.761 rows=2 loops=3)
                    Group Key: state
                    Batches: 1  Memory Usage: 24kB
                    Worker 0:  Batches: 1  Memory Usage: 24kB
                    Worker 1:  Batches: 1  Memory Usage: 24kB
                    ->  Parallel Seq Scan on river_job  (cost=0.00..117064.01 rows=2255360 width=4) (actual time=26.131..1074.503 rows=1801554 loops=3)
                          Filter: ((queue IS NOT NULL) AND (scheduled_at IS NOT NULL) AND (id IS NOT NULL) AND (priority > 0))
Planning Time: 7.718 ms
JIT:
  Functions: 27
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 18.873 ms, Inlining 0.000 ms, Optimization 7.710 ms, Emission 60.610 ms, Total 87.193 ms"
Execution Time: 1400.749 ms

available,5392821
completed,11839
retryable,2

brandur · 2024-08-06T00:21:06Z

Thanks @ligustah. We're looking into this.

If you're able to run this:

REINDEX INDEX CONCURRENTLY river_job_prioritized_fetching_index;

It'll very likely fix the problem, although if a whole bunch of new jobs churn through the table, it may reoccur.

We're looking into a longer term solution.

This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.

bgentry · 2024-08-26T21:05:09Z

@ligustah hi there, any chance you've been able to test out the REINDEX CONCURRENTLY suggestion to see if it resolves your querying issues? We also just released v0.4.0 with #108, which should help reduce the burden of these queries a bit.

ligustah · 2024-08-26T22:54:18Z

@ligustah hi there, any chance you've been able to test out the REINDEX CONCURRENTLY suggestion to see if it resolves your querying issues? We also just released v0.4.0 with #108, which should help reduce the burden of these queries a bit.

My test wasn't particularly scientific, but it seemed like it didn't make much of a difference.

brandur mentioned this issue Aug 7, 2024

API-side query caching for job counts #108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobCountByState query performance #106

JobCountByState query performance #106

ligustah commented Aug 4, 2024 •

edited

Loading

bgentry commented Aug 4, 2024

ligustah commented Aug 5, 2024

brandur commented Aug 6, 2024

bgentry commented Aug 26, 2024

ligustah commented Aug 26, 2024

JobCountByState query performance #106

JobCountByState query performance #106

Comments

ligustah commented Aug 4, 2024 • edited Loading

bgentry commented Aug 4, 2024

ligustah commented Aug 5, 2024

brandur commented Aug 6, 2024

bgentry commented Aug 26, 2024

ligustah commented Aug 26, 2024

ligustah commented Aug 4, 2024 •

edited

Loading