-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize repository history #120
Comments
Note: it turned out we already have deduplication of fresh entries in |
We also don't really want to store historical data on per-update granularity - it also makes sense to leave ~one point per day for history older than a year. |
And we can do the same for total statistics. Not that it would conserve much space, but for consistency at the very least. |
Before this, graphs didn't include the segment which intersects left boundary of a graph (21 day boundary in other words). This was not visible because on 21 day range, updates are rather frequent, and the leftmost segment starts just after the boundary. However, if there's huge gap in updates, or history is deduplicated to remove points which do not have any counter changes (see #120), we'll see a missing segment. Fix that by fetching an extra data point just before the graph range.
Also implement a way to use new history for these (#120)
It also turned out that we produce history for removed repositories. This should be fixed too. We can then clean up history generated outside of repository lifetime (based on |
We can probably also remove history data for repositories which have not been active for more than a year. That's more than 10% entries. |
This is major performance optimization not only because we no longer need to fetch, unpack and process several megabytes of jsonb, but also because reworked SQL code in Rust implementation of graphs contained performance regression which made this problem worse. It turns out that if there are multiple statements which extract field from single jsonb field (e.g. snapshot->{repository}->>{field}), it's processed from scratch each time (which as I understand involve TOAST fetch, decompression and jsonb processing), so relative graphs (*_percent, *_by_*) became 3x slower in SQL. This can be fixed by adding intermediate query which extracts repository data once, but there's no reason not to switch to a new history, as there were no visible discrepancies after several update cycles. Old history mode remains as a reference for around a week, after which it's safe to remove old history mode and old history table completely.
The production database still has jsonb column, and the updater still fills it, for that's required as some old webapp endpoints which need it are used. Here, however, pretend as if it doesn't exists, so it can be transparently dropped from the db when no longer needed.
Repositories history (the one used for graphs) is currenty stored as timestamped json of all per-repository counters, like this:
At a first glance, it looks like JSON overhead here is tremendous and converting the counters into classic SQL columns, also splitting these into per-repository entries would be much more optimal. In practice that's not the case, and doing so produces table of more than 2x size (988 MB jsonb, 2316 MB columns), not counting index sizes. I assume that jsonb variant benefits from TOAST compression and from lack of postgresql row overhead (with 381 active repositories, that's around 9kb per history point).
However, the conversion allows to handle each repository independently, which makes lookups faster (no need to fetch counters for 400 repositories for only one of them), and also allows to drop history points which did not have any counter changes. The latter allows to shrink the table to 387 MB (+145 MB index), at the cost of need for periodic cleanup, or more complex history update logic (if previous history point is the same, update its timestamp instead of adding a new one), which I'd like to avoid to implement with current sql-based updater.
Some more tests are needed, but it looks like the columns variant is more viable. At the very least, we should pick one as currently both histories are stored which takes precious disk space.
Migration query:
Cleanup query:
Migration without deduplication then cleanup produces the same result as deduplicating migration. However cleanup on 2GB table takes very long time, so it may be faster if it's limited with repository. Cleanup on mostly deduplicated table is fast, and is even faster if limited with age, so it's safe to run periodically or after the update.
Additional thing: for this to work, SQL queries for /graph/repo/* endpoints should be tweaked to fetch additional row just before desired range of 21 day, as the history is now sparse and the graph may now only contain one recent data point, with which it's not possible to plot any lines, so we need an extra point (in fact, we need it always not to have a gap at the beginning of the graph).
The text was updated successfully, but these errors were encountered: