Optimize repository history #120

AMDmi3 · 2024-11-19T15:12:57Z

Repositories history (the one used for graphs) is currenty stored as timestamped json of all per-repository counters, like this:

{
  "t2": {
    "num_problems": 1799,
    "num_maintainers": 74,
    "num_metapackages": 5398,
    "num_metapackages_newest": 2861,
    "num_metapackages_unique": 473,
    "num_metapackages_outdated": 1761,
    "num_metapackages_comparable": 4668,
    "num_metapackages_vulnerable": 194,
    "num_metapackages_problematic": 269
  },
  "aur": {
    "num_problems": 7578,
    "num_maintainers": 14430,
    "num_metapackages": 75715,
    "num_metapackages_newest": 23697,
    "num_metapackages_unique": 36915,
    "num_metapackages_outdated": 8363,
    "num_metapackages_comparable": 32114,
    "num_metapackages_vulnerable": 382,
    "num_metapackages_problematic": 276
  },
  "mpr": {
    "num_problems": 36,
    "num_maintainers": 110,
    "num_metapackages": 713,
    "num_metapackages_newest": 173,
    "num_metapackages_unique": 125,
    "num_metapackages_outdated": 309,
    "num_metapackages_comparable": 491,
    "num_metapackages_vulnerable": 31,
    "num_metapackages_problematic": 12
  },
  "pld": {
    "num_problems": 0,
    "num_maintainers": 0,
    "num_metapackages": 12567,
    "num_metapackages_newest": 5258,
    "num_metapackages_unique": 1595,
    ...

At a first glance, it looks like JSON overhead here is tremendous and converting the counters into classic SQL columns, also splitting these into per-repository entries would be much more optimal. In practice that's not the case, and doing so produces table of more than 2x size (988 MB jsonb, 2316 MB columns), not counting index sizes. I assume that jsonb variant benefits from TOAST compression and from lack of postgresql row overhead (with 381 active repositories, that's around 9kb per history point).

However, the conversion allows to handle each repository independently, which makes lookups faster (no need to fetch counters for 400 repositories for only one of them), and also allows to drop history points which did not have any counter changes. The latter allows to shrink the table to 387 MB (+145 MB index), at the cost of need for periodic cleanup, or more complex history update logic (if previous history point is the same, update its timestamp instead of adding a new one), which I'd like to avoid to implement with current sql-based updater.

Some more tests are needed, but it looks like the columns variant is more viable. At the very least, we should pick one as currently both histories are stored which takes precious disk space.

Migration query:

TRUNCATE repositories_history_new; 

WITH translated AS (
    SELECT
        (SELECT id FROM repositories WHERE name = key) AS repository_id,
        ts,
        (value->>'num_problems')::integer AS num_problems,
        (value->>'num_maintainers')::integer AS num_maintainers,
        (value->>'num_metapackages')::integer AS num_projects,
        (value->>'num_metapackages_unique')::integer AS num_projects_unique,
        (value->>'num_metapackages_newest')::integer AS num_projects_newest,
        (value->>'num_metapackages_outdated')::integer AS num_projects_outdated,
        (value->>'num_metapackages_comparable')::integer AS num_projects_comparable,
        (value->>'num_metapackages_problematic')::integer AS num_projects_problematic,
        (value->>'num_metapackages_vulnerable')::integer AS num_projects_vulnerable
    FROM repositories_history, jsonb_each(snapshot) as j(key, value)
), extended AS (
    select
        *,
        lag(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_problems,
        lag(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_maintainers,
        lag(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects,
        lag(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_unique,
        lag(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_newest,
        lag(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_outdated,
        lag(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_comparable,
        lag(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_problematic,
        lag(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_vulnerable,

        lead(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_problems,
        lead(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_maintainers,
        lead(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects,
        lead(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_unique,
        lead(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_newest,
        lead(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_outdated,
        lead(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_comparable,
        lead(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_problematic,
        lead(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_vulnerable
    FROM translated
    WHERE repository_id IS NOT NULL
)
INSERT INTO repositories_history_new
SELECT repository_id, ts, num_problems, num_maintainers, num_projects, num_projects_unique, num_projects_newest, num_projects_outdated, num_projects_comparable, num_projects_problematic, num_projects_vulnerable
FROM extended
WHERE num_problems IS DISTINCT FROM prev_num_problems
    OR num_maintainers IS DISTINCT FROM prev_num_maintainers
    OR num_projects IS DISTINCT FROM prev_num_projects
    OR num_projects_unique IS DISTINCT FROM prev_num_projects_unique
    OR num_projects_newest IS DISTINCT FROM prev_num_projects_newest
    OR num_projects_outdated IS DISTINCT FROM prev_num_projects_outdated
    OR num_projects_comparable IS DISTINCT FROM prev_num_projects_comparable
    OR num_projects_problematic IS DISTINCT FROM prev_num_projects_problematic
    OR num_projects_vulnerable IS DISTINCT FROM prev_num_projects_vulnerable

    OR num_problems IS DISTINCT FROM next_num_problems
    OR num_maintainers IS DISTINCT FROM next_num_maintainers
    OR num_projects IS DISTINCT FROM next_num_projects
    OR num_projects_unique IS DISTINCT FROM next_num_projects_unique
    OR num_projects_newest IS DISTINCT FROM next_num_projects_newest
    OR num_projects_outdated IS DISTINCT FROM next_num_projects_outdated
    OR num_projects_comparable IS DISTINCT FROM next_num_projects_comparable
    OR num_projects_problematic IS DISTINCT FROM next_num_projects_problematic
    OR num_projects_vulnerable IS DISTINCT FROM next_num_projects_vulnerable
;

Cleanup query:

WITH extended AS (
    SELECT *,
        lag(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_problems,
        lag(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_maintainers,
        lag(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects,
        lag(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_unique,
        lag(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_newest,
        lag(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_outdated,
        lag(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_comparable,
        lag(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_problematic,
        lag(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_vulnerable,

        lead(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_problems,
        lead(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_maintainers,
        lead(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects,
        lead(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_unique,
        lead(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_newest,
        lead(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_outdated,
        lead(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_comparable,
        lead(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_problematic,
        lead(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_vulnerable
    FROM repositories_history_new
	--WHERE repository_id = 113
	--WHERE ts > now() - interval '1 day'
	ORDER BY ts
), duplicates AS (
	SELECT repository_id, ts
	FROM extended
	WHERE num_problems IS NOT DISTINCT FROM prev_num_problems
		AND num_maintainers IS NOT DISTINCT FROM prev_num_maintainers
		AND num_projects IS NOT DISTINCT FROM prev_num_projects
		AND num_projects_unique IS NOT DISTINCT FROM prev_num_projects_unique
		AND num_projects_newest IS NOT DISTINCT FROM prev_num_projects_newest
		AND num_projects_outdated IS NOT DISTINCT FROM prev_num_projects_outdated
		AND num_projects_comparable IS NOT DISTINCT FROM prev_num_projects_comparable
		AND num_projects_problematic IS NOT DISTINCT FROM prev_num_projects_problematic
		AND num_projects_vulnerable IS NOT DISTINCT FROM prev_num_projects_vulnerable

		AND num_problems IS NOT DISTINCT FROM next_num_problems
		AND num_maintainers IS NOT DISTINCT FROM next_num_maintainers
		AND num_projects IS NOT DISTINCT FROM next_num_projects
		AND num_projects_unique IS NOT DISTINCT FROM next_num_projects_unique
		AND num_projects_newest IS NOT DISTINCT FROM next_num_projects_newest
		AND num_projects_outdated IS NOT DISTINCT FROM next_num_projects_outdated
		AND num_projects_comparable IS NOT DISTINCT FROM next_num_projects_comparable
		AND num_projects_problematic IS NOT DISTINCT FROM next_num_projects_problematic
		AND num_projects_vulnerable IS NOT DISTINCT FROM next_num_projects_vulnerable
)
DELETE FROM repositories_history_new
USING duplicates
WHERE repositories_history_new.repository_id = duplicates.repository_id
	AND repositories_history_new.ts = duplicates.ts;

Migration without deduplication then cleanup produces the same result as deduplicating migration. However cleanup on 2GB table takes very long time, so it may be faster if it's limited with repository. Cleanup on mostly deduplicated table is fast, and is even faster if limited with age, so it's safe to run periodically or after the update.

Additional thing: for this to work, SQL queries for /graph/repo/* endpoints should be tweaked to fetch additional row just before desired range of 21 day, as the history is now sparse and the graph may now only contain one recent data point, with which it's not possible to plot any lines, so we need an extra point (in fact, we need it always not to have a gap at the beginning of the graph).

AMDmi3 · 2024-11-19T16:32:07Z

Note: it turned out we already have deduplication of fresh entries in repositories_history_new.

AMDmi3 · 2024-11-19T17:03:14Z

We also don't really want to store historical data on per-update granularity - it also makes sense to leave ~one point per day for history older than a year.

AMDmi3 · 2024-11-19T17:03:50Z

And we can do the same for total statistics. Not that it would conserve much space, but for consistency at the very least.

Before this, graphs didn't include the segment which intersects left boundary of a graph (21 day boundary in other words). This was not visible because on 21 day range, updates are rather frequent, and the leftmost segment starts just after the boundary. However, if there's huge gap in updates, or history is deduplicated to remove points which do not have any counter changes (see #120), we'll see a missing segment. Fix that by fetching an extra data point just before the graph range.

Also implement a way to use new history for these (#120)

AMDmi3 · 2024-11-20T19:29:56Z

It also turned out that we produce history for removed repositories. This should be fixed too. We can then clean up history generated outside of repository lifetime (based on repository.last_seen).

AMDmi3 · 2024-11-20T19:53:58Z

We can probably also remove history data for repositories which have not been active for more than a year. That's more than 10% entries.

This is major performance optimization not only because we no longer need to fetch, unpack and process several megabytes of jsonb, but also because reworked SQL code in Rust implementation of graphs contained performance regression which made this problem worse. It turns out that if there are multiple statements which extract field from single jsonb field (e.g. snapshot->{repository}->>{field}), it's processed from scratch each time (which as I understand involve TOAST fetch, decompression and jsonb processing), so relative graphs (*_percent, *_by_*) became 3x slower in SQL. This can be fixed by adding intermediate query which extracts repository data once, but there's no reason not to switch to a new history, as there were no visible discrepancies after several update cycles. Old history mode remains as a reference for around a week, after which it's safe to remove old history mode and old history table completely.

The production database still has jsonb column, and the updater still fills it, for that's required as some old webapp endpoints which need it are used. Here, however, pretend as if it doesn't exists, so it can be transparently dropped from the db when no longer needed.

AMDmi3 mentioned this issue Nov 19, 2024

Implement all remaining webapp HTTP endpoints #44

Open

45 tasks

AMDmi3 added a commit that referenced this issue Nov 19, 2024

webapp: Implement repository graphs (for absolute values only)

6696695

Also implement a way to use new history for these (#120)

AMDmi3 added a commit that referenced this issue Dec 4, 2024

webapp: Remove webapp support for old repository history (#120)

8d35dc6

AMDmi3 added a commit that referenced this issue Dec 16, 2024

webapp: Update history tables schemas (#120)

385f112

AMDmi3 added a commit that referenced this issue Dec 16, 2024

webapp: Update history tables schemas (#120)

52eca9e

AMDmi3 added a commit that referenced this issue Dec 16, 2024

webapp: Switch to flat statistics history table (#120)

81af3f6

AMDmi3 added a commit to repology/repology-updater that referenced this issue Dec 17, 2024

Rename repositories_history table (repology/repology-rs#120)

ded6174

AMDmi3 added a commit that referenced this issue Dec 17, 2024

webapp: Rename repositories_history table (#120)

113e22d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize repository history #120

Optimize repository history #120

AMDmi3 commented Nov 19, 2024 •

edited

Loading

AMDmi3 commented Nov 19, 2024

AMDmi3 commented Nov 19, 2024

AMDmi3 commented Nov 19, 2024

AMDmi3 commented Nov 20, 2024 •

edited

Loading

AMDmi3 commented Nov 20, 2024

Optimize repository history #120

Optimize repository history #120

Comments

AMDmi3 commented Nov 19, 2024 • edited Loading

AMDmi3 commented Nov 19, 2024

AMDmi3 commented Nov 19, 2024

AMDmi3 commented Nov 19, 2024

AMDmi3 commented Nov 20, 2024 • edited Loading

AMDmi3 commented Nov 20, 2024

AMDmi3 commented Nov 19, 2024 •

edited

Loading

AMDmi3 commented Nov 20, 2024 •

edited

Loading