Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize repository history #120

Open
AMDmi3 opened this issue Nov 19, 2024 · 5 comments
Open

Optimize repository history #120

AMDmi3 opened this issue Nov 19, 2024 · 5 comments
Labels
Component: updater Component: webapp repology-webapp, repology web application Effort: intermediate A class or a huge function needs to be written Endpoint: graphs Priority: soon Should be done in a few days Topic: database SQL code mostly Topic: performance Both optimizations and perf monitoring Type: refactoring Internal improvements which do not change behavior

Comments

@AMDmi3
Copy link
Member

AMDmi3 commented Nov 19, 2024

Repositories history (the one used for graphs) is currenty stored as timestamped json of all per-repository counters, like this:

{
  "t2": {
    "num_problems": 1799,
    "num_maintainers": 74,
    "num_metapackages": 5398,
    "num_metapackages_newest": 2861,
    "num_metapackages_unique": 473,
    "num_metapackages_outdated": 1761,
    "num_metapackages_comparable": 4668,
    "num_metapackages_vulnerable": 194,
    "num_metapackages_problematic": 269
  },
  "aur": {
    "num_problems": 7578,
    "num_maintainers": 14430,
    "num_metapackages": 75715,
    "num_metapackages_newest": 23697,
    "num_metapackages_unique": 36915,
    "num_metapackages_outdated": 8363,
    "num_metapackages_comparable": 32114,
    "num_metapackages_vulnerable": 382,
    "num_metapackages_problematic": 276
  },
  "mpr": {
    "num_problems": 36,
    "num_maintainers": 110,
    "num_metapackages": 713,
    "num_metapackages_newest": 173,
    "num_metapackages_unique": 125,
    "num_metapackages_outdated": 309,
    "num_metapackages_comparable": 491,
    "num_metapackages_vulnerable": 31,
    "num_metapackages_problematic": 12
  },
  "pld": {
    "num_problems": 0,
    "num_maintainers": 0,
    "num_metapackages": 12567,
    "num_metapackages_newest": 5258,
    "num_metapackages_unique": 1595,
    ...

At a first glance, it looks like JSON overhead here is tremendous and converting the counters into classic SQL columns, also splitting these into per-repository entries would be much more optimal. In practice that's not the case, and doing so produces table of more than 2x size (988 MB jsonb, 2316 MB columns), not counting index sizes. I assume that jsonb variant benefits from TOAST compression and from lack of postgresql row overhead (with 381 active repositories, that's around 9kb per history point).

However, the conversion allows to handle each repository independently, which makes lookups faster (no need to fetch counters for 400 repositories for only one of them), and also allows to drop history points which did not have any counter changes. The latter allows to shrink the table to 387 MB (+145 MB index), at the cost of need for periodic cleanup, or more complex history update logic (if previous history point is the same, update its timestamp instead of adding a new one), which I'd like to avoid to implement with current sql-based updater.

Some more tests are needed, but it looks like the columns variant is more viable. At the very least, we should pick one as currently both histories are stored which takes precious disk space.

Migration query:

TRUNCATE repositories_history_new; 

WITH translated AS (
    SELECT
        (SELECT id FROM repositories WHERE name = key) AS repository_id,
        ts,
        (value->>'num_problems')::integer AS num_problems,
        (value->>'num_maintainers')::integer AS num_maintainers,
        (value->>'num_metapackages')::integer AS num_projects,
        (value->>'num_metapackages_unique')::integer AS num_projects_unique,
        (value->>'num_metapackages_newest')::integer AS num_projects_newest,
        (value->>'num_metapackages_outdated')::integer AS num_projects_outdated,
        (value->>'num_metapackages_comparable')::integer AS num_projects_comparable,
        (value->>'num_metapackages_problematic')::integer AS num_projects_problematic,
        (value->>'num_metapackages_vulnerable')::integer AS num_projects_vulnerable
    FROM repositories_history, jsonb_each(snapshot) as j(key, value)
), extended AS (
    select
        *,
        lag(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_problems,
        lag(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_maintainers,
        lag(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects,
        lag(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_unique,
        lag(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_newest,
        lag(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_outdated,
        lag(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_comparable,
        lag(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_problematic,
        lag(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_vulnerable,

        lead(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_problems,
        lead(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_maintainers,
        lead(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects,
        lead(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_unique,
        lead(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_newest,
        lead(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_outdated,
        lead(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_comparable,
        lead(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_problematic,
        lead(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_vulnerable
    FROM translated
    WHERE repository_id IS NOT NULL
)
INSERT INTO repositories_history_new
SELECT repository_id, ts, num_problems, num_maintainers, num_projects, num_projects_unique, num_projects_newest, num_projects_outdated, num_projects_comparable, num_projects_problematic, num_projects_vulnerable
FROM extended
WHERE num_problems IS DISTINCT FROM prev_num_problems
    OR num_maintainers IS DISTINCT FROM prev_num_maintainers
    OR num_projects IS DISTINCT FROM prev_num_projects
    OR num_projects_unique IS DISTINCT FROM prev_num_projects_unique
    OR num_projects_newest IS DISTINCT FROM prev_num_projects_newest
    OR num_projects_outdated IS DISTINCT FROM prev_num_projects_outdated
    OR num_projects_comparable IS DISTINCT FROM prev_num_projects_comparable
    OR num_projects_problematic IS DISTINCT FROM prev_num_projects_problematic
    OR num_projects_vulnerable IS DISTINCT FROM prev_num_projects_vulnerable

    OR num_problems IS DISTINCT FROM next_num_problems
    OR num_maintainers IS DISTINCT FROM next_num_maintainers
    OR num_projects IS DISTINCT FROM next_num_projects
    OR num_projects_unique IS DISTINCT FROM next_num_projects_unique
    OR num_projects_newest IS DISTINCT FROM next_num_projects_newest
    OR num_projects_outdated IS DISTINCT FROM next_num_projects_outdated
    OR num_projects_comparable IS DISTINCT FROM next_num_projects_comparable
    OR num_projects_problematic IS DISTINCT FROM next_num_projects_problematic
    OR num_projects_vulnerable IS DISTINCT FROM next_num_projects_vulnerable
;

Cleanup query:

WITH extended AS (
    SELECT *,
        lag(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_problems,
        lag(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_maintainers,
        lag(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects,
        lag(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_unique,
        lag(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_newest,
        lag(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_outdated,
        lag(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_comparable,
        lag(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_problematic,
        lag(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_vulnerable,

        lead(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_problems,
        lead(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_maintainers,
        lead(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects,
        lead(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_unique,
        lead(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_newest,
        lead(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_outdated,
        lead(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_comparable,
        lead(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_problematic,
        lead(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_vulnerable
    FROM repositories_history_new
	--WHERE repository_id = 113
	--WHERE ts > now() - interval '1 day'
	ORDER BY ts
), duplicates AS (
	SELECT repository_id, ts
	FROM extended
	WHERE num_problems IS NOT DISTINCT FROM prev_num_problems
		AND num_maintainers IS NOT DISTINCT FROM prev_num_maintainers
		AND num_projects IS NOT DISTINCT FROM prev_num_projects
		AND num_projects_unique IS NOT DISTINCT FROM prev_num_projects_unique
		AND num_projects_newest IS NOT DISTINCT FROM prev_num_projects_newest
		AND num_projects_outdated IS NOT DISTINCT FROM prev_num_projects_outdated
		AND num_projects_comparable IS NOT DISTINCT FROM prev_num_projects_comparable
		AND num_projects_problematic IS NOT DISTINCT FROM prev_num_projects_problematic
		AND num_projects_vulnerable IS NOT DISTINCT FROM prev_num_projects_vulnerable

		AND num_problems IS NOT DISTINCT FROM next_num_problems
		AND num_maintainers IS NOT DISTINCT FROM next_num_maintainers
		AND num_projects IS NOT DISTINCT FROM next_num_projects
		AND num_projects_unique IS NOT DISTINCT FROM next_num_projects_unique
		AND num_projects_newest IS NOT DISTINCT FROM next_num_projects_newest
		AND num_projects_outdated IS NOT DISTINCT FROM next_num_projects_outdated
		AND num_projects_comparable IS NOT DISTINCT FROM next_num_projects_comparable
		AND num_projects_problematic IS NOT DISTINCT FROM next_num_projects_problematic
		AND num_projects_vulnerable IS NOT DISTINCT FROM next_num_projects_vulnerable
)
DELETE FROM repositories_history_new
USING duplicates
WHERE repositories_history_new.repository_id = duplicates.repository_id
	AND repositories_history_new.ts = duplicates.ts;

Migration without deduplication then cleanup produces the same result as deduplicating migration. However cleanup on 2GB table takes very long time, so it may be faster if it's limited with repository. Cleanup on mostly deduplicated table is fast, and is even faster if limited with age, so it's safe to run periodically or after the update.

Additional thing: for this to work, SQL queries for /graph/repo/* endpoints should be tweaked to fetch additional row just before desired range of 21 day, as the history is now sparse and the graph may now only contain one recent data point, with which it's not possible to plot any lines, so we need an extra point (in fact, we need it always not to have a gap at the beginning of the graph).

@AMDmi3 AMDmi3 added Component: webapp repology-webapp, repology web application Effort: intermediate A class or a huge function needs to be written Priority: soon Should be done in a few days Type: refactoring Internal improvements which do not change behavior Component: updater Topic: performance Both optimizations and perf monitoring Topic: database SQL code mostly Endpoint: graphs labels Nov 19, 2024
@AMDmi3
Copy link
Member Author

AMDmi3 commented Nov 19, 2024

Note: it turned out we already have deduplication of fresh entries in repositories_history_new.

@AMDmi3
Copy link
Member Author

AMDmi3 commented Nov 19, 2024

We also don't really want to store historical data on per-update granularity - it also makes sense to leave ~one point per day for history older than a year.

@AMDmi3
Copy link
Member Author

AMDmi3 commented Nov 19, 2024

And we can do the same for total statistics. Not that it would conserve much space, but for consistency at the very least.

AMDmi3 added a commit that referenced this issue Nov 19, 2024
Before this, graphs didn't include the segment which intersects left
boundary of a graph (21 day boundary in other words). This was not
visible because on 21 day range, updates are rather frequent, and the
leftmost segment starts just after the boundary.

However, if there's huge gap in updates, or history is deduplicated
to remove points which do not have any counter changes (see #120),
we'll see a missing segment. Fix that by fetching an extra data point
just before the graph range.
AMDmi3 added a commit that referenced this issue Nov 19, 2024
Also implement a way to use new history for these (#120)
@AMDmi3
Copy link
Member Author

AMDmi3 commented Nov 20, 2024

It also turned out that we produce history for removed repositories. This should be fixed too. We can then clean up history generated outside of repository lifetime (based on repository.last_seen).

@AMDmi3
Copy link
Member Author

AMDmi3 commented Nov 20, 2024

We can probably also remove history data for repositories which have not been active for more than a year. That's more than 10% entries.

AMDmi3 added a commit that referenced this issue Nov 21, 2024
This is major performance optimization not only because we no longer
need to fetch, unpack and process several megabytes of jsonb, but also
because reworked SQL code in Rust implementation of graphs contained
performance regression which made this problem worse. It turns out that
if there are multiple statements which extract field from single jsonb
field (e.g. snapshot->{repository}->>{field}), it's processed from
scratch each time (which as I understand involve TOAST fetch,
decompression and jsonb processing), so relative graphs (*_percent,
*_by_*) became 3x slower in SQL. This can be fixed by adding
intermediate query which extracts repository data once, but there's no
reason not to switch to a new history, as there were no visible
discrepancies after several update cycles.

Old history mode remains as a reference for around a week, after which
it's safe to remove old history mode and old history table completely.
AMDmi3 added a commit to repology/repology-updater that referenced this issue Dec 17, 2024
AMDmi3 added a commit that referenced this issue Dec 17, 2024
The production database still has jsonb column, and the updater still
fills it, for that's required as some old webapp endpoints which need
it are used. Here, however, pretend as if it doesn't exists, so it
can be transparently dropped from the db when no longer needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: updater Component: webapp repology-webapp, repology web application Effort: intermediate A class or a huge function needs to be written Endpoint: graphs Priority: soon Should be done in a few days Topic: database SQL code mostly Topic: performance Both optimizations and perf monitoring Type: refactoring Internal improvements which do not change behavior
Projects
None yet
Development

No branches or pull requests

1 participant