Optimize repository and maintainer+repository events #58

AMDmi3 · 2024-10-31T15:40:28Z

We store a lot of events per repository (up to 1.2M as of now) and per maintainer+repository (up to 112k as of now) of which it's not possible to see more than 500 as feeds do not have pagination (and nobody have ever asked for it). Though we delete events older than a year, a lot of useless events still remain. We may want to introduce better cleanup procedure for these tables.

This would conserve a lot of space: events currently use around 1G of disk space (including indexes), and removing events starting with 500 per each repository/maintainer drops 97% repository entries and 52% repository + maintainer entries. We may want to introduce pagination though (esp for repository events) and bump it to e.g 10k, and even with that may still drop 80% and 30% resp.

Quick & dirty query to extract event ids to remove:

with numbered_events as (
    select
        id,
        row_number() over (partition by repository_id order by ts desc, id) as idx
    from repository_events
), removals as (
    select id from repository_events where ts < now() - interval '1 year'
    union
    select id from numbered_events where idx > 500
)
select count(*) from removals;

with numbered_events as (
    select
        id,
        row_number() over (partition by repository_id,maintainer_id order by ts desc, id) as idx
    from maintainer_repo_metapackages_events
), removals as (
    select id from maintainer_repo_metapackages_events where ts < now() - interval '1 year'
    union
    select id from numbered_events where idx > 500
)
select count(*) from removals;

While here, we may consider adding id columns to event table indexes, which would fix the need to sort a lot of data in rare cases where there's a lot of events with single timestamp.

AMDmi3 added Effort: simple A few lines fix Priority: someday Normal priority Type: refactoring Internal improvements which do not change behavior Topic: performance Both optimizations and perf monitoring Topic: database SQL code mostly labels Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize repository and maintainer+repository events #58

Optimize repository and maintainer+repository events #58

AMDmi3 commented Oct 31, 2024

Optimize repository and maintainer+repository events #58

Optimize repository and maintainer+repository events #58

Comments

AMDmi3 commented Oct 31, 2024