Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize repository and maintainer+repository events #58

Open
AMDmi3 opened this issue Oct 31, 2024 · 0 comments
Open

Optimize repository and maintainer+repository events #58

AMDmi3 opened this issue Oct 31, 2024 · 0 comments
Labels
Effort: simple A few lines fix Priority: someday Normal priority Topic: database SQL code mostly Topic: performance Both optimizations and perf monitoring Type: refactoring Internal improvements which do not change behavior

Comments

@AMDmi3
Copy link
Member

AMDmi3 commented Oct 31, 2024

We store a lot of events per repository (up to 1.2M as of now) and per maintainer+repository (up to 112k as of now) of which it's not possible to see more than 500 as feeds do not have pagination (and nobody have ever asked for it). Though we delete events older than a year, a lot of useless events still remain. We may want to introduce better cleanup procedure for these tables.

This would conserve a lot of space: events currently use around 1G of disk space (including indexes), and removing events starting with 500 per each repository/maintainer drops 97% repository entries and 52% repository + maintainer entries. We may want to introduce pagination though (esp for repository events) and bump it to e.g 10k, and even with that may still drop 80% and 30% resp.

Quick & dirty query to extract event ids to remove:

with numbered_events as (
    select
        id,
        row_number() over (partition by repository_id order by ts desc, id) as idx
    from repository_events
), removals as (
    select id from repository_events where ts < now() - interval '1 year'
    union
    select id from numbered_events where idx > 500
)
select count(*) from removals;

with numbered_events as (
    select
        id,
        row_number() over (partition by repository_id,maintainer_id order by ts desc, id) as idx
    from maintainer_repo_metapackages_events
), removals as (
    select id from maintainer_repo_metapackages_events where ts < now() - interval '1 year'
    union
    select id from numbered_events where idx > 500
)
select count(*) from removals;

While here, we may consider adding id columns to event table indexes, which would fix the need to sort a lot of data in rare cases where there's a lot of events with single timestamp.

@AMDmi3 AMDmi3 added Effort: simple A few lines fix Priority: someday Normal priority Type: refactoring Internal improvements which do not change behavior Topic: performance Both optimizations and perf monitoring Topic: database SQL code mostly labels Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Effort: simple A few lines fix Priority: someday Normal priority Topic: database SQL code mostly Topic: performance Both optimizations and perf monitoring Type: refactoring Internal improvements which do not change behavior
Projects
None yet
Development

No branches or pull requests

1 participant