Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup ZIMs - procedure & tooling #27

Open
benoit74 opened this issue Jan 22, 2024 · 1 comment
Open

Cleanup ZIMs - procedure & tooling #27

benoit74 opened this issue Jan 22, 2024 · 1 comment

Comments

@benoit74
Copy link
Contributor

Currently, we do not have any precise procedure or tooling around cleanup of ZIMs.

There are many topics that should be considered:

  • we sometimes need to remove files from production (ZIM no longer allowed to be published, reorganization of ZIM names / splitting of content around ZIM files, Zimfarm configuration error, ...):
    • these files should probably be moved to a temporary trash and kept there for few days before final deletion (errors arrives also when deleting content)
    • since few weeks I've beginning to move them to .hidden/to_delete with one folder per month
  • we regularly build new ZIMs for custom apps in .hidden/custom_apps but we want to keep only the latest version of each ZIM (see Older zim files are not deleted in /custom_apps zimfarm#905)
    • but there also we will probably have to delete some app which are not published anymore / renamed
  • we build a lot of ZIMs to .hidden/dev and most of them have no reason to be kept on the long term
    • but we want to keep some of them (i.e. we cannot say "delete everything which is more than 1 month old")
  • we build files for some projects (bsf, endless)
    • but here again we are probably only interesting in keeping the last (or two lasts) version of each book

I would propose to :

  • make official the decision to never directly delete a ZIM from production but move it to a temporary trash
  • build a small tool which would:
    • contains rules about what has to be kept / deleted
    • every day:
      • list files in watched directories
      • mark files that should be deleted according to the rules
      • unmark files that should not be deleted anymore according to the rules (which have probably been updated)
      • delete files that have been marked for more than 7 days
      • report actions in a Slack channel

The idea of marking files comes from the fact that:

  • it seems preferable to process things "on-the-fly" (rather than doing it only once a month) to keep storage usage flat and avoid situation where it takes long to clean things
  • it is in any case needed to keep a list of things to cleanup processable by the machine (e.g. we cannot list files of 1st of the month and cleanup on 7th of the month without a list of things to cleanup, because otherwise some files might have appeared in the cleanup list in between and would be deleted if the machine does not know it wasn't there on 1st of the month)

It has some drawbacks:

  • we need to keep a list of marked files (but it is not very important data, we can rebuild it)
  • there will probably be a kind of "fatigue" with new files marked every day, and people will begin to pay less attention to it

Proposal of rules (in TOML because it is a config file format for humans and I expect to write the tool in Python which promotes TOML significantly, but in fact I don't really care)

[delete_rules.dev]
folder="/data/hidden-zim"
delete_rule="file_older_than_days"
delete_threshold=30
force_keep=[
  "manioc.org_fr_all_2023-01.zim"
]

[delete_rules.custom_apps]
folder="/data/custom-apps"
delete_rule="all_but_last_book"
force_delete=[
 "my_oudated_app_2023-01.zim"
]

[delete_rules.to_delete]
folder="/data/to_delete"
delete_rule="last_folder_older_than_days"
delete_threshold=30
delete_empty_folders=true

With the following meanings:

  • [delete_rules.xxx]: this is the configuration of the deletion rule xxx (I imagine the tool will be able to do other stuff in the future)
  • folder: path to process for cleanup
  • delete_rule: how to decide what has to be cleaned
    • file_older_than_days: delete files older than a given amount of days
    • all_but_last_book: delete files which are not the last book version (based on ZIM naming convention)
    • last_folder_older_than_days: delete folders if they are older than a given amount of days AND the last folder in the tree (i.e. they do not contain another folder)
  • delete_threshold: the threshold for the deletion rule
  • force_delete: a list of file to force to delete
  • force_keep: a list of files to force to keep

I think that this tool will be used for other cleanup duties:

WDYT?

@rgaudin
Copy link
Member

rgaudin commented Jan 23, 2024

LGTM ; I can't find the other discussion but found this (dont look at the rest of the ticket) which is a bit similar. I find your approach better in several ways: commit to mark stuff we want to keep ~forever (so we'll get a commit message) and a short duration to deletion (otherwise there's the risk of postponing it then missing the deadline)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants