Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new standard metadata (optional) to differentiate content harvesting period from ZIM creation date #40

Open
benoit74 opened this issue Jul 22, 2024 · 1 comment

Comments

@benoit74
Copy link
Contributor

Currently, there is only one standard metadata named Date in ZIM metadata. Documentation specifically states this is the ZIM creation date.

There is no standard metadata to store information about when the ZIM content has been captured / fetched / crawled / scraped / ...

Given the fact that we rebuild regularly ZIMs (see ZIM Update v2 at #35 and https://wiki.openzim.org/wiki/ZIM_Updates) and we more and more process content that has been harvested at a time different than the ZIM creation (all stackexchange, some zimit with warcs reprocessed), it is useful to consider adding a new standard metadata to store this information.

Given the fact that content (e.g. with zimit) can be scrapped across multiple days, it seems important that the date is in fact a range from-to.

Just like current Date metadata, I think that we should keep this metadata understandable / easy to grab by keeping it only a day, not a day+time.

Given the fact that some content might come with lower precision than a day (e.g. when a content provider says "this is the content for April 2023, do not mind which day I published it"), I think we need to allow passing only a month or only a year in this metadata.

I hence propose to introduce this new standard ZIM metadata:

  • Name: ContentDate
  • Mandatory: No
  • Description: Date of the content, i.e. when content has been fetched to create the ZIM ; preferably a day (ISO format YYYY-MM-DD) but can be a year (YYYY) or month (YYYY-MM) if daily precision makes no sense ; can be a single value or a range from start to end, with format "from,to"
  • Examples: 2012-11 or 2023-01-12,2023-01-15

WDYT?

@mgautierfr
Copy link

Related to #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants