Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist requests that are duplicates of existing resources or bound to fail #28

Open
Popolechien opened this issue Mar 2, 2022 · 12 comments
Labels
enhancement New feature or request prio1

Comments

@Popolechien
Copy link
Contributor

Following openzim/zimit#113, we should think about implementing a fairly easily editable list (hosted on drive.kiwix.org?) of blacklisted sites that can not be requested on zimit, e.g.

  • kiwix.org subdomains (download and library);
  • very large corporate websites (e.g. Facebook, Twitter, Reddit, Youtube, etc.)
  • websites that have been scraped in the past and failed.

It's probably the matter of a separate ticket, but requests for websites we already have a scraper for (wikipedia, stackoverflow, etc.) should also be soft blocked and the user offered a direct link to the zim file.

@Popolechien Popolechien added the enhancement New feature or request label Mar 2, 2022
@rgaudin
Copy link
Member

rgaudin commented Mar 2, 2022

Can you move your comment to #25 and close this? This is the scraper's repo.

@Popolechien Popolechien transferred this issue from openzim/zimit Mar 2, 2022
@Popolechien
Copy link
Contributor Author

@rgaudin Moved it but I'd keep it open as this ticket is a little bit different.

@rgaudin
Copy link
Member

rgaudin commented Mar 2, 2022

This one's better ; closing the other one but the problem raised there remains: where do we point to for stuff that we know exists?

@Popolechien
Copy link
Contributor Author

Is your question "in case there are several versions of the same zim" (e.g., Wikipedia mini/nopic/maxi)?

The basic assumption here is that zimit provides a copy of the real thing, so we should send them the maxi zim file.

@stale
Copy link

stale bot commented May 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@kelson42
Copy link
Contributor

kelson42 commented Nov 4, 2023

See also #33

@benoit74
Copy link
Collaborator

I've started to document blacklist I encounter during maintenance tasks at https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0

@rgaudin
Copy link
Member

rgaudin commented Oct 28, 2024

Should we add a link to it to the routine? Should we count them in some way?

@benoit74
Copy link
Collaborator

Added the link to the routine, indeed it would help to have the link at hand.
About counting them, what would be the added value? (nothing against it, but I don't get why we would like to do this, and it seems to be cumbersome / complex to implement)

@rgaudin
Copy link
Member

rgaudin commented Oct 28, 2024

That's why I asked. The value would be to distinguish the importance between them ; should the eventual actions have to be prioritized

@Popolechien
Copy link
Contributor Author

I've added two more to the list.
Which routine are we talking about?

@benoit74
Copy link
Collaborator

The weekly infra routine (manual checks we do every week to ensure infra is up and running)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request prio1
Projects
None yet
Development

No branches or pull requests

4 participants