-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block users submitting more than one task in parallel to ensure fair use #56
Comments
Why not being transparent about the queue size and the (very wrong) ETA? This way we don't have to track nor patronize users and we'd get the same results. With such an information, many requests we get would simply not be sent because some user would loose interest in the ZIM if it cant be retrieved shortly. I understant the service may look bad when the queue is very long but it's fair and respectful of the user to announce it instead of him just thinking he submitted his request and got zero response. |
Announcing ETA is more the proposition of #32 (which mostly speaks about rank, but ETA could be added). This issue suggests that it might not be enough to make users reasonable. It might be wrong, I don't know, at least it is tracked in an issue now. |
Ah sorry I didn't click the issue ; I believe it was the ticket we mentioned yesterday about blocking similar requests. |
Let me try to rephrase the problem: we should deliver the ZIM files within 24 hours (if I remember properly the SLA we have fixed to ourself) and we fail because we have too many requests (and too few hardware). This ticket is an attempt do reduce the amount of requests by reducing the amount of "abusive" ones, so at the source. As abusive behaviour we want to avoid the users to launch many requests in parallel. In particular if it makes little sense: ie. for the same web sites. To reduce the "abuses", we have two ways which are not exclusive to each other: use pedagogy and inform the users AND/OR forbids certain users actions. You have focused your comments on informing users about the delay/size of the queue. I'm not against this, but IMHO it's more important to respect the SLA. And I prefer to have furstrated users because the service does not deliver many ZIM files in paralell than all users waiting for days. Therefore I propose:
|
I'm fine with the idea to forbid certain users actions. I wasn't aware of any SLA (good to know there was one) and I've understood (probably was wrong) that we didn't wanted at all to block users, and even tracking them with IP/cookie was a concern. For peace of mind, I like when we can block "abuse" rather than hope for users behind reasonable. My only concern with what is proposed is that an approach based on IP cannot work (schools, universities, companies, ...). An approach based on cookie is pretty fragile: cookies are easy to delete and once a user finds the trick, it might spread quite fast in the community. If it was just for pedagogy, somehow we can say that we do not mind. If goal is to forbid some actions, then we could spend time implementing something to block users ... and be back to square one (only pedagogy) within few months. If we want to block users, we need something more robust than cookies/IP. Which also usually means something more intrusive and usually not free (in term of money at least). I don't have much to propose unfortunately. |
Discussed live: we need to use an hybrid approach:
This might block legit distinct users coming from same IP but coming for the first time on the site ... we consider this is acceptable for a free service and because it will be the case only for up to 24h (until the currently ongoing task is completed) |
After some thought, I wonder if it is really worth it to consider adding a cookie. It makes the fair use blocking easier to circumvent. And users behind a single IP are probably from big companies or universities, for which we can consider deploying a custom Zimit service if need is significant. |
What has been discussed is also that this issue must clearly indicate when the user is blocked the reason why there is a blockage (fair use of a free service), and the fact that we are open to consider deploying custom services for the ones needing it. @Popolechien do you have any idea of phrasing / design on this? |
Isn't a cookie browser-based? It would also be interesting to know also how many requests are made with an email address vs. not (which we could also force, because honestly only a tiny fraction is going to leave a window open until they get a result, and I suspect a lot of duplicate requests are from people not realizing results are not immediate and then restarting the query but this time with a request for an email ping) Edit: I realize that I missed an earlier comment
As in: same request from same IP but without cookie = block? (I have no opinion really on this, I could see several scenarios warranting a pass rather than a block, incl. from a large IP block) |
you have details about this is in the export I made the other day
Yes, but it will be based on user IP so there is nothing like large IP block. Only big companies or universities all "hiding" behind one single public IP. Could you explain your other scenarii? |
Top of my head:
|
Since the blockage will be gone once the task finish, all but the first scenario you mention are only temporary and indicated potential other limitations in current UI. Only real concern for me is the case where many users are behind the same IP, since one user might be blocked without having being involved at all with this blockage. But again, if many users are behind the same IP, it is probably fair to still limit them to one task at a time, they can always switch to a different IP (their phone, their mum internet, at home, ...). |
It's not limited to that.
? |
https://docs.google.com/spreadsheets/d/1GaebcExX7d4jq3ndB6zKnSRElz40fs2bccrGjpNtRs0/edit?usp=sharing |
Ok thanks a lot. I see that about 30% of requests are anonymous, but then we can't know for sure which ones were requested a second time with an email address. Excluding these, a third of email users entered more than one query, which answers the initial question in this thread. Other stats of interest: about 5% of requests are duplicates of existing zim files, another 5% should be seen as unrealistic (e.g. youtube or google translate), and yet another 5% are naughty (yes, I mean pr0n) requests. Duplicates (same address requested twice or more, though possibly by different people) represent about 1/3 of all requests. |
Currently we see a lot of stuff which looks like multiple requests coming from the same user.
We cannot be sure in all cases because we do not track users and do not force them to enter their emails, but in some cases we have the email and it is clear, and in other cases the URLs are so closely related that it seems obvious.
While #32 would help, it might not be sufficient.
I suggest that we should warn the user when he has already at least 1 task in the pipe with something like "this is a fair use free service, please avoid to submitting too many task at the same time, you already have xxx task in the pipe, please wait for it to complete"
Detecting the user and its associated tasks could be done with a tracking cookie. We should force the user to accept this cookie (this is a free service, we can impose some constraints) and make it clear that this tracking cookie is used only for fair use of the service and for some internal statistics.
I don't think that blocking the user would help, first because it will cause frustration and because whatever system we put in place, if we want to keep the possibility for the user to stay anonymous we cannot enforce blockage, it will always be possible / easy to circumvent.
I don't think that inventing something around "similarly looking URLs" would help since this might block two users requesting almost the same task at the same time, but not knowing each others, and again will cause frustration all while being difficult to implement.
The text was updated successfully, but these errors were encountered: