Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block users submitting more than one task in parallel to ensure fair use #56

Open
benoit74 opened this issue Jun 4, 2024 · 15 comments
Open
Labels
enhancement New feature or request question Further information is requested

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Jun 4, 2024

Currently we see a lot of stuff which looks like multiple requests coming from the same user.

We cannot be sure in all cases because we do not track users and do not force them to enter their emails, but in some cases we have the email and it is clear, and in other cases the URLs are so closely related that it seems obvious.

While #32 would help, it might not be sufficient.

I suggest that we should warn the user when he has already at least 1 task in the pipe with something like "this is a fair use free service, please avoid to submitting too many task at the same time, you already have xxx task in the pipe, please wait for it to complete"

Detecting the user and its associated tasks could be done with a tracking cookie. We should force the user to accept this cookie (this is a free service, we can impose some constraints) and make it clear that this tracking cookie is used only for fair use of the service and for some internal statistics.

I don't think that blocking the user would help, first because it will cause frustration and because whatever system we put in place, if we want to keep the possibility for the user to stay anonymous we cannot enforce blockage, it will always be possible / easy to circumvent.

I don't think that inventing something around "similarly looking URLs" would help since this might block two users requesting almost the same task at the same time, but not knowing each others, and again will cause frustration all while being difficult to implement.

@benoit74 benoit74 added enhancement New feature or request question Further information is requested labels Jun 4, 2024
@rgaudin
Copy link
Member

rgaudin commented Jun 4, 2024

Why not being transparent about the queue size and the (very wrong) ETA? This way we don't have to track nor patronize users and we'd get the same results.
Could be as simple as a message saying “ There are currently 516 tasks in the pipe, your request is expected to be delivered within… 30days”

With such an information, many requests we get would simply not be sent because some user would loose interest in the ZIM if it cant be retrieved shortly.

I understant the service may look bad when the queue is very long but it's fair and respectful of the user to announce it instead of him just thinking he submitted his request and got zero response.

@benoit74
Copy link
Collaborator Author

benoit74 commented Jun 4, 2024

Announcing ETA is more the proposition of #32 (which mostly speaks about rank, but ETA could be added).

This issue suggests that it might not be enough to make users reasonable. It might be wrong, I don't know, at least it is tracked in an issue now.

@rgaudin
Copy link
Member

rgaudin commented Jun 4, 2024

Ah sorry I didn't click the issue ; I believe it was the ticket we mentioned yesterday about blocking similar requests.
I guess I stand by what I proposed 18m ago 😅

@kelson42
Copy link
Contributor

kelson42 commented Jun 17, 2024

Let me try to rephrase the problem: we should deliver the ZIM files within 24 hours (if I remember properly the SLA we have fixed to ourself) and we fail because we have too many requests (and too few hardware). This ticket is an attempt do reduce the amount of requests by reducing the amount of "abusive" ones, so at the source.

As abusive behaviour we want to avoid the users to launch many requests in parallel. In particular if it makes little sense: ie. for the same web sites.

To reduce the "abuses", we have two ways which are not exclusive to each other: use pedagogy and inform the users AND/OR forbids certain users actions.

You have focused your comments on informing users about the delay/size of the queue. I'm not against this, but IMHO it's more important to respect the SLA. And I prefer to have furstrated users because the service does not deliver many ZIM files in paralell than all users waiting for days.

Therefore I propose:

  • Identify users correctly based on cookie/IP
  • Allow users to cancel an old request of them (this is necessary if they identify an error in their configuration early)
  • Forbids two requests in parallel.
  • Encourage users to take contact with us if they want more quota

@benoit74
Copy link
Collaborator Author

I'm fine with the idea to forbid certain users actions.

I wasn't aware of any SLA (good to know there was one) and I've understood (probably was wrong) that we didn't wanted at all to block users, and even tracking them with IP/cookie was a concern.

For peace of mind, I like when we can block "abuse" rather than hope for users behind reasonable.

My only concern with what is proposed is that an approach based on IP cannot work (schools, universities, companies, ...). An approach based on cookie is pretty fragile: cookies are easy to delete and once a user finds the trick, it might spread quite fast in the community. If it was just for pedagogy, somehow we can say that we do not mind. If goal is to forbid some actions, then we could spend time implementing something to block users ... and be back to square one (only pedagogy) within few months. If we want to block users, we need something more robust than cookies/IP. Which also usually means something more intrusive and usually not free (in term of money at least). I don't have much to propose unfortunately.

@benoit74 benoit74 changed the title Warn the users about the number of tasks already pending to ease fair use Block users submitting more than one task in parallel to ensure fair use Jun 21, 2024
@benoit74
Copy link
Collaborator Author

Discussed live: we need to use an hybrid approach:

  • combine IP + cookie
  • if user comes without a cookie, consider tasks from same IP to decide if there is already a requested / ongoing task
  • if user comes with IP + cookie, consider both criteria to decide if there is already a requested / ongoing task

This might block legit distinct users coming from same IP but coming for the first time on the site ... we consider this is acceptable for a free service and because it will be the case only for up to 24h (until the currently ongoing task is completed)

@benoit74
Copy link
Collaborator Author

After some thought, I wonder if it is really worth it to consider adding a cookie. It makes the fair use blocking easier to circumvent. And users behind a single IP are probably from big companies or universities, for which we can consider deploying a custom Zimit service if need is significant.

@benoit74
Copy link
Collaborator Author

What has been discussed is also that this issue must clearly indicate when the user is blocked the reason why there is a blockage (fair use of a free service), and the fact that we are open to consider deploying custom services for the ones needing it.

@Popolechien do you have any idea of phrasing / design on this?

@Popolechien
Copy link
Contributor

Popolechien commented Jun 21, 2024

Isn't a cookie browser-based? It would also be interesting to know also how many requests are made with an email address vs. not (which we could also force, because honestly only a tiny fraction is going to leave a window open until they get a result, and I suspect a lot of duplicate requests are from people not realizing results are not immediate and then restarting the query but this time with a request for an email ping)

Edit: I realize that I missed an earlier comment

if user comes without a cookie, consider tasks from same IP to decide if there is already a requested / ongoing task

As in: same request from same IP but without cookie = block? (I have no opinion really on this, I could see several scenarios warranting a pass rather than a block, incl. from a large IP block)

@benoit74
Copy link
Collaborator Author

how many requests are made with an email address vs. not

you have details about this is in the export I made the other day

As in: same request from same IP but without cookie = block? (I have no opinion really on this, I could see several scenarios warranting a pass rather than a block, incl. from a large IP block)

Yes, but it will be based on user IP so there is nothing like large IP block. Only big companies or universities all "hiding" behind one single public IP. Could you explain your other scenarii?

@Popolechien
Copy link
Contributor

Top of my head:

  • as you mentioned separate individuals behind the same IP
  • request made without an email address, see request not being lightning fast and unable to cancel previous order
  • change in priorities, e.g. user has several website they want to zim up but would rather start with the most important one (though here it is a secondary issue of being able to cancel jobs)
  • etc.

@benoit74
Copy link
Collaborator Author

Since the blockage will be gone once the task finish, all but the first scenario you mention are only temporary and indicated potential other limitations in current UI. Only real concern for me is the case where many users are behind the same IP, since one user might be blocked without having being involved at all with this blockage. But again, if many users are behind the same IP, it is probably fair to still limit them to one task at a time, they can always switch to a different IP (their phone, their mum internet, at home, ...).

@rgaudin
Copy link
Member

rgaudin commented Jun 21, 2024

And users behind a single IP are probably from big companies or universities, for which we can consider deploying a custom Zimit service if need is significant.

It's not limited to that.

  • Users behind a VPN share their server's exit IP
  • Many ISPs are sharing public IPs across many users on mobile.
  • ISPs with limited IP numbers (most of african ones) even rotate IPs across customers quickly on residential fixed connections as well

you have details about this is in the export I made the other day

?

@benoit74
Copy link
Collaborator Author

you have details about this is in the export I made the other day

https://docs.google.com/spreadsheets/d/1GaebcExX7d4jq3ndB6zKnSRElz40fs2bccrGjpNtRs0/edit?usp=sharing

@Popolechien
Copy link
Contributor

Popolechien commented Jun 24, 2024

Ok thanks a lot. I see that about 30% of requests are anonymous, but then we can't know for sure which ones were requested a second time with an email address. Excluding these, a third of email users entered more than one query, which answers the initial question in this thread.

Other stats of interest: about 5% of requests are duplicates of existing zim files, another 5% should be seen as unrealistic (e.g. youtube or google translate), and yet another 5% are naughty (yes, I mean pr0n) requests.

Duplicates (same address requested twice or more, though possibly by different people) represent about 1/3 of all requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants