Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

Open
littleskunk opened this issue Dec 11, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@littleskunk
Copy link
Member

We noticed a performance issue in the storj select network. One of the storage node providers had issues. We believe a good number of nodes have been offline for an hour or so. During that time the upload performance was impacted. We didn't have a good explaination why so Igor and me reproduced it on the QA satellite. In total, the QA satellite has 146 nodes available for uploads. I took 40 nodes offline to create a similar situation as in the storj select network. Igor uploaded a file with an old and a new uplink version. The behavior of both uplink versions was the same.

Here is an example log: olduplink.txt

The upload took a supprising long time. The same issue as we observed in production. Here is the timeline:
Round 1: Uplink connects to 110 nodes. 33 nodes are offline. It takes just 1 second to notice that. 76 successful uploads after just 10 seconds.. The real problem is one slow node that takes a minute before the uplink errors out. -> Retry means no longtail cancelation and we wait for the slowest node to finish or fail.
Round 2: 34 pieces to upload. 8 connection errors after 1 second. 4 successful uploads. Upload finished with 80 pieces. This all took again just a few seconds.

The gap between our expectation (should finish in seconds) and the current implementation (takes at least a minute if not more) is that the offline nodes error out fast but the retry is slowed down by the slowest node in the mix. So a single slow node in combination with too many offline nodes can destroy the performance.

Would it be possible to kick of the retry for the offline nodes more or less right away without waiting for the slow node? Or a more agressive long tail cancelation that also triggers if successful + failed uploads > 80 (+ 10 safety threshold or so to avoid false positives). In this situation it would have canceld the slow node after 10 seconds instead of waiting more than a minute.

@littleskunk littleskunk added the bug Something isn't working label Dec 11, 2023
@shaupt131 shaupt131 moved this from Backlog to Up Next in Team Satellite Dec 20, 2023
@iglesiasbrandon
Copy link
Contributor

@littleskunk we are not sure if this issue is still relevant or not. We think we implemented some code changes that might have resolved it. can you take a look and let us know?

@pwilloughby
Copy link
Contributor

pwilloughby commented Sep 26, 2024

@iglesiasbrandon This is still relevant especially with placements with smaller long tail and fewer nodes. There is no longer any functioning dial timeouts in uplink, see https://review.dev.storj.io/c/storj/uplink/+/9815. Enough offline nodes cause an upload to hang until the OS tcp syn retires are exhausted (2 minutes 7 seconds). Enough slow nodes cause the upload to hang until the message timeout (10 minutes!). This is the root cause of this issue https://github.com/storj/customer-issues/issues/2112. Given a mix of normal, slow, and offline nodes uplink should succeed with minimal slowdown.

@pwilloughby
Copy link
Contributor

pwilloughby commented Sep 30, 2024

Offline nodes, 2 minute timeout. (There's a trace with the 10 minute message timeout below this one)
2m11s
Slow nodes, 10 minute message timeout. tcp_syn_retires was shortened to 4 in this run
syn_retries4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Up Next
Development

No branches or pull requests

3 participants