Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

littleskunk · 2023-12-11T17:57:45Z

We noticed a performance issue in the storj select network. One of the storage node providers had issues. We believe a good number of nodes have been offline for an hour or so. During that time the upload performance was impacted. We didn't have a good explaination why so Igor and me reproduced it on the QA satellite. In total, the QA satellite has 146 nodes available for uploads. I took 40 nodes offline to create a similar situation as in the storj select network. Igor uploaded a file with an old and a new uplink version. The behavior of both uplink versions was the same.

Here is an example log: olduplink.txt

The upload took a supprising long time. The same issue as we observed in production. Here is the timeline:
Round 1: Uplink connects to 110 nodes. 33 nodes are offline. It takes just 1 second to notice that. 76 successful uploads after just 10 seconds.. The real problem is one slow node that takes a minute before the uplink errors out. -> Retry means no longtail cancelation and we wait for the slowest node to finish or fail.
Round 2: 34 pieces to upload. 8 connection errors after 1 second. 4 successful uploads. Upload finished with 80 pieces. This all took again just a few seconds.

The gap between our expectation (should finish in seconds) and the current implementation (takes at least a minute if not more) is that the offline nodes error out fast but the retry is slowed down by the slowest node in the mix. So a single slow node in combination with too many offline nodes can destroy the performance.

Would it be possible to kick of the retry for the offline nodes more or less right away without waiting for the slow node? Or a more agressive long tail cancelation that also triggers if successful + failed uploads > 80 (+ 10 safety threshold or so to avoid false positives). In this situation it would have canceld the slow node after 10 seconds instead of waiting more than a minute.

iglesiasbrandon · 2024-03-12T14:59:27Z

@littleskunk we are not sure if this issue is still relevant or not. We think we implemented some code changes that might have resolved it. can you take a look and let us know?

pwilloughby · 2024-09-26T22:44:43Z

@iglesiasbrandon This is still relevant especially with placements with smaller long tail and fewer nodes. There is no longer any functioning dial timeouts in uplink, see https://review.dev.storj.io/c/storj/uplink/+/9815. Enough offline nodes cause an upload to hang until the OS tcp syn retires are exhausted (2 minutes 7 seconds). Enough slow nodes cause the upload to hang until the message timeout (10 minutes!). This is the root cause of this issue https://github.com/storj/customer-issues/issues/2112. Given a mix of normal, slow, and offline nodes uplink should succeed with minimal slowdown.

pwilloughby · 2024-09-30T19:07:53Z

Offline nodes, 2 minute timeout. (There's a trace with the 10 minute message timeout below this one)

Slow nodes, 10 minute message timeout. tcp_syn_retires was shortened to 4 in this run

littleskunk added the bug Something isn't working label Dec 11, 2023

shaupt131 added this to Team Satellite Dec 20, 2023

github-project-automation bot moved this to Backlog in Team Satellite Dec 20, 2023

shaupt131 moved this from Backlog to Up Next in Team Satellite Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

littleskunk commented Dec 11, 2023

iglesiasbrandon commented Mar 12, 2024

pwilloughby commented Sep 26, 2024 •

edited

Loading

pwilloughby commented Sep 30, 2024 •

edited

Loading

Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

Comments

littleskunk commented Dec 11, 2023

iglesiasbrandon commented Mar 12, 2024

pwilloughby commented Sep 26, 2024 • edited Loading

pwilloughby commented Sep 30, 2024 • edited Loading

pwilloughby commented Sep 26, 2024 •

edited

Loading

pwilloughby commented Sep 30, 2024 •

edited

Loading