You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We noticed a performance issue in the storj select network. One of the storage node providers had issues. We believe a good number of nodes have been offline for an hour or so. During that time the upload performance was impacted. We didn't have a good explaination why so Igor and me reproduced it on the QA satellite. In total, the QA satellite has 146 nodes available for uploads. I took 40 nodes offline to create a similar situation as in the storj select network. Igor uploaded a file with an old and a new uplink version. The behavior of both uplink versions was the same.
The upload took a supprising long time. The same issue as we observed in production. Here is the timeline:
Round 1: Uplink connects to 110 nodes. 33 nodes are offline. It takes just 1 second to notice that. 76 successful uploads after just 10 seconds.. The real problem is one slow node that takes a minute before the uplink errors out. -> Retry means no longtail cancelation and we wait for the slowest node to finish or fail.
Round 2: 34 pieces to upload. 8 connection errors after 1 second. 4 successful uploads. Upload finished with 80 pieces. This all took again just a few seconds.
The gap between our expectation (should finish in seconds) and the current implementation (takes at least a minute if not more) is that the offline nodes error out fast but the retry is slowed down by the slowest node in the mix. So a single slow node in combination with too many offline nodes can destroy the performance.
Would it be possible to kick of the retry for the offline nodes more or less right away without waiting for the slow node? Or a more agressive long tail cancelation that also triggers if successful + failed uploads > 80 (+ 10 safety threshold or so to avoid false positives). In this situation it would have canceld the slow node after 10 seconds instead of waiting more than a minute.
The text was updated successfully, but these errors were encountered:
@littleskunk we are not sure if this issue is still relevant or not. We think we implemented some code changes that might have resolved it. can you take a look and let us know?
@iglesiasbrandon This is still relevant especially with placements with smaller long tail and fewer nodes. There is no longer any functioning dial timeouts in uplink, see https://review.dev.storj.io/c/storj/uplink/+/9815. Enough offline nodes cause an upload to hang until the OS tcp syn retires are exhausted (2 minutes 7 seconds). Enough slow nodes cause the upload to hang until the message timeout (10 minutes!). This is the root cause of this issue https://github.com/storj/customer-issues/issues/2112. Given a mix of normal, slow, and offline nodes uplink should succeed with minimal slowdown.
Offline nodes, 2 minute timeout. (There's a trace with the 10 minute message timeout below this one)
Slow nodes, 10 minute message timeout. tcp_syn_retires was shortened to 4 in this run
We noticed a performance issue in the storj select network. One of the storage node providers had issues. We believe a good number of nodes have been offline for an hour or so. During that time the upload performance was impacted. We didn't have a good explaination why so Igor and me reproduced it on the QA satellite. In total, the QA satellite has 146 nodes available for uploads. I took 40 nodes offline to create a similar situation as in the storj select network. Igor uploaded a file with an old and a new uplink version. The behavior of both uplink versions was the same.
Here is an example log: olduplink.txt
The upload took a supprising long time. The same issue as we observed in production. Here is the timeline:
Round 1: Uplink connects to 110 nodes. 33 nodes are offline. It takes just 1 second to notice that. 76 successful uploads after just 10 seconds.. The real problem is one slow node that takes a minute before the uplink errors out. -> Retry means no longtail cancelation and we wait for the slowest node to finish or fail.
Round 2: 34 pieces to upload. 8 connection errors after 1 second. 4 successful uploads. Upload finished with 80 pieces. This all took again just a few seconds.
The gap between our expectation (should finish in seconds) and the current implementation (takes at least a minute if not more) is that the offline nodes error out fast but the retry is slowed down by the slowest node in the mix. So a single slow node in combination with too many offline nodes can destroy the performance.
Would it be possible to kick of the retry for the offline nodes more or less right away without waiting for the slow node? Or a more agressive long tail cancelation that also triggers if successful + failed uploads > 80 (+ 10 safety threshold or so to avoid false positives). In this situation it would have canceld the slow node after 10 seconds instead of waiting more than a minute.
The text was updated successfully, but these errors were encountered: