-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust connect/reconnect timeouts #42
Comments
A 5000ms timeout sounds reasonable, but then I would think it could retry every 1000ms until session restablishment. Is there any good reason to have it longer? We would want it to reestablish the connection as soon as possible. |
1000ms is acceptable as well, as long as it's not hammering it that's probably fine for rescheduling the retry. The 5000ms on connect timeout probably needs to remain that just to deal with any blips on DNS or services all reconnecting at once and a delayed ACK from the system. |
Some additional new information related to this, we no longer have session restoring, but we are introducing unroutable message holding, the intent is to give a service such as FS a chance to reconnect before we give up trying to deliver messages destined for a specific node. |
I followed the path and it looks like there is the 10000ms connect timeout, after which it has a 1000ms sleep. It does this twice in a loop, and if this fails then it bails out on creating the connection, which drops way back out to where the session is managed, if the connection fails then it schedules to try the whole thing again in 5000ms.
This means that in the worst case with slow connect failure it could take 10 seconds to timeout before it waits another 1 second to try again. After waiting another 10 seconds it would wait one second, bail out, reschedule in 5 seconds, and then try again.
Collectively this is a little long and could cause missing session restore windows easier. So I think to bring it in line with what other services currently have we should turn down the connect timeout to 5000ms then remove the 2 retries with the 1000ms wait entirely. And leave the rescheduling at 5000ms. This means at most it'll take upto 10 seconds to retry if the connect isn't failing fast, and as little as 5s if it does, which is in line with other services.
The text was updated successfully, but these errors were encountered: