Adjust connect/reconnect timeouts #42

Astaelan · 2019-08-29T17:58:35Z

I followed the path and it looks like there is the 10000ms connect timeout, after which it has a 1000ms sleep. It does this twice in a loop, and if this fails then it bails out on creating the connection, which drops way back out to where the session is managed, if the connection fails then it schedules to try the whole thing again in 5000ms.
This means that in the worst case with slow connect failure it could take 10 seconds to timeout before it waits another 1 second to try again. After waiting another 10 seconds it would wait one second, bail out, reschedule in 5 seconds, and then try again.
Collectively this is a little long and could cause missing session restore windows easier. So I think to bring it in line with what other services currently have we should turn down the connect timeout to 5000ms then remove the 2 retries with the 1000ms wait entirely. And leave the rescheduling at 5000ms. This means at most it'll take upto 10 seconds to retry if the connect isn't failing fast, and as little as 5s if it does, which is in line with other services.

emcgee · 2019-08-29T18:35:47Z

A 5000ms timeout sounds reasonable, but then I would think it could retry every 1000ms until session restablishment. Is there any good reason to have it longer? We would want it to reestablish the connection as soon as possible.

Astaelan · 2019-08-29T18:37:43Z

1000ms is acceptable as well, as long as it's not hammering it that's probably fine for rescheduling the retry. The 5000ms on connect timeout probably needs to remain that just to deal with any blips on DNS or services all reconnecting at once and a delayed ACK from the system.

Astaelan · 2020-08-20T18:47:37Z

Some additional new information related to this, we no longer have session restoring, but we are introducing unroutable message holding, the intent is to give a service such as FS a chance to reconnect before we give up trying to deliver messages destined for a specific node.
To address this properly, we will need to ensure that the C client does properly attempt to reconnect quickly, at least once or twice within the first 3-4 seconds of losing the connection to ensure that the client even has a chance of getting reconnected before the unroutable messages are dropped.
As a side issue, we might look at increasing the default timeout of messages from 5 seconds to 10 seconds, this would cause clients to wait a bit longer before timing out and give services a better chance to reconnect and resume operations after a short network issue. See https://github.com/signalwire/cloud-product/issues/224

Astaelan added the enhancement New feature or request label Aug 29, 2019

Astaelan self-assigned this Aug 29, 2019

Astaelan removed their assignment Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust connect/reconnect timeouts #42

Adjust connect/reconnect timeouts #42

Astaelan commented Aug 29, 2019

emcgee commented Aug 29, 2019

Astaelan commented Aug 29, 2019

Astaelan commented Aug 20, 2020

Adjust connect/reconnect timeouts #42

Adjust connect/reconnect timeouts #42

Comments

Astaelan commented Aug 29, 2019

emcgee commented Aug 29, 2019

Astaelan commented Aug 29, 2019

Astaelan commented Aug 20, 2020