Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust connect/reconnect timeouts #42

Open
Astaelan opened this issue Aug 29, 2019 · 3 comments
Open

Adjust connect/reconnect timeouts #42

Astaelan opened this issue Aug 29, 2019 · 3 comments
Labels
enhancement New feature or request

Comments

@Astaelan
Copy link
Contributor

I followed the path and it looks like there is the 10000ms connect timeout, after which it has a 1000ms sleep. It does this twice in a loop, and if this fails then it bails out on creating the connection, which drops way back out to where the session is managed, if the connection fails then it schedules to try the whole thing again in 5000ms.
This means that in the worst case with slow connect failure it could take 10 seconds to timeout before it waits another 1 second to try again. After waiting another 10 seconds it would wait one second, bail out, reschedule in 5 seconds, and then try again.
Collectively this is a little long and could cause missing session restore windows easier. So I think to bring it in line with what other services currently have we should turn down the connect timeout to 5000ms then remove the 2 retries with the 1000ms wait entirely. And leave the rescheduling at 5000ms. This means at most it'll take upto 10 seconds to retry if the connect isn't failing fast, and as little as 5s if it does, which is in line with other services.

@Astaelan Astaelan added the enhancement New feature or request label Aug 29, 2019
@Astaelan Astaelan self-assigned this Aug 29, 2019
@emcgee
Copy link
Member

emcgee commented Aug 29, 2019

A 5000ms timeout sounds reasonable, but then I would think it could retry every 1000ms until session restablishment. Is there any good reason to have it longer? We would want it to reestablish the connection as soon as possible.

@Astaelan
Copy link
Contributor Author

1000ms is acceptable as well, as long as it's not hammering it that's probably fine for rescheduling the retry. The 5000ms on connect timeout probably needs to remain that just to deal with any blips on DNS or services all reconnecting at once and a delayed ACK from the system.

@Astaelan
Copy link
Contributor Author

Some additional new information related to this, we no longer have session restoring, but we are introducing unroutable message holding, the intent is to give a service such as FS a chance to reconnect before we give up trying to deliver messages destined for a specific node.
To address this properly, we will need to ensure that the C client does properly attempt to reconnect quickly, at least once or twice within the first 3-4 seconds of losing the connection to ensure that the client even has a chance of getting reconnected before the unroutable messages are dropped.
As a side issue, we might look at increasing the default timeout of messages from 5 seconds to 10 seconds, this would cause clients to wait a bit longer before timing out and give services a better chance to reconnect and resume operations after a short network issue. See https://github.com/signalwire/cloud-product/issues/224

@Astaelan Astaelan removed their assignment Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants