-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of aborted connections #26
Comments
Potential duplication, in which case this issue can be closed. That issue refers to a lot of the same problems discussed in #9. |
At least for TCP connections, this is not compatible with the transit protocol whatsoever. |
That's correct. I think this is a breaking change which requires the clients to implement handling for this. Alternatively operators might change the tcp keepalive time from the default of 2 hours to something much smaller(e g. a minute) which would require no changes to protocol or clients. |
One thing to note is that we should distinguish between an idle connection and a broken connection. Keeping an idle connection around for a few hours is fine, but waiting the same time for an ACK when sending a message is not. I don't know how flexible TCP options are on that front though. |
Maybe "Idle connection" can be defined as "Last activity was since X duration", which can be configurable. For those connections that are deemed idle, sending a message will determine if it is broken or just idle. Additionally, for connections that are not broken but idle for a long enough time, probably the server should also close these, but after a much longer period than the one used for checking if they're broken or not. |
I don't think the relay will ever be able to choose a good timeout: no matter what is chosen, some client will think it is too long (or too short). The relay doesn't know what protocol is running over the connections, either, so this becomes especially hard. Only the clients can decide how long they "should" wait. The only problem is when the sender disappears (without any TCP level close etc) for the file-transfer use case. When this happens, the receiver will see the same symptoms from a "slow" sender and from a "gone" sender (namely: data stops arriving). So, their tolerance for waiting (to see if more data shows up) is all that matters. (For other protocols, it may be completely fine for no data to show up for minutes or more -- consider an interactive chat application, for example, or a Chess game). Therefore, I don't see what handling could be improved at the relay so long as we're propagating closes and error properly -- the relay will only learn a connection is dead when it tries to send data (and in those cases, it should then close the "other" side). If there are cases where that isn't currently happening, we should fix them. No timeouts should be added to the relay. |
I agree that the relay could do good to clients by not closing the connections if they're idle but not broken, because there might be legitimate use cases for which the connection can be idle for a long time. It's up to clients to decide if this is unacceptable or not, so long as the other side is still connected but not sending anything. I still think the inability to detect broken connections at all might be a problem because if both sides are gone, the relay still keeps whatever buffers or state associated with that pair until it gets an error when writing to either side, which it won't do. Unfortunately though, this looks like it will require a change in the protocol. |
The relay is here to support any client-to-client protocol. The file-transfer use-case is just one. Clients are free to use a protocol that has better failure characteristics (e.g. periodic pings) but the file-transfer v1 protocol doesn't have that. Someone deploying the relay could change TCP-level options (e.g. keepalive) if they determine a need to free resources more quickly. Note also that Dilation solves this, because it does have pings (again, according to #9). If deployments of the relay run into problems, perhaps additional features could be proposed -- many of these discussed in #9 so I think that's the right ticket to collect any additional ideas. |
Currently when either side closes the TCP connectionn, the relay closes the other connection in the pair. However, when one side aborts(i.e. without closing the connection), the relay relies on TCP keepalive for recognizing this and closing the other side's connection.
From the relay's perspective, this could be a potential issue if enough of these aborted connections are created and the relay keeps all that state in-memory. Additionally, the open file limit might be reached which prevents further connections to the relay.
From the client's perspective, it would be good for the relay to more proactively detect and act on this so that clients are not stuck waiting needlessly.
A potential fix might be to let the relay detect idle connections and send a "ping" message to the idle side, if there is no response or if the connection is broken, then the connection being closed/aborted will cause a write error which will allow for closing and cleanup to happen. This would require changes to clients to handle this and respond with a "pong".
The text was updated successfully, but these errors were encountered: