Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of aborted connections #26

Closed
ewanas opened this issue Jun 2, 2022 · 8 comments
Closed

Improve handling of aborted connections #26

ewanas opened this issue Jun 2, 2022 · 8 comments

Comments

@ewanas
Copy link

ewanas commented Jun 2, 2022

Currently when either side closes the TCP connectionn, the relay closes the other connection in the pair. However, when one side aborts(i.e. without closing the connection), the relay relies on TCP keepalive for recognizing this and closing the other side's connection.

From the relay's perspective, this could be a potential issue if enough of these aborted connections are created and the relay keeps all that state in-memory. Additionally, the open file limit might be reached which prevents further connections to the relay.

From the client's perspective, it would be good for the relay to more proactively detect and act on this so that clients are not stuck waiting needlessly.

A potential fix might be to let the relay detect idle connections and send a "ping" message to the idle side, if there is no response or if the connection is broken, then the connection being closed/aborted will cause a write error which will allow for closing and cleanup to happen. This would require changes to clients to handle this and respond with a "pong".

@ewanas
Copy link
Author

ewanas commented Jun 2, 2022

Potential duplication, in which case this issue can be closed. That issue refers to a lot of the same problems discussed in #9.

@piegamesde
Copy link
Member

A potential fix might be to let the relay detect idle connections and send a "ping" message to the idle side

At least for TCP connections, this is not compatible with the transit protocol whatsoever.

@ewanas
Copy link
Author

ewanas commented Jun 2, 2022

That's correct. I think this is a breaking change which requires the clients to implement handling for this.

Alternatively operators might change the tcp keepalive time from the default of 2 hours to something much smaller(e g. a minute) which would require no changes to protocol or clients.

@piegamesde
Copy link
Member

One thing to note is that we should distinguish between an idle connection and a broken connection. Keeping an idle connection around for a few hours is fine, but waiting the same time for an ACK when sending a message is not. I don't know how flexible TCP options are on that front though.

@ewanas
Copy link
Author

ewanas commented Jun 2, 2022

Maybe "Idle connection" can be defined as "Last activity was since X duration", which can be configurable. For those connections that are deemed idle, sending a message will determine if it is broken or just idle.

Additionally, for connections that are not broken but idle for a long enough time, probably the server should also close these, but after a much longer period than the one used for checking if they're broken or not.

@meejah
Copy link
Member

meejah commented Jun 2, 2022

I don't think the relay will ever be able to choose a good timeout: no matter what is chosen, some client will think it is too long (or too short). The relay doesn't know what protocol is running over the connections, either, so this becomes especially hard.

Only the clients can decide how long they "should" wait.

The only problem is when the sender disappears (without any TCP level close etc) for the file-transfer use case. When this happens, the receiver will see the same symptoms from a "slow" sender and from a "gone" sender (namely: data stops arriving). So, their tolerance for waiting (to see if more data shows up) is all that matters.

(For other protocols, it may be completely fine for no data to show up for minutes or more -- consider an interactive chat application, for example, or a Chess game).

Therefore, I don't see what handling could be improved at the relay so long as we're propagating closes and error properly -- the relay will only learn a connection is dead when it tries to send data (and in those cases, it should then close the "other" side). If there are cases where that isn't currently happening, we should fix them. No timeouts should be added to the relay.

@ewanas
Copy link
Author

ewanas commented Jun 2, 2022

I agree that the relay could do good to clients by not closing the connections if they're idle but not broken, because there might be legitimate use cases for which the connection can be idle for a long time. It's up to clients to decide if this is unacceptable or not, so long as the other side is still connected but not sending anything.

I still think the inability to detect broken connections at all might be a problem because if both sides are gone, the relay still keeps whatever buffers or state associated with that pair until it gets an error when writing to either side, which it won't do. Unfortunately though, this looks like it will require a change in the protocol.

@meejah
Copy link
Member

meejah commented Jun 2, 2022

The relay is here to support any client-to-client protocol. The file-transfer use-case is just one. Clients are free to use a protocol that has better failure characteristics (e.g. periodic pings) but the file-transfer v1 protocol doesn't have that.

Someone deploying the relay could change TCP-level options (e.g. keepalive) if they determine a need to free resources more quickly.

Note also that Dilation solves this, because it does have pings (again, according to #9).

If deployments of the relay run into problems, perhaps additional features could be proposed -- many of these discussed in #9 so I think that's the right ticket to collect any additional ideas.

@meejah meejah closed this as completed Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants