Failures to recover from unresponsive TCP connection when ACM is enabled #1461

frankoid · 2023-03-07T14:45:23Z

frankoid
Mar 7, 2023

Hi 👋

We have an application that runs on multiple Kubernetes clusters in AWS. Many of the services that make up the application are written in C++ and use Ice to communicate with each other.

We recently had an incident where our application returned errors to callers for about 15 minutes and our logs contained many error logs like this one:

Failed to look up proxy MyObject3.Proxy (MyObject -t -e 1.1:ssl -h shard3-ice.prod.example.com -p 443 -t 7500) for shard 3; N3Ice26InvocationTimeoutExceptionE: src/Ice/OutgoingAsync.cpp:845: ::Ice::InvocationTimeoutException:

We are using the following ACM settings. The ACM timeout is set quite low to make sure it’s below the InvocationTimeout - we have previously experienced similar problems when the InvocationTimeout is shorter than the ACM timeout:

Ice.ACM.Server.Close=1      # CloseOnIdle
Ice.ACM.Server.Heartbeat=1  # HeartbeatOnDispatch
Ice.ACM.Server.Timeout=4
Ice.ACM.Client.Close=3     # CloseOnInvocationAndIdle
Ice.ACM.Client.Heartbeat=0 # HeartbeatOff (heartbeat is off here but enabled on the server end in girogate-main)
Ice.ACM.Client.Timeout=4

# a variety of different invocation timeouts are in use, all 5000 (5s) or greater

We have been able to reproduce the problem in a non-production environment for the particular client and server services involved. To reproduce we sent a steady stream of traffic (2 requests per second) and made a network connection become unresponsive whilst still allowing new connections to succeed (details of reproduction method below). When we run netstat on the client we see that the bad (unresponsive) connection remains open, so it seems that ACM is not closing the bad connection for some reason - I’d expect it to close this connection after 4 seconds because no heartbeat will have been received from the server.

The same issue doesn’t reproduce when we try it with different clients and servers and we’re not sure what is different about this client & server that’s causing the issue to happen here and not elsewhere. We have worked around the issue by making this client set a different ice_connectionId every time it looks up a proxy as described in the Ice docs, however because we don’t understand why the problem happens we don’t know whether other parts of the system might suffer from the same problem. The overall system is quite large so it would be a lot of effort to apply this workaround to all Ice connections. It might also affect performance or stability - we use TLS so establishing a new connection for each proxy uses more CPU, and establishing may more connections might mean we hit a limit.

Does anyone have any idea what might be going on or how we can investigate?

Appendix: Steps to Reproduce

Background: multiple replicas of the server pod run, so the server hostname resolves to multiple IP addresses. I’ve replaced the actual server hostname with server.example.com below for confidentiality reasons.

Install Chaos Mesh in your developer k8s cluster.
Edit client k8s manifests to reduce replicast to 1 and allow root access so that /etc/hosts can be edited.
Get a shell on the client container and add an /etc/hosts file entry pointing server.example.com at one of its IPs.
Run a test client in a loop continually making 2 requests per second to exercise the client & server. We use our load test running at a low load setting here.
Use Chaos Mesh to introduce a partition between the client pod and the IP address added to the hosts file in step 3.
Change the hosts file in the client pod to point server.example.com at another of its IPs that is not affected by the partition affected in the previous step (remove the partitioned IP address, so there is a only single entry for server.example.com pointing at the good IP address).

Expected behaviour: the client’s Ice ACM notices that its connection to the partitioned server IP is no longer responsive, so it establishes a new connection - this uses the still-working IP address added to the hosts file in step 6.

Actual behaviour: the client fails with InvocationTimeoutExceptions for about 15 minutes before a new connection is established and it requests start to succeed again. Running netstat -a in the client pod shows a connection to the partitioned IP address (added to /etc/hosts in step 3) and no connections to the working IP address (added to /etc/hosts in step 6).

bentoi · 2023-03-08T11:15:59Z

bentoi
Mar 8, 2023
Maintainer

Hi,

Which Ice version do you use?

We recently had an incident where our application returned errors to callers for about 15 minutes and our logs contained many error logs like this one:

If I understand it correctly, your application establishes the connection and has therefore the "client" role. The "callers" you mention have the "server" role and accept connections from your application. Is this accurate?

We are using the following ACM settings.

Are these settings for both the client and server? Or do you use different ACM settings for each process?

The ACM timeout is set quite low to make sure it’s below the InvocationTimeout - we have previously experienced similar problems when the InvocationTimeout is shorter than the ACM timeout

It's not clear why it's necessary. The invocation timeout should be triggered regardless of the ACM timeout. You can have a 30s ACM timeout and an invocation timeout of 5s should trigger an InvocationTimeoutException exception. And if you have a 2 minutes invocation timeout and the connection is lost within these 2 minutes, ACM is supposed to detect the connection loss and trigger the invocation failure (regardless of the invocation timeout set).

I’d expect it to close this connection after 4 seconds because no heartbeat will have been received from the server.

It's actually not quite how ACM works. An Ice client not receiving heartbeats doesn't necessarily implies that the connection with the server is broken (depending on the server configuration, the server might not necessarily send heartbeats or perhaps the client is using an old Ice version which didn't support heartbeats).

As long as writes are successful, the connection is not considered broken -- even if it doesn't receive heartbeats within the ACM timeout period. In other words, if your application still successfully sends invocations while waiting for an invocation response, the connection is still considered healthy. Ice relies on the TCP stack to eventually report the failure to send data (either when sending heartbeats or invocations).

For more information on ACM see: https://doc.zeroc.com/ice/latest/client-server-features/connection-management/active-connection-management

We need to identify exactly the source of your issue but from your description it sounds like the TCP stack in your K8s environment does not detect the failure when sending the data and Ice therefore assumes that the data is successfully delivered by the recipient.

1 reply

frankoid Mar 9, 2023
Author

Thanks for the reply.

Which Ice version do you use?

We use the Ubuntu 20.04 package, version 3.7.3-1build2.

We recently had an incident where our application returned errors to callers for about 15 minutes and our logs contained many error logs like this one:

If I understand it correctly, your application establishes the connection and has therefore the "client" role. The "callers" you mention have the "server" role and accept connections from your application. Is this accurate?

By "callers" I mean external callers making HTTP requests into our system.

We actually have multiple Ice "hops" involved in servicing each HTTP request.

By "client" I mean an application/service that makes outbound Ice calls. By "server" I mean an application that accepts inbound Ice calls. Some of our applications act as both clients and servers - they accept inbound Ice calls and also make outbound Ice calls.

We are using the following ACM settings.

Are these settings for both the client and server? Or do you use different ACM settings for each process?

In most cases we use the same settings in all applications and we include both the Ice.ACM.Server.* and the Ice.ACM.Client.* settings in all applications even though I don't think the Ice.ACM.Server.* settings are strictly needed for clients and vice versa.

The ACM timeout is set quite low to make sure it’s below the InvocationTimeout - we have previously experienced similar problems when the InvocationTimeout is shorter than the ACM timeout

It's not clear why it's necessary. The invocation timeout should be triggered regardless of the ACM timeout. You can have a 30s ACM timeout and an invocation timeout of 5s should trigger an InvocationTimeoutException exception. And if you have a 2 minutes invocation timeout and the connection is lost within these 2 minutes, ACM is supposed to detect the connection loss and trigger the invocation failure (regardless of the invocation timeout set).

The problem we saw was that when we had an invocation timeout of 5s and an ACM timeout of 30s and a TCP connection stopped responding (traffic stops flowing due to an infrastructure issue) then we'd see the following behaviour:

Ice throws InvocationTimeoutException after 5 seconds - this is what I'd expect.
When we keep making calls, Ice continues to throw InvocationTimeoutException after 5 seconds for each call. I would expect that 30 seconds after the TCP connection stopped responding then Ice should establish a new connection, and assuming that connection works (traffic flows) then subsequent calls should succeed without any exception. What we observed was that we'd still get InvocationTimeoutExceptions long after the 30 second ACM timeout.

It's as if the low invocation timeout was preventing Ice from "noticing" that the original connection was bad (I'm speculating here).

Reducing the ACM timeout from 30s to 4s (i.e. lower than the 5s invocation timeout) improved the behaviour we saw in the "existing TCP connection becomes unresponsive but new TCP connections will work" scenario I described above - in our testing it caused the system to recover quickly (within a few seconds) in this scenario rather than continuing to experience InvocationTimeoutExceptions for a long time (about 15 minutes). However, although this change generally improved the behaviour we still have one particular part of our application where even with the 4s ACM timeout the application still fails to recover quickly in this situation (we still get InvocationTimeoutExceptions for a long time after a TCP connection becomes unresponsive).

I’d expect it to close this connection after 4 seconds because no heartbeat will have been received from the server.

It's actually not quite how ACM works. An Ice client not receiving heartbeats doesn't necessarily implies that the connection with the server is broken (depending on the server configuration, the server might not necessarily send heartbeats or perhaps the client is using an old Ice version which didn't support heartbeats).

Yes, I've seen that ACM considers other activity not just heartbeats, I was oversimplifying when I said "I’d expect it to close this connection after 4 seconds because no heartbeat will have been received from the server".

As long as writes are successful, the connection is not considered broken -- even if it doesn't receive heartbeats within the ACM timeout period. In other words, if your application still successfully sends invocations while waiting for an invocation response, the connection is still considered healthy. Ice relies on the TCP stack to eventually report the failure to send data (either when sending heartbeats or invocations).

This is interesting and might be related to our problem. What does Ice consider a "successful write"? Whilst testing unresponsive connections by introducing a network partition I've seen that the Send-Q reported by netstat -a for the TCP connection Ice is using grows. Perhaps Ice is "successfully" writing data to the network socket but the data is not being received by the server, i.e. the client OS (Linux) has sent the TCP packets but not received ACKs for them because of the network partition?

For more information on ACM see: https://doc.zeroc.com/ice/latest/client-server-features/connection-management/active-connection-management

We need to identify exactly the source of your issue but from your description it sounds like the TCP stack in your K8s environment does not detect the failure when sending the data and Ice therefore assumes that the data is successfully delivered by the recipient.

bentoi · 2023-03-09T15:20:25Z

bentoi
Mar 9, 2023
Maintainer

This is interesting and might be related to our problem. What does Ice consider a "successful write"? Whilst testing unresponsive connections by introducing a network partition I've seen that the Send-Q reported by netstat -a for the TCP connection Ice is using grows. Perhaps Ice is "successfully" writing data to the network socket but the data is not being received by the server, i.e. the client OS (Linux) has sent the TCP packets but not received ACKs for them because of the network partition?

It's indeed what I suspect as well.

A successful write is a write for which the socket send system call succeeds, meaning that the data was successfully copied to the socket send buffer (aka Send-Q). The TCP stack is responsible for sending the data under the hood and should eventually trigger a failure from the send system call if a connection loss is detected (the TCP stack didn't receive a ACK for sometime). Ice relies on this to detect the connection failure (in addition to the receival of heartbeats if the peer sends heartbeats).

Here are two things you could try to help us pint point the issue.

The first thing is to try to fill up the send buffer quicker:

use a 1KB payload for the request you send every 500ms
eventually decrease the send/recv buffer size in the client and the server configuration (32KB for example).

Ice is supposed to teardown the connection if writes don't progress after the connection timeout. In your case, it's the 7500ms timeout set on your endpoint.

The second thing you could try is to tune the TCP keep alive mechanism. This requires root access to the host. See this page for information on TCP keep alives: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

By default, TCP keep alive starts after 2 hours. This isn't suitable for your deployment. You could try the following settings to detect a dead connection sooner:

net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 6

TCP will start sending keep alive probes after 30s and will declare the connection dead after sending 6 unanswered probes at 5s intervals.

Let us know how it goes.

1 reply

frankoid Mar 14, 2023
Author

Thanks, we will do some experiments and let you know how it goes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroC Ice

Failures to recover from unresponsive TCP connection when ACM is enabled #1461

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ZeroC Ice

Failures to recover from unresponsive TCP connection when ACM is enabled #1461

frankoid Mar 7, 2023

Appendix: Steps to Reproduce

Replies: 2 comments · 2 replies

bentoi Mar 8, 2023 Maintainer

frankoid Mar 9, 2023 Author

bentoi Mar 9, 2023 Maintainer

frankoid Mar 14, 2023 Author

frankoid
Mar 7, 2023

Replies: 2 comments 2 replies

bentoi
Mar 8, 2023
Maintainer

frankoid Mar 9, 2023
Author

bentoi
Mar 9, 2023
Maintainer

frankoid Mar 14, 2023
Author