Replies: 2 comments 2 replies
-
Hi, Which Ice version do you use?
If I understand it correctly, your application establishes the connection and has therefore the "client" role. The "callers" you mention have the "server" role and accept connections from your application. Is this accurate?
Are these settings for both the client and server? Or do you use different ACM settings for each process?
It's not clear why it's necessary. The invocation timeout should be triggered regardless of the ACM timeout. You can have a 30s ACM timeout and an invocation timeout of 5s should trigger an
It's actually not quite how ACM works. An Ice client not receiving heartbeats doesn't necessarily implies that the connection with the server is broken (depending on the server configuration, the server might not necessarily send heartbeats or perhaps the client is using an old Ice version which didn't support heartbeats). As long as writes are successful, the connection is not considered broken -- even if it doesn't receive heartbeats within the ACM timeout period. In other words, if your application still successfully sends invocations while waiting for an invocation response, the connection is still considered healthy. Ice relies on the TCP stack to eventually report the failure to send data (either when sending heartbeats or invocations). For more information on ACM see: https://doc.zeroc.com/ice/latest/client-server-features/connection-management/active-connection-management We need to identify exactly the source of your issue but from your description it sounds like the TCP stack in your K8s environment does not detect the failure when sending the data and Ice therefore assumes that the data is successfully delivered by the recipient. |
Beta Was this translation helpful? Give feedback.
-
It's indeed what I suspect as well. A successful write is a write for which the socket send system call succeeds, meaning that the data was successfully copied to the socket send buffer (aka Send-Q). The TCP stack is responsible for sending the data under the hood and should eventually trigger a failure from the send system call if a connection loss is detected (the TCP stack didn't receive a ACK for sometime). Ice relies on this to detect the connection failure (in addition to the receival of heartbeats if the peer sends heartbeats). Here are two things you could try to help us pint point the issue. The first thing is to try to fill up the send buffer quicker:
Ice is supposed to teardown the connection if writes don't progress after the connection timeout. In your case, it's the 7500ms timeout set on your endpoint. The second thing you could try is to tune the TCP keep alive mechanism. This requires root access to the host. See this page for information on TCP keep alives: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html By default, TCP keep alive starts after 2 hours. This isn't suitable for your deployment. You could try the following settings to detect a dead connection sooner:
TCP will start sending keep alive probes after 30s and will declare the connection dead after sending 6 unanswered probes at 5s intervals. Let us know how it goes. |
Beta Was this translation helpful? Give feedback.
-
Hi 👋
We have an application that runs on multiple Kubernetes clusters in AWS. Many of the services that make up the application are written in C++ and use Ice to communicate with each other.
We recently had an incident where our application returned errors to callers for about 15 minutes and our logs contained many error logs like this one:
We are using the following ACM settings. The ACM timeout is set quite low to make sure it’s below the InvocationTimeout - we have previously experienced similar problems when the InvocationTimeout is shorter than the ACM timeout:
We have been able to reproduce the problem in a non-production environment for the particular client and server services involved. To reproduce we sent a steady stream of traffic (2 requests per second) and made a network connection become unresponsive whilst still allowing new connections to succeed (details of reproduction method below). When we run
netstat
on the client we see that the bad (unresponsive) connection remains open, so it seems that ACM is not closing the bad connection for some reason - I’d expect it to close this connection after 4 seconds because no heartbeat will have been received from the server.The same issue doesn’t reproduce when we try it with different clients and servers and we’re not sure what is different about this client & server that’s causing the issue to happen here and not elsewhere. We have worked around the issue by making this client set a different
ice_connectionId
every time it looks up a proxy as described in the Ice docs, however because we don’t understand why the problem happens we don’t know whether other parts of the system might suffer from the same problem. The overall system is quite large so it would be a lot of effort to apply this workaround to all Ice connections. It might also affect performance or stability - we use TLS so establishing a new connection for each proxy uses more CPU, and establishing may more connections might mean we hit a limit.Does anyone have any idea what might be going on or how we can investigate?
Appendix: Steps to Reproduce
Background: multiple replicas of the server pod run, so the server hostname resolves to multiple IP addresses. I’ve replaced the actual server hostname with
server.example.com
below for confidentiality reasons.Install Chaos Mesh in your developer k8s cluster.
Edit client k8s manifests to reduce replicast to 1 and allow root access so that /etc/hosts can be edited.
Get a shell on the client container and add an /etc/hosts file entry pointing
server.example.com
at one of its IPs.Run a test client in a loop continually making 2 requests per second to exercise the client & server. We use our load test running at a low load setting here.
Use Chaos Mesh to introduce a partition between the client pod and the IP address added to the hosts file in step 3.
Change the hosts file in the client pod to point
server.example.com
at another of its IPs that is not affected by the partition affected in the previous step (remove the partitioned IP address, so there is a only single entry forserver.example.com
pointing at the good IP address).Expected behaviour: the client’s Ice ACM notices that its connection to the partitioned server IP is no longer responsive, so it establishes a new connection - this uses the still-working IP address added to the hosts file in step 6.
Actual behaviour: the client fails with InvocationTimeoutExceptions for about 15 minutes before a new connection is established and it requests start to succeed again. Running
netstat -a
in the client pod shows a connection to the partitioned IP address (added to /etc/hosts in step 3) and no connections to the working IP address (added to /etc/hosts in step 6).Beta Was this translation helpful? Give feedback.
All reactions