-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshoot zookeeper connection failures #310
base: static-props-some-day
Are you sure you want to change the base?
Conversation
and I'm guessing it could be due to throttling Kafka for example reports: kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING And zookeeper logs include things like: java.io.EOFException at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1206) [2020-03-06 12:35:45,056] WARN Connection broken for id 1, my id = 4, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager) java.net.SocketException: Socket closed at java.base/java.net.SocketInputStream.socketRead0(Native Method) at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115) at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168) at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140) at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252) at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271) at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1206) I've been able to resolve it twice with this memory limit increase, but of course it could also be that the restart resolves the issue.
through the service. This helps troubleshoot issues like #310 by pointing out (by podIP) which actual zookeeper connection that failed. Also I like the simplification.
The issue is present in clusters with preemptive notes as well. I just found https://issues.apache.org/jira/browse/ZOOKEEPER-2938 with fix https://issues.apache.org/jira/browse/ZOOKEEPER-2164. I just found those issues so on next failure I'll try to only delete the leader. Until now I've scaled down and up again. |
https://zookeeper.apache.org/doc/r3.6.1/releasenotes.html contains:
They might not be the cause for timeouts, but definitely pose difficulties for stabilization on ephemeral/shared-core nodes. |
After testing both Zookeeper 3.5.7 and 3.6.1 with a lot of different configs the essence of the failure remains. A kafka pod fails to start up, with timeouts like:
The amount of retries depends on the combination of
And Kafka 2.5.0 increases the default session timeout to 18s based on a reasoning around transient instability vs genuine failures. I haven't found any effect of varying these timeouts on the actual cause of the crashloop.
|
Yolean/kubernetes-kafka#310 because when I investigated the timeouts there I found reports that similar errors had been fixed by cleaning up (or trashing) persisted state. > Detected a direct/mapped ByteBuffer in the image heap. > A direct ByteBuffer has a pointer to unmanaged C memory, > and C memory from the image generator is not available at image run time. > A mapped ByteBuffer references a file descriptor, > which is no longer open and mapped at run time. Error: com.oracle.graal.pointsto.constraints.UnsupportedFeatureException: Detected a direct/mapped ByteBuffer in the image heap. A direct ByteBuffer has a pointer to unmanaged C memory, and C memory from the image generator is not available at image run time.A mapped ByteBuffer references a file descriptor, which is no longer open and mapped at run time. To see how this object got instantiated use -H:+TraceClassInitialization. The object was probably created by a class initializer and is reachable from a static field. You can request class initialization at image run time by using the option --initialize-at-run-time=<class-name>. Or you can write your own initialization methods and call them explicitly from your main entry point. Detailed message: Trace: at parsing org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:78) Call path from entry point to org.apache.zookeeper.server.persistence.FilePadding.padFile(FileChannel): at org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76)
I've experimented with a 3-kafka 5-zk cluster on a 3-node GKE e2-small cluster, i.e. just enough resources to run the Kafka stack together with some basic infra. During tests I occasionally get kafka pods restarting due to failures to connect to zookeeper. Also clients like the kafka-topics CLI had troubles connecting. No zookeeper restarts, no OOMKilled. See the first commit comment for some stack traces.