Replies: 1 comment 7 replies
-
I do not see anything that would be wrong. You get an exact description of the issue, and the cluster is not considered Ready because it is not 100% ready. the actual impact of it might differ depending on your configuration on the Kafka side, on the client side, on which topics is which client using etc. So it is impossible to say exactly what a single broker failure means. You also say that:
But Strimzi does not do anything like that. |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Bug Description
I was upgrading operator (0.35.0 -> 0.36.1) when I ran into an issue with the operator putting the entire cluster down.
Before the upgrade,
terminationGracePeriodSeconds
was not set (so default 30s). Cluster starts rolling the first node, which unfortunately doesn't have time to stop gracefully before it's killed and starts recovery.operationTimeoutMs
was set for 2h. The cluster eventually times out on waiting for the pod to be ready, and setsFatalError
on the kafka instance:As a consequence, the endpoints are retired and healthy brokers are unreachable, isolating the entire cluster.
This is somewhat related #5263.
Steps to reproduce
No response
Expected behavior
The operator would not set kafka instance to
NotReady
if a single broker has issues.Strimzi version
0.36.1
Kubernetes version
1.28
Installation method
HelmRelease via flux v1
Infrastructure
OKD 4.15.0-0.okd-2024-03-10-010116 (OpenShift upstream)
Configuration files and logs
No response
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions