[Bug]: Operator considers entire cluster failed if it times out on rolling one broker #10648

aneagoe · 2024-09-27T07:42:02Z

aneagoe
Sep 27, 2024

Bug Description

I was upgrading operator (0.35.0 -> 0.36.1) when I ran into an issue with the operator putting the entire cluster down.
Before the upgrade, terminationGracePeriodSeconds was not set (so default 30s). Cluster starts rolling the first node, which unfortunately doesn't have time to stop gracefully before it's killed and starts recovery. operationTimeoutMs was set for 2h. The cluster eventually times out on waiting for the pod to be ready, and sets FatalError on the kafka instance:

    - lastTransitionTime: "2024-09-26T14:47:54.830455552Z"
      message: Error while waiting for restarted pod main-kafka-1 to become ready
      reason: FatalProblem
      status: "True"
      type: NotReady

As a consequence, the endpoints are retired and healthy brokers are unreachable, isolating the entire cluster.
This is somewhat related #5263.

Steps to reproduce

No response

Expected behavior

The operator would not set kafka instance to NotReady if a single broker has issues.

Strimzi version

0.36.1

Kubernetes version

1.28

Installation method

HelmRelease via flux v1

Infrastructure

OKD 4.15.0-0.okd-2024-03-10-010116 (OpenShift upstream)

Configuration files and logs

No response

Additional context

No response

scholzj · 2024-09-27T08:41:24Z

scholzj
Sep 27, 2024
Maintainer

I do not see anything that would be wrong. You get an exact description of the issue, and the cluster is not considered Ready because it is not 100% ready. the actual impact of it might differ depending on your configuration on the Kafka side, on the client side, on which topics is which client using etc. So it is impossible to say exactly what a single broker failure means.

You also say that:

As a consequence, the endpoints are retired and healthy brokers are unreachable, isolating the entire cluster.

But Strimzi does not do anything like that.

7 replies

aneagoe Sep 27, 2024
Author

I'm not really sure what do you mean with the entire cluster should **not** be taken down.

When you have failure of one or more brokers, the operator will simply retry again and again. In some situations, it might try to restart the broker if it is non-responsive for example. But it will not take down the rest of the cluster or prevent access to it in any way. In this particular case as you describe it, the operator really has the only chance to wait until the pod recovers, gets fully started and then it will continue.

What I mean is that even though just a single broker was down, all service endpoints were removed. From outside the k8s cluster, one could not reach any broker. Attempting to connect to the healthy, not-yet restarted brokers by using the pod IP was fine, just not through the loadbalancer IPs/service IPs.
All in all, the succession of events was quite simple, operator takes down one broker (out of 3), that broker struggles to recover due to terminationGracePeriodSeconds too low and improper shutdown. After operator times out on reconciling the recovering broker, endpoints from the two remaining, untouched healthy brokers are also removed. Could also be some weird coincidence, but as soon as the reconciliation was done (basically after all nodes cycled through and recovered eventually), the kafka cluster was fine again and endpoints were restored.

aneagoe Sep 27, 2024
Author

@scholzj ok, took me a while to dig up the root cause. Luckily, I had yaml outputs of both service and pod from the healthy broker. For whatever reason, the service was updated with label selector strimzi.io/pool-name=kafka (I guess as part of the 0.35.0 -> 0.36.1 operator upgrade) but the pods were not yet adjusted with that label. It was observed on this cluster since it took a very long time for each broker to reconcile. Please let me know if there's any information I can provide to help track this down in the operator.

scholzj Sep 27, 2024
Maintainer

Hmm 🤔 ... I guess changing the selectors was a part of the upgrade for these versions and nobody realized the possible consequences. So I think that explains the issue. The good news is that this is IMHO a one-of change. So once you are using the new selectors, it should not be a problem in any regular rolling updates.

aneagoe Sep 27, 2024
Author

Indeed. Though if selectors are involved (ie to be updated), the operator should only apply them after each pod is reconciled and not all at once, since that's always going to cause some issues. Up to you to decide if this is a feature worth adding or not, right now I'm just happy we've got a full explanation of the issue as it was quite scary.

scholzj Sep 27, 2024
Maintainer

Up to you to decide if this is a feature worth adding or not

There is no feature to add. Had you raised this in July or August 2023, we would have looked into it and fixed it one way or another. But the release you are talking about is now over a year old, so there is not much to do about it other than explain it and try to remember it in order to try to avoid it in the future if we ever change the labels again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

[Bug]: Operator considers entire cluster failed if it times out on rolling one broker #10648

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strimzi

[Bug]: Operator considers entire cluster failed if it times out on rolling one broker #10648

aneagoe Sep 27, 2024

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

Replies: 1 comment · 7 replies

scholzj Sep 27, 2024 Maintainer

aneagoe Sep 27, 2024 Author

aneagoe Sep 27, 2024 Author

scholzj Sep 27, 2024 Maintainer

aneagoe Sep 27, 2024 Author

scholzj Sep 27, 2024 Maintainer

aneagoe
Sep 27, 2024

Replies: 1 comment 7 replies

scholzj
Sep 27, 2024
Maintainer

aneagoe Sep 27, 2024
Author

aneagoe Sep 27, 2024
Author

scholzj Sep 27, 2024
Maintainer

aneagoe Sep 27, 2024
Author

scholzj Sep 27, 2024
Maintainer