Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Parallel start for Kafka and Zookeeper #226

Merged
merged 1 commit into from
Nov 29, 2018

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Nov 28, 2018

Is there any reason to keep using OrderedReady with either Kafka or Zookeeper? We still get ordered rolling updates after #55. The problem with OrderedReady is that for kafka and pzoo if the zone with the volume for index 0 is unavailable none of the other pods can be restarted.

@sibtainabbas10 I noticed you used Parallel in #28 back in January. Have you seen any issues?

@solsson solsson added this to the 5.0 - Java 11 milestone Nov 28, 2018
@solsson solsson requested a review from atamon November 28, 2018 15:29
Copy link
Contributor

@atamon atamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only time we tried this in production was for minio right? Their distribution/replication strategy required us to do so even. I can't remember that we noticed anything strange in how kubernetes behaved, and even so we had quite a few crashlooping pods as the minio startup-sequence was mishbehaving for us

@solsson
Copy link
Contributor Author

solsson commented Nov 28, 2018

I can't come up with any reason to stick with OrderedReady.

I just tested to apply all manifests at the same time in a fresh ninikube (k apply -f zookeeper/ && k apply -f kafka/ && k apply -f kafka/test/ after configuration + rbac) and it took 2:30 min for Kafka to be ready with 3 restarts per kafka pod and a further 1 minute for the tests to go ready.

@solsson solsson merged commit c41f9f9 into master Nov 29, 2018
@stigok
Copy link

stigok commented Nov 29, 2018

But a problem arises when the StatefulSet changes and the pods are switched out. We'd definitely not want to take down the whole cluster at once. In a Deployment resource we could set maxUnavailable and maxSurge, but that won't work for a StatefulSet.

@solsson
Copy link
Contributor Author

solsson commented Nov 29, 2018

when the StatefulSet changes and the pods are switched out

@stigok I tested that with edited memory limits and it behaved as expected, never more than one pod unready. Did you observe a different behavior?

@solsson
Copy link
Contributor Author

solsson commented Nov 30, 2018

I've verified again, on k8s 1.11.2, and the update behavior is as expected.

rolling-update-test1.txt

The kafkacat based test did have a short unreadiness period when a partition leader was down. Only for one of the three partitions. I'm not sure how to interpret that.

test-kafkacat [0] offset 396
Last message is 10.255 old:
Test kafkacat-hrhwr@2018-11-30T08:00:21,053167482+00:00
test-kafkacat [0] offset 398
test-kafkacat [0] offset 399
test-kafkacat [0] offset 400
test-kafkacat [0] offset 401
% ERROR: offsets_for_times failed: Broker: Not leader for partition
Last message is 10.205 old:
Test kafkacat-hrhwr@2018-11-30T08:01:21,051927817+00:00
Last message is 20.207 old:
Test kafkacat-hrhwr@2018-11-30T08:01:21,051927817+00:00
test-kafkacat [0] offset 405

I've also tested the following bad patch resulting in the desired behavior that only kafka-2 goes unready, but the rollout process stops.

         command:
-        - ./bin/kafka-server-start.sh
+        - tail
+        - -f
         - /etc/kafka/server.properties

I was positively surprised that kubectl -n kafka rollout undo statefulset kafka resolved the situation, behaving like with a deployment.

@stigok
Copy link

stigok commented Dec 10, 2018

You are absolutely right. It worked like a charm when applying the JMX exporter patch now 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants