-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control plane failure modes for high-availability documentation #43849
Comments
There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:
Please see the group list for a listing of the SIGs, working groups, and committees available. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/transfer website where the k8s documentation is located. |
when speaking about "majority" is this about etcd's raft algorithm? k8s core doesn't have this requirement directly. |
/kind feature |
It'd be good to understand the gaps: what should https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/ cover that it doesn't? |
/close the ticket has missing information; questions were not answered. |
@neolit123: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi all, really sorry for the delay on my elaboration of this issue! The context is that my team is working on Kubernetes reliability (as part of a product) and we want to understand the failure modes of the control plane. I had a chat with Han Kang about this offline, and I wanted to amend the details of this issue with our conversation of what I think is missing, but I wanted to review the links you all sent first to see if I was missing something. @sftim thank you very much for sending it over! The part I wanted the most is the expectations of restrictions when one or more nodes of the control plane are down. We're currently working with a setup that considers HA as three control plane nodes, so we were trying to understand what were the consequences of:
So what I was asking was "what Kubernetes customers can expect in case of failure of their control plane nodes". Let me know if this makes sense, and sorry again for the delay |
what you are talking about makes sense, @royalsflush please include more detail in the OP post: i don't mind us including more documentation about failures and recovery of the CP, as the documentation is lacking. /reopen |
@neolit123: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig architecture Please revise (edit) the original issue description @royalsflush to explain what you want added to the documentation. You could write this as a user story or as a definition of done. |
/assign (I can take this, if y'all don't mind) |
Thanks @logicalhan. These things are important. |
/triage accepted |
I would add that we ideally ought to cover some of the less common situations too. I'll outline some below. What I hope is that someone carefully reading the docs can answer what the expected outcome is, without actually setting up a cluster or reading any source code. Eg:
I'm sure we could think up more; maybe we even have a list already? We can produce - and publish - docs without meeting this ideal; I've mentioned it so we understand where we'd like to end up. |
Additional scenarios:
|
I may group answers based on local or remote etcd hosts, since the answers are likely skewed to that distinction anyway. |
These questions need not appear in the page; you could think of them as like unit tests for the docs. In other words, if a reviewer picks a question, can they - just by reading what's in the page - work out what the answer must be? (we could even ask a large language AI model to help us check) |
I dig the framing. |
#43903 feels slightly relevant (only slightly, though). I don't know how much we want to also cover upgrades and how they impact failure modes. |
+1 to cover upgrades and rollback. in KEP PRRs we require "downgradability" of k8s features, but etcd by design does not support downgrade well, yet: kubeadm as a whole also does not support downgrades. it supports rollback, in case of component failure, but that may or may not work, depending on:
it's a bug in kubeadm's api-machinery usage and the etcd upgrade failure will trigger a rollback, unless the user workarounds it. |
+1 @sftim , Can you reshare the docs for gaps |
I don't understand what you'd like me to do here @kumarankit999. How would you know when I'd done what you're asking (can you frame it as a definition of done)? If you mean #43849 (comment), I was the person who asked the question, and I do not have the answer to it. |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
We likely need some brief documentation on what customers can expect in terms of the reliability of the control plane. We discussed the "majority" vs "less than majority" buckets of problems, would be great to have documentation that we can point to, in order to justify our reliability stance
The text was updated successfully, but these errors were encountered: