-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150
Comments
Hi, the cluster api exposes entry points like
General note: Clusters that has only masters without followers are in risk to loose sync between their configurations, it is something we noticed not long ago and still suggesting good method to detect properly and handle. I looked into the log file, seems like the cluster lost quorum and this is a point of failure to the cluster, the way to detect it is to see if a watcher terminal for We are working on a good way to trigger auto-reset for those cases, currently we are not convinced we are able to distinguish well between cases that can be recovered and cases that cannot, this is why we enabling the |
Thanks Natali! Anyway a question, are they any operator configurations that control the when the operator detects "Lost nodes detcted on some of cluster nodes"? (e.g retries, timeouts etc.)? |
Regard the image of great list of keyslots ranges, when the cluster is loosing handle on some of the slots (reshard that been interrupted or loose of master and all his followers for example), the operator changes cluster state to "Fix", it will "trap" the reconcile loop to apply We will see the log that imply loss of redis node every time the "cluster view" (map that lists each one of the existing pods in cluster) is different from the "state view" (map that lists each one of the theoretically expected pods according to spec), when a node that is expected to be a leader is missing from the cluster view, it will trigger a search for a redis node that appears in the cluster view and has leader name that is the same as the leader from state map that we know is missing, if non of such appeared in the search - it will declare "loss of master and all of its followers" to the logs. sometimes it is not the real case (example: we have only masters, and during upgrade process the rolled pod starts to reshard its slots to other redis node, it is kept in the state map but deleted as an actual pod), but we do declare it as a case of loss of pod set, as it requires the same handle as a real lost, and there is no other good way to distinguish the cases currently, anyway running fix and rebalance after such operation is always recommended as it is sensitive process. In the heart of the idea to make the operator applying "self recovery / self maintenance" we purposely didn't implement a retry mechanism per operation, we save the state and attempts to perform the mitigation in the next reconcile loop. Good example that helps to understand the rational behind it is when we have a loss of node/s and it leads to miss-alignment between nodes tables, at this point it doesnt matter how much we try - we cannot add new nodes until tables are cleared from non-responsive nodes, cluster is being fixed and rebalanced with a proper waiting for the nodes to agree about new configuration. this is a routine that can only be guaranteed before next attempt to add node only if the next reconcile loop will be triggered. So, as we see it, we don't wan't to follow a hard rule of "trigger an SRE for any case of 3 failures in a row" - as it could be a case that can be fixed alone within finite number of reconcile loops, and also we don't want to stay blind to a case of inability of the operator and cluster to come back to healthy state, this is why we managing the counter of |
Cluster is not revering, operator stuck in a loop looking for a leader, only one node left out of 3 (no followers), see attached operators log
cluster_failure.log
Steps taken (not shown in log), didn't help:
Any idea?
@voltbit @NataliAharoniPayu
The text was updated successfully, but these errors were encountered: