Skip to content

Design and infrastructure

Natali Aharoni edited this page Jul 11, 2022 · 33 revisions

Design and infrastructure

The Redis operator is a state machine app that triggers automatically Redis cluster auto-maintenance routines as long as the app is running, with an intervals of once per 20 seconds (or immediately when a change in the cluster appears), continuously sampling the Redis cluster pods status, indicating if an unusual event appeared and requires action, and according to the result, changes its state to one of the followings: Initializing/Resetting, Ready/Healthy, Recovering, Upgrading or Scaling.

stateMachine

RedisCluster custom resource

The RedisCluster CR is similar to a StatefulSet - it describes the layout of the Redis cluster in terms of leader and replica node count.

Recovery flow

The operator demonstrates ability to auto-detect set of possible harming scenarios, confirm them by a case, derive estimated potential damage, reflect it to the logs, trigger a custom mitigation step, asses if the cluster is ready for being recovered to it's initial expected state and eventually it will run a recovery flow that will relay on the cluster spec as it was presented prior to the event that led to the harming scenario (in terms of the expected number of leaders and followers in the cluster).

The target of this flow is to heal the cluster and both restore it to the last form that was reported in the spec as optimal against the handled traffic loads.

Supported scenarios:

  • Loss of single or multiple followers
  • Loss of single or multiple leaders (can be healed as long as quorum is kept)
  • Loss of several followers and leaders at the same time (can be healed as long as quorum is kept)
  • Loss of leader and all of its followers (if happens for multiple replica sets, can be healed as long as quorum is kept)
  • Loss of slots coverage due to a failing process of resharding (interrupted slots movement between leaders)
  • Loss of agreement between nodes configuration tables (for example node that refuses to forget lost node or to accept new leader in cluster)

Cluster loss of balance is handled with the following flow:

stateMachine

Recovery flow algorithm:

stateMachine

Node lifecycle

Redis cluster is defined by a set of nodes that aware of each other and shares a well defined topology that dictates responsibility ranges over the data slots, and hierarchy that dictates which one of the nodes is allowed to write data and be marked as a node to be replicated from and sync with.

  • Master node in cluster: a node that holds responsibility over slot range, allowed to write data, and as such, is marked as one of the 'dictating' nodes (other nodes need to make sure they are synced with it and aligned with their configuration tables to it).
  • Follower node in cluster: a node that holds a responsibility over slot range only in matters of ability to read the data, and as such, is marked as a node that need to continuously sync with the master node it linked to. If a master node stops to respond, an election is performed by other masters in cluster, and one of its followers becomes master itself.
  • Replica node in cluster: a node that is part of a set/group of replicas, among them one master and zero or more followers, all of them have access (holds responsibility) for the same range of slots (data).
  • Leader: a replica node in Redis cluster, that is considered by the operator as a node that should hold the title 'master' the complete most of the time. In case it is not its current status and this node is healthy, the operator will continuously trigger failover to this node in order to promote it to claim back the 'master' title. other replica in the set will be titled as temporary master only if the leader node stopped to respond, is currently in process of re-creating, or in the process of being deprecated and upgraded.

The operator manage creation of Redis nodes by using one of the two flows:

Creation of new follower

Considering a case of an existing (successfully initiated) Redis cluster, that one of its nodes suddenly stopped responding / got connection lost.

This lost node is part of a replica set in the Redis cluster, and it holds one of the titles: master or follower. If it is the only replica of the replica set the flow will be considered a recovery that requires creation of new leader (see next section).

Otherwise: it is one of the replicas among several of them in the set, and it doesn't matter if the node was a master or follower as both creation process is identical up to the sync state.

This recovery process is considered one that requires creation process of follower node, that will replicate from existing master in the set (it can be the original one or temporary one), and if this node is marked as a leader by the operator, in addition to all the follower creation steps, there will be performed an additional step of failing over the node to make it claim the 'master' title. This flow also can be triggered as a result of scaling up followers request in cluster.

Creation of new leader

Creating new leader is basically a flow of 'scaling up' the number of leaders in the cluster, i.e create node that will take part in the next rebalance operation and at the end of it, will hold a responsibility over range of slots with an ability to populate them with data. This flow is being triggered in two cases:

  • Loss of a complete replica set, master and all of its followers (zero or more)
  • Scale up cluster

stateMachine

Upgrade flow

The operator supports rolling updates.

A rolling restart can be triggered:

  • When a change in the resource metadata appear
  • Manually by demand when using the operator server entry point /upgrade

Upgrade/Update flow algorithm:

  • Get list of cluster pods
  • Retrieve the number of allowed nodes to be handled per each cycle at once (in small clusters, it is determined by heuristics, and in large clusters it can be determined by the user, see configuration file documentation)
  • Scan for not-up-to-date leaders
  • For each leader that is not up to date: As long as number of handled nodes per cycle didn't reach the maximum allowed, fail over the leader to one of its replicas, if there no other replicas in the set, reshard the leader (move all slots to other healthy leader), forget the node by all other healthy nodes, and delete pod.
  • If number of handled nodes reached the maximum allowed, return and set the state to [Recovery]
  • Recover from all lost nodes by re creating them, all the new nodes are up to date now.
  • If there still exists non-up-to-date-leaders, continue the same routine
  • Otherwise perform the same process with followers (failover/reshard will be triggered only if the replica is by mistake left as master. it is not suppose to happen but the case is being recovered in case it will, anyhow it will be fixed in the recovery reconcile loops).
  • If no other not-up-to-date nodes left, declare successful upgrade and set state to recovery, if the recovery process will end up without handling any scenario, the state will be declared as [Ready].

Performing the process reliably:

Every major change and a trigger of a flow that terminates pods intentionally and in particular doing so in scheduled rate, is always risky and although the complete most of the possible disaster flows should be supported by the operator at this point, it is still suggested that the flow will be escorted by someone who can access the logs and track the cluster view.

The operator manages two counters as part of its routine:

  • Number of non-healthy reconcile loops that got reported since the last healthy one

Can be used as a metric that triggers alert for a threshold that is determined to present a considering state.

  • Number of healthy reconcile loops in a row since the last unhealthy one

When this counter reaches 10 healthy loops in a row, it ceases to get updated and a log is printed at info level "Cluster is in finalized state", which can indicates that the last process ended successfully, and we have a solid ground to assume stability for the cluster.

Both can be presented when viewing redis-cluster-state-map configmap, using the command

watch kubectl -n <namespace of the cluster> describe configmap redis-cluster-state-map.

MaxSurge and MaxUnavailable options are currently not supported. //?

Upgrading from v1.x.x of the operator to v2.x.x

Our last operator update required us to implement breaking changes in order to be able to support our new scale feature efficiently.

  • Redis pods (pods that hosts Redis node process) are now labeled with 'node-name' and 'leader-name' instead of 'node-number' and 'leader-number', which means the operator will parse the labels differently when listing the resources for maintenance purposes. The operator implements a flow that detects the old case and informing the user regard it.

After several reconcile loops, the operator will enter by itself to a healthy state, and will print to the logs a row that reflects "Upgrade is required and can be triggered by using the entry point /upgrade".

Please trigger it only when this line appear and not before. Running the upgrade easily can be performed by using the following commands:

kubectl -n <namespace of the cluster> port-forward <name of the *elected* manager pod> <chosen port>:8080

curl -X POST localhost:<chosen port>/upgrade

  • Recovery process got de-coupled from relying on rdc spec to estimate the required number of leaders and followers that should be present when recovery flow reaches the desired state. Instead it rely on a state map named 'redis-cluster-state-map', which holds the theoretical cluster state data. The operator implements an automatic flow of deriving the theoretical state of the cluster out of the spec at any point if it cannot find it using the k8s api. It will appear automatically as soon as the new operator will be deployed and elected. It is good practice to look for it and follow its form during the upgrade by running

watch kubectl -n <namespace of the cluster> describe configmap redis-cluster-state-map

  • Naming convention for Redis pods changed to follow 'redis-node-x' for leaders, 'redis-node-x-y' for followers, where x indicates the number of the leader it related to (example: redis-node-0 is leader, redis-node-0-1 and redis-node-0-2 are his followers). The operator will manage this change automatically by deprecating old non-matching pods and create new pods that matches the new format instead of them.

Scale flow

Cluster supports the following scale flows:

  • Scale up leaders
  • Scale up followers
  • Scale down leaders
  • Scale down followers

The flow can be triggered by editing the spec of the rdc resource, for the parameters 'leaderCount' (will determine what is the expected number of leaders) and 'leaderFollowersCount' (will determine what is the expected number of followers per each leader). The rdc can be edited such that both will be changed at the same time (leaders will scale first, followers will scale second), regardless of the change (for example you can scale up leaders and down followers and vise versa, any combination of them is supported).

In case of failure, the operator implements self-detecting mechanism that will recover the cluster and enter by itself to scale mode in case of incomplete scale request (that haven't been finalized to required state).

Trouble shooting

The operator among the rest got upgraded also in terms of the rate it can handle multiple events of loss or misalignments, this was implemented by integrating to the operator ability to run some of the flows concurrently.

In the heart of the idea to have an operator that will continuously sample and heal the Redis cluster, there stands the paradigm behind such implementation, as a big part of the code is implemented with a big hope to not to be used, but it is there just in case it will be needed one day. This one kind of contradicts the paradigm behind implementing event driven system which consider as good practice to have only code that we have intention to use, because a system that is driven by sudden events always holds the potential to trigger unconventional flows, whether if they are fully supported or not.

As a general say, we can agree that there is always the risk of having the operator face a sudden event that will lead it to a blocking scenario, due to an internal or external signal that having it blocked from being promoted.

After simulating several of those scenarios, we are introducing new trouble shooting feature and as well suggested practices that can serve the user in case things go south. During a disaster that prevents the operator from performing properly, the operator can serve as a tool that triggers some of the flows atomicly, by using manual action. The list of the exposed entry points:

  • /fix: will trigger cluster fix
  • /rebalance: will trigger cluster rebalance
  • /forgetLostNodes: will trigger a flow of clearing cluster tables from non responding cluster nodes
  • /forceReconcile: will trigger reconcile loop manually in the middle of the current action (and cancel the activity of the current one)

All of the above actions can be used for temporary manipulation of the cluster, until some stability is reached, and later on a decision of re creating the operator manager pod can be taken as an act of 'resetting the server safely' (kind of similar to CPR steps for the cluster before resetting).

  • /reset: will terminate all Redis cluster pods and initialize new cluster.

[Warning] this action is protected by the flag 'exposeSensitiveEntryPoints' (see config file documentation), and will definitely lead to data loss if cluster is populated.

Failing scale & AZ awareness feature

The operator implements AZ awareness, by using strong (hard) anti affinity rules that prevents from Redis pods from being scheduled on the same k8s workers. The purpose is to separate leaders from each other and as well leaders from their replicas.

This can lead to a problematic loop when not enough resources are presented on the cluster, as the operator will continuously try to create pods, wait for them to gain ip addresses, when at the other hand they will be stuck without being scheduled on available k8s worker nodes. They operator will keep waiting for them to be initialized / terminated without being able to perform any action that will break this loop of waiting. This scenario is not yet supported to be handled automatically, and as a best practice we suggest to make sure there are enough resources before deploying / scaling.

In case the operator got stuck in the loop:

  • Scale down by edit the rdc to have to minimal possible number of nodes for cluster (absolute minimum for the cluster is 3 leaders and zero followers).
  • Delete manually the nodes to be removed from the redis-cluster-state-map configmap.
  • Delete manually the pods to be removed by using kubectl commands
  • Trigger /fix, /rebalance, /forgetLostNodes, /forceReconcile by using the server entry points as manual intervention until stability is reached.
  • If non of the following works, cluster reset will be needed. It won't save the written data but at least it can prevent from the Redis cluster clients to continue to time out during their attempt to send additional data.

Limitations

  1. The operator uniquely identifies all Redis nodes by their IP addresses. This introduces the following limitations:

    1. All pods must run on the same network and have different IPs.
    2. Only one Redis instance can run on a pod.
  2. Only one custom resource managed per namespace. Each Redis cluster will need its own operator deployment and unique select label for the pods.

  3. No cross-namespace support, the RedisCluster and operator deployment should be in the same namespace.

Mandatory Kubernetes labels

Some labels are important for the functioning of the cluster and they must be present:

TODO

Clone this wiki locally