Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis can't recover after node is down with redis-replicas + HA sentinel #1052

Open
sho34215 opened this issue Aug 27, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@sho34215
Copy link

V0.18.0

Defaulted container "redis-sentinel-sentinel" out of: redis-sentinel-sentinel, redis-exporter
Running sentinel without TLS mode
ACL_MODE is not true, skipping ACL file modification
Starting  sentinel service .....
7:X 27 Aug 2024 09:41:07.340 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7:X 27 Aug 2024 09:41:07.340 * Redis version=7.2.1, bits=64, commit=00000000, modified=0, pid=7, just started
7:X 27 Aug 2024 09:41:07.340 * Configuration loaded
7:X 27 Aug 2024 09:41:07.341 * monotonic clock: POSIX clock_gettime
7:X 27 Aug 2024 09:41:07.345 # Failed to write PID file: Permission denied
7:X 27 Aug 2024 09:41:07.345 * Running mode=sentinel, port=26379.
7:X 27 Aug 2024 09:41:07.353 * Sentinel new configuration saved on disk
7:X 27 Aug 2024 09:41:07.353 * Sentinel ID is 772d1d4234446162e55d26c4472ad3e5b2d52f28
7:X 27 Aug 2024 09:41:07.353 # +monitor master myMaster 10.42.3.4 6379 quorum 2
7:X 27 Aug 2024 09:41:12.352 # +sdown master myMaster 10.42.3.4 6379

redis-operator version:

Does this issue reproduce with the latest release?
Yes

What operating system and processor architecture are you using (kubectl version)?
Ubuntu 22 + kube

kubectl version Output
$ kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.3+k3s1
WARNING: version difference between client (1.28) and server (1.30) exceeds the supported minor version skew of +/-1

What did you do?

Create local cluster with k3d

k3d cluster create redis-operator-ot-container --servers 3 --agents 3
kubectl taint nodes k3d-redis-operator-ot-container-server-0 node-role.kubernetes.io/master=:NoSchedule
kubectl taint nodes k3d-redis-operator-ot-container-server-1 node-role.kubernetes.io/master=:NoSchedule
kubectl taint nodes k3d-redis-operator-ot-container-server-2 node-role.kubernetes.io/master=:NoSchedule

Create namespace

kubectl create namespace redis-dev-ot-operator

Configure helm

helm repo add ot-helm https://ot-container-kit.github.io/helm-charts/
helm repo update

Create redis replication OT operator

helm upgrade redis-operator ot-helm/redis-operator --install --namespace redis-dev-ot-operator
helm test redis-operator --namespace redis-dev-ot-operator

Create redis sentinel

helm upgrade redis-sentinel ot-helm/redis-sentinel --install --namespace redis-dev-ot-operator

Create redis replication

helm upgrade redis-replication ot-helm/redis-replication --install --namespace redis-dev-ot-operator

At the end you should have 3 agents node where each one have 1 sentinel and 1 replica. 1 replica is the master and other are slaves.

then look for agent name where master redis is deployed

Test a chaos scenario where node with master is down.

k drain <node>

What did you expect to see?
Sentinel agent + replica instance are redeployed and sentinel process new master election

What did you see instead?
Sentinel agent + replica instance are redeployed on a node that already have sentinel + replica instances.
Thus, no master election and each sentienl are stuck.

Below log from sentinel

Defaulted container "redis-sentinel-sentinel" out of: redis-sentinel-sentinel, redis-exporter
Running sentinel without TLS mode
ACL_MODE is not true, skipping ACL file modification
Starting  sentinel service .....
7:X 27 Aug 2024 09:41:07.340 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7:X 27 Aug 2024 09:41:07.340 * Redis version=7.2.1, bits=64, commit=00000000, modified=0, pid=7, just started
7:X 27 Aug 2024 09:41:07.340 * Configuration loaded
7:X 27 Aug 2024 09:41:07.341 * monotonic clock: POSIX clock_gettime
7:X 27 Aug 2024 09:41:07.345 # Failed to write PID file: Permission denied
7:X 27 Aug 2024 09:41:07.345 * Running mode=sentinel, port=26379.
7:X 27 Aug 2024 09:41:07.353 * Sentinel new configuration saved on disk
7:X 27 Aug 2024 09:41:07.353 * Sentinel ID is 772d1d4234446162e55d26c4472ad3e5b2d52f28
7:X 27 Aug 2024 09:41:07.353 # +monitor master myMaster 10.42.3.4 6379 quorum 2
7:X 27 Aug 2024 09:41:12.352 # +sdown master myMaster 10.42.3.4 6379
@sho34215 sho34215 added the bug Something isn't working label Aug 27, 2024
@Jamesits
Copy link

Had a very similar incident after a node autoscaling event followed by a pod rebalance. One thing worth noticing is that my Redis replication nodes (non-master ones) printed a lot of logs like this:

2024-11-16T05:42:17.116997335Z 1:S 16 Nov 2024 05:42:17.116 * Connecting to MASTER :6379
2024-11-16T05:42:17.117026707Z 1:S 16 Nov 2024 05:42:17.116 # Unable to connect to MASTER: Invalid argument
2024-11-16T05:42:18.119024659Z 1:S 16 Nov 2024 05:42:18.118 * Connecting to MASTER :6379
2024-11-16T05:42:18.119048774Z 1:S 16 Nov 2024 05:42:18.118 # Unable to connect to MASTER: Invalid argument
2024-11-16T05:42:19.120973784Z 1:S 16 Nov 2024 05:42:19.120 * Connecting to MASTER :6379
2024-11-16T05:42:19.121016883Z 1:S 16 Nov 2024 05:42:19.120 # Unable to connect to MASTER: Invalid argument
2024-11-16T05:42:20.123027140Z 1:S 16 Nov 2024 05:42:20.122 * Connecting to MASTER :6379

While the master Redis node prints:

2024-11-16T01:33:11.144362406Z Setting up redis in standalone mode
2024-11-16T01:33:11.144763663Z Running without TLS mode
2024-11-16T01:33:11.144770083Z ACL_MODE is not true, skipping ACL file modification
2024-11-16T01:33:11.144772981Z Starting redis service in standalone mode.....
2024-11-16T01:33:11.150182784Z 1:C 16 Nov 2024 01:33:11.149 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
2024-11-16T01:33:11.150194669Z 1:C 16 Nov 2024 01:33:11.150 # Redis version=7.0.15, bits=64, commit=00000000, modified=0, pid=1, just started
2024-11-16T01:33:11.150198672Z 1:C 16 Nov 2024 01:33:11.150 # Configuration loaded
2024-11-16T01:33:11.150548785Z 1:M 16 Nov 2024 01:33:11.150 * monotonic clock: POSIX clock_gettime
2024-11-16T01:33:11.151035317Z 1:M 16 Nov 2024 01:33:11.150 * Running mode=standalone, port=6379.
2024-11-16T01:33:11.151042926Z 1:M 16 Nov 2024 01:33:11.151 # Server initialized
2024-11-16T01:33:11.156624576Z 1:M 16 Nov 2024 01:33:11.156 * Reading RDB base file on AOF loading...
2024-11-16T01:33:11.156637212Z 1:M 16 Nov 2024 01:33:11.156 * Loading RDB produced by version 7.0.15
2024-11-16T01:33:11.156640260Z 1:M 16 Nov 2024 01:33:11.156 * RDB age 119424 seconds
2024-11-16T01:33:11.156643125Z 1:M 16 Nov 2024 01:33:11.156 * RDB memory usage when created 5.28 Mb
2024-11-16T01:33:11.156689047Z 1:M 16 Nov 2024 01:33:11.156 * RDB is base AOF
2024-11-16T01:33:11.181245948Z 1:M 16 Nov 2024 01:33:11.181 * Done loading RDB, keys loaded: 621, keys expired: 0.
2024-11-16T01:33:11.181282094Z 1:M 16 Nov 2024 01:33:11.181 * DB loaded from base file appendonly.aof.7.base.rdb: 0.027 seconds
2024-11-16T01:33:12.567363747Z 1:M 16 Nov 2024 01:33:12.567 * DB loaded from incr file appendonly.aof.7.incr.aof: 1.386 seconds
2024-11-16T01:33:12.567387139Z 1:M 16 Nov 2024 01:33:12.567 * DB loaded from append only file: 1.413 seconds
2024-11-16T01:33:12.567390607Z 1:M 16 Nov 2024 01:33:12.567 * Opening AOF incr file appendonly.aof.7.incr.aof on server start
2024-11-16T01:33:12.567394224Z 1:M 16 Nov 2024 01:33:12.567 * Ready to accept connections
2024-11-16T01:34:12.099062295Z 1:M 16 Nov 2024 01:34:12.098 * 10000 changes in 60 seconds. Saving...
2024-11-16T01:34:12.099330444Z 1:M 16 Nov 2024 01:34:12.099 * Background saving started by pid 57
2024-11-16T01:34:12.123990362Z 57:C 16 Nov 2024 01:34:12.123 * DB saved on disk
2024-11-16T01:34:12.124270877Z 57:C 16 Nov 2024 01:34:12.124 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
2024-11-16T01:34:12.199655561Z 1:M 16 Nov 2024 01:34:12.199 * Background saving terminated with success

It looks like no Redis instances are receiving the correct replication config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants