Try harder to failover on recover from master loss #250

applike-ss · 2024-10-16T06:30:49Z

Regarding this:

dragonfly-operator/internal/controller/dragonfly_instance.go

Line 116 in 64cfcba

// TODO: Why does this fail every now and then?

and this:

dragonfly-operator/internal/controller/dragonfly_instance.go

Line 117 in 64cfcba

// Should replication be continued if it fails?

We just observed this behavior and in the logs i discovered this error: error running SLAVE OF command: dial tcp 10.138.59.180:9999: i/o timeout, so i'll assume that either of this happened:

network issue
dragonfly main/networking thread blocked
dragonfly crashed without killing the process

Due to this i would like to suggest the following changes:

Check via redis client that the operator can talk to the new master before promoting it
Check via redis client that the operator can talk to the (now) replicas before setting it to slave of new master
kill the pod if it can't talk to it after X tries (configurable? 0 meaning, do not kill it?)

The text was updated successfully, but these errors were encountered:

Pothulapati · 2024-11-07T17:14:08Z

Thanks @applike-ss for the issue!

All the suggestions seem valid, and are easy enough to implement.

Pothulapati assigned Abhra303 Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try harder to failover on recover from master loss #250

Try harder to failover on recover from master loss #250

applike-ss commented Oct 16, 2024

Pothulapati commented Nov 7, 2024

Try harder to failover on recover from master loss #250

Try harder to failover on recover from master loss #250

Comments

applike-ss commented Oct 16, 2024

Pothulapati commented Nov 7, 2024