Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try harder to failover on recover from master loss #250

Open
applike-ss opened this issue Oct 16, 2024 · 1 comment
Open

Try harder to failover on recover from master loss #250

applike-ss opened this issue Oct 16, 2024 · 1 comment
Assignees

Comments

@applike-ss
Copy link

Regarding this:

// TODO: Why does this fail every now and then?

and this:

// Should replication be continued if it fails?

We just observed this behavior and in the logs i discovered this error: error running SLAVE OF command: dial tcp 10.138.59.180:9999: i/o timeout, so i'll assume that either of this happened:

  • network issue
  • dragonfly main/networking thread blocked
  • dragonfly crashed without killing the process

Due to this i would like to suggest the following changes:

  • Check via redis client that the operator can talk to the new master before promoting it
  • Check via redis client that the operator can talk to the (now) replicas before setting it to slave of new master
  • kill the pod if it can't talk to it after X tries (configurable? 0 meaning, do not kill it?)
@Pothulapati
Copy link
Collaborator

Thanks @applike-ss for the issue!

All the suggestions seem valid, and are easy enough to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants