Skip to content
This repository has been archived by the owner on Jan 19, 2024. It is now read-only.

Readiness probe prevents recovery #220

Open
phoerious opened this issue Oct 5, 2021 · 0 comments
Open

Readiness probe prevents recovery #220

phoerious opened this issue Oct 5, 2021 · 0 comments

Comments

@phoerious
Copy link

phoerious commented Oct 5, 2021

Describe the bug
The readiness probe prevents recovery of large databases. If I restart a node from a clean data directory (e.g. after a hard drive failure), it takes a long time for the server to come back up. Particularly the step "Started downloading snapshot for database XXX" can take several minutes (or longer). Unfortunately, this step never finishes, because the readiness probe always kills the container before the download is done.

For now I have set the timeout to one hour via the Helm chart values YAML, but it would be great if there were a more intelligent way to do this, since having a shorter timeout is definitely useful under normal circumstances and I don't always want to reinstall the Helm chart and restart the whole deployment before and after a single-node recovery.

To Reproduce
Steps to reproduce the behavior:

  1. Create a large database
  2. Kill off one of the nodes and delete its hard drive
  3. Try to recover the deleted node

Expected behavior
Recovery should succeed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant