Readiness probe prevents recovery #220

phoerious · 2021-10-05T16:02:03Z

Describe the bug
The readiness probe prevents recovery of large databases. If I restart a node from a clean data directory (e.g. after a hard drive failure), it takes a long time for the server to come back up. Particularly the step "Started downloading snapshot for database XXX" can take several minutes (or longer). Unfortunately, this step never finishes, because the readiness probe always kills the container before the download is done.

For now I have set the timeout to one hour via the Helm chart values YAML, but it would be great if there were a more intelligent way to do this, since having a shorter timeout is definitely useful under normal circumstances and I don't always want to reinstall the Helm chart and restart the whole deployment before and after a single-node recovery.

To Reproduce
Steps to reproduce the behavior:

Create a large database
Kill off one of the nodes and delete its hard drive
Try to recover the deleted node

Expected behavior
Recovery should succeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness probe prevents recovery #220

Readiness probe prevents recovery #220

phoerious commented Oct 5, 2021 •

edited

Loading

Readiness probe prevents recovery #220

Readiness probe prevents recovery #220

Comments

phoerious commented Oct 5, 2021 • edited Loading

phoerious commented Oct 5, 2021 •

edited

Loading