Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Prometheus Rule to monitor Velero seems bad #562

Open
savar opened this issue Apr 4, 2024 · 0 comments
Open

Example Prometheus Rule to monitor Velero seems bad #562

savar opened this issue Apr 4, 2024 · 0 comments
Labels

Comments

@savar
Copy link

savar commented Apr 4, 2024

This is a copy from vmware-tanzu/velero#7132 therefore please also check the conversation there for further input.

We saw in our different clusters different reporting even though Velero was affected the same in all three clusters.

Checking again on the implemented PrometheusRule it is as described in vmware-tanzu/velero#2725 but is this the right choice? The metric velero_backup_attempt_total, as long as no restart happened, is only growing. So using a ratio of failed attempts to an ever growing total should almost make you blind for failures after a long time of "all is good".

Assuming you run 20 backups per day for a specific schedule and this works fine for a year and your Pod isn't restarting in that time (and that is possible), you would have 7300 successful attempts. If you would do the example query and check for a failure rate of more than 25% you would need 2434 failed attemps before you hit that mark or ~122 days before you even realize that your backups aren't working anymore.

I am not sure what the best approach would be, but it might be either using an increase() over a shorter time instead of "the whole time the pod is running" or using velero_backup_last_successful_timestamp or other things instead.

My bug here is mainly: should we change the example (even though it is just a comment) to avoid people simply copy and pasting this and using it as a way to monitor velero backups?

@jenting jenting added the velero label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants