-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/sys/health doesn't accurate reflect the health of Vault #28846
Comments
Hi @askmike1 👋 Thank you for raising this issue. We've discussed this use case internally and unfortunately we're erring on the side of not implementing this enhancement. We believe that this use case is solved better by utilising a dedicated infrastructure monitoring tool and alerting/automation to assure that the disks have enough free space for Vault to operate well and we recommend that approach. |
@kubawi lets ignore for a second the disk space issue. Shouldn't the healthcheck be able to accurately say if Vault was operational or not? If I am unable to log in, read or write to vault, I would classify that as being unhealthy, regardless of the issue being Disk Space or anything else |
@askmike1 Allow me to put on my pedantry hat for just a moment. The tricky part with a situation like this is where do we draw the line? If we take this reasoning to its logical conclusion, then Vault becomes not only a secrets manager but a monitor of all host level metrics as well. If the NIC is down and people can't reach Vault over the network, technically it's unhealthy by the same reasoning. Same goes for overloaded CPU, not enough RAM, disk IO exceeding allocated IOPS, etc. OTOH, one could argue that it's not actually a Vault problem, per se, since if you clear up disk space, Vault is now operational again, and we didn't have to touch Vault at all to make that happen. The same goes for any of the host level metrics I mentioned above. It's definitely a nuanced point, though, and I can appreciate your point of view. Have a great day! |
But again, ignoring what the issue is, if Vault itself is unusable, it is not healthy. I don't care about network issues getting to Vault, I care about once I do get to Vault, if I'm unable to log in/write/read it is not actually healthy. Doesn't matter if the issue is disk space, CPU, memory or anything else. The health check doesn't need to list why it isn't working, but it should be able to list that it isn't working. Especially when we are utilizing an audit mechanism provided by Vault which is causing everything to not work. Perhaps an additional ask would be for the audit mechanism not working not cause the rest of vault to stop working. If Vault itself was up and running, but the backend that holds the data (be it Consul, DynamoDB or something else) was down, would the healthcheck still show as successful? |
Is your feature request related to a problem? Please describe.
We have file-based audit logs in place. File system got to 100% full. /sys/health still reported everything as healthy, even though all commands were failing due to inability to access file system. This meant our AWS Target Group still saw it as healthy and didn't take it out of rotation.
Describe the solution you'd like
/sys/health should provide a deeper health check to more properly check for issues with vault (in this case, logging in was failing in addition to most read/write commands, due to the file space being full)
Describe alternatives you've considered
We are taking workarounds to prevent this, but this would be needed as a valid solution outside of putting something like nginx or something in front of vault which seems unnecessary.
Explain any additional use-cases
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: