/sys/health doesn't accurate reflect the health of Vault #28846

askmike1 · 2024-11-06T17:59:35Z

Is your feature request related to a problem? Please describe.
We have file-based audit logs in place. File system got to 100% full. /sys/health still reported everything as healthy, even though all commands were failing due to inability to access file system. This meant our AWS Target Group still saw it as healthy and didn't take it out of rotation.

Describe the solution you'd like
/sys/health should provide a deeper health check to more properly check for issues with vault (in this case, logging in was failing in addition to most read/write commands, due to the file space being full)

Describe alternatives you've considered
We are taking workarounds to prevent this, but this would be needed as a valid solution outside of putting something like nginx or something in front of vault which seems unnecessary.

Explain any additional use-cases
N/A

Additional context
N/A

kubawi · 2024-11-25T17:34:05Z

Hi @askmike1 👋 Thank you for raising this issue.

We've discussed this use case internally and unfortunately we're erring on the side of not implementing this enhancement. We believe that this use case is solved better by utilising a dedicated infrastructure monitoring tool and alerting/automation to assure that the disks have enough free space for Vault to operate well and we recommend that approach.

askmike1 · 2024-11-25T17:59:09Z

@kubawi lets ignore for a second the disk space issue. Shouldn't the healthcheck be able to accurately say if Vault was operational or not? If I am unable to log in, read or write to vault, I would classify that as being unhealthy, regardless of the issue being Disk Space or anything else

raskchanky · 2024-11-25T18:50:42Z

@askmike1 Allow me to put on my pedantry hat for just a moment.

The tricky part with a situation like this is where do we draw the line? If we take this reasoning to its logical conclusion, then Vault becomes not only a secrets manager but a monitor of all host level metrics as well. If the NIC is down and people can't reach Vault over the network, technically it's unhealthy by the same reasoning. Same goes for overloaded CPU, not enough RAM, disk IO exceeding allocated IOPS, etc.

OTOH, one could argue that it's not actually a Vault problem, per se, since if you clear up disk space, Vault is now operational again, and we didn't have to touch Vault at all to make that happen. The same goes for any of the host level metrics I mentioned above.

It's definitely a nuanced point, though, and I can appreciate your point of view. Have a great day!

askmike1 · 2024-11-25T19:44:16Z

But again, ignoring what the issue is, if Vault itself is unusable, it is not healthy. I don't care about network issues getting to Vault, I care about once I do get to Vault, if I'm unable to log in/write/read it is not actually healthy. Doesn't matter if the issue is disk space, CPU, memory or anything else. The health check doesn't need to list why it isn't working, but it should be able to list that it isn't working.

Especially when we are utilizing an audit mechanism provided by Vault which is causing everything to not work. Perhaps an additional ask would be for the audit mechanism not working not cause the rest of vault to stop working.

If Vault itself was up and running, but the backend that holds the data (be it Consul, DynamoDB or something else) was down, would the healthcheck still show as successful?

mpalmi added enhancement core/audit feature-request labels Nov 7, 2024

kubawi self-assigned this Nov 21, 2024

raskchanky closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/sys/health doesn't accurate reflect the health of Vault #28846

/sys/health doesn't accurate reflect the health of Vault #28846

askmike1 commented Nov 6, 2024

kubawi commented Nov 25, 2024

askmike1 commented Nov 25, 2024

raskchanky commented Nov 25, 2024

askmike1 commented Nov 25, 2024

/sys/health doesn't accurate reflect the health of Vault #28846

/sys/health doesn't accurate reflect the health of Vault #28846

Comments

askmike1 commented Nov 6, 2024

kubawi commented Nov 25, 2024

askmike1 commented Nov 25, 2024

raskchanky commented Nov 25, 2024

askmike1 commented Nov 25, 2024