-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(vault): apply retry count maximum threshold for secret rotation #13194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but I'd like to hear @jschmid1 's opinion as I believe he knows this code much better than I do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?
I aggree that we need to have an upper limit of sorts but that approach seems too drastic. What do others think?
@jschmid1 Agreed and this is also my major concern about this approach. I've been considering this issue and the solution for a long time(and you can also see another attempt here #13193). It seems that according to our current implementation, the only ways which I think are applicable: 1)manage a bi-directional relationship dict between entities/configurations and vault refs, and add triggers in crud events/declarative config loading to update the relationship or 2) remove these refs directly when it fails too many times or 3)iterate all entities/configs to check the vault refs and updated and cleanup accordingly or 4) modify the log lines that complain about the dereference failure due to secret not existing on secret manager backend. All of these plans have pros&cons, and 2) seems to be relatively simple(that's why I said "keeps simplicity") and reasonable compared with the others. Again, the issue is small but the solution is still debatable. For 2) I think we can also adjust the max retry number to give a larger time window for the vault backend to recover. But then it makes no sense to me that, without ttls(especially the resurrect_ttl) properly configured, Kong can still work normally after the vault backend has been down for such a long time(say several hours). Just my 2cnts, and due to the report of this issue showing up more frequently nowadays, I decide to go for 2) to deliver a reasonable solution for the issue. Anyway, I would be glad to hear if there are better ideas to solve this issue! |
@@ -1274,7 +1283,21 @@ local function new(self) | |||
-- we should refresh the secret at this point | |||
local ok, err = get_from_vault(reference, strategy, config, new_cache_key, parsed_reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?
I aggree that we need to have an upper limit of sorts but that approach seems too drastic. What do others think?
I had a similar thought. Also, maybe a setting that allows defining an upper bound in time instead of number of tries would be more user friendly (set an optional maximum downtime that is acceptable), but I guess that'd make the implementation more complex.
Would the error returned on this line give any information that helps determining if the secret was deleted or the vault was down / did not respond? If so maybe we could only remove the entry when we know it was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message of the secret was deleted
depends on the implementation of the secret manager products, so very hard to tell...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I like the time-based upper bound solution. We can track last_seen
and compare it with the defined upper limit (which should be configurable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let me put up another PR(that implement time-based upper bond) and see if that is better than this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschmid1 @samugi What do you think if we just log the error for X times(and after that not printing this error log, since it is just the error log that the user cares about), but keep the background rotating? That would both keep the rotating works even if the vault backend has been down for a long time, and not flood the user's logging system with useless errors
Summary
This PR adds a maximum threshold for secret rotation in the vault. If a secret has exceeded the threshold, vault will remove it from the shared dict and no longer rotate it in the future. Note that this is a solution that keeps simplicity for fixing this issue.
Checklist
changelog/unreleased/kong
orskip-changelog
label added on PR if changelog is unnecessary. README.mdIssue reference
FTI-5775