Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(vault): apply retry count maximum threshold for secret rotation #13194

Closed
wants to merge 5 commits into from

Conversation

windmgc
Copy link
Member

@windmgc windmgc commented Jun 11, 2024

Summary

This PR adds a maximum threshold for secret rotation in the vault. If a secret has exceeded the threshold, vault will remove it from the shared dict and no longer rotate it in the future. Note that this is a solution that keeps simplicity for fixing this issue.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

FTI-5775

@github-actions github-actions bot added core/pdk cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Jun 11, 2024
@pull-request-size pull-request-size bot added size/M and removed size/S labels Jun 11, 2024
@windmgc windmgc requested review from bungle and jschmid1 June 12, 2024 06:46
@windmgc
Copy link
Member Author

windmgc commented Jun 25, 2024

It seems we're having more and more users reporting this as an issue to them.
@bungle @jschmid1 Could you help review this small PR? Thanks!

kong/pdk/vault.lua Show resolved Hide resolved
kong/pdk/vault.lua Outdated Show resolved Hide resolved
Copy link
Member

@samugi samugi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but I'd like to hear @jschmid1 's opinion as I believe he knows this code much better than I do.

Copy link
Contributor

@jschmid1 jschmid1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?

I aggree that we need to have an upper limit of sorts but that approach seems too drastic. What do others think?

@windmgc
Copy link
Member Author

windmgc commented Jul 10, 2024

I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?

@jschmid1 Agreed and this is also my major concern about this approach.

I've been considering this issue and the solution for a long time(and you can also see another attempt here #13193). It seems that according to our current implementation, the only ways which I think are applicable: 1)manage a bi-directional relationship dict between entities/configurations and vault refs, and add triggers in crud events/declarative config loading to update the relationship or 2) remove these refs directly when it fails too many times or 3)iterate all entities/configs to check the vault refs and updated and cleanup accordingly or 4) modify the log lines that complain about the dereference failure due to secret not existing on secret manager backend. All of these plans have pros&cons, and 2) seems to be relatively simple(that's why I said "keeps simplicity") and reasonable compared with the others. Again, the issue is small but the solution is still debatable.

For 2) I think we can also adjust the max retry number to give a larger time window for the vault backend to recover. But then it makes no sense to me that, without ttls(especially the resurrect_ttl) properly configured, Kong can still work normally after the vault backend has been down for such a long time(say several hours).

Just my 2cnts, and due to the report of this issue showing up more frequently nowadays, I decide to go for 2) to deliver a reasonable solution for the issue. Anyway, I would be glad to hear if there are better ideas to solve this issue!

@@ -1274,7 +1283,21 @@ local function new(self)
-- we should refresh the secret at this point
local ok, err = get_from_vault(reference, strategy, config, new_cache_key, parsed_reference)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?

I aggree that we need to have an upper limit of sorts but that approach seems too drastic. What do others think?

I had a similar thought. Also, maybe a setting that allows defining an upper bound in time instead of number of tries would be more user friendly (set an optional maximum downtime that is acceptable), but I guess that'd make the implementation more complex.

Would the error returned on this line give any information that helps determining if the secret was deleted or the vault was down / did not respond? If so maybe we could only remove the entry when we know it was deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message of the secret was deleted depends on the implementation of the secret manager products, so very hard to tell...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I like the time-based upper bound solution. We can track last_seen and compare it with the defined upper limit (which should be configurable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let me put up another PR(that implement time-based upper bond) and see if that is better than this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschmid1 @samugi What do you think if we just log the error for X times(and after that not printing this error log, since it is just the error log that the user cares about), but keep the background rotating? That would both keep the rotating works even if the vault backend has been down for a long time, and not flood the user's logging system with useless errors

@team-eng-enablement team-eng-enablement added author/community PRs from the open-source community (not Kong Inc) and removed author/community PRs from the open-source community (not Kong Inc) labels Aug 13, 2024
@windmgc windmgc closed this Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/pdk size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants