fix(vault): apply retry count maximum threshold for secret rotation #13194

windmgc · 2024-06-11T09:50:44Z

Summary

This PR adds a maximum threshold for secret rotation in the vault. If a secret has exceeded the threshold, vault will remove it from the shared dict and no longer rotate it in the future. Note that this is a solution that keeps simplicity for fixing this issue.

Checklist

The Pull Request has tests
A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

FTI-5775

windmgc · 2024-06-25T07:57:51Z

It seems we're having more and more users reporting this as an issue to them.
@bungle @jschmid1 Could you help review this small PR? Thanks!

kong/pdk/vault.lua

samugi

lgtm but I'd like to hear @jschmid1 's opinion as I believe he knows this code much better than I do.

jschmid1

I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?

I aggree that we need to have an upper limit of sorts but that approach seems too drastic. What do others think?

windmgc · 2024-07-10T06:55:48Z

I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?

@jschmid1 Agreed and this is also my major concern about this approach.

I've been considering this issue and the solution for a long time(and you can also see another attempt here #13193). It seems that according to our current implementation, the only ways which I think are applicable: 1)manage a bi-directional relationship dict between entities/configurations and vault refs, and add triggers in crud events/declarative config loading to update the relationship or 2) remove these refs directly when it fails too many times or 3)iterate all entities/configs to check the vault refs and updated and cleanup accordingly or 4) modify the log lines that complain about the dereference failure due to secret not existing on secret manager backend. All of these plans have pros&cons, and 2) seems to be relatively simple(that's why I said "keeps simplicity") and reasonable compared with the others. Again, the issue is small but the solution is still debatable.

For 2) I think we can also adjust the max retry number to give a larger time window for the vault backend to recover. But then it makes no sense to me that, without ttls(especially the resurrect_ttl) properly configured, Kong can still work normally after the vault backend has been down for such a long time(say several hours).

Just my 2cnts, and due to the report of this issue showing up more frequently nowadays, I decide to go for 2) to deliver a reasonable solution for the issue. Anyway, I would be glad to hear if there are better ideas to solve this issue!

samugi · 2024-07-10T07:58:06Z

kong/pdk/vault.lua

@@ -1274,7 +1283,21 @@ local function new(self)
    -- we should refresh the secret at this point
    local ok, err = get_from_vault(reference, strategy, config, new_cache_key, parsed_reference)


I wonder if we really want this. There are certainly cases where a vault backend is down for a prolonged period in time. As a user I would expect things to continue working when my backend comes back up, right?

I aggree that we need to have an upper limit of sorts but that approach seems too drastic. What do others think?

I had a similar thought. Also, maybe a setting that allows defining an upper bound in time instead of number of tries would be more user friendly (set an optional maximum downtime that is acceptable), but I guess that'd make the implementation more complex.

Would the error returned on this line give any information that helps determining if the secret was deleted or the vault was down / did not respond? If so maybe we could only remove the entry when we know it was deleted.

The error message of the secret was deleted depends on the implementation of the secret manager products, so very hard to tell...

personally I like the time-based upper bound solution. We can track last_seen and compare it with the defined upper limit (which should be configurable).

Alright, let me put up another PR(that implement time-based upper bond) and see if that is better than this one

@jschmid1 @samugi What do you think if we just log the error for X times(and after that not printing this error log, since it is just the error log that the user cares about), but keep the background rotating? That would both keep the rotating works even if the vault backend has been down for a long time, and not flood the user's logging system with useless errors

fix(vault): apply retry count maximum threshold for secret rotation

35221f9

pull-request-size bot added the size/S label Jun 11, 2024

github-actions bot added core/pdk cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Jun 11, 2024

github-actions bot assigned windmgc Jun 11, 2024

docs(changelog): add changelog

ef8ae08

windmgc mentioned this pull request Jun 11, 2024

fix(vault): remove no longer used vault references from secret rotation #13193

Closed

3 tasks

windmgc added 2 commits June 11, 2024 19:55

fix(vault): reset retry count when resolve successfully

bf3b914

fix(vault): cleanup old count

6df34cf

pull-request-size bot added size/M and removed size/S labels Jun 11, 2024

windmgc requested review from bungle and jschmid1 June 12, 2024 06:46

samugi reviewed Jul 5, 2024

View reviewed changes

kong/pdk/vault.lua Show resolved Hide resolved

kong/pdk/vault.lua Outdated Show resolved Hide resolved

style(*): rename var

8611392

samugi approved these changes Jul 9, 2024

View reviewed changes

jschmid1 reviewed Jul 9, 2024

View reviewed changes

samugi reviewed Jul 10, 2024

View reviewed changes

windmgc mentioned this pull request Aug 5, 2024

fix(vault): apply secret rotation error log throttling #13449

Closed

3 tasks

team-eng-enablement added author/community PRs from the open-source community (not Kong Inc) and removed author/community PRs from the open-source community (not Kong Inc) labels Aug 13, 2024

windmgc closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vault): apply retry count maximum threshold for secret rotation #13194

fix(vault): apply retry count maximum threshold for secret rotation #13194

windmgc commented Jun 11, 2024 •

edited

Loading

windmgc commented Jun 25, 2024

samugi left a comment

jschmid1 left a comment •

edited

Loading

windmgc commented Jul 10, 2024 •

edited

Loading

samugi Jul 10, 2024

windmgc Jul 10, 2024

jschmid1 Jul 15, 2024

windmgc Jul 15, 2024

windmgc Jul 29, 2024

		@@ -1274,7 +1283,21 @@ local function new(self)
		-- we should refresh the secret at this point
		local ok, err = get_from_vault(reference, strategy, config, new_cache_key, parsed_reference)

fix(vault): apply retry count maximum threshold for secret rotation #13194

fix(vault): apply retry count maximum threshold for secret rotation #13194

Conversation

windmgc commented Jun 11, 2024 • edited Loading

Summary

Checklist

Issue reference

windmgc commented Jun 25, 2024

samugi left a comment

Choose a reason for hiding this comment

jschmid1 left a comment • edited Loading

Choose a reason for hiding this comment

windmgc commented Jul 10, 2024 • edited Loading

samugi Jul 10, 2024

Choose a reason for hiding this comment

windmgc Jul 10, 2024

Choose a reason for hiding this comment

jschmid1 Jul 15, 2024

Choose a reason for hiding this comment

windmgc Jul 15, 2024

Choose a reason for hiding this comment

windmgc Jul 29, 2024

Choose a reason for hiding this comment

windmgc commented Jun 11, 2024 •

edited

Loading

jschmid1 left a comment •

edited

Loading

windmgc commented Jul 10, 2024 •

edited

Loading