-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry obtaining KUBECONFIG file lock #3782
base: main
Are you sure you want to change the base?
Conversation
When there are multiple concurrent create or delete operations, there can be different processing trying to lock the KUBECONFIG file at the same time. This leads to small windows where getting the lock files. We don't fail the operation if something like "kind delete cluster" isn't able to remove its settings from KUBECONFIG. This is ideal and what most users would expect, but it also means there may be some leftover crud left behind that they are not aware of. This adds some basic retries to obtaining the file lock. It potentially adds a very small delay (in a much longer process), but makes it more likely that the process is able to get the lock and complete the KUBECONFIG update. Signed-off-by: Sean McGinnis <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: stmcginnis The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if err != nil { | ||
return err | ||
|
||
// Retry obtaining the file lock a few times to accommodate concurrent operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is safe, if there is a lock we're racing with another process anyhow? (which may logically conflict) and we're deviating in behavior from client-go
I think we probably need to retry the entire kubeconfig edit call instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be the safe route. Once we obtain the lock, there shouldn't be any concerns about other client-go (or other libraries) that also use file locks to update the KUBECONFIG. All of the update operations happen between that window of obtaining the file lock and removing it.
When we update the contents, we do both read and write within this locked section.
Same with removing an entry. The file lock is obtained first, and only then are updates made.
The only difference from client-go is it will immediately fail if another process already has that lock. We will be a little more robust and just retry, where typical operations should be complete within milliseconds.
So I believe the only risk if there is another process that is not using the same file locking mechanism to control access to KUBECONFIG. As long is it operates under the same principle of getting the lock before reading and updating, then the only difference is we might be able to wait until the other process is done before completely failing the operation.
When there are multiple concurrent create or delete operations, there can be different processing trying to lock the KUBECONFIG file at the same time. This leads to small windows where getting the lock files.
We don't fail the operation if something like "kind delete cluster" isn't able to remove its settings from KUBECONFIG. This is ideal and what most users would expect, but it also means there may be some leftover crud left behind that they are not aware of.
This adds some basic retries to obtaining the file lock. It potentially adds a very small delay (in a much longer process), but makes it more likely that the process is able to get the lock and complete the KUBECONFIG update.
Related: #3781