-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] KEP-3926: show how we address client cache inconsistency issue #4949
Open
tkashem
wants to merge
1
commit into
kubernetes:master
Choose a base branch
from
tkashem:kep-3926-changes
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+124
−1
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -93,6 +93,7 @@ tags, and then generate with `hack/update-toc.sh`. | |||||
- [Retrieving a failing resource](#retrieving-a-failing-resource) | ||||||
- [Deleting a failing resource](#deleting-a-failing-resource) | ||||||
- [Protecting unconditional deletion](#protecting-unconditional-deletion) | ||||||
- [Propagating with WATCH](#propagating-with-watch) | ||||||
- [Test Plan](#test-plan) | ||||||
- [Prerequisite testing updates](#prerequisite-testing-updates) | ||||||
- [Unit tests](#unit-tests) | ||||||
|
@@ -405,7 +406,118 @@ deletions should therefore have their own extra admission. | |||||
|
||||||
The unconditional deletion admission: | ||||||
1. checks if a "delete" request contains the `IgnoreStoreReadErrorWithClusterBreakingPotential` option | ||||||
2. if it does, it checks the RBAC of the request's user for the `delete-ignore-read-errors` verb of the given resource | ||||||
2. if it does, it checks the RBAC of the request's user for the `unsafe-delete-ignore-read-errors` verb of the given resource | ||||||
|
||||||
|
||||||
|
||||||
#### Propagating with WATCH | ||||||
When a corrupt object is deleted with an `unsafe-delete-ignore-read-errors`, a | ||||||
watcher of this resource will not be able to transform its old data from the | ||||||
storage, or decode it into an object [1]. This will cause the watch to throw | ||||||
an `watch.ERROR` event. | ||||||
A client backed up by an informer already has the object in its cache, since the | ||||||
client never receives a `watch.DELETED` event the object remains in the lsiter | ||||||
cache. This creates an inconsistency - retrieving the object from the cache | ||||||
yields the object, but if we get it from the storage we see a `corrupt object` | ||||||
error. | ||||||
|
||||||
There are a few factors we need to consider to understan how the client is impacted: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- a) a client that has not update yet | ||||||
- b) whether watch cache is enabled in the kube-apiserver | ||||||
|
||||||
We can review the following options: | ||||||
|
||||||
*A: Send an `ERROR` event with a `metav1.Status` object*: | ||||||
When the `watcher` receives a `DELETED` event from etcd, and it fails to | ||||||
transform or decode the old data associated with the deleted resource, it | ||||||
will return an `ERROR` event with a `metav1.Status` object with a reason | ||||||
`StatusReasonStoreReadError`: | ||||||
``` | ||||||
{ | ||||||
type: "ERROR", | ||||||
object: { | ||||||
"status": { | ||||||
"message":"corrupt object has been deleted - data from the storage is not transformable - ...", | ||||||
"reason": "StorageReadError" | ||||||
} | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
Upon receiving this error, the existing reflector implementation should end the | ||||||
current watch and restart with a new list+watch which will result in | ||||||
the old cache being replaced. | ||||||
|
||||||
Pros: | ||||||
- existing clients will recover without any code change | ||||||
- No changes in watch API | ||||||
Cons: relisting is expensive, but it is limted to client(s) that are watching | ||||||
the resource(s) being unsafe-deleted. | ||||||
|
||||||
|
||||||
*B: Send an `ERROR` event with a partial object of the type being watched*: | ||||||
We will change the watch to throw an `watch.ERROR` with a partial object of the | ||||||
type being watched that has the follwoing information: | ||||||
- Namespace/Name: identifies the object | ||||||
- ResourceVersion: it is the Revision of the key-value store after the Delete operation | ||||||
- Annotation: `"unsafe-delete-ignore-read-errors" : "true"` to indicate that the | ||||||
object associated with the DELETED event from the storage was corrupt. | ||||||
|
||||||
|
||||||
We will change the reflector to interpret this error, and do the following: | ||||||
- delete the object from its store so clients are in sync with the current state | ||||||
- set the `LastSyncResourceVersion` to the `ResourceVersion` from the partial | ||||||
object, this advances to the revision of the storage after the delete operation | ||||||
|
||||||
This is what it would look in the wire: | ||||||
``` | ||||||
{ | ||||||
type: "ERROR", | ||||||
object: { | ||||||
"ObjectMeta": { | ||||||
"Name":"foo", | ||||||
"Namespace":"test-ns", | ||||||
"ResourceVersion":"3", | ||||||
"Annotations": { | ||||||
"kubernetes.io/unsafe-delete-ignore-read-errors":"true" | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
Pros: new client recovers and catches up with the store without a restart | ||||||
Cons: old client behaves as it behaves today | ||||||
|
||||||
[1] https://github.com/kubernetes/kubernetes/blob/2bb886ce2a6ae5a83031efd225e8a17dd7ecba03/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L681-L697 | ||||||
[2] https://github.com/kubernetes/kubernetes/pull/127513/commits/46c3fdc4eeda8d3ffb1839fff196c5fece75066d#diff-9ccdf713e010f73dbebd01e936cb0077fc63e4f5ab941d865ded42da219d84ec | ||||||
|
||||||
|
||||||
*C: Send a `DELETED` event with a partial object of the type being watched*: | ||||||
Today the semantics of a `DELETED` event is: | ||||||
- 1) The `Object` associated with the event is the whole object as it was. | ||||||
- 2) The `ResourceVersion` of the object is the revision of the storage | ||||||
after the delete operation. | ||||||
|
||||||
Upon faliure to transform or decode the object data, if we return a `DELETED` | ||||||
event with a partial object similar to `B` then respect `2` but not `1`. | ||||||
It's worth noting that a client can define keys based off of fields that are | ||||||
limited to `Namespace` or `Name`, e.g. someone could define a key from a value | ||||||
from a label or an annotation. | ||||||
|
||||||
Pros: existing clients work as expected without any code change | ||||||
Cons: partial object can cause a client that uses custom keys to fail | ||||||
|
||||||
|
||||||
*D: Introduce a new watch event*: | ||||||
This seems more intrusive than necessary, so i will not go in details here. | ||||||
|
||||||
|
||||||
|
||||||
Add an integration test to show that an informer is consistent with the state | ||||||
in the storage after the corrupt resource is unsafe-deleted. | ||||||
|
||||||
|
||||||
|
||||||
### Test Plan | ||||||
|
||||||
|
@@ -491,6 +603,17 @@ At this time only integration tests are considered. | |||||
|
||||||
### Graduation Criteria | ||||||
|
||||||
#### Beta | ||||||
|
||||||
- API validation must be resilient to future additions to the API | ||||||
- unfortuately dynamic encryption config reload takes about 1m, so can't use wait.ForeverTestTimeout | ||||||
in the integration test yet, I have left to TODO to investigate and improve the reload time | ||||||
- Allow `DryRun` together with `ignoreStoreReadErrorWithClusterBreakingPotential`: | ||||||
|
||||||
We will check if the user has the permission to do `unsafe-delete-ignore-read-errors` | ||||||
on the resource | ||||||
|
||||||
|
||||||
<!-- | ||||||
**Note:** *Not required until targeted at a release.* | ||||||
|
||||||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I can parse the sentence. You are saying the cache will be stuck?