(bugfix): reduce frequency of update requests for copied CSVs #3411

everettraven · 2024-10-08T19:47:02Z

Description of the change:
Reduces the frequency of update requests for copied CSVs by:

Adding the olm.operatorframework.io/nonStatusCopyHash annotation to the copied CSVs, populated with a hash of all the non-status fields of the original CSV.
Adding the olm.operatorframework.io/statusCopyHash annotation to the copied CSVs, populated with a hash of all the status fields of the original CSV.
Updating the above annotations as necessary on the copied CSVs when changes are made to the original CSV

This appears to have been the desired behavior of this function, as evidenced by:

operator-lifecycle-manager/pkg/controller/operators/olm/operatorgroup.go

Lines 829 to 830 in 461018e

 existingNonStatus := existing.Annotations["$copyhash-nonstatus"] 

 existingStatus := existing.Annotations["$copyhash-status"]

The problem with the previous implementation was that these annotations were never set on the copied CSVs.

Motivation for the change:

Reduce the frequency of UPDATE requests made to the Kubernetes API Server for copied CSVs

Architectural changes:

Testing remarks:

Reviewer Checklist

by adding annotations to copied CSVs that are populated with hashes of the non-status fields and the status fields. This seems to be how this was intended to work, but was not actually working this way because the annotations never actually existed on the copied CSV. This resulted in a hot loop of update requests being made on all copied CSVs. Signed-off-by: everettraven <[email protected]>

Signed-off-by: everettraven <[email protected]>

benluddy · 2024-10-08T20:11:56Z

Those annotations were set on the CSVs in the informer's cache when the hash comparison was introduced (cf618d3). The observation was that CSV copies needed to be written if and only if the copy was out of sync with the original, so caching a pruned object with a hash was much cheaper than caching complete objects while still being able to stomp CSV copies at the appropriate times. This seems to have regressed as part of #3003.

Edit: Breaking the hash comparison would have masked this with the unnecessary update requests, but using a partialobjectmeta informer for CSV copies means that olm-operator doesn't have enough information to determine whether or not a copy is in sync with the original. Even if a hash of the original is included in an annotation when olm-operator writes a copy, if someone else makes a change to a copy's spec, that won't be observable by olm-operator.

Signed-off-by: everettraven <[email protected]>

bentito · 2024-10-09T14:08:45Z

@benluddy I think we want to push ahead with this PR for a couple reasons:

reverting OCPBUGS-17157: improve watching semantics around CSVs #3003 would be difficult at this late date because it had follow-on work AND we need the memory reduction.
there have not been any OLM users noticing the introduced problems from OCPBUGS-17157: improve watching semantics around CSVs #3003
this PR suppresses sending anything to the API server when annotations appear, which does help out users who have noticed the increased activity.

Signed-off-by: Brett Tofel <[email protected]>

benluddy · 2024-10-09T14:24:16Z

there have not been any OLM users noticing the introduced problems from #3003

You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original.

Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR?

everettraven · 2024-10-09T15:05:36Z

there have not been any OLM users noticing the introduced problems from #3003

You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original.

Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR?

@benluddy I think what @bentito is trying to say is that we have received no complaints from users about the bug you called out specifically related to writes to a copied CSV, without changing the annotation, would not be reverted.

My understand of the current plan, and I may be a tad out of the loop due to hurricane preparations, is:

Merge this PR to fix a critical bug causing users higher logging costs and load on the API server
If bug reports come in regarding copied CSV writes not being reverted, investigate how to resolve that

I agree that reverting #3003 is the right approach to fix both the critical bug and the bug you called out that was introduced by the use of the PartialObjectMetadata informer. There are folks with concerns about reverting that change because we needed the performance gain to enable folks running on the edge to be able to use OLM. Reverting #3003 is probably still the best thing to do and identify a way to still achieve enough performance gain for the edge use cases, but the priority is being placed on resolving the update 100% of the time ASAP and the consensus of the current maintainers seems to be that this PR is what allows us to achieve that and give us some breathing room to re-assess.

I know it isn't a great answer, but hopefully this gives some insight as to the decisions the team has made re: continuing with this PR.

kevinrizza · 2024-10-09T15:20:13Z

@everettraven I don't think Ben is suggesting that we revert #3003, I think he is suggesting that we don't introduce another regression on top of the previous one as a fix. With this change, I think we are putting ourselves in a situation where we are going to ignore spec changes in the copied csvs, which I'm pretty sure are considered in resolution.

everettraven · 2024-10-09T15:24:11Z

@kevinrizza The regression is technically caused by #3003 . We just would realize it in this PR because we add the annotations back that prevent updating 100% of the time.

That being said, maybe I'm missing that there is a different way to do this?

everettraven · 2024-10-09T15:27:42Z

Thinking about this a bit more, maybe we can remove the hash annotation and always calculate the hash of the copied CSV and compare that?

EDIT: This isn't possible without reverting #3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information

EDIT EDIT: This could be possible, likely with significant refactoring. I haven't had time to dig into this more. @bentito is taking over this work since I'll be impacted by Hurricane Milton.

benluddy · 2024-10-09T15:55:32Z

Seems likely that some of the observed memory reductions over caching pruned CSVs might have come from reducing the high watermark transient allocations from decoding a big list response (unlike CSVs, or any CRD-backed resource, a PartialObjectMetaList can be encoded as Protobuf and not only JSON).

EDIT: This isn't possible without reverting #3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information

Right, without the full object, the copy loop can't see all the fields that might have diverged from the original. The metadata.generation field is automatically incremented for all CRs on updates that change non-metadata fields (https://github.com/kubernetes/kubernetes/blob/d9c46d8ecb1ede9be30545c9803e17682fcc4b50/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/strategy.go#L185) so you might be able to keep track of the last observed generation per copy with only metadata available. The only problem would be that the status subresource is enabled on the CSV CRD, so status updates don't cause metadata.generation to be incremented.

everettraven added 2 commits October 8, 2024 15:35

update unit tests

7144858

Signed-off-by: everettraven <[email protected]>

openshift-ci bot requested review from akihikokuroda and benluddy October 8, 2024 19:47

updates to test so far

38901d7

Signed-off-by: everettraven <[email protected]>

Fix typo in operatorgroup_test.go

01343d6

Signed-off-by: Brett Tofel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(bugfix): reduce frequency of update requests for copied CSVs #3411

(bugfix): reduce frequency of update requests for copied CSVs #3411

everettraven commented Oct 8, 2024

benluddy commented Oct 8, 2024 •

edited

Loading

bentito commented Oct 9, 2024

benluddy commented Oct 9, 2024 •

edited

Loading

everettraven commented Oct 9, 2024

kevinrizza commented Oct 9, 2024

everettraven commented Oct 9, 2024

everettraven commented Oct 9, 2024 •

edited

Loading

benluddy commented Oct 9, 2024

	existingNonStatus := existing.Annotations["$copyhash-nonstatus"]
	existingStatus := existing.Annotations["$copyhash-status"]

(bugfix): reduce frequency of update requests for copied CSVs #3411

Are you sure you want to change the base?

(bugfix): reduce frequency of update requests for copied CSVs #3411

Conversation

everettraven commented Oct 8, 2024

benluddy commented Oct 8, 2024 • edited Loading

bentito commented Oct 9, 2024

benluddy commented Oct 9, 2024 • edited Loading

everettraven commented Oct 9, 2024

kevinrizza commented Oct 9, 2024

everettraven commented Oct 9, 2024

everettraven commented Oct 9, 2024 • edited Loading

benluddy commented Oct 9, 2024

benluddy commented Oct 8, 2024 •

edited

Loading

benluddy commented Oct 9, 2024 •

edited

Loading

everettraven commented Oct 9, 2024 •

edited

Loading