Convert fleet-agent to controller-runtime #1772

manno · 2023-09-11T16:09:20Z

Refers to #1734

Requires:

Follow up:

find better mechanism to "enqueue" bundledeployment from drift correction ("trigger" mini-controller)?
- maybe disable resync in agent to avoid triggering for all bd?
when converting fleetcontroller, filter bd status events
cleanup should use finalizers instead of iterating all the releases:
- https://github.com/rancher/fleet/blob/master/internal/cmd/agent/deployer/cleanup/cleanup.go
bring back leader election, configurable with args, should fit sharding strategy, see:
- [SURE-3805] Make leaderelection poll period configurable (was: Fleet lock api calls account for 33% of all api calls) #1491
- Controllers Support Sharding via Labels #1740
- logging, field names (fieldNames), log levels (1,5?)

e2e/single-cluster/bundle_diffs_test.go:145 is flaky -> Switch bundle diff test to use random target namespaces #1975
clean up calls listdeployment too often, tries to delete k3s charts - Agent Bundle/Release Cleanup Ignores Non-Fleet Releases #1976
get rid of nodestatus reporting in clusterstatus.go, Remove cluster node status reporting from agent #1973
is it possible to remove requeueAfter with bd.Status.AppliedDeploymentID = "", or is it still needed? - Remove bd requeue, rely on drift correction #1985
variables instead constants in types - Change variables to constants in bundle type #1986
Manual testing:
clean up agent resources needed? b7810f0
do agent restarts, agent updates work?
keeps resources when upstream is offline
cleans up orphaned resources

dev/setup-k3ds

integrationtests/agent/bundle_deployment_status_test.go

pkg/apis/fleet.cattle.io/v1alpha1/bundle_types.go

internal/manifest/lookup.go

internal/helmdeployer/helmcache/secret.go

internal/cmd/controller/controllers/bundle/controller.go

internal/cmd/agent/deployer/monitor/updatestatus.go

aruiz14 · 2023-11-27T11:12:20Z

internal/cmd/agent/deployer/driftdetect/driftdetect.go

+	// This mechanism of triggering requeues for changes is not ideal.
+	// It's a workaround since we can't enqueue directly from the trigger
+	// mini controller. Triggering via a status update is expensive.
+	// It's hard to compute a stable hash to make this idempotent, because
+	// the hash would need to be computed over the whole change. We can't
+	// just use the resource version of the bundle deployment. We would
+	// need to look at the deployed resources and compute a hash over them.
+	// However this status update happens for every changed resource, maybe
+	// multiple times per resource. It will also trigger on a resync.


Happy to talk or brainstorm about this, but could it be an option to order all resources by kind + key, hash their resource versions and compare that against SyncGeneration? This should avoid unnecessary re-deployments, for example on fleet-agent restarts?

Also, we should consider using Generation instead of the ResourceVersion. So status updates are omitted and don't requeue bundle deployments?

Yes let's talk. I'm currently assuming it's cheaper to trigger than to compute a proper hash.

yes, let's talk about this

Yes let's talk. I'm currently assuming it's cheaper to trigger than to compute a proper hash.

I'm not sure it's cheaper, especially since BundleDeployments updates may trigger as well updates on other resources on the upstream cluster (e.g. updating status of Bundles, GitRepos or Clusters resources). Also, beware of the multiplying factor in the case of many downstream clusters.

BundleDeployments update all the time. At least every 15min, for the heartbeat timestamp in the "agent" status fields. We will need to ignore some of the status updates, when we convert the fleetcontroller.

Since the new controller framework deduplicates events, this might not be too bad.

The alternative is to list all the resources of the bundle. When two resources change, like in the integration test, this will happen twice. Then some computation, then two update to the bd status. That also happens every time we deploy. There is a TriggerSleep delay, any change after that, the the mini controller updates bundledeployments..

internal/cmd/agent/apply.go

internal/cmd/agent/controller.go

.golangci.json

internal/cmd/agent/controller/bundledeployment_controller.go

raulcabello · 2023-11-27T14:15:07Z

internal/cmd/agent/controller.go

+		Scheme:                 scheme,
+		Metrics:                metricsserver.Options{BindAddress: metricsAddr},
+		HealthProbeBindAddress: probeAddr,
+		LeaderElection:         false,


Are we sure we want to remove leader election? If that is the case, do we want to do it in gitjob and fleet controller as well?

We would be losing the ability to be highly available, and we would see some incosistent behaviour if someone manually scale the statefulset

We already removed leader election in https://github.com/rancher/fleet/pull/1905/files
We can add a follow up item to bring it back. We'll need to discuss how that works together with the upcoming sharding story.

raulcabello · 2023-11-27T14:22:37Z

internal/cmd/agent/controller/bundledeployment_controller.go

+		merr = append(merr, fmt.Errorf("failed refreshing drift detection: %w", err))
+	}
+
+	err = r.Cleanup.CleanupReleases(ctx, key, bd)


We iterate through all BundleDeployments each time any of them changes to see if they need to be cleaned up. That's not very efficient. We can use finalizers as I mentioned in the previous comment

Yes, that's old logic again (https://github.com/rancher/fleet/blob/master/internal/cmd/agent/deployer/cleanup/cleanup.go). I'd not fix it within this PR, but I'll add an item to the follow up items.

maybe we can fix this and this using finalizers in a separate PR?

internal/cmd/agent/deployer/driftdetect/driftdetect.go

raulcabello · 2023-11-27T14:36:25Z

internal/cmd/agent/deployer/driftdetect/driftdetect.go

+	// This mechanism of triggering requeues for changes is not ideal.
+	// It's a workaround since we can't enqueue directly from the trigger
+	// mini controller. Triggering via a status update is expensive.
+	// It's hard to compute a stable hash to make this idempotent, because
+	// the hash would need to be computed over the whole change. We can't
+	// just use the resource version of the bundle deployment. We would
+	// need to look at the deployed resources and compute a hash over them.
+	// However this status update happens for every changed resource, maybe
+	// multiple times per resource. It will also trigger on a resync.


yes, let's talk about this

raulcabello · 2023-11-27T14:41:56Z

internal/cmd/agent/controller/bundledeployment_controller.go

+
+		// TODO is this needed with drift correction?
+		if len(bd.Status.ModifiedStatus) > 0 && monitor.ShouldRedeploy(bd) {
+			result.RequeueAfter = durations.MonitorBundleDelay


why do we want to requeue here?

It's old behavior. I'm not sure we can remove it.

if we don't remove I think we will always requeue if there is drift. I think this should be removed since the reconciler is trigger when one of the resources changes. We should avoid using RequeueAfter to avoid unnecessary (and maybe infinite?) reconcile calls

integrationtests/agent/suite_test.go

internal/cmd/agent/apply.go

internal/helmdeployer/deployer.go

internal/cmd/agent/root.go

internal/cmd/agent/controller/bundledeployment_controller.go

integrationtests/agent/bundle_deployment_status_test.go

* new bundledeployment controller + reconciler * move code from "handlers" into the deployer packages * no more leader election for bundledeployment controller * replace agent's logger with logr.Logger * move newMappers into clusterstatus package, still uses wrangler * move "list resources" up into reconciler, don't fetch twice * agent reconciler only sets status when exiting * trigger bundledeployment via status on updates * move requeue and drift correction into reconciler * simplify bundlestatus condition handling * add controller-runtime args to agent * zap logging config * kubeconfig * increase TriggerSleep delay by 3s

Co-authored-by: Corentin Néau <[email protected]>

manno changed the title ~~Fleet agent controller runtime~~ Fleet agent controller runtime [skip ci] Sep 11, 2023

manno force-pushed the fleet-agent-controller-runtime branch 2 times, most recently from 1790703 to 2857d39 Compare September 22, 2023 16:12

manno force-pushed the fleet-agent-controller-runtime branch from 2857d39 to c652c1e Compare September 29, 2023 15:41

manno force-pushed the fleet-agent-controller-runtime branch 3 times, most recently from d1589ae to 4d2ee4f Compare October 23, 2023 16:06

manno force-pushed the fleet-agent-controller-runtime branch from c26a7e7 to a67c52a Compare October 26, 2023 13:54

manno mentioned this pull request Oct 26, 2023

Split fleet-agent #1905

Merged

manno force-pushed the fleet-agent-controller-runtime branch 3 times, most recently from 45266ca to e449b48 Compare October 31, 2023 10:37

manno force-pushed the fleet-agent-controller-runtime branch from e449b48 to 46225bf Compare November 6, 2023 10:57

manno changed the title ~~Fleet agent controller runtime [skip ci]~~ Fleet agent controller runtime Nov 22, 2023

manno force-pushed the fleet-agent-controller-runtime branch 5 times, most recently from 7a53be3 to 2919ee6 Compare November 24, 2023 15:44

manno marked this pull request as ready for review November 24, 2023 15:45

manno requested a review from a team as a code owner November 24, 2023 15:45

aruiz14 reviewed Nov 27, 2023

View reviewed changes

manno force-pushed the fleet-agent-controller-runtime branch from 262777e to a138a30 Compare November 27, 2023 14:00

raulcabello reviewed Nov 27, 2023

View reviewed changes

manno force-pushed the fleet-agent-controller-runtime branch 3 times, most recently from c10f21e to 1470596 Compare November 27, 2023 16:16

weyfonk reviewed Nov 28, 2023

View reviewed changes

Mario Manno and others added 7 commits November 29, 2023 12:33

Update integrationtests

23b1d36

Update unit tests

9075e22

fixup! Update integrationtests/agent/suite_test.go

d4ba6bd

Update integrationtests/agent/suite_test.go

a8cecbe

Co-authored-by: Corentin Néau <[email protected]>

Apply suggestions from code review

00b8b9f

Co-authored-by: Corentin Néau <[email protected]>

Apply suggestions from code review

599bd4e

Co-authored-by: Corentin Néau <[email protected]>

Use bd.Generation in SyncGeneration status field to trigger reconciler

5fc1324

manno force-pushed the fleet-agent-controller-runtime branch from 112a27e to 5fc1324 Compare November 29, 2023 11:38

fixup! Apply suggestions from code review

0a1e38c

manno mentioned this pull request Nov 29, 2023

Agent Bundle/Release Cleanup Ignores Non-Fleet Releases #1976

Merged

manno enabled auto-merge (squash) November 29, 2023 15:34

Merge branch 'master' into fleet-agent-controller-runtime

c770efb

manno disabled auto-merge November 30, 2023 11:06

manno merged commit d0178fa into master Nov 30, 2023
10 checks passed

manno deleted the fleet-agent-controller-runtime branch November 30, 2023 11:07

This was referenced Dec 1, 2023

Remove bd requeue, rely on drift correction #1985

Merged

Change variables to constants in bundle type #1986

Merged

Remove cluster node status reporting from agent #1973

Closed

manno changed the title ~~Fleet agent controller runtime~~ Convert fleet-agent to controller-runtime Jan 11, 2024

manno mentioned this pull request Jan 11, 2024

Convert fleet-agent to controller-runtime #1734

Closed

BrewTestBot mentioned this pull request Jul 17, 2024

fleet-cli 0.10.0 Homebrew/homebrew-core#177602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert fleet-agent to controller-runtime #1772

Convert fleet-agent to controller-runtime #1772

manno commented Sep 11, 2023 •

edited

Loading

aruiz14 Nov 27, 2023

manno Nov 27, 2023

raulcabello Nov 27, 2023

aruiz14 Nov 27, 2023

manno Nov 27, 2023 •

edited

Loading

raulcabello Nov 27, 2023

manno Nov 27, 2023

raulcabello Nov 27, 2023

manno Nov 27, 2023

raulcabello Nov 28, 2023

raulcabello Nov 27, 2023

raulcabello Nov 27, 2023

manno Nov 27, 2023

raulcabello Nov 28, 2023

manno Nov 29, 2023

Convert fleet-agent to controller-runtime #1772

Convert fleet-agent to controller-runtime #1772

Conversation

manno commented Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manno Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manno commented Sep 11, 2023 •

edited

Loading

manno Nov 27, 2023 •

edited

Loading