Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert fleet-agent to controller-runtime #1772

Merged
merged 10 commits into from
Nov 30, 2023
Merged

Conversation

manno
Copy link
Member

@manno manno commented Sep 11, 2023

Refers to #1734

Requires:

Follow up:

@manno manno changed the title Fleet agent controller runtime Fleet agent controller runtime [skip ci] Sep 11, 2023
@manno manno force-pushed the fleet-agent-controller-runtime branch 2 times, most recently from 1790703 to 2857d39 Compare September 22, 2023 16:12
@manno manno force-pushed the fleet-agent-controller-runtime branch from 2857d39 to c652c1e Compare September 29, 2023 15:41
@manno manno force-pushed the fleet-agent-controller-runtime branch 3 times, most recently from d1589ae to 4d2ee4f Compare October 23, 2023 16:06
@manno manno force-pushed the fleet-agent-controller-runtime branch from c26a7e7 to a67c52a Compare October 26, 2023 13:54
@manno manno mentioned this pull request Oct 26, 2023
@manno manno force-pushed the fleet-agent-controller-runtime branch 3 times, most recently from 45266ca to e449b48 Compare October 31, 2023 10:37
@manno manno force-pushed the fleet-agent-controller-runtime branch from e449b48 to 46225bf Compare November 6, 2023 10:57
@manno manno changed the title Fleet agent controller runtime [skip ci] Fleet agent controller runtime Nov 22, 2023
@manno manno force-pushed the fleet-agent-controller-runtime branch 5 times, most recently from 7a53be3 to 2919ee6 Compare November 24, 2023 15:44
@manno manno marked this pull request as ready for review November 24, 2023 15:45
@manno manno requested a review from a team as a code owner November 24, 2023 15:45
dev/setup-k3ds Outdated Show resolved Hide resolved
internal/manifest/lookup.go Show resolved Hide resolved
internal/helmdeployer/helmcache/secret.go Show resolved Hide resolved
internal/cmd/agent/deployer/monitor/updatestatus.go Outdated Show resolved Hide resolved
Comment on lines +111 to +115
// This mechanism of triggering requeues for changes is not ideal.
// It's a workaround since we can't enqueue directly from the trigger
// mini controller. Triggering via a status update is expensive.
// It's hard to compute a stable hash to make this idempotent, because
// the hash would need to be computed over the whole change. We can't
// just use the resource version of the bundle deployment. We would
// need to look at the deployed resources and compute a hash over them.
// However this status update happens for every changed resource, maybe
// multiple times per resource. It will also trigger on a resync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to talk or brainstorm about this, but could it be an option to order all resources by kind + key, hash their resource versions and compare that against SyncGeneration? This should avoid unnecessary re-deployments, for example on fleet-agent restarts?

Also, we should consider using Generation instead of the ResourceVersion. So status updates are omitted and don't requeue bundle deployments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's talk. I'm currently assuming it's cheaper to trigger than to compute a proper hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's talk about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's talk. I'm currently assuming it's cheaper to trigger than to compute a proper hash.

I'm not sure it's cheaper, especially since BundleDeployments updates may trigger as well updates on other resources on the upstream cluster (e.g. updating status of Bundles, GitRepos or Clusters resources). Also, beware of the multiplying factor in the case of many downstream clusters.

Copy link
Member Author

@manno manno Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BundleDeployments update all the time. At least every 15min, for the heartbeat timestamp in the "agent" status fields. We will need to ignore some of the status updates, when we convert the fleetcontroller.

Since the new controller framework deduplicates events, this might not be too bad.

The alternative is to list all the resources of the bundle. When two resources change, like in the integration test, this will happen twice. Then some computation, then two update to the bd status. That also happens every time we deploy. There is a TriggerSleep delay, any change after that, the the mini controller updates bundledeployments..

internal/cmd/agent/apply.go Show resolved Hide resolved
internal/cmd/agent/controller.go Outdated Show resolved Hide resolved
@manno manno force-pushed the fleet-agent-controller-runtime branch from 262777e to a138a30 Compare November 27, 2023 14:00
.golangci.json Outdated Show resolved Hide resolved
Scheme: scheme,
Metrics: metricsserver.Options{BindAddress: metricsAddr},
HealthProbeBindAddress: probeAddr,
LeaderElection: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to remove leader election? If that is the case, do we want to do it in gitjob and fleet controller as well?

We would be losing the ability to be highly available, and we would see some incosistent behaviour if someone manually scale the statefulset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already removed leader election in https://github.com/rancher/fleet/pull/1905/files
We can add a follow up item to bring it back. We'll need to discuss how that works together with the upcoming sharding story.

merr = append(merr, fmt.Errorf("failed refreshing drift detection: %w", err))
}

err = r.Cleanup.CleanupReleases(ctx, key, bd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We iterate through all BundleDeployments each time any of them changes to see if they need to be cleaned up. That's not very efficient. We can use finalizers as I mentioned in the previous comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's old logic again (https://github.com/rancher/fleet/blob/master/internal/cmd/agent/deployer/cleanup/cleanup.go). I'd not fix it within this PR, but I'll add an item to the follow up items.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can fix this and this using finalizers in a separate PR?

internal/cmd/agent/deployer/driftdetect/driftdetect.go Outdated Show resolved Hide resolved
Comment on lines +111 to +115
// This mechanism of triggering requeues for changes is not ideal.
// It's a workaround since we can't enqueue directly from the trigger
// mini controller. Triggering via a status update is expensive.
// It's hard to compute a stable hash to make this idempotent, because
// the hash would need to be computed over the whole change. We can't
// just use the resource version of the bundle deployment. We would
// need to look at the deployed resources and compute a hash over them.
// However this status update happens for every changed resource, maybe
// multiple times per resource. It will also trigger on a resync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's talk about this


// TODO is this needed with drift correction?
if len(bd.Status.ModifiedStatus) > 0 && monitor.ShouldRedeploy(bd) {
result.RequeueAfter = durations.MonitorBundleDelay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to requeue here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's old behavior. I'm not sure we can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't remove I think we will always requeue if there is drift. I think this should be removed since the reconciler is trigger when one of the resources changes. We should avoid using RequeueAfter to avoid unnecessary (and maybe infinite?) reconcile calls

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manno manno force-pushed the fleet-agent-controller-runtime branch 3 times, most recently from c10f21e to 1470596 Compare November 27, 2023 16:16
integrationtests/agent/suite_test.go Outdated Show resolved Hide resolved
internal/cmd/agent/apply.go Show resolved Hide resolved
internal/cmd/agent/apply.go Show resolved Hide resolved
internal/helmdeployer/deployer.go Show resolved Hide resolved
internal/cmd/agent/root.go Show resolved Hide resolved
integrationtests/agent/bundle_deployment_status_test.go Outdated Show resolved Hide resolved
integrationtests/agent/bundle_deployment_status_test.go Outdated Show resolved Hide resolved
integrationtests/agent/bundle_deployment_status_test.go Outdated Show resolved Hide resolved
integrationtests/agent/bundle_deployment_status_test.go Outdated Show resolved Hide resolved
* new bundledeployment controller + reconciler
* move code from "handlers" into the deployer packages
* no more leader election for bundledeployment controller
* replace agent's logger with logr.Logger
* move newMappers into clusterstatus package, still uses wrangler
* move "list resources" up into reconciler, don't fetch twice
* agent reconciler only sets status when exiting
* trigger bundledeployment via status on updates
* move requeue and drift correction into reconciler
* simplify bundlestatus condition handling
* add controller-runtime args to agent
  * zap logging config
  * kubeconfig
* increase TriggerSleep delay by 3s
@manno manno force-pushed the fleet-agent-controller-runtime branch from 112a27e to 5fc1324 Compare November 29, 2023 11:38
@manno manno disabled auto-merge November 30, 2023 11:06
@manno manno merged commit d0178fa into master Nov 30, 2023
10 checks passed
@manno manno deleted the fleet-agent-controller-runtime branch November 30, 2023 11:07
@manno manno changed the title Fleet agent controller runtime Convert fleet-agent to controller-runtime Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants