Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-8794] Deploying ClusterGroup from GitRepo results in loop #2859

Open
p-se opened this issue Sep 17, 2024 · 3 comments
Open

[SURE-8794] Deploying ClusterGroup from GitRepo results in loop #2859

p-se opened this issue Sep 17, 2024 · 3 comments

Comments

@p-se
Copy link
Contributor

p-se commented Sep 17, 2024

Deploying a ClusterGroup from a GitRepo which also contains accompanying other GitRepo resources that use those newly created ClusterGroups result in a loop.

This loop triggers ClusterGroups and appends a message to it that endlessly grows, until the limit of etcd is hit. In which case Fleet is supposedly blocked.

The issue can be reproduced by adding this GitRepo resource to the cluster. The issue was reproducible on the latest Fleet development version at the time and did not require a Rancher installation to reproduce. The cluster was prepared using dev/setup-multi-cluster.

@rancherbot rancherbot added this to Fleet Sep 17, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Sep 17, 2024
@p-se p-se moved this from 🆕 New to To Triage in Fleet Sep 17, 2024
@p-se p-se added JIRA Must shout kind/bug labels Sep 17, 2024
@kkaempf kkaempf added this to the v2.9.3 milestone Sep 17, 2024
@kkaempf kkaempf modified the milestones: v2.9.3, 2.9.4 Oct 2, 2024
@manno manno moved this from To Triage to 📋 Backlog in Fleet Oct 23, 2024
@manno manno unassigned p-se Oct 23, 2024
@manno manno modified the milestones: v2.9.4, v2.11.0, v2.9.5 Oct 23, 2024
@p-se p-se self-assigned this Oct 25, 2024
@p-se p-se moved this from 📋 Backlog to 🏗 In progress in Fleet Oct 25, 2024
p-se added a commit to p-se/fleet that referenced this issue Oct 31, 2024
Prevents fleet from crashing due to resources exceeding etcd's
configured size limit.

Deduplicate messages should only be necessary for edge cases which are
not officially supported by fleet but result in ever increasing message
sizes.

This is due to the messages being copied from one resource to another
and back again. Every resource adds its status to the message. This only
happens if a cluster group is deployed by a GitRepo, which results in a
bundle containing a cluster group. This bundle can only become ready if
the cluster group is ready, but if the cluster group points to the
cluster of the bundle, this cannot ever happen. The user is expected to
fix this situation but deduplicating the messages prevents the message
from growing up to the point where etcd's limit is reached and fleet
crashes.

Deduplicating the messages also has the effect of not changing the
status of resources frequently, which results in less controllers being
triggered.
p-se added a commit to p-se/fleet that referenced this issue Oct 31, 2024
Prevents fleet from crashing due to resources exceeding etcd's
configured size limit.

Deduplicate messages should only be necessary for edge cases which are
not officially supported by fleet but result in ever increasing message
sizes.

This is due to the messages being copied from one resource to another
and back again. Every resource adds its status to the message. This only
happens if a cluster group is deployed by a GitRepo, which results in a
bundle containing a cluster group. This bundle can only become ready if
the cluster group is ready, but if the cluster group points to the
cluster of the bundle that deployed the cluster group, this cannot ever
happen. The user is expected to fix this situation but deduplicating the
messages prevents the message from growing up to the point where etcd's
limit is reached and fleet crashes.

Deduplicating the messages also has the effect of not changing the
status of resources frequently, which results in less controllers being
triggered.
p-se added a commit to p-se/fleet that referenced this issue Oct 31, 2024
Prevents fleet from crashing due to resources exceeding etcd's
configured size limit.

Deduplicate messages should only be necessary for edge cases which are
not officially supported by fleet but result in ever increasing message
sizes.

This is due to the messages being copied from one resource to another
and back again. Every resource adds its status to the message. This only
happens if a cluster group is deployed by a GitRepo, which results in a
bundle containing a cluster group. This bundle can only become ready if
the cluster group is ready, but if the cluster group points to the
cluster of the bundle that deployed the cluster group, this cannot ever
happen. The user is expected to fix this situation but deduplicating the
messages prevents the message from growing up to the point where etcd's
limit is reached and fleet crashes.

Deduplicating the messages also has the effect of not changing the
status of resources frequently, which results in less controllers being
triggered.
p-se added a commit to p-se/fleet that referenced this issue Nov 5, 2024
Prevents fleet from crashing due to resources exceeding etcd's
configured size limit.

Deduplicate messages should only be necessary for edge cases which are
not officially supported by fleet but result in ever increasing message
sizes.

This is due to the messages being copied from one resource to another
and back again. Every resource adds its status to the message. This only
happens if a cluster group is deployed by a GitRepo, which results in a
bundle containing a cluster group. This bundle can only become ready if
the cluster group is ready, but if the cluster group points to the
cluster of the bundle that deployed the cluster group, this cannot ever
happen. The user is expected to fix this situation but deduplicating the
messages prevents the message from growing up to the point where etcd's
limit is reached and fleet crashes.

Deduplicating the messages also has the effect of not changing the
status of resources frequently, which results in less controllers being
triggered.
@p-se p-se moved this from 🏗 In progress to 👀 In review in Fleet Nov 5, 2024
@p-se p-se moved this from 👀 In review to Needs QA review in Fleet Nov 6, 2024
@mmartin24 mmartin24 self-assigned this Nov 12, 2024
@manno manno modified the milestones: v2.9.5, v2.10.1 Dec 3, 2024
@rancher rancher deleted a comment from rancherbot Dec 3, 2024
@p-se
Copy link
Contributor Author

p-se commented Dec 3, 2024

/backport v2.10.1

@mmartin24
Copy link
Collaborator

mmartin24 commented Dec 4, 2024

I tested this in v2.10-d8667221a2eec48d4350d00a9d39aee54f00f810-head with fleet fleet:105.0.1+up0.11.1.

I saw a significant improvement against 2.8.5 where the logs grow every few seconds. In a matter of minutes, the logs fill the page as can be seen in this screenshot:
2 8 5

When I checked in v2.10-d8667221a2eec48d4350d00a9d39aee54f00f810-head with fleet fleet:105.0.1+up0.11.1, the log growth was significantly lower; however, it was still present, as can be seen in the screenshot after a few hours as can be seen in the screenshot:

2 10-longlog

@p-se is this expected?

@p-se
Copy link
Contributor Author

p-se commented Dec 4, 2024

@p-se is this expected?

No, it is not expected to see the status ever-growing! That said, I'm not sure if the fix is in the versions you've used for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs QA review
Development

No branches or pull requests

4 participants