Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

application-controlller Watch failed #15464

Open
3 tasks done
yellowhat opened this issue Sep 12, 2023 · 10 comments
Open
3 tasks done

application-controlller Watch failed #15464

yellowhat opened this issue Sep 12, 2023 · 10 comments
Labels

Comments

@yellowhat
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Describe the bug

Hi,
I am using the argo-cd 5.46.2 helm chart.

I have noticed that every 12 hours the application-controller throws the following error:

 retrywatcher.go:130] "Watch failed" err="context canceled"

According to this discussion some watch permission are missing.

Currently the role associated the application-controller service account has watch on secrets and configmaps:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argo-cd-application-controller
  namespace: argo-cd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argo-cd-application-controller
subjects:
- kind: ServiceAccount
  name: argocd-application-controller
  namespace: argo-cd

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-cd-application-controller
  namespace: argo-cd
rules:
- apiGroups:
  - ""
  resources:
  - secrets
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - argoproj.io
  resources:
  - applications
  - appprojects
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create

Is there something else missing?

To Reproduce

kubectl logs argo-cd-application-controller-0 | grep Watch

Expected behavior

No error

Version

$ argocd version
argocd: v2.8.3+77556d9
  BuildDate: 2023-09-07T16:05:43Z
  GitCommit: 77556d9e64304c27c718bb0794676713628e435e
  GitTreeState: clean
  GoVersion: go1.20.6
  Compiler: gc
  Platform: linux/amd64

Logs

E0912 08:10:34.158858       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.158977       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.161448       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.162382       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.158558       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.162246       7 retrywatcher.go:130] "Watch failed" err="context canceled"
@yellowhat yellowhat added the bug Something isn't working label Sep 12, 2023
@mimartin12
Copy link

mimartin12 commented Oct 31, 2023

I am experiencing the same. Every 12 hours, I get about 40 or so errors that all say err="context canceled"
Most of the instance that these errors show are after attempting to sync an externally managed cluster. The cluster does sync eventually, but these errors are initially thrown.

Time | Host
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"

@dmarquez-splunk
Copy link

We are still seeing this issue in argocd 2.11.2 that is causing deployment outages for some of our users. We have 1 installation with multiple controller that manage about 40+ clusters

@gdsoumya
Copy link
Member

This might be unrelated but if you are using a limited rbac for argocd app controller instead of the admin rbac with permission to all resources on cluster you might want to either manually put resource inclusions/exclusions or use the respectRBAC feature available to automatically let argocd figure which resources it has access to and needs to monitor/watch.

Ref:

  1. https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion
  2. https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion

@colinodell
Copy link

We are also seeing 75-200 of these logs entries from each application controller every 12 hours on v2.11.3. The timing correlates with the cluster's cache age dropping to 0:

image

Here's a zoomed-in look at a 15 minute window:

image

I don't know what this correlation means but thought it might be worth sharing.

@AurimasNav
Copy link

this morning found controller log was logging this error for whole night every second:

E0918 05:28:13.841461 7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0918 05:28:14.842231 7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0918 05:28:15.842669 7 retrywatcher.go:130] "Watch failed" err="context canceled"

there are problems with my argocd, but this does not help to identify the cause

@TechDawg270
Copy link

Experiencing this issue as well. ArgoCD version is v2.12.6. I took a look at the cluster cache age as @colinodell mentioned above, and it is the same pattern for our errors. It occurs every 12 hours and coorelates with the cluster's cache age dropping to 0

@dee-kryvenko
Copy link

dee-kryvenko commented Jan 13, 2025

I see another 3 similar issues, all marked as "resolved" although it is still happening on v2.13.3 for me at least once a day, sometimes - multiple times a day.

Also looks like this one is related as well #20785. And another one requesting more context to be added to this log entry "Watch failed" err="context canceled" #14134.

Anyway just tried to sum it up, I tried multiple sharding algorithms, my pods are not starving on resources, everything looks fine and dandy - but the controllers gets dead locked. Resource counts going down, queue depth goes to zero, thousands of "Watch failed" err="context canceled" log entries, and nothing works until controllers are restarted. Controllers are still alive BTW and still serving other subset of clusters, so it is a thread-level issue.

Also my cache age metric look weird, here's example for one of the cluster that's currently being stuck. The spike is when the issue started, but how come cache was at 0 before the issue started while the controller was working just fine?

Screenshot 2025-01-13 at 8 47 35 AM

Also see #20785 (comment)

@dee-kryvenko
Copy link

dee-kryvenko commented Jan 13, 2025

Never mind this weird cache age - this particular one turned out to be one of the "test" clusters that is indeed unstable for other reasons. Everything else still stands and on other clusters that do hang up the cache age for me look the same as for everyone else.

Screenshot 2025-01-13 at 10 23 58 AM

@dee-kryvenko
Copy link

I managed to capture goroutine pprof on a controller in the stuck state.

goroutine.txt

@dee-kryvenko
Copy link

As I am continuing my troubleshooting, I've found potentially related problem that might be triggering this bug. I filed it as a new ticket here because it goes out of the scope of the dead lock described here, imo it would still be an issue even if didn't cause a deadlock, not to mention that they may not be even related at all #21506

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants