application-controlller `Watch failed` #15464

yellowhat · 2023-09-12T13:59:45Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

Describe the bug

Hi,
I am using the argo-cd 5.46.2 helm chart.

I have noticed that every 12 hours the application-controller throws the following error:

 retrywatcher.go:130] "Watch failed" err="context canceled"

According to this discussion some watch permission are missing.

Currently the role associated the application-controller service account has watch on secrets and configmaps:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argo-cd-application-controller
  namespace: argo-cd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argo-cd-application-controller
subjects:
- kind: ServiceAccount
  name: argocd-application-controller
  namespace: argo-cd

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-cd-application-controller
  namespace: argo-cd
rules:
- apiGroups:
  - ""
  resources:
  - secrets
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - argoproj.io
  resources:
  - applications
  - appprojects
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create

Is there something else missing?

To Reproduce

kubectl logs argo-cd-application-controller-0 | grep Watch

Expected behavior

No error

Version

$ argocd version
argocd: v2.8.3+77556d9
  BuildDate: 2023-09-07T16:05:43Z
  GitCommit: 77556d9e64304c27c718bb0794676713628e435e
  GitTreeState: clean
  GoVersion: go1.20.6
  Compiler: gc
  Platform: linux/amd64

Logs

E0912 08:10:34.158858       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.158977       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.161448       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.162382       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.158558       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.162246       7 retrywatcher.go:130] "Watch failed" err="context canceled"

The text was updated successfully, but these errors were encountered:

mimartin12 · 2023-10-31T20:56:05Z

I am experiencing the same. Every 12 hours, I get about 40 or so errors that all say err="context canceled"
Most of the instance that these errors show are after attempting to sync an externally managed cluster. The cluster does sync eventually, but these errors are initially thrown.

Time | Host
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"

dmarquez-splunk · 2024-08-13T00:52:52Z

We are still seeing this issue in argocd 2.11.2 that is causing deployment outages for some of our users. We have 1 installation with multiple controller that manage about 40+ clusters

gdsoumya · 2024-08-13T05:43:28Z

This might be unrelated but if you are using a limited rbac for argocd app controller instead of the admin rbac with permission to all resources on cluster you might want to either manually put resource inclusions/exclusions or use the respectRBAC feature available to automatically let argocd figure which resources it has access to and needs to monitor/watch.

Ref:

colinodell · 2024-09-10T14:24:11Z

We are also seeing 75-200 of these logs entries from each application controller every 12 hours on v2.11.3. The timing correlates with the cluster's cache age dropping to 0:

Here's a zoomed-in look at a 15 minute window:

I don't know what this correlation means but thought it might be worth sharing.

AurimasNav · 2024-09-18T05:54:48Z

this morning found controller log was logging this error for whole night every second:

E0918 05:28:13.841461 7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0918 05:28:14.842231 7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0918 05:28:15.842669 7 retrywatcher.go:130] "Watch failed" err="context canceled"

there are problems with my argocd, but this does not help to identify the cause

TechDawg270 · 2024-11-14T16:29:23Z

Experiencing this issue as well. ArgoCD version is v2.12.6. I took a look at the cluster cache age as @colinodell mentioned above, and it is the same pattern for our errors. It occurs every 12 hours and coorelates with the cluster's cache age dropping to 0

dee-kryvenko · 2025-01-13T15:52:53Z

I see another 3 similar issues, all marked as "resolved" although it is still happening on v2.13.3 for me at least once a day, sometimes - multiple times a day.

Also looks like this one is related as well #20785. And another one requesting more context to be added to this log entry "Watch failed" err="context canceled" #14134.

Anyway just tried to sum it up, I tried multiple sharding algorithms, my pods are not starving on resources, everything looks fine and dandy - but the controllers gets dead locked. Resource counts going down, queue depth goes to zero, thousands of "Watch failed" err="context canceled" log entries, and nothing works until controllers are restarted. Controllers are still alive BTW and still serving other subset of clusters, so it is a thread-level issue.

Also my cache age metric look weird, here's example for one of the cluster that's currently being stuck. The spike is when the issue started, but how come cache was at 0 before the issue started while the controller was working just fine?

Also see #20785 (comment)

dee-kryvenko · 2025-01-13T17:24:38Z

Never mind this weird cache age - this particular one turned out to be one of the "test" clusters that is indeed unstable for other reasons. Everything else still stands and on other clusters that do hang up the cache age for me look the same as for everyone else.

dee-kryvenko · 2025-01-15T00:03:19Z

I managed to capture goroutine pprof on a controller in the stuck state.

goroutine.txt

dee-kryvenko · 2025-01-15T21:41:11Z

As I am continuing my troubleshooting, I've found potentially related problem that might be triggering this bug. I filed it as a new ticket here because it goes out of the scope of the dead lock described here, imo it would still be an issue even if didn't cause a deadlock, not to mention that they may not be even related at all #21506

yellowhat added the bug Something isn't working label Sep 12, 2023

alexmt added component:application type:bug labels Aug 13, 2024

dee-kryvenko mentioned this issue Jan 13, 2025

Applications are stuck in refreshing #20785

Open

dee-kryvenko mentioned this issue Jan 15, 2025

Reconciliations caused by child resources updates #21506

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

application-controlller `Watch failed` #15464

application-controlller `Watch failed` #15464

yellowhat commented Sep 12, 2023

mimartin12 commented Oct 31, 2023 •

edited

Loading

dmarquez-splunk commented Aug 13, 2024

gdsoumya commented Aug 13, 2024

colinodell commented Sep 10, 2024

AurimasNav commented Sep 18, 2024

TechDawg270 commented Nov 14, 2024

dee-kryvenko commented Jan 13, 2025 •

edited

Loading

dee-kryvenko commented Jan 13, 2025 •

edited

Loading

dee-kryvenko commented Jan 15, 2025

dee-kryvenko commented Jan 15, 2025

application-controlller Watch failed #15464

application-controlller Watch failed #15464

Comments

yellowhat commented Sep 12, 2023

Describe the bug

mimartin12 commented Oct 31, 2023 • edited Loading

dmarquez-splunk commented Aug 13, 2024

gdsoumya commented Aug 13, 2024

colinodell commented Sep 10, 2024

AurimasNav commented Sep 18, 2024

TechDawg270 commented Nov 14, 2024

dee-kryvenko commented Jan 13, 2025 • edited Loading

dee-kryvenko commented Jan 13, 2025 • edited Loading

dee-kryvenko commented Jan 15, 2025

dee-kryvenko commented Jan 15, 2025

application-controlller `Watch failed` #15464

application-controlller `Watch failed` #15464

mimartin12 commented Oct 31, 2023 •

edited

Loading

dee-kryvenko commented Jan 13, 2025 •

edited

Loading

dee-kryvenko commented Jan 13, 2025 •

edited

Loading