Skip to content

Commit

Permalink
cni-repair controller (#306)
Browse files Browse the repository at this point in the history
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does.

This controller "`linkerd-cni-repair-controller`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to delete them so they can restart with a proper network config.

The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#11699).

This exposes two custom counter metrics: `linkerd_cni_repair_controller_queue_overflow` (in the spirit of the destination controller's `endpoint_updates_queue_overflow`) and `linkerd_cni_repair_controller_deleted`
  • Loading branch information
alpeb authored Jan 2, 2024
1 parent 7417ddd commit 67cc03d
Show file tree
Hide file tree
Showing 11 changed files with 2,108 additions and 160 deletions.
3 changes: 0 additions & 3 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,2 @@
Cargo.toml
Cargo.lock
rust-toolchain
validator/
target/
Loading

0 comments on commit 67cc03d

Please sign in to comment.