CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

taer · 2024-10-14T20:36:28Z

I am having a similar issue as reported in #1912

I installed the cni-metric-helper via the helm chart, 1.18.5 1.30 EKS cluster, all the addons pretty must updated to at least a week ago

My logs are showing a failure when it attempts to pull the metrics from the aws-node pods

{"level":"info","ts":"2024-10-14T20:06:11.547Z","caller":"cni-metrics-helper/main.go:69","msg":"Constructed new logger instance"}                                                          
{"level":"info","ts":"2024-10-14T20:06:11.548Z","caller":"runtime/proc.go:271","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: false, Prometheus: true, LogLevel DEBUG, me
tricUpdateInterval 30"}                                                                                                                                                                    
{"level":"info","ts":"2024-10-14T20:06:41.588Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:08:51.287Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-n929t:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:11:02.359Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xlz6m:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:13:13.431Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:15:24.503Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:17:35.575Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-fj6n5:61678)"}                                                                                                                                                  
{"level":"info","ts":"2024-10-14T20:17:35.575Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:19:46.647Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:21:57.719Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}

My config for helm is

env:
  USE_CLOUDWATCH: "false"
  USE_PROMETHEUS: "true"
  AWS_VPC_K8S_CNI_LOGLEVEL: "DEBUG"

Other than that, there is little other config. The helm targeted the kube-system namespace.

The cluster role binding seems correct

roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cni-metrics-helper
subjects:
  - kind: ServiceAccount
    name: cni-metrics-helper
    namespace: kube-system

the sa is in the right place

$ k get sa -n kube-system cni-metrics-helper
NAME                 SECRETS   AGE
cni-metrics-helper   0         48m

Traced the call down to https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/cni-metrics-helper/metrics/metrics.go#L89

	rawOutput, err := k8sClient.CoreV1().RESTClient().Get().
		Namespace(namespace).
		Resource("pods").
		SubResource("proxy").
		Name(fmt.Sprintf("%v:%v", podName, port)).
		Suffix("metrics").
		Do(ctx).Raw()

We have istio installed on this cluster, but it's not in the kube-system namespace.

There was talk about needing to set the REGION and cluster in the other issue. I did that manually to see if it would help, and no dice

        - name: AWS_CLUSTER_ID
          value: k8s-wl-snd-use1-default
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_VPC_K8S_CNI_LOGLEVEL
          value: DEBUG
        - name: USE_CLOUDWATCH
          value: 'false'
        - name: USE_PROMETHEUS
          value: 'true'

There isn't a security group that prevents any inter-node communications > port 1024

Thanks!

The text was updated successfully, but these errors were encountered:

jaydeokar · 2024-11-12T01:17:51Z

What's the scale of your cluster ? Is the node healthy when it tries to pull the metrics from the failed pods ? ^
We have not seen this behavior recently. Is there anything installed on the node which might be blocking the connections ?Any Network Policy ?

dshehbaj · 2024-11-12T01:18:25Z

Hi @taer,

I'm working on reproducing the error you've encountered by setting up an EKS environment similar to yours. To better understand your setup, could you please provide the following information:

1. EKS Cluster Setup

Could you share details about how your EKS cluster was created?

2. Helm Chart Installation Method

Which approach are you using to install the helm chart?

Automatic Installation:

helm install cni-metrics-helper --namespace kube-system eks/cni-metrics-helper

Manual Installation (with custom configuration):

helm install cni-metrics-helper --namespace kube-system ./charts/cni-metrics-helper

For more details, you can refer to the documentation.

3. Configuration Details

Could you please share:

The manifest for the CNI Metrics Helper Deployment
Pod configurations

I've attempted to reproduce this issue but haven't been successful. For reference, here's my test setup:

eksctl create cluster \
  --name <cluster_name> \
  --version 1.30 \
  --region us-west-2 \
  --with-oidc \
  --nodegroup-name <node_group_name> \
  --node-type t3.small \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 3 \
  --managed

I then manually installed the helm chart with configurations matching yours. The additional context you provide will help me better understand and troubleshoot the issue you're experiencing.

taer added the bug label Oct 14, 2024

jaydeokar self-assigned this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

taer commented Oct 14, 2024

jaydeokar commented Nov 12, 2024

dshehbaj commented Nov 12, 2024

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint" #3071

Comments

taer commented Oct 14, 2024

jaydeokar commented Nov 12, 2024

dshehbaj commented Nov 12, 2024

1. EKS Cluster Setup

2. Helm Chart Installation Method

3. Configuration Details