found duplicate series for the match group #33

marianobilli · 2022-04-23T08:14:59Z

Error explanation

Dears I found an issue in prometheus server when processing the rules that if two different machine types got the same ip and hence kubernetes node name, then the rule would fail as it does not know how to join the metrics.

Error Message (in prometheus server)

err="found duplicate series for the match group {node=\"ip-172-30-42-190.eu-west-1.compute.internal\"} on the right hand-side of the operation:
[
{label_beta_kubernetes_io_instance_type=\"m5d.2xlarge\", label_eks_amazonaws_com_capacity_type=\"ON_DEMAND\", node=\"ip-172-30-42-190.eu-west-1.compute.internal\"}, 
{label_beta_kubernetes_io_instance_type=\"m5d.xlarge\", label_eks_amazonaws_com_capacity_type=\"ON_DEMAND\", node=\"ip-172-30-42-190.eu-west-1.compute.internal\"}];
many-to-many matching not allowed: matching labels must be unique on one side"

My solution

I've developed node-to-pod-labeler that scans the whole cluster and as soon as a pod is assigned to a node it will transfer the relevant labels from the node to the pod, then kube-state-metrics will expose these labels as part of the metrics, then the rules can be simplified and this error avoided.

Im sharing it in case you find it useful

The text was updated successfully, but these errors were encountered:

kaskol10 · 2022-04-25T10:59:02Z

Thanks for your contribution @marianobilli . We'll take a look at your contribution and study how to solve this error.

marianobilli · 2022-04-27T09:33:08Z

If you are interested here are my adjusted rules were basically instead of joining via node to kube_node_labels I am now joining via pod to kube_pod_labels. Different to node, pod will trully match 1:1 with a kube_pod_label result.

In order to have these labels in kube_pod_labels I added this configuration to kube-state-metrics

--metric-labels-allowlist=pods=[eks.amazonaws.com/capacityType,node.kubernetes.io/instance-type,beta.kubernetes.io/instance-type]

the adjusted prometheus rules

      - expr: |-
          sum by (cluster, namespace, pod, container) (
            irate(container_cpu_usage_seconds_total{image!=""}[5m])
          ) * on (cluster, namespace, pod) group_left(node, label_beta_kubernetes_io_instance_type) topk by (cluster, namespace, pod) (
            1, max by(cluster, namespace, pod, node, label_beta_kubernetes_io_instance_type) (kube_pod_labels{node!=""})
          )
        record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
      - expr: |-
          kube_pod_container_resource_requests{resource="nvidia_com_gpu",}  * on (namespace, pod, cluster)
          group_left() max by (namespace, pod, cluster) (
            (kube_pod_status_phase{phase=~"Pending|Running"} == 1)
          )
        record: cluster:namespace:pod_gpu:active:kube_pod_container_resource_requests
      - expr: |-
          (
            (
              sum by(namespace, node, pod) (cluster:namespace:pod_gpu:active:kube_pod_container_resource_requests)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_cost{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_gpu_requests_instance_gpu_price:on_demand_pod_gpu_requests_cost
      - expr: |-
          (
            (
              (sum by(namespace, node, pod) (cluster:namespace:pod_memory:active:kube_pod_container_resource_requests) /1024/1024/1024)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_mem_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_memory_requests_instance_mem_price:on_demand_pod_mem_requests_cost
      - expr: |-
          (
            (
              sum by(namespace, node, pod) (cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_cpu_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_cpu_requests_instance_cpu_price:on_demand_pod_cpu_requests_cost
      - expr: |-
          (
            (
              (sum by (namespace, node, pod) (container_memory_working_set_bytes{name!=""}) /1024/1024/1024)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_mem_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_memory_usage_instance_mem_price:on_demand_pod_mem_usage_cost
      - expr: |-
          (
            (
              sum by(namespace, node, pod) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_cpu_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_cpu_usage_instance_cpu_price:on_demand_pod_cpu_usage_cost

marianobilli · 2022-04-28T11:16:23Z

I realized that for these 4 metrics it will still be an issue

capacity_instance_node_resource:kube_node_status_allocatable_idle_instance_cpu_price:on_demand_idle_cpu_cost
capacity_instance_node_resource:kube_node_status_allocatable_idle_instance_mem_price:on_demand_idle_mem_cost
capacity_instance_node_resource:kube_node_status_shared_instance_cpu_price:on_demand_shared_cpu_cost
capacity_instance_node_resource:kube_node_status_shared_instance_mem_price:on_demand_shared_mem_cost

marianobilli · 2022-04-28T11:23:56Z

I think the solution could be to use hostname resource based naming for the eks nodes, it can be configured on the launch template of the managed node groups
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html

The problem for me is that the guys of the terraform EKS module do not have the parameter hostname_type https://github.com/terraform-aws-modules/terraform-aws-eks/blob/dc8a6eecddc2c6957ba309a939ee46f39b946461/modules/eks-managed-node-group/main.tf#L45

So it will require a PR to them

kaskol10 self-assigned this Apr 25, 2022

kaskol10 added the bug Something isn't working label Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

found duplicate series for the match group #33

found duplicate series for the match group #33

marianobilli commented Apr 23, 2022 •

edited

Loading

kaskol10 commented Apr 25, 2022

marianobilli commented Apr 27, 2022

marianobilli commented Apr 28, 2022

marianobilli commented Apr 28, 2022 •

edited

Loading

found duplicate series for the match group #33

found duplicate series for the match group #33

Comments

marianobilli commented Apr 23, 2022 • edited Loading

Error explanation

Error Message (in prometheus server)

My solution

kaskol10 commented Apr 25, 2022

marianobilli commented Apr 27, 2022

marianobilli commented Apr 28, 2022

marianobilli commented Apr 28, 2022 • edited Loading

marianobilli commented Apr 23, 2022 •

edited

Loading

marianobilli commented Apr 28, 2022 •

edited

Loading