Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

found duplicate series for the match group #33

Open
marianobilli opened this issue Apr 23, 2022 · 4 comments
Open

found duplicate series for the match group #33

marianobilli opened this issue Apr 23, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@marianobilli
Copy link

marianobilli commented Apr 23, 2022

Error explanation

Dears I found an issue in prometheus server when processing the rules that if two different machine types got the same ip and hence kubernetes node name, then the rule would fail as it does not know how to join the metrics.

Error Message (in prometheus server)

err="found duplicate series for the match group {node=\"ip-172-30-42-190.eu-west-1.compute.internal\"} on the right hand-side of the operation:
[
{label_beta_kubernetes_io_instance_type=\"m5d.2xlarge\", label_eks_amazonaws_com_capacity_type=\"ON_DEMAND\", node=\"ip-172-30-42-190.eu-west-1.compute.internal\"}, 
{label_beta_kubernetes_io_instance_type=\"m5d.xlarge\", label_eks_amazonaws_com_capacity_type=\"ON_DEMAND\", node=\"ip-172-30-42-190.eu-west-1.compute.internal\"}];
many-to-many matching not allowed: matching labels must be unique on one side"

My solution

I've developed node-to-pod-labeler that scans the whole cluster and as soon as a pod is assigned to a node it will transfer the relevant labels from the node to the pod, then kube-state-metrics will expose these labels as part of the metrics, then the rules can be simplified and this error avoided.

Im sharing it in case you find it useful

@kaskol10
Copy link
Contributor

Thanks for your contribution @marianobilli . We'll take a look at your contribution and study how to solve this error.

@kaskol10 kaskol10 self-assigned this Apr 25, 2022
@kaskol10 kaskol10 added the bug Something isn't working label Apr 25, 2022
@marianobilli
Copy link
Author

If you are interested here are my adjusted rules were basically instead of joining via node to kube_node_labels I am now joining via pod to kube_pod_labels. Different to node, pod will trully match 1:1 with a kube_pod_label result.

In order to have these labels in kube_pod_labels I added this configuration to kube-state-metrics

--metric-labels-allowlist=pods=[eks.amazonaws.com/capacityType,node.kubernetes.io/instance-type,beta.kubernetes.io/instance-type]

the adjusted prometheus rules

      - expr: |-
          sum by (cluster, namespace, pod, container) (
            irate(container_cpu_usage_seconds_total{image!=""}[5m])
          ) * on (cluster, namespace, pod) group_left(node, label_beta_kubernetes_io_instance_type) topk by (cluster, namespace, pod) (
            1, max by(cluster, namespace, pod, node, label_beta_kubernetes_io_instance_type) (kube_pod_labels{node!=""})
          )
        record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
      - expr: |-
          kube_pod_container_resource_requests{resource="nvidia_com_gpu",}  * on (namespace, pod, cluster)
          group_left() max by (namespace, pod, cluster) (
            (kube_pod_status_phase{phase=~"Pending|Running"} == 1)
          )
        record: cluster:namespace:pod_gpu:active:kube_pod_container_resource_requests
      - expr: |-
          (
            (
              sum by(namespace, node, pod) (cluster:namespace:pod_gpu:active:kube_pod_container_resource_requests)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_cost{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_gpu_requests_instance_gpu_price:on_demand_pod_gpu_requests_cost
      - expr: |-
          (
            (
              (sum by(namespace, node, pod) (cluster:namespace:pod_memory:active:kube_pod_container_resource_requests) /1024/1024/1024)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_mem_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_memory_requests_instance_mem_price:on_demand_pod_mem_requests_cost
      - expr: |-
          (
            (
              sum by(namespace, node, pod) (cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_cpu_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_cpu_requests_instance_cpu_price:on_demand_pod_cpu_requests_cost
      - expr: |-
          (
            (
              (sum by (namespace, node, pod) (container_memory_working_set_bytes{name!=""}) /1024/1024/1024)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_mem_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_memory_usage_instance_mem_price:on_demand_pod_mem_usage_cost
      - expr: |-
          (
            (
              sum by(namespace, node, pod) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate)
              * on (pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
              sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type, pod) (kube_pod_labels{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
            )
            * ignoring(namespace, node, pod) group_left(label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type)
            sum by (label_eks_amazonaws_com_capacity_type, label_beta_kubernetes_io_instance_type) (instance_cpu_price{label_eks_amazonaws_com_capacity_type="ON_DEMAND"})
          )
        record: capacity_instance_namespace_node_pod:pod_cpu_usage_instance_cpu_price:on_demand_pod_cpu_usage_cost

@marianobilli
Copy link
Author

I realized that for these 4 metrics it will still be an issue

capacity_instance_node_resource:kube_node_status_allocatable_idle_instance_cpu_price:on_demand_idle_cpu_cost
capacity_instance_node_resource:kube_node_status_allocatable_idle_instance_mem_price:on_demand_idle_mem_cost
capacity_instance_node_resource:kube_node_status_shared_instance_cpu_price:on_demand_shared_cpu_cost
capacity_instance_node_resource:kube_node_status_shared_instance_mem_price:on_demand_shared_mem_cost

@marianobilli
Copy link
Author

marianobilli commented Apr 28, 2022

I think the solution could be to use hostname resource based naming for the eks nodes, it can be configured on the launch template of the managed node groups
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html

The problem for me is that the guys of the terraform EKS module do not have the parameter hostname_type https://github.com/terraform-aws-modules/terraform-aws-eks/blob/dc8a6eecddc2c6957ba309a939ee46f39b946461/modules/eks-managed-node-group/main.tf#L45

So it will require a PR to them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants