Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing the Representation of Cluster Switch Network Topology #4962

Open
4 tasks
dmitsh opened this issue Nov 14, 2024 · 9 comments
Open
4 tasks

Standardizing the Representation of Cluster Switch Network Topology #4962

dmitsh opened this issue Nov 14, 2024 · 9 comments
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@dmitsh
Copy link

dmitsh commented Nov 14, 2024

Enhancement Description

  • One-line enhancement description (can be used as a release note):
    Standardizing Cluster Network Topology Representation

  • Kubernetes Enhancement Proposal:
    KEP-4962: Standardizing the Representation of Cluster Switch Network Topology #4965

  • Discussion Link:

  • Primary contact (assignee):
    @dmitsh
    @sanjaychatterjee

  • Responsible SIGs:
    /sig network

  • Enhancement target (which target equals to which milestone):

    • Alpha release target (x.y): 1.33
    • Beta release target (x.y):
    • Stable release target (x.y):
  • Alpha

    • KEP (k/enhancements) update PR(s):
    • Code (k/k) update PR(s):
    • Docs (k/website) update PR(s):
@k8s-ci-robot
Copy link
Contributor

@dmitsh: The label(s) sig/networking cannot be applied, because the repository doesn't have them.

In response to this:

Enhancement Description

  • One-line enhancement description (can be used as a release note):
    Standardizing Cluster Network Topology Representation

  • Kubernetes Enhancement Proposal:

  • Discussion Link:

  • Primary contact (assignee):
    @dmitsh
    @sanjaychatterjee

  • Responsible SIGs:
    /sig networking

  • Enhancement target (which target equals to which milestone):

  • Alpha release target (x.y): 0.1

  • Beta release target (x.y):

  • Stable release target (x.y):

  • Alpha

  • KEP (k/enhancements) update PR(s):

  • Code (k/k) update PR(s):

  • Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Summary

This document proposes a standard for declaring cluster network topology in Kubernetes, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.

Motivation

Understanding the cluster network topology is essential for optimizing the placement of workloads that require intensive inter-node communication. Currently, there is no standardized way to represent this information in Kubernetes, making it challenging to develop control plane components and applications that can leverage topology awareness.

This information might be useful for various components and features, including:

  • Pod affinity sections in deployment and pod specs
  • Kueue network-aware scheduling
  • Future development of native scheduler plugins for topology-aware scheduling, for example
  • topology aware gang scheduling plugin
  • DRA scheduler plugin

Cluster Topology Sources

Cluster topology information can be derived from various sources:

  • Provided directly by a Cloud Service Provider (CSP)
  • Extracted from a CSP using specialized tools like "topograph"
  • Manually set up by cluster administrators
  • A combination of the above methods to ensure comprehensive coverage

Proposal

We propose new node label and annotation types to capture network topology information:

  • Network Topology Label
  • Network QoS Annotation

Network Topology Label

Format: network.topology.kubernetes.io/<nw-switch-type>: <switch-name>

  • <nw-switch-type>: Logical type of the network switch (can be one of the reserved names or a custom name)
  • Reserved names: accelerator, block, datacenter, zone
  • <switch-name>: Unique identifier for the switch

Network QoS Annotation

Format: network.qos.kubernetes.io/switches: <QoS>

  • <QoS>: A JSON object where each key is a switch name (matching the network topology label) with a value containing:
  • distance: Numerical value representing the distance in hops from the node to the switch, required
  • latency: Link latency (e.g., 200 ms), optional
  • bandwidth: Link bandwidth (e.g., 100 Gbps), optional

This structure can be easily extended with additional network QoS metrics in the future.

Reserved Network Types

We have introduced reserved network types to better accommodate common network hierarchies. These reserved network types include the following predefined names and characteristics:

  1. accelerator: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
  2. block: Rack-level switches connecting hosts in one or more racks as a block.
  3. datacenter: Spine-level switches connecting multiple blocks inside a datacenter.
  4. zone: Zonal switches connecting multiple datacenters inside an availability zone.

When using reserved network types, Network QoS Annotations become optional. In the absence of these annotations, it is assumed that performance within each network layer is uniform.

The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations.

If provided, Network QoS Annotations can be used to refine and enhance the details of link performance, enabling more precise scheduling decisions.

Example of Network Topology Labels with reserved network types:

network.topology.kubernetes.io/accelerator: nvl72-a
network.topology.kubernetes.io/block: block-b
network.topology.kubernetes.io/datacenter: dc-c
network.topology.kubernetes.io/zone: zone-d

Example of Network QoS Annotations that complements the example above:

network.qos.kubernetes.io/switches: {
  "nvl72-a": {
     "latency": "2us",
     "bandwidth": "100Gbps"
  },
  "block-b": {
     "latency": "50us",
     "bandwidth": "40Gbps"
  },
  "dc-c": {
     "latency": "500us",
     "bandwidth": "20Gbps"
  },
 "zone-d": {
     "latency": "1ms",
     "bandwidth": "10Gbps"
  }
}

Extensibility and Future-Proofing

This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.

For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.

Example of network topology with custom network types

Node Labels:

network.topology.kubernetes.io/area: sw-a
network.topology.kubernetes.io/sector: block-b
network.topology.kubernetes.io/center: center-c

Node Annotations:

network.qos.kubernetes.io/switches: {
  "sw-a": {
     "distance": 1,
     "latency": "100ns",
     "bandwidth": "40Gbps"
  },
  "block-b": {
     "distance": 2,
     "latency": "500ns",
     "bandwidth": "20Gbps"
  },
  "center-c": {
     "distance": 3,
     "latency": "1ms",
     "bandwidth": "10Gbps"
  }
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 14, 2024
@adrianmoisey
Copy link
Member

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 14, 2024
@dmitsh dmitsh changed the title Standardizing Cluster Network Topology Representation KEP-4962: Standardizing Cluster Network Topology Representation Nov 14, 2024
@dmitsh
Copy link
Author

dmitsh commented Nov 14, 2024

/cc @aojea

@alaypatel07
Copy link
Contributor

/cc

1 similar comment
@ritazh
Copy link
Member

ritazh commented Nov 15, 2024

/cc

@kannon92
Copy link
Contributor

Thank you for creating the issue. Please follow our KEP template.

The issue is a placeholder for tracking and your issue should follow the KEP template.

We would need to give feedback on the PR rather than as part of a github issue.

@kannon92
Copy link
Contributor

Please remove text starting from Summary, create a PR with https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md filled out with all your areas.

@dmitsh
Copy link
Author

dmitsh commented Nov 15, 2024

Please remove text starting from Summary, create a PR with https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md filled out with all your areas.

Thanks, created the PR

@BenTheElder
Copy link
Member

/title Standardizing Cluster Network Topology Representation

"KEP" in the title of the issue is redundant and the issue number is already the identifier for this page.

@BenTheElder BenTheElder changed the title KEP-4962: Standardizing Cluster Network Topology Representation Standardizing Cluster Network Topology Representation Nov 18, 2024
@dmitsh dmitsh changed the title Standardizing Cluster Network Topology Representation Standardizing the Representation of Cluster Switch Network Topology Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

7 participants