From 4e1505819d8e6d82c090abbd2e4825551cd567ad Mon Sep 17 00:00:00 2001 From: rszarecki <46606165+rszarecki@users.noreply.github.com> Date: Wed, 6 Dec 2023 19:56:42 -0800 Subject: [PATCH] Pipeline-cntr-guide (#961) * Create Integrated-Circuit_pipeline_ggregated_counters_guide.md --------- Co-authored-by: Darren Loher Co-authored-by: Roland Phung --- ...cuit_pipeline_aggregated_counters_guide.md | 72 ++++++++++++++++++ ...openconfig-platform-pipeline-counters.yang | 74 +++++++++++++++++-- 2 files changed, 141 insertions(+), 5 deletions(-) create mode 100644 doc/Integrated-Circuit_pipeline_aggregated_counters_guide.md diff --git a/doc/Integrated-Circuit_pipeline_aggregated_counters_guide.md b/doc/Integrated-Circuit_pipeline_aggregated_counters_guide.md new file mode 100644 index 000000000..3dfc5087a --- /dev/null +++ b/doc/Integrated-Circuit_pipeline_aggregated_counters_guide.md @@ -0,0 +1,72 @@ +# Intergrated Circuit aggregated pipeline counters guide +## Introduction +This guide discusses semantics of different counters provided under the +`openconfig-platform/components/component/integrated-circuit/pipeline-counters` container. +The `INTEGRATED_CIRCUIT` or I-C, in this document refers to the OpenConfig [INTEGRATED_CIRCUIT](https://github.com/openconfig/public/blob/5d38d8531ef9c5b998262207eb6dbdae8968f9fe/release/models/platform/openconfig-platform-types.yang#L346) component type which is typically an ASIC or NPU (or combination of both) that provides packet processing capabilities. + +## Per-block packets/octets counters +[TODO] more detailed description +## Drop packets/octets counters +The `/components/component/integrated-circuit/pipeline-counters/drop` container collects counters related to packets dropped by the `INTEGRATED_CIRCUIT`. +### Aggregated drop counters +These 4 counters should cover all packets dropped by the IC which are not already covered by the /interfaces tree. For example, a packet which is dropped due to QoS policy for WRED should be counted only by the appropriate /interfaces path [dropped-pkts](https://github.com/openconfig/public/blob/5d38d8531ef9c5b998262207eb6dbdae8968f9fe/release/models/qos/openconfig-qos-interfaces.yang#L375). + +Aggregated drop counters are modeled as below: +``` +module: openconfig-platform + +--rw components + +--rw component* [name] + +--rw integrated-circuit + +--ro oc-ppc:pipeline-counters + +--ro oc-ppc:drop + +--ro oc-ppc:state + +--ro oc-ppc:adverse-aggregate? oc-yang:counter64 + +--ro oc-ppc:congestion-aggregate? oc-yang:counter64 + +--ro oc-ppc:packet-processing-aggregate? oc-yang:counter64 + +--ro oc-ppc:urpf-aggregate? oc-yang:counter64 +``` +#### urpf-aggregate + +##### Usability +The increments of this counter are typically signal of some form of attack with spoofed source address. Typically dDOS class. + +#### packet-processing-aggregate + +##### Usability +The increments of this counter are expected during convergence events as well as during stable operation. However rapid increase in drop rate **may** be a signal of network being unhealthy and typically requires further investigation. +The further break down of this counter, if available as vendor extension under `/openconfig-platform:components/component/integrated-circuit/openconfig-platform-pipeline-counters:pipeline-counters/drop/vendor` container could help to further narrow-down cause of drops. + +If prolonged packet drops are found to be caused by lack of FIB entry for incomming packets, this suggest inconsistency between Network Control plane protocols (BGP, IGP, RSVP, gRIBI), FIB calculated by Controller Card and FIB programmed into given Integrated Circuit. + +If implemetation supports `urpf-aggregate` counter, packets discarded due to uRPF should not be counted as `packet-processing-aggregate`. Else, uRPF discarded oacket should be counted against this counter. + +#### congestion-aggregate + + +##### Usability +The increments of this counter are signal of given Integrated Circuit being overhelmed by incomming traffic and complexity of packet processing that is required. + +#### adverse-aggregate +##### Usability +The increments of this counter are generally a signal of a hardware defect (e.g. memory errors or signal integrity issues) or (micro)code software defects. + +#### Queue tail and AQM drops exeption discussion. +Drops associated with QoS queue tail or AQM are the result of egress interface congestion. This is NOT the same as I-C congestion, and should be counted using /interfaces counters as it is expected state from the platform (router) point of view. It may be not expected state from a network design point of view but from the INTEGRATED_CIRCUIT, it is behaving according to design. + +The OpenConfig definition for [congestion-aggregate](https://github.com/openconfig/public/blob/5d38d8531ef9c5b998262207eb6dbdae8968f9fe/release/models/platform/openconfig-platform-pipeline-counters.yang#L1096-L1099) excludes "queue drop counters". It desirable to not count QoS queue drops under this `congestion-aggregate` in order to maintain a clear signal of hitting I-C performance limitations, rather then blend it with basic, simple egress interface speed limitations. + +### Per-Block drop copunters +[TODO] more detailed description for standard OpenConfig drop counters defined for Interface-, Lookup-, Queueing-, Fabric- and Host-Interface- blocks. Also discuss relationship with Control plane traffic packets/octets counters. +### Vendor extensions +Please refer to [Vendor-Specific Augmentation for Pipeline Counter](vendor_counter_guide.md) +## Error counters +These leafs **do not** count **packets or bytes**. +They count error events. + +For example corruption of on chip, HBM or chip external memory buffers (soft-error) which also are not already counted as queue drops for interfaces. + +[TODO] more detailed description +## Control plane traffic packets/octets counters +[TODO] more detailed description. Also discuss relationship with Host-Interface block counters. +### Standard OpenConfig counters +### Vendor extensions diff --git a/release/models/platform/openconfig-platform-pipeline-counters.yang b/release/models/platform/openconfig-platform-pipeline-counters.yang index b7d81c962..9a28d28c2 100644 --- a/release/models/platform/openconfig-platform-pipeline-counters.yang +++ b/release/models/platform/openconfig-platform-pipeline-counters.yang @@ -65,10 +65,16 @@ module openconfig-platform-pipeline-counters { 5 blocks, is to have the abililty to receive all drop counters from all 5 blocks, for example, with one request."; - oc-ext:openconfig-version "0.5.0"; + oc-ext:openconfig-version "0.5.1"; oc-ext:catalog-organization "openconfig"; oc-ext:origin "openconfig"; + revision "2023-10-08" { + description + "More detail description of pipe-line aggregated drop counters"; + reference "0.5.1"; + } + revision "2023-09-26" { description "Add no-route aggregate drop counter."; @@ -1093,7 +1099,18 @@ module openconfig-platform-pipeline-counters { "This captures the aggregation of all counters where the switch is unexpectedly dropping packets. Occurrence of these drops on a stable (no recent hardware or config changes) and otherwise healthy - switch needs further investigation."; + switch needs further investigation. + This leaf counts packet discarded as result of corrupted + programming state in an INTEGRATED_CIRCUIT or corrupted data + structures of packet descriptors. + + Note: corrupted packets received on ingress interfaces should be counted + in `/interfaces/interface/state/counters/in-errors` and NOT counted as + adverse-aggregate. This is because incoming corrupted packets are NOT + a signal of adverse state of an INTEGRATED_CIRCUIT but rather of an + entity adjacent to the Interface, such as a cable or transceiver). Therefore + such drops SHOULD NOT be counted as adverse-aggregate to preserve + a clean signal of INTEGRATED_CIRCUIT adverse state."; } leaf congestion-aggregate { @@ -1102,7 +1119,31 @@ module openconfig-platform-pipeline-counters { "This tracks the aggregation of all counters where the expected conditions of packet drops due to internal congestion in some block of the hardware that may not be visible in through other congestion - indicators like interface discards or queue drop counters."; + indicators like interface discards or queue drop counters. + + This leaf counts packet discarded as result of exceeding + performance limits of an INTEGRATED_CIRCUT, when it processes + non-corrupted packets using legitimate, non-corrupted programming + state of the INTEGRATED_CIRCUIT. + + The typical example is overloading given IC with higher packet rate (pps) + then given chip can handle. For example, let's assume chip X can process + 3.6Bpps of incoming traffic and 2000 Mpps. However if average incoming + packet size is 150B, at full ingress rate this become 3000Mpps. Hence + 1/3 of packets would be cropped and should be counted against + congestion-aggregate. + + Another example is the case when some INTEGRATED_CIRCUIT internal data bus is + too narrow/slow for handling traffic. For example let's assume chip X needs to send + 3Tbps of traffic to an external buffer memory which has only 2Tbps access I/O. In + this case packets would be discarded, because of congestion of memory I/O bus + which is part of the INTEGRATED_CIRCUIT. Depending on the design of the + INTEGRATED_CIRCUIT, packets could be discarded even if interface queues are + not full, hence this scenario is NOT treated as QoS queue tail-drops nor WRED drops. + + Yet another example is the case where extremely large and long + ACL/filter requires more cycles to process than the INTEGRATED_CIRCUIT + has budgeted. "; } leaf packet-processing-aggregate { @@ -1110,7 +1151,25 @@ module openconfig-platform-pipeline-counters { description "This aggregation of counters represents the conditions in which packets are dropped due to legitimate forwarding decisions (ACL drops, - No Route etc.)"; + No Route etc.) + This counter counts packet discarded as result of processing + non-corrupted packet against legitimate, non-corrupted state + of INTEGRATED_CIRCUIT program (FIB content, ACL content, rate-limiting token-buckets) + which mandate packet drop. The examples of this class of discard are: + - dropping packets which destination address to no match any FIB entry + - dropping packets which destination address matches FIB entry pointing + to discard next-hop (e.g. route to null0) + - dropping packts due to ACL/packet filter decission + - dropping packets due to its TTL = 1 + - dropping packets due to its size exceeds egress interface MTU and + packet can't be fragmented (IPv6 or do not fragment bit is set) + - dropping packets due to uRPF rules (note: packet is counted here and + in separate, urpf-aggregate counter simultaneously) + - etc + + Note:The INTEGRATED_CIRCUIT is doing exactly what it is programmed + to do, and the packet is parsable. + "; } leaf urpf-aggregate { @@ -1119,7 +1178,12 @@ module openconfig-platform-pipeline-counters { "This aggregation of counters represents the conditions in which packets are dropped due to failing uRPF lookup check. This counter and the packet-processing-aggregate counter should be incremented - for each uRPF packet drop."; + for each uRPF packet drop. + This counter counts packet discarded as result of Unicast Reverse + Path Forwarding verification."; + reference + "RFC2827: Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing + RFC3704: Ingress Filtering for Multihomed Networks"; } leaf no-route {