Skip to content

Commit

Permalink
Add KIC failure mode page (#7793)
Browse files Browse the repository at this point in the history
* add kic failure mode page

* fix vale and broken link

* Apply copyedits and add errors table

Signed-off-by: Diana <[email protected]>

* Apply suggestions from code review

Co-authored-by: lena-larionova <[email protected]>

* fix anchor links

---------

Signed-off-by: Diana <[email protected]>
Co-authored-by: Diana <[email protected]>
Co-authored-by: lena-larionova <[email protected]>
  • Loading branch information
3 people authored Sep 26, 2024
1 parent e1b19ac commit a19feae
Show file tree
Hide file tree
Showing 2 changed files with 109 additions and 0 deletions.
2 changes: 2 additions & 0 deletions app/_data/docs_nav_kic_3.3.x.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,3 +219,5 @@ items:
url: /reference/ingress-to-gateway-migration
- text: Required Permissions for Installation
url: /reference/required-permissions
- text: Categories of Failures
url: /reference/failure-modes
107 changes: 107 additions & 0 deletions app/_src/kubernetes-ingress-controller/reference/failure-modes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: Failure Modes and Processing in KIC
type: reference
---

This reference describes the different types of {{site.kic_product_name}} failure modes and how it processes them.

When you run {{site.kic_product_name}}, you can encounter the following failures:

| Error Example | Failure Mode |
|-----------------------------------|-----------------------------------------|
| `Reconciler error` in logs | [Errors in reconciling Kubernetes resources](#errors-in-reconciling-kubernetes-resources) |
| Non-existent service referenced by an `Ingress`. <br>Example: `Ingress` with a non-existent backend service | [Failures in translating configuration](#failures-in-translating-configuration) |
| {{site.base_gateway}} rejected configuration <br> Example: `Ingress` with invalid regex in the path | [Failures in applying configuration to {{site.base_gateway}}](#failures-in-applying-configuration-to-kong-gateway) |
| Errors when sending configuration to {{site.konnect_product_name}} <br> Example: Failed request logs | [Failures in uploading configuration to {{site.konnect_short_name}}](#failures-in-uploading-configuration-to-konnect) |

{{site.kic_product_name}} uses different methods to process each failure type, and creates error logs or other evidence, like Prometheus metrics and Kubernetes events, so you can observe and track the failures.


## Errors in reconciling Kubernetes resources

When the controllers reconciling a specific kind of Kubernetes resource run into errors in reconciling the resource, a `Reconciler error` log line is recorded and the resource is re-queued for another round of reconciliation.

Thee Prometheus metric `controller_runtime_reconcile_errors_total` stores the total number of reconcile errors per controller from the start of {{site.kic_product_name}}. Search for the `Reconciler error` keyword in the {{site.kic_product_name}} container logs to see detailed errors.

## Failures in translating configuration

When {{site.kic_product_name}} finds Kubernetes resources that can't be correctly translated to {{site.base_gateway}} configuration (for example, an `Ingress` is using a non-existent `Service` as its backend), a translation failure is generated with the namespace and name of the objects causing the failure.

The Kubernetes objects causing translation failures are not translated to {{site.base_gateway}} configuration in the translation process.
You can use Kubernetes events and Prometheus metrics to observe the translation failures.
If {{site.kic_product_name}} is integrated with {{site.konnect_product_name}}, it will report that a translation error happened in the uploading node status.

{{site.kic_product_name}} collects all translation failures and generates a Kubernetes `Event` with the `Warning` type and the `KongConfigurationTranslationFailed` reason for each causing object in a translation failure. Prometheus metrics could also reflect the statistics of translation failures:
* `ingress_controller_translation_broken_resource_count` is the number of translation failures that happened in the latest translation
* `ingress_controller_translation_count` with the `success=false` label is the total number of translation procedures where translation failures happened

You can use `kubectl get events -n <namespace> --field-selector reason="KongConfigurationTranslationFailed"` to fetch events generated for translation failures. For example, if an `Ingress` named `ing-1` in the namespace `test` used a non-existent `Service` as its backend, you could get the event with the following command:

```bash
kubectl get events -n test --field-selector reason="KongConfigurationTranslationFailed"

LAST SEEN TYPE REASON OBJECT MESSAGE
18m Warning KongConfigurationTranslationFailed ingress/ing-1 failed to resolve Kubernetes Service for backend: failed to fetch Service test/httpbin-deployment-1: Service test/httpbin-deployment-1 not found
```

## Failures in applying configuration to {{site.base_gateway}}

When {{site.kic_product_name}} fails to apply translated {{site.base_gateway}} configuration to {{site.base_gateway}}, {{site.kic_product_name}} will try to recover from the failure and record the failure into logs, Kubernetes events, and Prometheus metrics.
Recovery usually fails because the translated configuration is rejected by {{site.base_gateway}}.

If {{site.kic_product_name}} fails to apply the translated configuration, it then tries to apply the last successful {{site.base_gateway}} configuration to new instances of {{site.base_gateway}} to attempt a best effort at making them available.
If the `FallbackConfiguration` feature gate is enabled, {{site.kic_product_name}} discovers the Kubernetes objects that caused the invalid configuration, and tries to build a fallback configuration from valid objects and parts of the last valid configuration that are built from the broken objects. See [fallback configuration][fallback configuration] for more information.

### Debugging configuration failures

You can observe failures in applying configuration from Kubernetes events and Prometheus metrics:
* {{site.kic_product_name}} generates an event with the `Warning` type and the `KongConfigurationApplyFailed` reason attached to the pod itself when it fails to apply the configuration.
* For each object that causes the invalid configuration, {{site.kic_product_name}} generates a `Warning` event type and the `KongConfigurationApplyFailed` reason attached to the object.
* The Prometheus metric `ingress_controller_configuration_push_count` with the `success=false` label shows the total number of failures from applying the configuration by reason and URL of {{site.base_gateway}} admin API.
* The Prometheus metric `ingress_controller_configuration_push_broken_resource_count` reflects the number of Kubernetes resources that caused the error in the last configuration push.

If the CLI flag `--dump-config` is enabled, then the `/debug/config/raw-error` endpoint is enabled on the debug server port of {{site.kic_product_name}}, which will fetch the raw error returned from {{site.base_gateway}} if applying the configuration fails.

For example, let's say you create an `Ingress` with the `ImplementationSpecific` path type and an invalid regex in `Path` (which can only be only be done when the validating webhook is disabled, otherwise it will be rejected by the webhook):

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
konghq.com/strip-path: "true"
name: ingress-invalid-regex
namespace: default
spec:
ingressClassName: kong
rules:
- http:
paths:
- backend:
service:
name: httpbin-deployment
port:
number: 80
path: /~^^/a$
pathType: ImplementationSpecific
```
You can get the Kubernetes events:
```bash
kubectl get events --all-namespaces --field-selector reason=KongConfigurationApplyFailed
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
default 2m9s Warning KongConfigurationApplyFailed ingress/ingress-invalid-regex invalid paths.1: should start with: / (fixed path) or ~/ (regex path)
kong 15s Warning KongConfigurationApplyFailed pod/kong-controller-779cb796f4-7q7c2 failed to apply Kong configuration to https://10.244.1.43:8444: HTTP status 400 (message: "failed posting new config to /config")
```
Both the events attached to the invalid ingress and attached to the {{site.kic_product_name}} pod are recorded.
## Failures in uploading configuration to {{site.konnect_short_name}}
When {{site.kic_product_name}} is integrated with {{site.konnect_short_name}} and it fails to send configuration to {{site.konnect_short_name}}, it generates error logs for failed requests, records the failures to Prometheus metrics, and updates the node status of itself in {{site.konnect_short_name}}:
* {{site.kic_product_name}} parses errors returned from {{site.konnect_short_name}} when uploading the configuration fails. It logs a line at the error level for each {{site.base_gateway}} entity that failed to create/update/delete, with the message `Failed to send request to Konnect`.
* The Prometheus metrics `ingress_controller_configuration_push_count` and `ingress_controller_configuration_push_duration_milliseconds_bucket` can also reflect configuration upload failures to {{site.konnect_short_name}}, where the `dataplane` label is the URL of {{site.konnect_short_name}} and `success=false` APIs.

[fallback configuration]: /kubernetes-ingress-controller/{{page.release}}/guides/high-availability/fallback-config

0 comments on commit a19feae

Please sign in to comment.