Skip to content

Commit

Permalink
review comments, to be squashed
Browse files Browse the repository at this point in the history
  • Loading branch information
machine424 committed Nov 28, 2024
1 parent bfb53ec commit 309d93a
Showing 1 changed file with 61 additions and 17 deletions.
78 changes: 61 additions & 17 deletions enhancements/monitoring/early-monitoring-config-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,15 @@ authors:
- "@machine424"
reviewers:
- "@openshift/openshift-team-monitoring"
- TBD
- "@vrutkovs"
approvers:
- "@openshift/openshift-team-monitoring"
- TBD
- "@jan--f"
api-approvers:
- TBD
- "@JoelSpeed"
creation-date: 2024-11-13
last-updated: 2024-11-13
last-updated: 2024-11-28
tracking-link:
- TBD
- https://issues.redhat.com/browse/MON-4092
---

# Early Monitoring Config Validation
Expand All @@ -27,17 +26,17 @@ shorter feedback loops and enhance user experience.

## Motivation

CMO currently uses ConfigMaps to store configurations for the Platform and User Workload monitoring stacks. Due to the limitations of this approach, a migration to CRD based configs is planned ([OBSDA-212](https://issues.redhat.com/browse/OBSDA-212)). In the interim, enhancing the validation process for these ConfigMaps would be highly beneficial.
CMO currently uses ConfigMaps to store configurations for the Platform and User Workload monitoring stacks. Due to the limitations of this approach, a migration to CRD-based configurations is planned ([OBSDA-212](https://issues.redhat.com/browse/OBSDA-212)). In the interim, enhancing the validation process for these ConfigMaps would be highly beneficial. This is because the required changes are much smaller in scope than the CRD migration, primarily involving changes on the CMO side. For more details, see [the implementation](https://github.com/openshift/cluster-monitoring-operator/pull/2490).

Insights show that in `2024`, there were more than `650` unique CMO failures related to parsing issues that lasted over `1h`, with some going unnoticed for over `215` days. The total duration of all failures exceeded `10` years.
[Insights](https://docs.openshift.com/container-platform/4.17/support/remote_health_monitoring/about-remote-health-monitoring.html) show that in `2024`, there were more than `650` unique CMO failures related to parsing issues that lasted over `1h`, with some going unnoticed for over `215` days. The total duration of all failures exceeded `10` years.

### User Stories

As a user, if my configuration is invalid (malformed JSON/YAML, contains invalid, no longer supported, or duplicated fields), I do not want to have to check the operator’s status or logs or wait for an alert to be notified of the issue.

Such situations may lead me to suspect other issues incorrectly within the monitoring stack, causing me to solicit help from colleagues or support.

The existing signals take time to propagate and can easily be missed, resulting in a poor user experience. A shorter feedback loop, where invalid configurations are rejected when trying to push them into the Configmaps (as with CRDs), would be more user-friendly.
The existing signals (CMO becoming degraded and emitting logs. And all the other signals that result from that: alerts etc.) take time to propagate and can easily be missed, resulting in a poor user experience. A shorter feedback loop, where invalid configurations are rejected when trying to push them into the Configmaps (as with CRDs), would be more user-friendly.

### Goals

Expand All @@ -50,17 +49,19 @@ The existing signals take time to propagate and can easily be missed, resulting
- This addition does not intend to replace or render obsolete the existing `UserWorkloadInvalidConfiguration`/`InvalidConfiguration` related signals in the operator status/logs/alerts.
- This proposal does not intend to prevent or postpone the planned transition to CRDs for enhanced validation capabilities. Instead, it will prepare the way for it, provide a preview of what will happen with CRDs, educate users about it, and ease the migration.
- Some ConfigMap changes may bypass the CMO validation logic if the CMO operator is down for some reason; these changes will not be validated (best-effort approach).
- ConfigMaps with invalid monitoring configurations deployed before the webhook is enabled (before upgrading to the version that enables the validation webhook on CMO) will not be flagged or adjusted. The webhook will only intervene on them during subsequent changes, if any.
- ConfigMaps with invalid monitoring configurations deployed before the webhook is enabled (before upgrading to the version that enables the validation webhook on CMO) will not be flagged or adjusted. The webhook will only intervene on them during subsequent changes, if any. The approach presented in [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) will help ensure that the configs in the two ConfigMaps are valid before the upgrade.

## Proposal

Implement and expose a validation webhook in CMO. This webhook will intercept `CREATE` and `UPDATE` actions on the platform and UW monitoring ConfigMaps. It will attempt to fetch the configuration within the ConfigMap, unmarshal/parse it, identify potential errors (such as malformed JSON/YAML, unknown field names, or duplicated fields), and reject the request if such issues are found.

For more detailed information, the implementation can be reviewed [here](https://github.com/openshift/cluster-monitoring-operator/pull/2490).

### Workflow Description

The webhook will be enabled by default.

The `ValidatingWebhookConfiguration` will ensure the webhook only intervenes on changes to the two ConfigMaps: `openshift-monitoring/cluster-monitoring-config` and `openshift-user-workload-monitoring/user-workload-monitoring-config`.
The `matchConditions` will ensure the webhook only intervenes on changes to the two ConfigMaps: `openshift-monitoring/cluster-monitoring-config` and `openshift-user-workload-monitoring/user-workload-monitoring-config`.

```yaml
matchConditions:
Expand Down Expand Up @@ -130,36 +131,76 @@ webhooks:
### Topology Considerations
The presence of the `monitoringconfigmaps.openshift.io` `ValidatingWebhookConfiguration` determines whether the early validation logic will be enabled.
This resource is managed by the Cluster Version Operator (CVO) and is intended to accompany the CMO Deployment whenever it is deployed.
The e2e and payload tests for the different profiles/topologies should help ensure that the addition of the early validation logic does not cause any issues.
Given that this is a nice-to-have feature, each profile/topology should decide whether they want to enforce it by ensuring the `ValidatingWebhookConfiguration` resource is present and is takedn into acount (or prevent its addition if deemed unnecessary, although we advise against that).
Note that this is not a first; the monitoring stack [already uses validation webhooks](https://github.com/openshift/cluster-monitoring-operator/tree/release-4.18/assets/admission-webhook).
#### Hypershift / Hosted Control Planes
The early configuration validation could be deployed on both the management and hosted clusters; wherever CMO is deployed.
#### Standalone Clusters
#### Single-node Deployments or MicroShift
### Implementation Details/Notes/Constraints
To avoid any divergence (the validate webhook producing false positives), the webhook will be
running the same code (a subset of the checks) that CMO runs when laoding and applying the config.
running the same code (a subset of the checks) that CMO runs when loading and applying the config.
CMO will expose the webhook at `:8443/validate-webhook/monitoringconfigmaps`.
### Risks and Mitigations
In the event that the validation webhook makes incorrect decisions (which we aim to keep rare, as the webhook will run the same code that CMO uses when applying the configuration), users will have the option to temporarily bypass the CMO webhook. This can be done by adding the label `monitoringconfigmaps.openshift.io/skip-validate-webhook: true` to the ConfigMaps.
In the event that the validation webhook makes incorrect decisions (which we aim to keep rare, as the webhook will run a subset of the validation code that CMO runs when applying the configuration), users will have the option to temporarily bypass the CMO webhook in case of a bug or a misbehaviour in the code paths engaged by the webhook (CMO server, apiserver etc.).
This can be done by adding the label `monitoringconfigmaps.openshift.io/skip-validate-webhook: true` to the ConfigMaps.
The label is alos used to simulate and test scenarios where the webhook is skipped (e.g., CMO pod down).
Additionally, the webhook endpoint will not perform client authentication on `/validate-webhook/monitoringconfigmaps`. Another proposal will be initiated to discuss how to facilitate easier identification of requests from the apiserver for webhooks in OCP.
### Drawbacks
Some users who may have been relying on or exploiting the lack of pre-validation will need to adapt, as their invalid changes to the ConfigMaps will now be denied by the apiserver. This adaptation is necessary for the future transition to CRD-based configuration. Tightening the configurations now serves as a preparatory step for the upcoming CRD-based configuration effort.
This drawback applies if the cluster upgrade bypasses the mechanism detailed in the [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy), which ensures the ConfigMaps are adjusted before the upgrade.
## Open Questions [optional]
## Test Plan
Since the webhook will be enabled by default, all existing tests that create or update the ConfigMaps holding the monitoring configuration are considered tests for the webhook itself.
Additionally, unit and e2e tests will be added (or adjusted) to better highlight invalid configurations scenarios.
Additionally, unit and e2e tests will be added (or adjusted) to better highlight invalid configurations scenarios similar to those mentioned below:
### Malformed YAML.
```
config.yaml: |
prometheusK8s::
retention: 1d
```
### Invalid field.
```
config.yaml: |
prometheusk8s:
...
```
### No longer supported field.
```
config.yaml: |
grafana:
...
```
Invalid configuration related tests outside the CMO repository will also need to be adjusted accordingly.
Expand All @@ -181,13 +222,15 @@ N/A
### Removing a deprecated feature
Once CRD-based configuration is GA, configuration via ConfigMaps will no longer be allowed, and the webhook logic will be removed.
Once CRD-based configuration is GA and configuration via ConfigMaps will is longer allowed, the webhook logic will become useless and will be removed.
## Upgrade / Downgrade Strategy
Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.
A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.
[A change in `4.17.5`](https://issues.redhat.com/browse/OCPBUGS-43690) will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.
(Credit goes to "@simonpasquier" for that idea.)
Upgrades will be covered by existing upgrade tests.
Expand Down Expand Up @@ -233,6 +276,7 @@ Cross-referencing these with the apiserver logs should provide detailed insights
## Alternatives
Wait for CRD based configs to be GA.
Wait for CRD based configs to be GA. See [OBSDA-212](https://issues.redhat.com/browse/OBSDA-212) and the related [proposal](https://github.com/openshift/enhancements/pull/1627).
As a reminder, this proposal does not intend to prevent or postpone the planned CRD based effort, see [Motivation](#motivation) and [Non-Goals](#non-goals)
## Infrastructure Needed [optional]

0 comments on commit 309d93a

Please sign in to comment.