Skip to content

Commit

Permalink
Merge pull request #551 from slintes/external-remediation-template
Browse files Browse the repository at this point in the history
Propose to backport the "external remediation template" feature
  • Loading branch information
openshift-merge-robot authored Apr 19, 2021
2 parents 0a070cf + 817c3d9 commit a658e5a
Showing 1 changed file with 242 additions and 0 deletions.
242 changes: 242 additions & 0 deletions enhancements/machine-api/external-remediations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: external-remediations
authors:
- @slintes
reviewers:
- @beekhof
- @n1r1
approvers:
- @JoelSpeed
- @michaelgugino
- @enxebre
creation-date: 2020-11-29
last-updated: 2020-11-29
status: implementable
see-also:
- https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
- https://github.com/kubernetes-sigs/cluster-api/pull/3606
---

# External remediations

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

By using `MachineHealthChecks` a cluster admin can configure automatic remediation of unhealthy machines and nodes.
The machine healthcheck controller's remediation strategy is deleting the machine, and letting the cloud provider
create a new one. This isn't the best remediation strategy in all environments.

There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the
`MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable and diverges from upstream.

With this enhancement we propose a better, future-proof mechanism, that aligns us with the mechanism implemented upstream.
This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which
also is already implemented [1].

- [0] [upstream machine healthcheck proposal](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md)
- [1] [upstream machine healthcheck implementation](https://github.com/kubernetes-sigs/cluster-api/pull/3606)

## Motivation

- Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines,
so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes.
- Other environments and vendors also have specific remediation requirements, so there is a need to provide a generic
mechanism for implementing custom remediation logic.

### Goals

- Create the ability to define customized remediation flows outside of the Machine Health Check and Machine API
Operator codebase.
- Migrate the existing external remediation mechanism to the new one.
- Ensure that at most one external mechanism will attempt to recover the node.

### Non-Goals

TBD

## Proposal

We propose modifying the MachineHealthCheck CRD to add a new field, `externalRemediationTemplate`, an ObjectReference to
a provider-specific remediation template CR.

### User Stories

#### Story 1

As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover
from transient errors faster and begin application recovery sooner.

#### Story 2

As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect
non-transient issues faster.
If automatic power-cycles don't resolve the issue it helps the admin to rule out transient issues like software bugs, etc.

### Implementation Details/Notes/Constraints

If no value for externalRemediationTemplate is defined for the MachineHealthCheck CR, the existing remediation flow
is preserved.

If a value for externalRemediationTemplate is supplied and the Machine enters an unhealthy state, the template will
be instantiated, with the same name and namespace of the target Machine, which passes the remediation flow to an
External Remediation Controller (ERC) watching for that CR.

No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the
Node becomes healthy. After that, it will locate and delete the instantiated MachineRemediation CR.

We use the same name and namespace for the External Machine Remediation CR to ensure uniqueness and lessen the
possibility for multiple parallel remediations of the same Machine.

The lifespan of the EMRs is that of the remediation process, and they are not intended to be a record of past events.
The EMR will also contain an ownerRef to the Machine, to ensure that it does not outlive the Machine it references.

The only signaling between the MHC and the external controller watching EMR CRs is the creation and deletion of the
EMR itself. Any actions or changes that admins should be informed about should be emitted as events for consoles
and UIs to consume if necessary. They are informational only and do not result in or expect any behaviour from the MHC,
Node, or Machine as a result.

When the external remediation controller detects the new EMR it starts remediation and performs whatever actions
it deems appropriate until the EMR is deleted by the MHC. It is a detail of the ERC when and how to retry remediation
in the event that a EMR is not deleted after the ERC considers remediation complete.

The ERC may wish to register a finalizer on its CR to ensure it has an opportunity to perform any additional cleanup
in the case that the unhealthy state was transient and the Node returned to a healthy state prior to the completion
of the full custom ERC flow.

#### MHC struct enhancement

```go
type MachineHealthCheckSpec struct {
...

// +optional
ExternalRemediationTemplate *ObjectReference `json:"externalRemediationTemplate,omitempty"`
}
```

#### Example CRs

MachineHealthCheck:
```yaml
kind: MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
metadata:
name: REMEDIATION_GROUP
namespace: NAMESPACE_OF_UNHEALTHY_MACHINE
spec:
selector:
matchLabels:
...
externalRemediationTemplate:
kind: Metal3RemediationTemplate
apiVersion: remediation.metal3.io/v1alphaX
name: M3_REMEDIATION_GROUP
```
Metal3RemediationTemplate:
```yaml
kind: Metal3RemediationTemplate
apiVersion: remediation.metal3.io/v1alphaX
metadata:
name: M3_REMEDIATION_GROUP
namespace: openshift-machine-api
spec:
template:
spec:
strategy: escalate
deleteAfterRetries: 10
powerOnTimeoutSeconds: 600
powerOffTimeoutSeconds: 120
```
Metal3Remediation:
```yaml
apiVersion: remediation.metal3.io/v1alphaX
kind: Metal3Remediation
metadata:
name: NAME_OF_UNHEALTHY_MACHINE
namespace: openshift-machine-api
finalizer:
- remediation.metal3.io
ownerReferences:
- apiVersion: machine.openshift.io/v1beta1
kind: Machine
name: NAME_OF_UNHEALTHY_MACHINE
uid: ...
spec:
strategy: escalate
deleteAfterRetries: 10
powerOnTimeoutSeconds: 600
powerOffTimeoutSeconds: 120
status:
phase: power-off
retryCount: 1
```
### Risks and Mitigations
No known risks
## Design Details
### Open Questions
See deprecation and upgrade
### Test Plan
The existing external remediation tests will be reviewed / adapted / extended as needed, and the upstream tests will
be backported as well.
### Graduation Criteria
TBD
#### Examples
TBD
##### Dev Preview -> Tech Preview
TBD
##### Tech Preview -> GA
TBD
##### Removing a deprecated feature
- The annotation based external remediation needs to be deprecated
- Open question: for how long do we need to support both mechanisms in parallel (if at all)?
### Upgrade / Downgrade Strategy
- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one?
### Version Skew Strategy
There is a dependency between the machine-api-operator (which contains the machine healthcheck controller) and
cluster-api-provider-baremetal (which provides the current external baremetal remediation controller), both are part
of the OCP release payload. That means that there can be a short living version skew during upgrades. This isn't a
problem though, because an updated MHC can only be applied (or an automatic conversion can only happen) when both
controllers and their CRDs are updated: the MHC needs to have the new template field, and the remediation CRD and
its template CRD need to exist.
## Implementation History
- [x] 11/30/2020: Opened enhancement PR
## Drawbacks
no known drawbacks
## Alternatives
- Keep the existing annotation based mechanism.

0 comments on commit a658e5a

Please sign in to comment.