Skip to content

Commit

Permalink
Merge pull request #54019 from openshift-cherrypick-robot/cherry-pick…
Browse files Browse the repository at this point in the history
…-50952-to-enterprise-4.12

[enterprise-4.12] TELCODOCS 477 CNF-3882 4.12 ACM Topology Aware Lifecycle Manager
  • Loading branch information
jeana-redhat authored Dec 19, 2022
2 parents 1623974 + f3e75b3 commit 9564cc3
Show file tree
Hide file tree
Showing 12 changed files with 510 additions and 218 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,8 @@ The {cgu-operator-first} manages the deployment of {rh-rhacm-first} policies for
* The update order of the clusters
* The set of policies remediated to the cluster
* The order of policies remediated to the cluster
* The assignment of a canary cluster
For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.

{cgu-operator} supports the orchestration of the {product-title} y-stream and z-stream updates, and day-two operations on y-streams and z-streams.
525 changes: 372 additions & 153 deletions modules/cnf-topology-aware-lifecycle-manager-about-cgu-crs.adoc

Large diffs are not rendered by default.

30 changes: 21 additions & 9 deletions modules/cnf-topology-aware-lifecycle-manager-apply-policies.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,13 @@ spec:
remediationStrategy:
maxConcurrency: 2 <3>
timeout: 240 <4>
batchTimeoutAction: <5>
----
<1> The name of the policies to apply.
<2> The list of clusters to update.
<3> The `maxConcurrency` field signifies the number of clusters updated at the same time.
<4> The update timeout in minutes.
<5> Controls what happens if a batch times out. Possible values are `abort` or `continue`. If unspecified, the default is `continue`.

. Create the `ClusterGroupUpgrade` CR by running the following command:
+
Expand All @@ -65,8 +67,8 @@ $ oc get cgu --all-namespaces
+
[source,terminal]
----
NAMESPACE NAME AGE
default cgu-1 8m55s
NAMESPACE NAME AGE STATE DETAILS
default cgu-1 8m55 NotEnabled Not Enabled
----

.. Check the status of the update by running the following command:
Expand All @@ -85,10 +87,10 @@ $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
"conditions": [
{
"lastTransitionTime": "2022-02-25T15:34:07Z",
"message": "The ClusterGroupUpgrade CR is not enabled", <1>
"reason": "UpgradeNotStarted",
"message": "Not enabled", <1>
"reason": "NotEnabled",
"status": "False",
"type": "Ready"
"type": "Progressing"
}
],
"copiedPolicies": [
Expand Down Expand Up @@ -204,11 +206,21 @@ $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
"computedMaxConcurrency": 2,
"conditions": [ <1>
{
"lastTransitionTime": "2022-02-25T15:33:07Z",
"message": "All selected clusters are valid",
"reason": "ClusterSelectionCompleted",
"status": "True",
"type": "ClustersSelected",
"lastTransitionTime": "2022-02-25T15:33:07Z",
"message": "Completed validation",
"reason": "ValidationCompleted",
"status": "True",
"type": "Validated",
"lastTransitionTime": "2022-02-25T15:34:07Z",
"message": "The ClusterGroupUpgrade CR has upgrade policies that are still non compliant",
"reason": "UpgradeNotCompleted",
"status": "False",
"type": "Ready"
"message": "Remediating non-compliant policies",
"reason": "InProgress",
"status": "True",
"type": "Progressing"
}
],
"copiedPolicies": [
Expand Down
38 changes: 29 additions & 9 deletions modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,37 @@

For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.

The container image backup starts when the `backup` field is set to `true` in the `ClusterGroupUpgrade` CR.
To use the backup feature you first create a `ClusterGroupUpgrade` CR with the `backup` field set to `true`. To ensure that the contents of the backup are up to date, the backup is not taken until you set the `enable` field in the `ClusterGroupUpgrade` CR to `true`.

The backup process can be in the following statuses:
{cgu-operator} uses the `BackupSucceeded` condition to report the status and reasons as follows:

* `true`
+
Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update does not proceed for that cluster.
* `false`
+
Backup is still in progress for one or more clusters or has failed for all clusters. The backup process running in the spoke clusters can have the following statuses:
+
** `PreparingToStart`
+
The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
** `Starting`
+
The backup prerequisites and backup job are being created.
** `Active`
+
The backup is in progress.
** `Succeeded`
+
The backup succeeded.
** `BackupTimeout`
+
Artifact backup is partially done.
** `UnrecoverableError`
+
The backup has ended with a non-zero exit code.
`BackupStatePreparingToStart`:: The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
`BackupStateStarting`:: The backup prerequisites and backup job are being created.
`BackupStateActive`:: The backup is in progress.
`BackupStateSucceeded`:: The backup has succeeded.
`BackupStateTimeout`:: Artifact backup has been partially done.
`BackupStateError`:: The backup has ended with a non-zero exit code.
[NOTE]
====
If the backup fails and enters the `BackupStateTimeout` or `BackupStateError` state, the cluster upgrade does not proceed.
If the backup of a cluster fails and enters the `BackupTimeout` or `UnrecoverableError` state, the cluster update does not proceed for that cluster. Updates to other clusters are not affected and continue.
====
18 changes: 11 additions & 7 deletions modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ nodes:

.Procedure

. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` field set to `true` in the `clustergroupupgrades-group-du.yaml` file:
. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` and `enable` fields set to `true` in the `clustergroupupgrades-group-du.yaml` file:
+
[source,yaml]
----
Expand All @@ -65,7 +65,7 @@ spec:
clusters:
- cnfdb1
- cnfdb2
enable: false
enable: true
managedPolicies:
- du-upgrade-platform-upgrade
remediationStrategy:
Expand Down Expand Up @@ -101,21 +101,25 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
],
"status": {
"cnfdb1": "Succeeded",
"cnfdb2": "Succeeded"
"cnfdb2": "Failed" <1>
}
},
"computedMaxConcurrency": 1,
"conditions": [
{
"lastTransitionTime": "2022-04-05T10:37:19Z",
"message": "Backup is completed",
"reason": "BackupCompleted",
"status": "True",
"type": "BackupDone"
"message": "Backup failed for 1 cluster", <2>
"reason": "PartiallyDone", <3>
"status": "True", <4>
"type": "Succeeded"
}
],
"precaching": {
"spec": {}
},
"status": {}
----
<1> Backup has failed for one cluster.
<2> The message confirms that the backup failed for one cluster.
<3> The backup was partially successful.
<4> The backup process has finished.
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ $ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
+
[source,terminal]
----
$ oc ostree admin status
$ ostree admin status
----
.Example outputs
+
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,6 @@ If a spoke cluster does not report any compliant state to {rh-rhacm}, the manage
* If a policy's `status.status` is missing, {cgu-operator} produces an error.
* If a cluster's compliance status is missing in the policy's `status.status` field, {cgu-operator} considers that cluster to be non-compliant with that policy.
The `ClusterGroupUpgrade` CR's `batchTimeoutAction` determines what happens if an upgrade fails for a cluster. You can specify `continue` to skip the failing cluster and continue to upgrade other clusters, or specify `abort` to stop the policy remediation for all clusters. Once the timeout elapses, {cgu-operator} removes all enforce policies to ensure that no further updates are made to clusters.

For more information about {rh-rhacm} policies, see link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/{rh-rhacm-version}/html-single/governance/index#policy-overview[Policy overview].
52 changes: 42 additions & 10 deletions modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,55 @@
[id="talo-precache-feature-concept_{context}"]
= Using the container image pre-cache feature

Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.

[NOTE]
====
The time of the update is not set by {cgu-operator}. You can apply the `ClusterGroupUpgrade` CR at the beginning of the update by manual application or by external automation.
====

The container image pre-caching starts when the `preCaching` field is set to `true` in the `ClusterGroupUpgrade` CR. After a successful pre-caching process, you can start remediating policies. The remediation actions start when the `enable` field is set to `true`.
The container image pre-caching starts when the `preCaching` field is set to `true` in the `ClusterGroupUpgrade` CR.

{cgu-operator} uses the `PrecacheSpecValid` condition to report status information as follows:

* `true`
+
The pre-caching spec is valid and consistent.
* `false`
+
The pre-caching spec is incomplete.
{cgu-operator} uses the `PrecachingSucceeded` condition to report status information as follows:

* `true`
+
TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
* `false`
+
Pre-caching is still in progress for one or more clusters or has failed for all clusters.
After a successful pre-caching process, you can start remediating policies. The remediation actions start when the `enable` field is set to `true`. If there is a pre-caching failure on a cluster, the upgrade fails for that cluster. The upgrade process continues for all other clusters that have a successful pre-cache.

The pre-caching process can be in the following statuses:

`PrecacheNotStarted`:: This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the `ClusterGroupUpgrade` CR.
* `NotStarted`
+
This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the `ClusterGroupUpgrade` CR. In this state, {cgu-operator} deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. {cgu-operator} then creates a new `ManagedClusterView` resource for the spoke pre-caching namespace to verify its deletion in the `PrecachePreparing` state.
* `PreparingToStart`
+
Cleaning up any remaining resources from previous incomplete updates is in progress.
* `Starting`
+
Pre-caching job prerequisites and the job are created.
* `Active`
+
The job is in "Active" state.
* `Succeeded`
+
The pre-cache job succeeded.
* `PrecacheTimeout`
+
The artifact pre-caching is partially done.
* `UnrecoverableError`
+
In this state, {cgu-operator} deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. {cgu-operator} then creates a new `ManagedClusterView` resource for the spoke pre-caching namespace to verify its deletion in the `PrecachePreparing` state.
`PrecachePreparing`:: Cleaning up any remaining resources from previous incomplete updates is in progress.
`PrecacheStarting`:: Pre-caching job prerequisites and the job are created.
`PrecacheActive`:: The job is in "Active" state.
`PrecacheSucceeded`:: The pre-cache job has succeeded.
`PrecacheTimeout`:: The artifact pre-caching has been partially done.
`PrecacheUnrecoverableError`:: The job ends with a non-zero exit code.
The job ends with a non-zero exit code.
28 changes: 11 additions & 17 deletions modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ spec:
----
<1> The `preCaching` field is set to `true`, which enables {cgu-operator} to pull the container images before starting the update.

. When you want to start the update, apply the `ClusterGroupUpgrade` CR by running the following command:
. When you want to start pre-caching, apply the `ClusterGroupUpgrade` CR by running the following command:
+
[source,terminal]
----
Expand All @@ -59,8 +59,8 @@ $ oc get cgu -A
+
[source,terminal]
----
NAMESPACE NAME AGE
ztp-group-du-sno du-upgrade-4918 10s <1>
NAMESPACE NAME AGE STATE DETAILS
ztp-group-du-sno du-upgrade-4918 10s InProgress Precaching is required and not done <1>
----
<1> The CR is created.

Expand All @@ -77,19 +77,12 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
----
{
"conditions": [
{
"lastTransitionTime": "2022-01-27T19:07:24Z",
"message": "Precaching is not completed (required)", <1>
"reason": "PrecachingRequired",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2022-01-27T19:07:24Z",
"message": "Precaching is required and not done",
"reason": "PrecachingNotDone",
"reason": "InProgress",
"status": "False",
"type": "PrecachingDone"
"type": "PrecachingSucceeded"
},
{
"lastTransitionTime": "2022-01-27T19:07:34Z",
Expand All @@ -101,17 +94,18 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
],
"precaching": {
"clusters": [
"cnfdb1" <2>
"cnfdb1" <1>
"cnfdb2"
],
"spec": {
"platformImage": "image.example.io"},
"status": {
"cnfdb1": "Active"}
"cnfdb1": "Active"
"cnfdb2": "Succeeded"}
}
}
----
<1> Displays that the update is in progress.
<2> Displays the list of identified clusters.
<1> Displays the list of identified clusters.

. Check the status of the pre-caching job by running the following command on the spoke cluster:
+
Expand Down Expand Up @@ -155,7 +149,7 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
"message": "Precaching is completed",
"reason": "PrecachingCompleted",
"status": "True",
"type": "PrecachingDone" <1>
"type": "PrecachingSucceeded" <1>
}
----
<1> The pre-cache tasks are done.
28 changes: 18 additions & 10 deletions modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -220,9 +220,9 @@ spoke3 true https://api.spoke3.testlab.com:6443 True
<1> The value of the `AVAILABLE` field is `True` for the managed clusters.

[discrete]
=== Checking clusterSelector
=== Checking clusterLabelSelector

Issue:: You want to check if the `clusterSelector` field is specified in the `ClusterGroupUpgrade` CR in at least one of the managed clusters.
Issue:: You want to check if the `clusterLabelSelector` field specified in the `ClusterGroupUpgrade` CR matches at least one of the managed clusters.

Resolution:: Run the following command:
+
Expand Down Expand Up @@ -250,16 +250,14 @@ Issue:: You want to check if the canary clusters are present in the list of clus
[source,yaml]
----
spec:
clusters:
- spoke1
- spoke3
clusterSelector:
- upgrade2=true
remediationStrategy:
canaries:
- spoke3
maxConcurrency: 2
timeout: 240
clusterLabelSelectors:
- matchLabels:
upgrade: true
----

Resolution:: Run the following commands:
Expand All @@ -276,7 +274,7 @@ $ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
["spoke1", "spoke3"]
----

. Check if the canary clusters are present in the list of clusters that match `clusterSelector` labels by running the following command:
. Check if the canary clusters are present in the list of clusters that match `clusterLabelSelector` labels by running the following command:
+
[source,terminal]
----
Expand All @@ -294,7 +292,7 @@ spoke3 true https://api.spoke3.testlab.com:6443 True Tr

[NOTE]
====
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterSelecter` label.
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterLabelSelector` label.
====

[discrete]
Expand Down Expand Up @@ -367,7 +365,7 @@ $ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
+
[source,json]
----
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"The ClusterGroupUpgrade CR has managed policies that are missing:[policyThatDoesntExist]", "reason":"UpgradeCannotStart", "status":"False", "type":"Ready"}
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
----

[discrete]
Expand Down Expand Up @@ -435,3 +433,13 @@ ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
----
<1> Displays the error.

[discrete]
=== Clusters are not compliant to some policies after a `ClusterGroupUpgrade` CR has completed

Issue:: The policy compliance status that {cgu-operator} uses to decide if remediation is needed has not yet fully updated for all clusters.
This may be because:
* The CGU was run too soon after a policy was created or updated.
* The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.

Resolution:: Create a new and apply `ClusterGroupUpdate` CR with the same specification .
Loading

0 comments on commit 9564cc3

Please sign in to comment.