Merge pull request #54019 from openshift-cherrypick-robot/cherry-pick…

…-50952-to-enterprise-4.12 [enterprise-4.12] TELCODOCS 477 CNF-3882 4.12 ACM Topology Aware Lifecycle Manager
openshift · Dec 19, 2022 · 9564cc3 · 9564cc3
2 parents 1623974 + f3e75b3
commit 9564cc3
Show file tree

Hide file tree

Showing 12 changed files with 510 additions and 218 deletions.
diff --git a/modules/cnf-about-topology-aware-lifecycle-manager-config.adoc b/modules/cnf-about-topology-aware-lifecycle-manager-config.adoc
@@ -14,5 +14,8 @@ The {cgu-operator-first} manages the deployment of {rh-rhacm-first} policies for
 * The update order of the clusters
 * The set of policies remediated to the cluster
 * The order of policies remediated to the cluster
+* The assignment of a canary cluster
+
+For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
 
 {cgu-operator} supports the orchestration of the {product-title} y-stream and z-stream updates, and day-two operations on y-streams and z-streams.
diff --git a/modules/cnf-topology-aware-lifecycle-manager-about-cgu-crs.adoc b/modules/cnf-topology-aware-lifecycle-manager-about-cgu-crs.adoc
diff --git a/modules/cnf-topology-aware-lifecycle-manager-apply-policies.adoc b/modules/cnf-topology-aware-lifecycle-manager-apply-policies.adoc
@@ -41,11 +41,13 @@ spec:
   remediationStrategy:
     maxConcurrency: 2 <3>
     timeout: 240 <4>
+  batchTimeoutAction: <5>
 ----
 <1> The name of the policies to apply.
 <2> The list of clusters to update.
 <3> The `maxConcurrency` field signifies the number of clusters updated at the same time.
 <4> The update timeout in minutes.
+<5> Controls what happens if a batch times out. Possible values are `abort` or `continue`. If unspecified, the default is `continue`.
 
 . Create the `ClusterGroupUpgrade` CR by running the following command:
 +
@@ -65,8 +67,8 @@ $ oc get cgu --all-namespaces
 +
 [source,terminal]
 ----
-NAMESPACE   NAME      AGE
-default     cgu-1     8m55s
+NAMESPACE   NAME  AGE  STATE      DETAILS
+default     cgu-1 8m55 NotEnabled Not Enabled
 ----
 
 .. Check the status of the update by running the following command:
@@ -85,10 +87,10 @@ $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
   "conditions": [
     {
       "lastTransitionTime": "2022-02-25T15:34:07Z",
-      "message": "The ClusterGroupUpgrade CR is not enabled", <1>
-      "reason": "UpgradeNotStarted",
+      "message": "Not enabled", <1>
+      "reason": "NotEnabled",
       "status": "False",
-      "type": "Ready"
+      "type": "Progressing"
     }
   ],
   "copiedPolicies": [
@@ -204,11 +206,21 @@ $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
   "computedMaxConcurrency": 2,
   "conditions": [ <1>
     {
+      "lastTransitionTime": "2022-02-25T15:33:07Z",
+      "message": "All selected clusters are valid",
+      "reason": "ClusterSelectionCompleted",
+      "status": "True",
+      "type": "ClustersSelected",
+      "lastTransitionTime": "2022-02-25T15:33:07Z",
+      "message": "Completed validation",
+      "reason": "ValidationCompleted",
+      "status": "True",
+      "type": "Validated",
       "lastTransitionTime": "2022-02-25T15:34:07Z",
-      "message": "The ClusterGroupUpgrade CR has upgrade policies that are still non compliant", 
-      "reason": "UpgradeNotCompleted",
-      "status": "False",
-      "type": "Ready"
+      "message": "Remediating non-compliant policies",
+      "reason": "InProgress",
+      "status": "True",
+      "type": "Progressing"
     }
   ],
   "copiedPolicies": [

diff --git a/modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc b/modules/cnf-topology-aware-lifecycle-manager-backup-concept.adoc
@@ -8,17 +8,37 @@
 
 For {sno}, the {cgu-operator-first} can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
 
-The container image backup starts when the `backup` field is set to `true` in the `ClusterGroupUpgrade` CR.
+To use the backup feature you first create a `ClusterGroupUpgrade` CR with the `backup` field set to `true`. To ensure that the contents of the backup are up to date, the backup is not taken until you set the `enable` field in the `ClusterGroupUpgrade` CR to `true`.
 
-The backup process can be in the following statuses:
+{cgu-operator} uses the `BackupSucceeded` condition to report the status and reasons as follows:
+
+* `true`
++
+Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update does not proceed for that cluster.
+* `false`
++
+Backup is still in progress for one or more clusters or has failed for all clusters. The backup process running in the spoke clusters can have the following statuses:
++
+** `PreparingToStart`
++
+The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
+** `Starting`
++
+The backup prerequisites and backup job are being created.
+** `Active`
++
+The backup is in progress.
+** `Succeeded`
++
+The backup succeeded.
+** `BackupTimeout`
++
+Artifact backup is partially done.
+** `UnrecoverableError`
++
+The backup has ended with a non-zero exit code.
 
-`BackupStatePreparingToStart`:: The first reconciliation pass is in progress. The {cgu-operator} deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
-`BackupStateStarting`:: The backup prerequisites and backup job are being created.
-`BackupStateActive`:: The backup is in progress.
-`BackupStateSucceeded`:: The backup has succeeded.
-`BackupStateTimeout`:: Artifact backup has been partially done.
-`BackupStateError`:: The backup has ended with a non-zero exit code.
 [NOTE]
 ====
-If the backup fails and enters the `BackupStateTimeout` or `BackupStateError` state, the cluster upgrade does not proceed.
+If the backup of a cluster fails and enters the `BackupTimeout` or `UnrecoverableError` state, the cluster update does not proceed for that cluster. Updates to other clusters are not affected and continue.
 ====
diff --git a/modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc b/modules/cnf-topology-aware-lifecycle-manager-backup-feature.adoc
@@ -50,7 +50,7 @@ nodes:
 
 .Procedure
 
-. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` field set to `true` in the `clustergroupupgrades-group-du.yaml` file:
+. Save the contents of the `ClusterGroupUpgrade` CR with the `backup` and `enable` fields set to `true` in the `clustergroupupgrades-group-du.yaml` file:
 +
 [source,yaml]
 ----
@@ -65,7 +65,7 @@ spec:
   clusters:
   - cnfdb1
   - cnfdb2
-  enable: false
+  enable: true
   managedPolicies:
   - du-upgrade-platform-upgrade
   remediationStrategy:
@@ -101,21 +101,25 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
     ],
     "status": {
         "cnfdb1": "Succeeded",
-        "cnfdb2": "Succeeded"
+        "cnfdb2": "Failed" <1>
     }
 },
 "computedMaxConcurrency": 1,
 "conditions": [
     {
         "lastTransitionTime": "2022-04-05T10:37:19Z",
-        "message": "Backup is completed",
-        "reason": "BackupCompleted",
-        "status": "True",
-        "type": "BackupDone"
+        "message": "Backup failed for 1 cluster", <2>
+        "reason": "PartiallyDone", <3>
+        "status": "True", <4>
+        "type": "Succeeded"
     }
 ],
 "precaching": {
     "spec": {}
 },
 "status": {}
 ----
+<1> Backup has failed for one cluster.
+<2> The message confirms that the backup failed for one cluster.
+<3> The backup was partially successful.
+<4> The backup process has finished.
diff --git a/modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc b/modules/cnf-topology-aware-lifecycle-manager-backup-recovery.adoc
@@ -34,7 +34,7 @@ $ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
 +
 [source,terminal]
 ----
-$ oc ostree admin status
+$ ostree admin status
 ----
 .Example outputs
 +

diff --git a/modules/cnf-topology-aware-lifecycle-manager-policies-concept.adoc b/modules/cnf-topology-aware-lifecycle-manager-policies-concept.adoc
@@ -16,4 +16,6 @@ If a spoke cluster does not report any compliant state to {rh-rhacm}, the manage
 * If a policy's `status.status` is missing, {cgu-operator} produces an error.
 * If a cluster's compliance status is missing in the policy's `status.status` field, {cgu-operator} considers that cluster to be non-compliant with that policy.
 
+The `ClusterGroupUpgrade` CR's `batchTimeoutAction` determines what happens if an upgrade fails for a cluster. You can specify `continue` to skip the failing cluster and continue to upgrade other clusters, or specify `abort` to stop the policy remediation for all clusters. Once the timeout elapses, {cgu-operator} removes all enforce policies to ensure that no further updates are made to clusters.
+
 For more information about {rh-rhacm} policies, see link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/{rh-rhacm-version}/html-single/governance/index#policy-overview[Policy overview].
diff --git a/modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc b/modules/cnf-topology-aware-lifecycle-manager-precache-concept.adoc
@@ -6,23 +6,55 @@
 [id="talo-precache-feature-concept_{context}"]
 = Using the container image pre-cache feature
 
-Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed. 
+Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
 
 [NOTE]
 ====
 The time of the update is not set by {cgu-operator}. You can apply the `ClusterGroupUpgrade` CR at the beginning of the update by manual application or by external automation.
 ====
 
-The container image pre-caching starts when the `preCaching` field is set to `true` in the `ClusterGroupUpgrade` CR. After a successful pre-caching process, you can start remediating policies. The remediation actions start when the `enable` field is set to `true`.
+The container image pre-caching starts when the `preCaching` field is set to `true` in the `ClusterGroupUpgrade` CR.
+
+{cgu-operator} uses the `PrecacheSpecValid` condition to report status information as follows:
+
+* `true`
++
+The pre-caching spec is valid and consistent.
+* `false`
++
+The pre-caching spec is incomplete.
+
+{cgu-operator} uses the `PrecachingSucceeded` condition to report status information as follows:
+
+* `true`
++
+TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
+* `false`
++
+Pre-caching is still in progress for one or more clusters or has failed for all clusters.
+
+After a successful pre-caching process, you can start remediating policies. The remediation actions start when the `enable` field is set to `true`. If there is a pre-caching failure on a cluster, the upgrade fails for that cluster. The upgrade process continues for all other clusters that have a successful pre-cache.
 
 The pre-caching process can be in the following statuses:
 
-`PrecacheNotStarted`:: This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the `ClusterGroupUpgrade` CR. 
+* `NotStarted`
++
+This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the `ClusterGroupUpgrade` CR. In this state, {cgu-operator} deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. {cgu-operator} then creates a new `ManagedClusterView` resource for the spoke pre-caching namespace to verify its deletion in the `PrecachePreparing` state.
+* `PreparingToStart`
++
+Cleaning up any remaining resources from previous incomplete updates is in progress.
+* `Starting`
++
+Pre-caching job prerequisites and the job are created.
+* `Active`
++
+The job is in "Active" state.
+*  `Succeeded`
++
+The pre-cache job succeeded.
+* `PrecacheTimeout`
++
+The artifact pre-caching is partially done.
+* `UnrecoverableError`
 +
-In this state, {cgu-operator} deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. {cgu-operator} then creates a new `ManagedClusterView` resource for the spoke pre-caching namespace to verify its deletion in the `PrecachePreparing` state.
-`PrecachePreparing`:: Cleaning up any remaining resources from previous incomplete updates is in progress.
-`PrecacheStarting`:: Pre-caching job prerequisites and the job are created.
-`PrecacheActive`:: The job is in "Active" state.
-`PrecacheSucceeded`:: The pre-cache job has succeeded.
-`PrecacheTimeout`:: The artifact pre-caching has been partially done.
-`PrecacheUnrecoverableError`:: The job ends with a non-zero exit code.
+The job ends with a non-zero exit code.
diff --git a/modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc b/modules/cnf-topology-aware-lifecycle-manager-precache-feature.adoc
@@ -39,7 +39,7 @@ spec:
 ----
 <1> The `preCaching` field is set to `true`, which enables {cgu-operator} to pull the container images before starting the update.
 
-. When you want to start the update, apply the `ClusterGroupUpgrade` CR by running the following command:
+. When you want to start pre-caching, apply the `ClusterGroupUpgrade` CR by running the following command:
 +
 [source,terminal]
 ----
@@ -59,8 +59,8 @@ $ oc get cgu -A
 +
 [source,terminal]
 ----
-NAMESPACE          NAME              AGE
-ztp-group-du-sno   du-upgrade-4918   10s <1>
+NAMESPACE          NAME              AGE   STATE        DETAILS
+ztp-group-du-sno   du-upgrade-4918   10s   InProgress   Precaching is required and not done <1>
 ----
 <1> The CR is created.
 
@@ -77,19 +77,12 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
 ----
 {
   "conditions": [
-    {
-      "lastTransitionTime": "2022-01-27T19:07:24Z",
-      "message": "Precaching is not completed (required)", <1>
-      "reason": "PrecachingRequired",
-      "status": "False",
-      "type": "Ready"
-    },
     {
       "lastTransitionTime": "2022-01-27T19:07:24Z",
       "message": "Precaching is required and not done",
-      "reason": "PrecachingNotDone",
+      "reason": "InProgress",
       "status": "False",
-      "type": "PrecachingDone"
+      "type": "PrecachingSucceeded"
     },
     {
       "lastTransitionTime": "2022-01-27T19:07:34Z",
@@ -101,17 +94,18 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
   ],
   "precaching": {
     "clusters": [
-      "cnfdb1" <2>
+      "cnfdb1" <1>
+      "cnfdb2"
     ],
     "spec": {
       "platformImage": "image.example.io"},
     "status": {
-      "cnfdb1": "Active"}
+      "cnfdb1": "Active"
+      "cnfdb2": "Succeeded"}
     }
 }
 ----
-<1> Displays that the update is in progress.
-<2> Displays the list of identified clusters.
+<1> Displays the list of identified clusters.
 
 . Check the status of the pre-caching job by running the following command on the spoke cluster:
 +
@@ -155,7 +149,7 @@ $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
       "message": "Precaching is completed",
       "reason": "PrecachingCompleted",
       "status": "True",
-      "type": "PrecachingDone" <1>
+      "type": "PrecachingSucceeded" <1>
     }
 ----
 <1> The pre-cache tasks are done.
diff --git a/modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc b/modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc
@@ -220,9 +220,9 @@ spoke3          true           https://api.spoke3.testlab.com:6443     True
 <1> The value of the `AVAILABLE` field is `True` for the managed clusters.
 
 [discrete]
-=== Checking clusterSelector
+=== Checking clusterLabelSelector
 
-Issue:: You want to check if the `clusterSelector` field is specified in the `ClusterGroupUpgrade` CR in at least one of the managed clusters.
+Issue:: You want to check if the `clusterLabelSelector` field specified in the `ClusterGroupUpgrade` CR matches at least one of the managed clusters.
 
 Resolution:: Run the following command:
 +
@@ -250,16 +250,14 @@ Issue:: You want to check if the canary clusters are present in the list of clus
 [source,yaml]
 ----
 spec:
-    clusters:
-    - spoke1
-    - spoke3
-    clusterSelector:
-    - upgrade2=true
     remediationStrategy:
         canaries:
         - spoke3
         maxConcurrency: 2
         timeout: 240
+    clusterLabelSelectors:
+      - matchLabels:
+          upgrade: true
 ----
 
 Resolution:: Run the following commands:
@@ -276,7 +274,7 @@ $ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
 ["spoke1", "spoke3"]
 ----
 
-. Check if the canary clusters are present in the list of clusters that match `clusterSelector` labels by running the following command:
+. Check if the canary clusters are present in the list of clusters that match `clusterLabelSelector` labels by running the following command:
 +
 [source,terminal]
 ----
@@ -294,7 +292,7 @@ spoke3          true           https://api.spoke3.testlab.com:6443   True     Tr
 
 [NOTE]
 ====
-A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterSelecter` label.
+A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterLabelSelector` label.
 ====
 
 [discrete]
@@ -367,7 +365,7 @@ $ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
 +
 [source,json]
 ----
-{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"The ClusterGroupUpgrade CR has managed policies that are missing:[policyThatDoesntExist]", "reason":"UpgradeCannotStart", "status":"False", "type":"Ready"}
+{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
 ----
 
 [discrete]
@@ -435,3 +433,13 @@ ERROR	controller-runtime.manager.controller.clustergroupupgrade	Reconciler error
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
 ----
 <1> Displays the error.
+
+[discrete]
+=== Clusters are not compliant to some policies after a `ClusterGroupUpgrade` CR has completed
+
+Issue:: The policy compliance status that {cgu-operator} uses to decide if remediation is needed has not yet fully updated for all clusters.
+This may be because:
+* The CGU was run too soon after a policy was created or updated. 
+* The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.
+
+Resolution:: Create a new and apply `ClusterGroupUpdate` CR with the same specification .