docs: add last replica timeout behavior to important notes

Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>
longhorn · Sep 3, 2024 · 15f8f3d · 15f8f3d
1 parent 251f47d
commit 15f8f3d
Show file tree

Hide file tree

Showing 4 changed files with 20 additions and 6 deletions.
diff --git a/content/docs/1.7.1/important-notes/_index.md b/content/docs/1.7.1/important-notes/_index.md
@@ -18,6 +18,7 @@ Please see [here](https://github.com/longhorn/longhorn/releases/tag/v{{< current
 - [Resilience](#resilience)
   - [RWX Volumes Fast Failover](#rwx-volumes-fast-failover)
   - [Timeout Configuration for Replica Rebuilding and Snapshot Cloning](#timeout-configuration-for-replica-rebuilding-and-snapshot-cloning)
+  - [Change in Engine Replica Timeout Behavior](#change-in-engine-replica-timeout-behavior)
 - [Data Integrity and Reliability](#data-integrity-and-reliability)
   - [Support Periodic and On-Demand Full Backups to Enhance Backup Reliability](#support-periodic-and-on-demand-full-backups-to-enhance-backup-reliability)
   - [High Availability of Backing Images](#high-availability-of-backing-images)
@@ -159,6 +160,12 @@ RWX Volumes fast failover is introduced in Longhorn v1.7.0 to improve resilience
 
 Starting with v1.7.0, Longhorn supports configuration of timeouts for replica rebuilding and snapshot cloning. Before v1.7.0, the replica rebuilding timeout was capped at 24 hours, which could cause failures for large volumes in slow bandwidth environments. The default timeout is still 24 hours but you can adjust it to accommodate different environments. For more information, see [Long gRPC Timeout](../references/settings/#long-grpc-timeout).
 
+### Change in Engine Replica Timeout Behavior
+
+In versions earlier than v1.7.1, the [Engine Replica Timeout](../references/settings#engine-replica-timeout) setting
+was equally applied to all V1 volume replicas. In v1.7.1, a V1 engine marks the last active replica as failed only after
+twice the configured number of seconds (timeout value x 2) have passed.
+
 ## Data Integrity and Reliability
 
 ### Support Periodic and On-Demand Full Backups to Enhance Backup Reliability

diff --git a/content/docs/1.7.1/references/settings.md b/content/docs/1.7.1/references/settings.md
@@ -424,9 +424,9 @@ The default minimum number of backing image copies Longhorn maintains.
 The time in seconds a v1 engine will wait for a response from a replica before marking it as failed. Values between 8
 and 30 are allowed. The engine replica timeout is only in effect while there are I/O requests outstanding.
 
-This timeout only applies as-configured to additional replicas. A v1 engine will not mark the final replica for a
-running volume as failed until twice the configured timeout. This behavior is intended to balance volume responsiveness
-with volume availability:
+This setting only applies to additional replicas. A V1 engine marks the last active replica as failed only after twice
+the configured number of seconds (timeout value x 2) have passed. This behavior is intended to balance volume
+responsiveness with volume availability.
 
 - The engine can quickly (after the configured timeout) ignore individual replicas that become unresponsive in favor of
   other available ones. This ensures future I/O will not be held up.

diff --git a/content/docs/1.8.0/important-notes/_index.md b/content/docs/1.8.0/important-notes/_index.md
@@ -18,6 +18,7 @@ Please see [here](https://github.com/longhorn/longhorn/releases/tag/v{{< current
 - [Resilience](#resilience)
   - [RWX Volumes Fast Failover](#rwx-volumes-fast-failover)
   - [Timeout Configuration for Replica Rebuilding and Snapshot Cloning](#timeout-configuration-for-replica-rebuilding-and-snapshot-cloning)
+  - [Change in Engine Replica Timeout Behavior](#change-in-engine-replica-timeout-behavior)
 - [Data Integrity and Reliability](#data-integrity-and-reliability)
   - [Support Periodic and On-Demand Full Backups to Enhance Backup Reliability](#support-periodic-and-on-demand-full-backups-to-enhance-backup-reliability)
   - [High Availability of Backing Images](#high-availability-of-backing-images)
@@ -161,6 +162,12 @@ RWX Volumes fast failover is introduced in Longhorn v1.7.0 to improve resilience
 
 Starting with v1.7.0, Longhorn supports configuration of timeouts for replica rebuilding and snapshot cloning. Before v1.7.0, the replica rebuilding timeout was capped at 24 hours, which could cause failures for large volumes in slow bandwidth environments. The default timeout is still 24 hours but you can adjust it to accommodate different environments. For more information, see [Long gRPC Timeout](../references/settings/#long-grpc-timeout).
 
+### Change in Engine Replica Timeout Behavior
+
+In versions earlier than v1.8.0, the [Engine Replica Timeout](../references/settings#engine-replica-timeout) setting
+was equally applied to all V1 volume replicas. In v1.8.0, a V1 engine marks the last active replica as failed only after
+twice the configured number of seconds (timeout value x 2) have passed.
+
 ## Data Integrity and Reliability
 
 ### Support Periodic and On-Demand Full Backups to Enhance Backup Reliability

diff --git a/content/docs/1.8.0/references/settings.md b/content/docs/1.8.0/references/settings.md
@@ -424,9 +424,9 @@ The default minimum number of backing image copies Longhorn maintains.
 The time in seconds a v1 engine will wait for a response from a replica before marking it as failed. Values between 8
 and 30 are allowed. The engine replica timeout is only in effect while there are I/O requests outstanding.
 
-This timeout only applies as-configured to additional replicas. A v1 engine will not mark the final replica for a
-running volume as failed until twice the configured timeout. This behavior is intended to balance volume responsiveness
-with volume availability:
+This setting only applies to additional replicas. A V1 engine marks the last active replica as failed only after twice
+the configured number of seconds (timeout value x 2) have passed. This behavior is intended to balance volume
+responsiveness with volume availability.
 
 - The engine can quickly (after the configured timeout) ignore individual replicas that become unresponsive in favor of
   other available ones. This ensures future I/O will not be held up.