-
Notifications
You must be signed in to change notification settings - Fork 249
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2492 from EnterpriseDB/release/2022-03-25
Release: 2022-03-25
- Loading branch information
Showing
76 changed files
with
1,137 additions
and
862 deletions.
There are no files selected for viewing
95 changes: 57 additions & 38 deletions
95
advocacy_docs/kubernetes/cloud_native_postgresql/api_reference.mdx
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
79 changes: 79 additions & 0 deletions
79
advocacy_docs/kubernetes/cloud_native_postgresql/failover.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
--- | ||
title: 'Automated failover' | ||
originalFilePath: 'src/failover.md' | ||
product: 'Cloud Native Operator' | ||
--- | ||
|
||
In the case of unexpected errors on the primary, the cluster will go into | ||
**failover mode**. This may happen, for example, when: | ||
|
||
- The primary pod has a disk failure | ||
- The primary pod is deleted | ||
- The `postgres` container on the primary has any kind of sustained failure | ||
|
||
In the failover scenario, the primary cannot be assumed to be working properly. | ||
|
||
After cases like the ones above, the readiness probe for the primary pod will start | ||
failing. This will be picked up in the controller's reconciliation loop. The | ||
controller will initiate the failover process, in two steps: | ||
|
||
1. First, it will mark the `TargetPrimary` as `pending`. This change of state will | ||
force the primary pod to shutdown, to ensure the WAL receivers on the replicas | ||
will stop. The cluster will be marked in failover phase ("Failing over"). | ||
2. Once all WAL receivers are stopped, there will be a leader election, and a | ||
new primary will be named. The chosen instance will initiate promotion to | ||
primary, and, after this is completed, the cluster will resume normal operations. | ||
Meanwhile, the former primary pod will restart, detect that it is no longer | ||
the primary, and become a replica node. | ||
|
||
!!! Important | ||
The two-phase procedure helps ensure the WAL receivers can stop in an orderly | ||
fashion, and that the failing primary will not start streaming WALs again upon | ||
restart. These safeguards prevent timeline discrepancies between the new primary | ||
and the replicas. | ||
|
||
During the time the failing primary is being shut down: | ||
|
||
1. It will first try a PostgreSQL's *fast shutdown* with | ||
`.spec.switchoverDelay` seconds as timeout. This graceful shutdown will attempt | ||
to archive pending WALs. | ||
2. If the fast shutdown fails, or its timeout is exceeded, a PostgreSQL's | ||
*immediate shutdown* is initiated. | ||
|
||
!!! Info | ||
"Fast" mode does not wait for PostgreSQL clients to disconnect and will | ||
terminate an online backup in progress. All active transactions are rolled back | ||
and clients are forcibly disconnected, then the server is shut down. | ||
"Immediate" mode will abort all PostgreSQL server processes immediately, | ||
without a clean shutdown. | ||
|
||
## RTO and RPO impact | ||
|
||
Failover may result in the service being impacted and/or data being lost: | ||
|
||
1. During the time when the primary has started to fail, and before the controller | ||
starts failover procedures, queries in transit, WAL writes, checkpoints and | ||
similar operations, may fail. | ||
2. Once the fast shutdown command has been issued, the cluster will no longer | ||
accept connections, so service will be impacted but no data | ||
will be lost. | ||
3. If the fast shutdown fails, the immediate shutdown will stop any pending | ||
processes, including WAL writing. Data may be lost. | ||
4. During the time the primary is shutting down and a new primary hasn't yet | ||
started, the cluster will operate without a primary and thus be impaired - but | ||
with no data loss. | ||
|
||
!!! Note | ||
The timeout that controls fast shutdown is set by `.spec.switchoverDelay`, | ||
as in the case of a switchover. Increasing the time for fast shutdown is safer | ||
from an RPO point of view, but possibly delays the return to normal operation - | ||
negatively affecting RTO. | ||
|
||
!!! Warning | ||
As already mentioned in the ["Instance Manager" section](instance_manager.md) | ||
when explaining the switchover process, the `.spec.switchoverDelay` option | ||
affects the RPO and RTO of your PostgreSQL database. Setting it to a low value, | ||
might favor RTO over RPO but lead to data loss at cluster level and/or backup | ||
level. On the contrary, setting it to a high value, might remove the risk of | ||
data loss while leaving the cluster without an active primary for a longer time | ||
during the switchover. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.