Skip to content

Commit

Permalink
Ensure strict failover/switchover definition difference (patroni#2784)
Browse files Browse the repository at this point in the history
- Don't set leader in failover key from patronictl failover
- Show warning and execute switchover if leader option is provided for patronictl failover command
- Be more precise in the log messages
- Allow to failover to an async candidate in sync mode
- Check if candidate is the same as the leader specified in api
- Fix and extend some tests
- Add documentation
  • Loading branch information
hughcapet authored Sep 12, 2023
1 parent 3c24c33 commit b31a4d5
Show file tree
Hide file tree
Showing 8 changed files with 673 additions and 262 deletions.
2 changes: 1 addition & 1 deletion docs/pause.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ When Patroni runs in a paused mode, it does not change the state of PostgreSQL,

- For the Postgres primary with the leader lock Patroni updates the lock. If the node with the leader lock stops being the primary (i.e. is demoted manually), Patroni will release the lock instead of promoting the node back.

- Manual unscheduled restart, reinitialize and manual failover are allowed. Manual failover is only allowed if the node to failover to is specified. In the paused mode, manual failover does not require a running primary node.
- Manual unscheduled restart, manual unscheduled failover/switchover and reinitialize are allowed. No scheduled action is allowed. Manual switchover is only allowed if the node to switch over to is specified.

- If 'parallel' primaries are detected by Patroni, it emits a warning, but does not demote the primary without the leader lock.

Expand Down
102 changes: 87 additions & 15 deletions docs/rest_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -560,39 +560,111 @@ The above call removes ``postgresql.parameters.max_connections`` from the dynami
Switchover and failover endpoints
---------------------------------

``POST /switchover`` or ``POST /failover``. These endpoints are very similar to each other. There are a couple of minor differences though:
.. _switchover_api:

1. The failover endpoint allows to perform a manual failover when there are no healthy nodes, but at the same time it will not allow you to schedule a switchover.
Switchover
^^^^^^^^^^

2. The switchover endpoint is the opposite. It works only when the cluster is healthy (there is a leader) and allows to schedule a switchover at a given time.
``/switchover`` endpoint only works when the cluster is healthy (there is a leader). It also allows to schedule a switchover at a given time.

When calling ``/switchover`` endpoint a candidate can be specified but is not required, in contrast to ``/failover`` endpoint. If a candidate is not provided, all the eligible nodes of the cluster will participate in the leader race after the leader stepped down.

In the JSON body of the ``POST`` request you must specify at least the ``leader`` or ``candidate`` fields and optionally the ``scheduled_at`` field if you want to schedule a switchover at a specific time.
In the JSON body of the ``POST`` request you must specify the ``leader`` field. The ``candidate`` and the ``scheduled_at`` fields are optional and can be used to schedule a switchover at a specific time.

Depending on the situation, requests might return different HTTP status codes and bodies. Status code **200** is returned when the switchover or failover successfully completed. If the switchover was successfully scheduled, Patroni will return HTTP status code **202**. In case something went wrong, the error status code (one of **400**, **412**, or **503**) will be returned with some details in the response body.

Example: perform a failover to the specific node:
``DELETE /switchover`` can be used to delete the currently scheduled switchover.

**Example:** perform a switchover to any healthy standby

.. code-block:: bash
$ curl -s http://localhost:8008/switchover -XPOST -d '{"leader":"postgresql1"}'
Successfully switched over to "postgresql2"
**Example:** perform a switchover to a specific node

.. code-block:: bash
$ curl -s http://localhost:8008/switchover -XPOST -d \
'{"leader":"postgresql1","candidate":"postgresql2"}'
Successfully switched over to "postgresql2"
**Example:** schedule a switchover from the leader to any other healthy standby in the cluster at a specific time.

.. code-block:: bash
$ curl -s http://localhost:8009/failover -XPOST -d '{"candidate":"postgresql1"}'
Successfully failed over to "postgresql1"
$ curl -s http://localhost:8008/switchover -XPOST -d \
'{"leader":"postgresql0","scheduled_at":"2019-09-24T12:00+00"}'
Switchover scheduled
Example: schedule a switchover from the leader to any other healthy replica in the cluster at a specific time:
Failover
^^^^^^^^

``/failover`` endpoint can be used to perform a manual failover when there are no healthy nodes (e.g. to an asynchronous standby if all synchronous standbys are not healthy enough to promote). However there is no requirement for a cluster not to have leader - failover can also be run on a healthy cluster.

In the JSON body of the ``POST`` request you must specify the ``candidate`` field. If the ``leader`` field is specified, a switchover is triggered instead.

**Example:**

.. code-block:: bash
$ curl -s http://localhost:8008/switchover -XPOST -d \
'{"leader":"postgresql0","scheduled_at":"2019-09-24T12:00+00"}'
Switchover scheduled
$ curl -s http://localhost:8008/failover -XPOST -d '{"candidate":"postgresql1"}'
Successfully failed over to "postgresql1"
.. warning::
:ref:`Be very careful <failover_healthcheck>` when using this endpoint, as this can cause data loss in certain situations. In most cases, :ref:`the switchover endpoint <switchover_api>` satisfies the administrator's needs.


``POST /switchover`` and ``POST /failover`` endpoints are used by ``patronictl switchover`` and ``patronictl failover``, respectively.

``DELETE /switchover`` is used by ``patronictl flush <cluster-name> switchover``.

.. list-table:: Failover/Switchover comparison
:widths: 25 25 25
:header-rows: 1

* -
- Failover
- Switchover
* - Requires leader specified
- no
- yes
* - Requires candidate specified
- yes
- no
* - Can be run in pause
- yes
- yes (only to a specific candidate)
* - Can be scheduled
- no
- yes (if not in pause)

.. _failover_healthcheck:

Healthy standby
^^^^^^^^^^^^^^^

There are a couple of checks that a member of a cluster should pass to be able to participate in the leader race during a switchover or to become a leader as a failover/switchover candidate:

Depending on the situation the request might finish with a different HTTP status code and body. The status code **200** is returned when the switchover or failover successfully completed. If the switchover was successfully scheduled, Patroni will return HTTP status code **202**. In case something went wrong, the error status code (one of **400**, **412** or **503**) will be returned with some details in the response body. For more information please check the source code of ``patroni/api.py:do_POST_failover()`` method.
- be reachable via Patroni API;
- not have ``nofailover`` tag set to ``true``;
- have watchdog fully functional (if required by the configuration);
- in case of a switchover in a healthy cluster or an automatic failover, not exceed maximum replication lag (``maximum_lag_on_failover`` :ref:`configuration parameter <dynamic_configuration>`);
- in case of a switchover in a healthy cluster or an automatic failover, not have a timeline number smaller than the cluster timeline if ``check_timeline`` :ref:`configuration parameter <dynamic_configuration>` is set to ``true``;
- in :ref:`synchronous mode <synchronous_mode>`:

- ``DELETE /switchover``: delete the scheduled switchover
- In case of a switchover (both with and without a candidate): be listed in the ``/sync`` key members;
- For a failover in both healthy and unhealthy clusters, this check is omitted.

The ``POST /switchover`` and ``POST failover`` endpoints are used by ``patronictl switchover`` and ``patronictl failover``, respectively.
The ``DELETE /switchover`` is used by ``patronictl flush <cluster-name> switchover``.
.. warning::
In case of a manual failover in a cluster without a leader, a candidate will be allowed to promote even if:
- it is not in the ``/sync`` key members when synchronous mode is enabled;
- its lag exceeds the maximum replication lag allowed;
- it has the timeline number smaller than the last known cluster timeline.


Restart endpoint
Expand Down
24 changes: 16 additions & 8 deletions patroni/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1074,7 +1074,8 @@ def do_POST_failover(self, action: str = 'failover') -> None:
* ``412``: if operation is not possible;
* ``503``: if unable to register the operation to the DCS;
* HTTP status returned by :func:`parse_schedule`, if any error was observed while parsing the schedule;
* HTTP status returned by :func:`poll_failover_result` if the operation has been processed immediately.
* HTTP status returned by :func:`poll_failover_result` if the operation has been processed immediately;
* ``400``: if none of the above applies.
.. note::
If unable to parse the request body, then the request is silently discarded.
Expand All @@ -1101,15 +1102,22 @@ def do_POST_failover(self, action: str = 'failover') -> None:
data = 'Switchover could be performed only from a specific leader'

if not data and scheduled_at:
if not leader:
data = 'Scheduled {0} is possible only from a specific leader'.format(action)
if not data and global_config.is_paused:
data = "Can't schedule {0} in the paused state".format(action)
if not data:
if action == 'failover':
data = "Failover can't be scheduled"
elif global_config.is_paused:
data = "Can't schedule switchover in the paused state"
else:
(status_code, data, scheduled_at) = self.parse_schedule(scheduled_at, action)

if not data and global_config.is_paused and not candidate:
data = action.title() + ' is possible only to a specific candidate in a paused state'
data = 'Switchover is possible only to a specific candidate in a paused state'

if action == 'failover' and leader:
logger.warning('received failover request with leader specifed - performing switchover instead')
action = 'switchover'

if not data and leader == candidate:
data = 'Switchover target and source are the same'

if not data and not scheduled_at:
data = self.is_failover_possible(cluster, leader, candidate, action)
Expand All @@ -1126,7 +1134,7 @@ def do_POST_failover(self, action: str = 'failover') -> None:
status_code, data = self.poll_failover_result(cluster.leader and cluster.leader.name,
candidate, action)
else:
data = 'failed to write {0} key into DCS'.format(action)
data = 'failed to write failover key into DCS'
status_code = 503
# pyright thinks ``status_code`` can be ``None`` because ``parse_schedule`` call may return ``None``. However,
# if that's the case, ``status_code`` will be overwritten somewhere between ``parse_schedule`` and
Expand Down
72 changes: 45 additions & 27 deletions patroni/ctl.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
except ImportError: # pragma: no cover
from cdiff import markup_to_pager, PatchStream # pyright: ignore [reportMissingModuleSource]

from .config import Config, get_global_config
from .dcs import get_dcs as _get_dcs, AbstractDCS, Cluster, Member
from .exceptions import PatroniException
from .postgresql.misc import postgres_version_to_int
Expand Down Expand Up @@ -225,8 +226,6 @@ def load_config(path: str, dcs_url: Optional[str]) -> Dict[str, Any]:
:raises:
:class:`PatroniCtlException`: if *path* does not exist or is not readable.
"""
from patroni.config import Config

if not (os.path.exists(path) and os.access(path, os.R_OK)):
if path != CONFIG_FILE_PATH: # bail if non-default config location specified but file not found / readable
raise PatroniCtlException('Provided config file {0} not existing or no read rights.'
Expand Down Expand Up @@ -1013,7 +1012,6 @@ def reload(obj: Dict[str, Any], cluster_name: str, member_names: List[str],
if r.status == 200:
click.echo('No changes to apply on member {0}'.format(member.name))
elif r.status == 202:
from patroni.config import get_global_config
config = get_global_config(cluster)
click.echo('Reload request received for member {0} and will be processed within {1} seconds'.format(
member.name, config.get('loop_wait') or dcs.loop_wait)
Expand Down Expand Up @@ -1095,7 +1093,6 @@ def restart(obj: Dict[str, Any], cluster_name: str, group: Optional[int], member
content['postgres_version'] = version

if scheduled_at:
from patroni.config import get_global_config
if get_global_config(cluster).is_paused:
raise PatroniCtlException("Can't schedule restart in the paused state")
content['schedule'] = scheduled_at.isoformat()
Expand Down Expand Up @@ -1223,19 +1220,22 @@ def _do_failover_or_switchover(obj: Dict[str, Any], action: str, cluster_name: s
dcs = get_dcs(obj, cluster_name, group)
cluster = dcs.get_cluster()

if action == 'switchover' and (cluster.leader is None or not cluster.leader.name):
raise PatroniCtlException('This cluster has no leader')
global_config = get_global_config(cluster)

if leader is None:
if force or action == 'failover':
leader = cluster.leader and cluster.leader.name
else:
from patroni.config import get_global_config
prompt = 'Standby Leader' if get_global_config(cluster).is_standby_cluster else 'Primary'
leader = click.prompt(prompt, type=str, default=(cluster.leader and cluster.leader.member.name))
# leader has to be be defined for switchover only
if action == 'switchover':
if cluster.leader is None or not cluster.leader.name:
raise PatroniCtlException('This cluster has no leader')

if leader is not None and cluster.leader and cluster.leader.member.name != leader:
raise PatroniCtlException('Member {0} is not the leader of cluster {1}'.format(leader, cluster_name))
if leader is None:
if force:
leader = cluster.leader.name
else:
prompt = 'Standby Leader' if global_config.is_standby_cluster else 'Primary'
leader = click.prompt(prompt, type=str, default=(cluster.leader and cluster.leader.name))

if cluster.leader.name != leader:
raise PatroniCtlException(f'Member {leader} is not the leader of cluster {cluster_name}')

# excluding members with nofailover tag
candidate_names = [str(m.name) for m in cluster.members if m.name != leader and not m.nofailover]
Expand All @@ -1255,7 +1255,16 @@ def _do_failover_or_switchover(obj: Dict[str, Any], action: str, cluster_name: s
raise PatroniCtlException(action.title() + ' target and source are the same.')

if candidate and candidate not in candidate_names:
raise PatroniCtlException('Member {0} does not exist in cluster {1}'.format(candidate, cluster_name))
raise PatroniCtlException(
f'Member {candidate} does not exist in cluster {cluster_name} or is tagged as nofailover')

if all((not force,
action == 'failover',
global_config.is_synchronous_mode,
not cluster.sync.is_empty,
not cluster.sync.matches(candidate, True))):
if click.confirm(f'Are you sure you want to failover to the asynchronous node {candidate}'):
raise PatroniCtlException('Aborting ' + action)

scheduled_at_str = None
scheduled_at = None
Expand All @@ -1268,25 +1277,29 @@ def _do_failover_or_switchover(obj: Dict[str, Any], action: str, cluster_name: s

scheduled_at = parse_scheduled(scheduled)
if scheduled_at:
from patroni.config import get_global_config
if get_global_config(cluster).is_paused:
if global_config.is_paused:
raise PatroniCtlException("Can't schedule switchover in the paused state")
scheduled_at_str = scheduled_at.isoformat()

failover_value = {'leader': leader, 'candidate': candidate, 'scheduled_at': scheduled_at_str}
failover_value = {'candidate': candidate}
if action == 'switchover':
failover_value['leader'] = leader
if scheduled_at_str:
failover_value['scheduled_at'] = scheduled_at_str

logging.debug(failover_value)

# By now we have established that the leader exists and the candidate exists
if not force:
demote_msg = ', demoting current leader ' + leader if leader else ''
demote_msg = f', demoting current leader {cluster.leader.name}' if cluster.leader else ''
if scheduled_at_str:
if not click.confirm('Are you sure you want to schedule {0} of cluster {1} at {2}{3}?'
.format(action, cluster_name, scheduled_at_str, demote_msg)):
# only switchover can be scheduled
if not click.confirm(f'Are you sure you want to schedule switchover of cluster '
f'{cluster_name} at {scheduled_at_str}{demote_msg}?'):
# action as a var to catch a regression in the tests
raise PatroniCtlException('Aborting scheduled ' + action)
else:
if not click.confirm('Are you sure you want to {0} cluster {1}{2}?'
.format(action, cluster_name, demote_msg)):
if not click.confirm(f'Are you sure you want to {action} cluster {cluster_name}{demote_msg}?'):
raise PatroniCtlException('Aborting ' + action)

r = None
Expand Down Expand Up @@ -1332,6 +1345,8 @@ def failover(obj: Dict[str, Any], cluster_name: str, group: Optional[int],
.. note::
If *leader* is given perform a switchover instead of a failover.
This behavior is deprecated. ``--leader`` option support will be
removed in the next major release.
.. seealso::
Refer to :func:`_do_failover_or_switchover` for details.
Expand All @@ -1345,7 +1360,12 @@ def failover(obj: Dict[str, Any], cluster_name: str, group: Optional[int],
:param candidate: name of a standby member to be promoted. Nodes that are tagged with ``nofailover`` cannot be used.
:param force: perform the failover or switchover without asking for confirmations.
"""
action = 'switchover' if leader else 'failover'
action = 'failover'
if leader:
action = 'switchover'
click.echo(click.style(
'Supplying a leader name using this command is deprecated and will be removed in a future version of'
' Patroni, change your scripts to use `switchover` instead.\nExecuting switchover!', fg='red'))
_do_failover_or_switchover(obj, action, cluster_name, group, leader, candidate, force)


Expand Down Expand Up @@ -1718,7 +1738,6 @@ def wait_until_pause_is_applied(dcs: AbstractDCS, paused: bool, old_cluster: Clu
:param old_cluster: original cluster information before pause or unpause has been requested. Used to report which
nodes are still pending to have ``pause`` equal *paused* at a given point in time.
"""
from patroni.config import get_global_config
config = get_global_config(old_cluster)

click.echo("'{0}' request sent, waiting until it is recognized by all nodes".format(paused and 'pause' or 'resume'))
Expand Down Expand Up @@ -1756,7 +1775,6 @@ def toggle_pause(config: Dict[str, Any], cluster_name: str, group: Optional[int]
* ``pause`` state is already *paused*; or
* cluster contains no accessible members.
"""
from patroni.config import get_global_config
dcs = get_dcs(config, cluster_name, group)
cluster = dcs.get_cluster()
if get_global_config(cluster).is_paused == paused:
Expand Down
Loading

0 comments on commit b31a4d5

Please sign in to comment.