diff --git a/advocacy_docs/partner_docs/DBeaverPRO/03-SolutionSummary.mdx b/advocacy_docs/partner_docs/DBeaverPRO/03-SolutionSummary.mdx index bcd5c33169e..43af8a06a49 100644 --- a/advocacy_docs/partner_docs/DBeaverPRO/03-SolutionSummary.mdx +++ b/advocacy_docs/partner_docs/DBeaverPRO/03-SolutionSummary.mdx @@ -6,5 +6,5 @@ description: 'Brief explanation of the solution and its purpose' DBeaver PRO is a SQL client software application and universal database management tool for EDB Postgres Advanced Server and EDB Postgres Extended Server. With DBeaver PRO you can manipulate your data like you would in a regular spreadsheet. You have the ability to view, create, modify, save, and delete all Postgres data types. The features resemble those of a regular spreadsheet, as you can create analytical reports based on records from different data storages and export information in an appropriate format. DBeaver PRO also provides you with a powerful editor for SQL, data and schema migration, monitoring of database connection sessions, and other administration features.
- +
diff --git a/advocacy_docs/partner_docs/DBeaverPRO/Images/UpdatedDBeaverArc.png b/advocacy_docs/partner_docs/DBeaverPRO/Images/UpdatedDBeaverArc.png new file mode 100644 index 00000000000..7778b4cff74 --- /dev/null +++ b/advocacy_docs/partner_docs/DBeaverPRO/Images/UpdatedDBeaverArc.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:75b711dd1f73d922486b23950819b7176fb4e8f5912eafee8cba526c881b0881 +size 65070 diff --git a/product_docs/docs/biganimal/release/overview/02_high_availability.mdx b/product_docs/docs/biganimal/release/overview/02_high_availability.mdx index a9f8278c76c..16cbdfb9284 100644 --- a/product_docs/docs/biganimal/release/overview/02_high_availability.mdx +++ b/product_docs/docs/biganimal/release/overview/02_high_availability.mdx @@ -28,7 +28,7 @@ In case of unrecoverable failure of the primary, a restore from a backup is requ ![*BigAnimal Cluster4*](images/Single-Node-Diagram-2x.png) -## Standard high availability +## High availability The high availability option is provided to minimize downtime in cases of failures. High-availability clusters—one *primary* and one or two *standby replicas*—are configured automatically, with standby replicas staying up to date through physical streaming replication. diff --git a/product_docs/docs/epas/14/epas_guide/03_database_administration/05_edb_audit_logging/09_object_auditing.mdx b/product_docs/docs/epas/14/epas_guide/03_database_administration/05_edb_audit_logging/09_object_auditing.mdx index 46ea81d2113..8a083930e15 100644 --- a/product_docs/docs/epas/14/epas_guide/03_database_administration/05_edb_audit_logging/09_object_auditing.mdx +++ b/product_docs/docs/epas/14/epas_guide/03_database_administration/05_edb_audit_logging/09_object_auditing.mdx @@ -9,7 +9,7 @@ The object-level auditing allows selective auditing of objects for specific Data Use the following syntax to specify an `edb_audit_statement` parameter value for `SELECT`, `UPDATE`, `DELETE`, or `INSERT` statements: ```text -{ select | update | delete | insert }@groupname +{ select | update | delete | insert }{@ | -}groupname ``` ## Example diff --git a/product_docs/docs/migration_portal/4.0/04_mp_migrating_database/03_mp_schema_migration.mdx b/product_docs/docs/migration_portal/4.0/04_mp_migrating_database/03_mp_schema_migration.mdx index 35c14928759..c59eb4b0230 100644 --- a/product_docs/docs/migration_portal/4.0/04_mp_migrating_database/03_mp_schema_migration.mdx +++ b/product_docs/docs/migration_portal/4.0/04_mp_migrating_database/03_mp_schema_migration.mdx @@ -30,8 +30,8 @@ You can migrate schemas to an existing on-premises EDB Postgres Advanced Server 2. Select one or more schemas to migrate to EDB Postgres Advanced Server. - !!! Note - If your schemas aren't 100% compatible, a banner appears. Complete the Contact Us form as needed. + !!! Note + If your schemas aren't 100% compatible, a banner appears. Complete the Contact Us form as needed. 3. To download the assessed schemas, select **Download SQL file**. @@ -42,20 +42,20 @@ You can migrate schemas to an existing on-premises EDB Postgres Advanced Server 5. To import the schemas, run: -- On CLI: + - On CLI: - ```text - \i c:\users\...\HARP offers multiple options for Distributed Consensus Service (DCS) source: etcd and BDR. The BDR consensus option can be used in deployments where etcd isn't present. Use of the BDR consensus option is no longer considered beta and is now supported for use in production environments.
| +| Enhancement | Transport layer proxy now generally available.HARP offers multiple proxy options for routing connections between the client application and database: application layer (L7) and transport layer (L4). The network layer 4 or transport layer proxy simply forwards network packets, and layer 7 terminates network traffic. The transport layer proxy, previously called simple proxy, is no longer considered beta and is now supported for use in production environments.
| diff --git a/product_docs/docs/pgd/3.7/harp/01_release_notes/harp2.0.3_rel_notes.mdx b/product_docs/docs/pgd/3.7/harp/01_release_notes/harp2.0.3_rel_notes.mdx new file mode 100644 index 00000000000..75722ff6794 --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/01_release_notes/harp2.0.3_rel_notes.mdx @@ -0,0 +1,11 @@ +--- +title: "Version 2.0.3" +--- + +This is a patch release of HARP 2 that includes fixes for issues identified +in previous versions. + +| Type | Description | +| ---- |------------ | +| Enhancement | HARP Proxy supports read-only user dedicated TLS Certificate (RT78516) | +| Bug Fix | HARP Proxy continues to try and connect to DCS instead of exiting after 50 seconds. (RT75406) | diff --git a/product_docs/docs/pgd/3.7/harp/01_release_notes/harp2.1.0_rel_notes.mdx b/product_docs/docs/pgd/3.7/harp/01_release_notes/harp2.1.0_rel_notes.mdx new file mode 100644 index 00000000000..bb7e5b16d47 --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/01_release_notes/harp2.1.0_rel_notes.mdx @@ -0,0 +1,17 @@ +--- +title: "Version 2.1.0" +--- + +This is a minor release of HARP 2 that includes new features as well +as fixes for issues identified in previous versions. + +| Type | Description | +| ---- |------------ | +| Feature | The BDR DCS now uses a push notification from the consensus rather than through polling nodes.This change reduces the time for new leader selection and the load that HARP does on the BDR DCS since it doesn't need to poll in short intervals anymore.
| +| Feature | TPA now restarts each HARP Proxy one by one and wait until they come back to reduce any downtime incurred by the application during software upgrades. | +| Feature | The support for embedding PGBouncer directly into HARP Proxy is now deprecated and will be removed in the next major release of HARP.It's now possible to configure TPA to put PGBouncer on the same node as HARP Proxy and point to that HARP Proxy.
| +| Bug Fix | `harpctl promoteThe use of HARP Router to translate DCS contents into appropriate online or offline states for HTTP-based URI requests meant a load balancer or HAProxy was necessary to determine the lead master. HARP Proxy now does this automatically without periodic iterative status checks.
| +| Feature | Utilizes DCS key subscription to respond directly to state changes.With relevant cluster state changes, the cluster responds immediately, resulting in improved failover and switchover times.
| +| Feature | Compatibility with etcd SSL settings.It is now possible to communicate with etcd through SSL encryption.
| +| Feature | Zero transaction lag on switchover.Transactions are not routed to the new lead node until all replicated transactions are replayed, thereby reducing the potential for conflicts.
+| Feature | Experimental BDR Consensus layer.Using BDR Consensus as the Distributed Consensus Service (DCS) reduces the amount of change needed for implementations.
+| Feature | Experimental built-in proxy.Proxy implementation for increased session control.
| diff --git a/product_docs/docs/pgd/3.7/harp/01_release_notes/index.mdx b/product_docs/docs/pgd/3.7/harp/01_release_notes/index.mdx new file mode 100644 index 00000000000..6905f183a81 --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/01_release_notes/index.mdx @@ -0,0 +1,26 @@ +--- +title: Release Notes +navigation: +- harp2.1.0_rel_notes +- harp2.0.3_rel_notes +- harp2.0.2_rel_notes +- harp2.0.1_rel_notes +- harp2_rel_notes +--- + +High Availability Routing for Postgres (HARP) is a cluster-management tool for +[Bi-directional Replication (BDR)](/bdr/latest) clusters. The core design of +the tool is to route all application traffic in a single data center or +region to only one node at a time. This node, designated the lead master, acts +as the principle write target to reduce the potential for data conflicts. + +The release notes in this section provide information on what was new in each release. + +| Version | Release Date | +| ----------------------- | ------------ | +| [2.1.1](harp2.1.1_rel_notes) | 2022 June 21 | +| [2.1.0](harp2.1.0_rel_notes) | 2022 May 17 | +| [2.0.3](harp2.0.3_rel_notes) | 2022 Mar 31 | +| [2.0.2](harp2.0.2_rel_notes) | 2022 Feb 24 | +| [2.0.1](harp2.0.1_rel_notes) | 2021 Jan 31 | +| [2.0.0](harp2_rel_notes) | 2021 Dec 01 | diff --git a/product_docs/docs/pgd/3.7/harp/02_overview.mdx b/product_docs/docs/pgd/3.7/harp/02_overview.mdx new file mode 100644 index 00000000000..7db92e093cd --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/02_overview.mdx @@ -0,0 +1,246 @@ +--- +navTitle: Overview +title: HARP functionality overview +--- + +HARP is a new approach to high availability for BDR +clusters. It leverages a consensus-driven quorum to determine the correct connection endpoint +in a semi-exclusive manner to prevent unintended multi-node writes from an +application. + +## The importance of quorum + +The central purpose of HARP is to enforce full quorum on any Postgres cluster +it manages. Quorum is a term applied to a voting body that +mandates a certain minimum of attendees are available to make a decision. More simply: majority rules. + +For any vote to end in a result other than a tie, an odd number of +nodes must constitute the full cluster membership. Quorum, however, doesn't +strictly demand this restriction; a simple majority is enough. This means +that in a cluster of N nodes, quorum requires a minimum of N/2+1 nodes to hold +a meaningful vote. + +All of this ensures the cluster is always in agreement regarding the node +that is "in charge." For a BDR cluster consisting of multiple nodes, this +determines the node that is the primary write target. HARP designates this node +as the lead master. + +## Reducing write targets + +The consequence of ignoring the concept of quorum, or not applying it +well enough, can lead to a "split brain" scenario where the "correct" write +target is ambiguous or unknowable. In a standard Postgres cluster, it's +important that only a single node is ever writable and sending replication +traffic to the remaining nodes. + +Even in multi-master-capable approaches such as BDR, it can be help to +reduce the amount of necessary conflict management to derive identical data +across the cluster. In clusters that consist of multiple BDR nodes per physical +location or region, this usually means a single BDR node acts as a "leader" and +remaining nodes are "shadow." These shadow nodes are still writable, but writing to them is discouraged unless absolutely necessary. + +By leveraging quorum, it's possible for all nodes to agree on the exact +Postgres node to represent the entire cluster or a local BDR region. Any +nodes that lose contact with the remainder of the quorum, or are overruled by +it, by definition can't become the cluster leader. + +This restriction prevents split-brain situations where writes unintentionally reach two +Postgres nodes. Unlike technologies such as VPNs, proxies, load balancers, or +DNS, you can't circumvent a quorum-derived consensus by misconfiguration or +network partitions. So long as it's possible to contact the consensus layer to +determine the state of the quorum maintained by HARP, only one target is ever +valid. + +## Basic architecture + +The design of HARP comes in essentially two parts, consisting of a manager and +a proxy. The following diagram describes how these interact with a single +Postgres instance: + +![HARP Unit](images/ha-unit.png) + +The consensus layer is an external entity where Harp Manager maintains +information it learns about its assigned Postgres node, and HARP Proxy +translates this information to a valid Postgres node target. Because Proxy +obtains the node target from the consensus layer, several such instances can +exist independently. + +While using BDR as the consensus layer, each server node resembles this +variant instead: + +![HARP Unit w/BDR Consensus](images/ha-unit-bdr.png) + +In either case, each unit consists of the following elements: + +* A Postgres or EDB instance +* A consensus layer resource, meant to track various attributes of the Postgres + instance +* A HARP Manager process to convey the state of the Postgres node to the + consensus layer +* A HARP Proxy service that directs traffic to the proper lead master node, + as derived from the consensus layer + +Not every application stack has access to additional node resources +specifically for the Proxy component, so it can be combined with the +application server to simplify the stack. + +This is a typical design using two BDR nodes in a single data center organized in a lead master/shadow master configuration: + +![HARP Cluster](images/ha-ao.png) + +When using BDR as the HARP consensus layer, at least three +fully qualified BDR nodes must be present to ensure a quorum majority. (Not shown in the diagram are connections between BDR nodes.) + +![HARP Cluster w/BDR Consensus](images/ha-ao-bdr.png) + +## How it works + +When managing a BDR cluster, HARP maintains at most one leader node per +defined location. This is referred to as the lead master. Other BDR +nodes that are eligible to take this position are shadow master state until they take the leader role. + +Applications can contact the current leader only through the proxy service. +Since the consensus layer requires quorum agreement before conveying leader +state, proxy services direct traffic to that node. + +At a high level, this mechanism prevents simultaneous application interaction with +multiple nodes. + +### Determining a leader + +As an example, consider the role of lead master in a locally subdivided +BDR Always-On group as can exist in a single data center. When any +Postgres or Manager resource is started, and after a configurable refresh +interval, the following must occur: + +1. The Manager checks the status of its assigned Postgres resource. + - If Postgres isn't running, try again after configurable timeout. + - If Postgres is running, continue. +2. The Manager checks the status of the leader lease in the consensus layer. + - If the lease is unclaimed, acquire it and assign the identity of + the Postgres instance assigned to this manager. This lease duration is + configurable, but setting it too low can result in unexpected leadership + transitions. + - If the lease is already claimed by us, renew the lease TTL. + - Otherwise do nothing. + +A lot more occurs, but this simplified version explains +what's happening. The leader lease can be held by only one node, and if it's +held elsewhere, HARP Manager gives up and tries again later. + +!!! Note + Depending on the chosen consensus layer, rather than repeatedly looping to + check the status of the leader lease, HARP subscribes to notifications. In this case, it can respond immediately any time the state of the + lease changes rather than polling. Currently this functionality is + restricted to the etcd consensus layer. + +This means HARP itself doesn't hold elections or manage quorum, which is +delegated to the consensus layer. A quorum of the consensus layer must acknowledge the act of obtaining the lease, so if the request succeeds, +that node leads the cluster in that location. + +### Connection routing + +Once the role of the lead master is established, connections are handled +with a similar deterministic result as reflected by HARP Proxy. Consider a case +where HARP Proxy needs to determine the connection target for a particular backend +resource: + +1. HARP Proxy interrogates the consensus layer for the current lead master in + its configured location. +2. If this is unset or in transition: + - New client connections to Postgres are barred, but clients + accumulate and are in a paused state until a lead master appears. + - Existing client connections are allowed to complete current transactions + and are then reverted to a similar pending state as new connections. +3. Client connections are forwarded to the lead master. + +The interplay shown in this case doesn't require any +interaction with either HARP Manager or Postgres. The consensus layer +is the source of all truth from the proxy's perspective. + +### Colocation + +The arrangement of the work units is such that their organization must follow these principles: + +1. The manager and Postgres units must exist concomitantly in the same + node. +2. The contents of the consensus layer dictate the prescriptive role of all + operational work units. + +This arrangement delegates cluster quorum responsibilities to the consensus layer, +while HARP leverages it for critical role assignments and key/value storage. +Neither storage nor retrieval succeeds if the consensus layer is inoperable +or unreachable, thus preventing rogue Postgres nodes from accepting +connections. + +As a result, the consensus layer generally exists outside of HARP or HARP-managed nodes for maximum safety. Our reference diagrams show this separation, although it isn't required. + +!!! Note + To operate and manage cluster state, BDR contains its own + implementation of the Raft Consensus model. You can configure HARP to + leverage this same layer to reduce reliance on external dependencies and + to preserve server resources. However, certain drawbacks to this + approach are discussed in + [Consensus layer](09_consensus-layer). + +## Recommended architecture and use + +HARP was primarily designed to represent a BDR Always-On architecture that +resides in two or more data centers and consists of at least five BDR +nodes. This configuration doesn't count any logical standby nodes. + +The following diagram shows the current and standard representation: + +![BDR Always-On Reference Architecture](images/bdr-ao-spec.png) + +In this diagram, HARP Manager exists on BDR Nodes 1-4. The initial state +of the cluster is that BDR Node 1 is the lead master of DC A, and BDR +Node 3 is the lead master of DC B. + +This configuration results in any HARP Proxy resource in DC A connecting to BDR Node 1 +and the HARP Proxy resource in DC B connecting to BDR Node 3. + +!!! Note + While this diagram shows only a single HARP Proxy per DC, this is + an example only and should not be considered a single point of failure. Any + number of HARP Proxy nodes can exist, and they all direct application + traffic to the same node. + +### Location configuration + +For multiple BDR nodes to be eligible to take the lead master lock in +a location, you must define a location in the `config.yml` configuration +file. + +To reproduce the BDR Always-On reference architecture shown in the diagram, include these lines in the `config.yml` +configuration for BDR Nodes 1 and 2: + +```yaml +location: dca +``` + +For BDR Nodes 3 and 4, add: + +```yaml +location: dcb +``` + +This applies to any HARP Proxy nodes that are designated in those respective +data centers as well. + +### BDR 3.7 compatibility + +BDR 3.7 and later offers more direct location definition by assigning a +location to the BDR node. This is done by calling the following SQL +API function while connected to the BDR node. So for BDR Nodes 1 and 2, you +might do this: + +```sql +SELECT bdr.set_node_location('dca'); +``` + +And for BDR Nodes 3 and 4: + +```sql +SELECT bdr.set_node_location('dcb'); +``` diff --git a/product_docs/docs/pgd/3.7/harp/03_installation.mdx b/product_docs/docs/pgd/3.7/harp/03_installation.mdx new file mode 100644 index 00000000000..b30b6d2bdf2 --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/03_installation.mdx @@ -0,0 +1,128 @@ +--- +navTitle: Installation +title: Installation +--- + +A standard installation of HARP includes two system services: + +* HARP Manager (`harp-manager`) on the node being managed +* HARP Proxy (`harp-proxy`) elsewhere + +There are two ways to install and configure these services to manage +Postgres for proper quorum-based connection routing. + +## Software versions + +HARP has dependencies on external software. These must fit a minimum +version as listed here. + +| Software | Min version | +|-----------|---------| +| etcd | 3.4 | +| PgBouncer | 1.14 | + +## TPAExec + +The easiest way to install and configure HARP is to use the EDB TPAexec utility +for cluster deployment and management. For details on this software, see the +[TPAexec product page](https://www.enterprisedb.com/docs/pgd/latest/deployments/tpaexec/). + +!!! Note + TPAExec is currently available only through an EULA specifically dedicated + to BDR cluster deployments. If you can't access the TPAExec URL, + contact your sales or account representative. + +Configure TPAexec to recognize that cluster routing is +managed through HARP by ensuring the TPA `config.yml` file contains these +attributes: + +```yaml +cluster_vars: + failover_manager: harp +``` + +!!! Note + Versions of TPAexec earlier than 21.1 require a slightly different approach: + + ```yaml + cluster_vars: + enable_harp: true + ``` + +After this, install HARP by invoking the `tpaexec` commands +for making cluster modifications: + +```bash +tpaexec provision ${CLUSTER_DIR} +tpaexec deploy ${CLUSTER_DIR} +``` + +No other modifications are necessary apart from cluster-specific +considerations. + + +## Package installation + +Currently CentOS/RHEL packages are provided by the EDB packaging +infrastructure. For details, see the [HARP product +page](https://www.enterprisedb.com/docs/harp/latest/). + +### etcd packages + +Currently `etcd` packages for many popular Linux distributions aren't +available by their standard public repositories. EDB has therefore packaged +`etcd` for RHEL and CentOS versions 7 and 8, Debian, and variants such as +Ubuntu LTS. You need access to our HARP package repository to use +these libraries. + +## Consensus layer + +HARP requires a distributed consensus layer to operate. Currently this must be +either `bdr` or `etcd`. If using fewer than three BDR nodes, you might need to rely on `etcd`. Otherwise any BDR service outage reduces the +consensus layer to a single node and thus prevents node consensus and disables +Postgres routing. + +### etcd + +If you're using `etcd` as the consensus layer, `etcd` must be installed either +directly on the Postgres nodes or in a separate location they can access. + +To set `etcd` as the consensus layer, include this code in the HARP `config.yml` +configuration file: + +```yaml +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 +``` + +When using TPAExec, all configured etcd endpoints are entered here +automatically. + +### BDR + +The `bdr` native consensus layer is available from BDR 3.6.21 and 3.7.3. This +consensus layer model requires no supplementary software when managing routing +for a BDR cluster. + +To ensure quorum is possible in the cluster, always +use more than two nodes so that BDR's consensus layer remains responsive during node +maintenance or outages. + +To set BDR as the consensus layer, include this in the `config.yml` +configuration file: + +```yaml +dcs: + driver: bdr + endpoints: + - host=host1 dbname=bdrdb user=harp_user + - host=host2 dbname=bdrdb user=harp_user + - host=host3 dbname=bdrdb user=harp_user +``` + +The endpoints for a BDR consensus layer follow the +standard Postgres DSN connection format. diff --git a/product_docs/docs/pgd/3.7/harp/04_configuration.mdx b/product_docs/docs/pgd/3.7/harp/04_configuration.mdx new file mode 100644 index 00000000000..10ac24fba06 --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/04_configuration.mdx @@ -0,0 +1,540 @@ +--- +navTitle: Configuration +title: Configuring HARP for cluster management +--- + +The HARP configuration file follows a standard YAML-style formatting that was simplified for readability. This file is located in the `/etc/harp` +directory by default and is named `config.yml` + +You can explicitly provide the configuration file location to all HARP +executables by using the `-f`/`--config` argument. + +## Standard configuration + +HARP essentially operates as three components: + +* HARP Manager +* HARP Proxy +* harpctl + +Each of these use the same standard `config.yml` configuration format, which always include the following sections: + +* `cluster.name` — The name of the cluster to target for all operations. +* `dcs` — DCS driver and connection configuration for all endpoints. + +This means a standard preamble is always included for HARP +operations, such as the following: + +```yaml +cluster: + name: mycluster + +dcs: + ... +``` + +Other sections are optional or specific to the named HARP +component. + +### Cluster name + +The `name` entry under the `cluster` heading is required for all +interaction with HARP. Each HARP cluster has a name for both disambiguation +and for labeling data in the DCS for the specific cluster. + +HARP Manager writes information about the cluster here for consumption by +HARP Proxy and harpctl. HARP Proxy services direct traffic to nodes in +this cluster. The `harpctl` management tool interacts with this cluster. + +### DCS settings + +Configuring the consensus layer is key to HARP functionality. Without the DCS, +HARP has nowhere to store cluster metadata, can't hold leadership elections, +and so on. Therefore this portion of the configuration is required, and +certain elements are optional. + +Specify all elements under a section named `dcs` with these multiple +supplementary entries: + +- **`driver`**: Required type of consensus layer to use. + Currently can be `etcd` or `bdr`. Support for `bdr` as a consensus layer is + experimental. Using `bdr` as the consensus layer reduces the + additional software for consensus storage but expects a minimum of three + full BDR member nodes to maintain quorum during database maintenance. + +- **`endpoints`**: Required list of connection strings to contact the DCS. + List every node of the DCS here if possible. This ensures HARP + continues to function as long as a majority of the DCS can still + operate and be reached by the network. + + Format when using `etcd` as the consensus layer is as follows: + + ```yaml + dcs: + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 + ``` + Format when using the experimental `bdr` consensus layer is as follows: + + ```yaml + dcs: + # only DSN format is supported + endpoints: + - "host=host1 port=5432 dbname=bdrdb user=postgres" + - "host=host2 port=5432 dbname=bdrdb user=postgres" + - "host=host3 port=5432 dbname=bdrdb user=postgres" + ``` +Currently, `bdr` consensus layer requires the first endpoint to point to the local postgres instance. + +- **`request_timeout`**: Time in milliseconds to consider a request as failed. + If HARP makes a request to the DCS and receives no response in this time, it considers the operation as failed. This can cause the issue + to be logged as an error or retried, depending on the nature of the request. + Default: 250. + +The following DCS SSL settings apply only when ```driver: etcd``` is set in the +configuration file: + +- **`ssl`**: Either `on` or `off` to enable SSL communication with the DCS. + Default: `off` + +- **`ssl_ca_file`**: Client SSL certificate authority (CA) file. + +- **`ssl_cert_file`**: Client SSL certificate file. + +- **`ssl_key_file`**: Client SSL key file. + +#### Example + +This example shows how to configure HARP to contact an etcd DCS +consisting of three nodes: + +```yaml +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 +``` + +### HARP Manager specific + +Besides the generic service options required for all HARP components, Manager +needs other settings: + +- **`log_level`**: One of `DEBUG`, `INFO`, `WARNING`, `ERROR`, or `CRITICAL`, + which might alter the amount of log output from HARP services. + +- **`name`**: Required name of the Postgres node represented by this Manager. + Since Manager can represent only a specific node, that node is named here and + also serves to name this Manager. If this is a BDR node, it must match the + value used at node creation when executing the + `bdr.create_node(node_name, ...)` function and as reported by the + `bdr.local_node_summary.node_name` view column. Alphanumeric characters + and underscores only. + +- **`start_command`**: This can be used instead of the information in DCS for + starting the database to monitor. This is required if using bdr as the + consensus layer. + +- **`status_command`**: This can be used instead of the information in DCS for + the Harp Manager to determine whether the database is running. This is + required if using bdr as the consensus layer. + +- **`stop_command`**: This can be used instead of the information in DCS for + stopping the database. + +- **`db_retry_wait_min`**: The initial time in seconds to wait if Harp Manager cannot + connect to the database before trying again. Harp Manager will increase the + wait time with each attempt, up to the `db_retry_wait_max` value. + +- **`db_retry_wait_max`**: The maximum time in seconds to wait if Harp Manager cannot + connect to the database before trying again. + + +Thus a complete configuration example for HARP Manager might look like this: + +```yaml +cluster: + name: mycluster + +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 + +manager: + name: node1 + log_level: INFO +``` + +This configuration is essentially the DCS contact information, any associated +service customizations, the name of the cluster, and the name of the +node. All other settings are associated with the node and is stored +in the DCS. + +Read the [Node bootstrapping](05_bootstrapping) for more about +specific node settings and initializing nodes to be managed by HARP Manager. + +### HARP Proxy specific + +Some configuration options are specific to HARP Proxy. These affect how the +daemon operates and thus are currently located in `config.yml`. + +Specify Proxy-based settings under a `proxy` heading, and include: + +- **`location`**: Required name of location for HARP Proxy to represent. + HARP Proxy nodes are directly tied to the location where they are running, as + they always direct traffic to the current lead master node. Specify location + for any defined proxy. + +- **`log_level`**: One of `DEBUG`, `INFO`, `WARNING`, `ERROR`, or `CRITICAL`, + which might alter the amount of log output from HARP services. + + * Default: `INFO` + +- **`name`**: Name of this specific proxy. + Each proxy node is named to ensure any associated statistics or operating + state are available in status checks and other interactive events. + +- **`type`**: Specifies whether to use pgbouncer or the experimental built-in passthrough proxy. All proxies must use the same proxy type. We recommend to experimenting with only the simple proxy in combination with the experimental BDR DCS. + Can be `pgbouncer` or `builtin`. + + * Default: `pgbouncer` + +- **`pgbouncer_bin_dir`**: Directory where PgBouncer binaries are located. + As HARP uses PgBouncer binaries, it needs to know where they are + located. This can be depend on the platform or distribution, so it has no + default. Otherwise, the assumption is that the appropriate binaries are in the + environment's `PATH` variable. + +#### Example + +HARP Proxy requires the cluster name, DCS connection settings, location, and +name of the proxy in operation. For example: + +```yaml +cluster: + name: mycluster + +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 + +proxy: + name: proxy1 + location: dc1 + pgbouncer_bin_dir: /usr/sbin +``` + +All other attributes are obtained from the DCS on proxy startup. + +## Runtime directives + +While it is possible to configure HARP Manager, HARP Proxy, or harpctl with a +minimum of YAML in the `config.yml` file, some customizations are held in +the DCS. These values must either be initialized via bootstrap or set +specifically with `harpctl set` directives. + +### Cluster-wide + +Set these settings under a `cluster` YAML heading during bootstrap, or +modify them with a `harpctl set cluster` command. + +- **`event_sync_interval`**: Time in milliseconds to wait for synchronization. + When events occur in HARP, they do so asynchronously across the cluster. + HARP managers start operating immediately when they detect metadata changes, + and HARP proxies might pause traffic and start reconfiguring endpoints. This is + a safety interval that roughly approximates the maximum amount of + event time skew that exists between all HARP components. + + For example, suppose Node A goes offline and HARP Manager on Node B commonly + receives this event 5 milliseconds before Node C. A setting of at least 5 ms + is then needed to ensure all HARP Manager services receive the + event before they begin to process it. + + This also applies to HARP Proxy. + +### Node directives + +You can change most node-oriented settings and then apply them while HARP +Manager is active. These items are retained in the DCS after initial bootstrap, +and thus you can modify them without altering a configuration file. + +Set these settings under a `node` YAML heading during bootstrap, or +modify them with a `harpctl set node` command. + +- **`node_type`**: The type of this database node, either `bdr` or `witness`. You can't promote a + witness node to leader. + +- **`camo_enforcement`**: Whether to strictly enforce CAMO queue state. + When set to `strict`, HARP never allows switchover or failover to a BDR + CAMO partner node unless it's fully caught up with the entire CAMO queue at + the time of the migration. When set to `lag_only`, only standard lag + thresholds such as `maximum_camo_lag` are applied. + +- **`dcs_reconnect_interval`**: The interval, measured in milliseconds, between attempts that a disconnected node tries to reconnect to the DCS. + + * Default: 1000. + +- **`dsn`**: Required full connection string to the managed Postgres node. + This parameter applies equally to all HARP services and enables + micro-architectures that run only one service per container. + + !!! Note + HARP sets the `sslmode` argument to `require` by default and prevents + connections to servers that don't require SSL. To disable this behavior, + explicitly set this parameter to a more permissive value such as + `disable`, `allow`, or `prefer`. + +- **`db_data_dir`**: Required Postgres data directory. + This is required by HARP Manager to start, stop, or reload the Postgres + service. It's also the default location for configuration files, which you can use + later for controlling promotion of streaming replicas. + +- **`db_conf_dir`**: Location of Postgres configuration files. + Some platforms prefer storing Postgres configuration files away from the + Postgres data directory. In these cases, set this option to that + expected location. + +- **`db_log_file`**: Location of Postgres log file. + + * Default: `/tmp/pg_ctl.out` + +- **`fence_node_on_dcs_failure`**: If HARP can't reach the DCS, several readiness keys and the leadership lease expire. This implicitly prevents a node from routing consideration. However, such a node isn't officially fenced, and the Manager doesn't stop monitoring the database if `stop_database_when_fenced` is set to `false`. + + * Default: False + +- **`leader_lease_duration`**: Amount of time in seconds the lead master + lease persists if not refreshed. This allows any HARP Manager a certain + grace period to refresh the lock, before expiration allows another node to + obtain the lead master lock instead. + + * Default: 6 + +- **`lease_refresh_interval`**: Amount of time in milliseconds between + refreshes of the lead master lease. This essentially controls the time + between each series of checks HARP Manager performs against its assigned + Postgres node and when the status of the node is updated in the consensus + layer. + + * Default: 2000 +- **`max_dcs_failures`**: The amount of DCS request failures before marking a node as fenced according to `fence_node_on_dcs_failure`. This setting prevents transient communication disruptions from shutting down database nodes. + + * Default: 10 + +- **`maximum_lag`**: Highest allowable variance (in bytes) between last + recorded LSN of previous lead master and this node before being allowed to + take the lead master lock. This setting prevents nodes experiencing terminal amounts + of lag from taking the lead master lock. Set to `-1` to disable this check. + + * Default: -1 + +- **`maximum_camo_lag`**: Highest allowable variance (in bytes) between last + received LSN and applied LSN between this node and its CAMO partners. + This applies only to clusters where CAMO is both available and enabled. + Thus this applies only to BDR EE clusters where `pg2q.enable_camo` is set. + For clusters with particularly stringent CAMO apply queue restrictions, set + this very low or even to `0` to avoid any unapplied CAMO transactions. Set to + `-1` to disable this check. + + * Default: -1 + +- **`ready_status_duration`**: Amount of time in seconds the node's readiness + status persists if not refreshed. This is a failsafe that removes a + node from being contacted by HARP Proxy if the HARP Manager in charge of it + stops operating. + + * Default: 30 + +- **`db_bin_dir`**: Directory where Postgres binaries are located. + As HARP uses Postgres binaries, such as `pg_ctl`, it needs to know where + they're located. This can depend on the platform or distribution, so it has no + default. Otherwise, the assumption is that the appropriate binaries are in the + environment's `PATH` variable. + +- **`priority`**: Any numeric value. + Any node where this option is set to `-1` can't take the lead master role, even when attempting to explicitly set the lead master using `harpctl`. + + * Default: 100 + +- **`stop_database_when_fenced`**: Rather than removing a node from all possible routing, stop the database on a node when it is fenced. This is an extra safeguard to prevent data from other sources than HARP Proxy from reaching the database or in case proxies can't disconnect clients for some other reason. + + * Default: False + +- **`consensus_timeout`**: Amount of milliseconds before aborting a read or + write to the consensus layer. If the consensus layer loses + quorum or becomes unreachable, you want near-instant errors rather than + infinite timeouts. This prevents blocking behavior in such cases. + When using `bdr` as the consensus layer, the highest recognized timeout + is 1000 ms. + + * Default: 250 + +- **`use_unix_socket`**: Specifies for HARP Manager to prefer to use + Unix sockets to connect to the database. + + * Default: False + +All of these runtime directives can be modified via `harpctl`. Consider if you +want to decrease the `lease_refresh_interval` to 100ms on `node1`: + +```bash +harpctl set node node1 lease_refresh_interval=100 +``` + +### Proxy directives + +You can change certain settings to the proxy while the service is active. These +items are retained in the DCS after initial bootstrap, and thus you can modify them +without altering a configuration file. Many of these settings are direct +mappings to their PgBouncer equivalent, and we will note these where relevant. + +Set these settings under a `proxies` YAML heading during bootstrap, or +modify them with a `harpctl set proxy` command. +Properties set by `harpctl set proxy` require a restart of the proxy. + +- **`auth_file`**: The full path to a PgBouncer-style `userlist.txt` file. + HARP Proxy uses this file to store a `pgbouncer` user that has + access to PgBouncer's Admin database. You can use this for other users + as well. Proxy modifies this file to add and modify the password for the + `pgbouncer` user. + + * Default: `/etc/harp/userlist.txt` + +- **`auth_type`**: The type of Postgres authentication to use for password + matching. This is actually a PgBouncer setting and isn't fully compatible + with the Postgres `pg_hba.conf` capabilities. We recommend using `md5`, `pam` + `cert`, or `scram-sha-256`. + + * Default: `md5` + +- **`auth_query`**: Query to verify a user’s password with Postgres. + Direct access to `pg_shadow` requires admin rights. It’s better to use a + non-superuser that calls a `SECURITY DEFINER` function instead. If using + TPAexec to create a cluster, a function named `pgbouncer_get_auth` is + installed on all databases in the `pg_catalog` namespace to fulfill this + purpose. + +- **`auth_user`**: If `auth_user` is set, then any user not specified in + `auth_file` is queried through the `auth_query` query from `pg_shadow` + in the database, using `auth_user`. The password of `auth_user` is + taken from `auth_file`. + +- **`client_tls_ca_file`**: Root certificate file to validate client + certificates. Requires `client_tls_sslmode` to be set. + +- **`client_tls_cert_file`**: Certificate for private key. Clients can + validate it. Requires `client_tls_sslmode` to be set. + +- **`client_tls_key_file`**: Private key for PgBouncer to accept client + connections. Requires `client_tls_sslmode` to be set. + +- **`client_tls_protocols`**: TLS protocol versions allowed for + client connections. + Allowed values: `tlsv1.0`, `tlsv1.1`, `tlsv1.2`, `tlsv1.3`. + Shortcuts: `all` (tlsv1.0,tlsv1.1,tlsv1.2,tlsv1.3), + `secure` (tlsv1.2,tlsv1.3), `legacy` (all). + + * Default: `secure` + +- **`client_tls_sslmode`**: Whether to enable client SSL functionality. + Possible values are `disable`, `allow`, `prefer`, `require`, `verify-ca`, and `verify-full`. + + * Default: `disable` + +- **`database_name`**: Required name that represents the database clients + use when connecting to HARP Proxy. This is a stable endpoint that doesn't + change and points to the current node, database name, port, etc., + necessary to connect to the lead master. You can use the global value `*` + here so all connections get directed to this target regardless of database + name. + +- **`default_pool_size`**: The maximum number of active connections to allow + per database/user combination. This is for connection pooling purposes + but does nothing in session pooling mode. This is a PgBouncer setting. + + * Default: 25 + +- **`ignore_startup_parameters`**: By default, PgBouncer allows only + parameters it can keep track of in startup packets: `client_encoding`, + `datestyle`, `timezone`, and `standard_conforming_strings`. All other + parameters raise an error. To allow other parameters, you can specify them here so that PgBouncer knows that they are handled by the admin + and it can ignore them. Often, you need to set this to + `extra_float_digits` for Java applications to function properly. + + * Default: `extra_float_digits` + +- **`listen_address`**: IP addresses where Proxy should listen for + connections. Used by pgbouncer and builtin proxy. + + * Default: 0.0.0.0 + +- **`listen_port`**: System port where Proxy listens for connections. + Used by pgbouncer and builtin proxy. + + * Default: 6432 + +- **`max_client_conn`**: The total maximum number of active client + connections that are allowed on the proxy. This can be many orders of + magnitude greater than `default_pool_size`, as these are all connections that + have yet to be assigned a session or have released a session for use by + another client connection. This is a PgBouncer setting. + + * Default: 100 + +- **`monitor_interval`**: Time in seconds between Proxy checks of PgBouncer. + Since HARP Proxy manages PgBouncer as the actual connection management + layer, it needs to periodically check various status and stats to verify + it's still operational. You can also log or register some of this information to the DCS. + + * Default: 5 + +- **`server_tls_protocols`**: TLS protocol versions are allowed for + server connections. + Allowed values: `tlsv1.0`, `tlsv1.1`, `tlsv1.2`, `tlsv1.3`. + Shortcuts: `all` (tlsv1.0,tlsv1.1,tlsv1.2,tlsv1.3), + `secure` (tlsv1.2,tlsv1.3), `legacy` (all). + + * Default: `secure` + +- **`server_tls_sslmode`**: Whether to enable server SSL functionality. + Possible values are `disable`, `allow`, `prefer`, `require`, `verify-ca`, and `verify-full`. + + * Default: `disable` + +- **`session_transfer_mode`**: Method by which to transfer sessions. + Possible values are `fast`, `wait`, and `reconnect`. + + * Default: `wait` + + This property isn't used by the builtin proxy. + +- **`server_transfer_timeout`**: The number of seconds Harp Proxy waits before giving up on a PAUSE and issuing a KILL command. + + * Default: 30 + +The following two options apply only when using the built-in proxy. + +- **`keepalive`**: The number of seconds the built-in proxy waits before sending a keepalive message to an idle leader connection. + + * Default: 5 + + +- **`timeout`**: The number of seconds the built-in proxy waits before giving up on connecting to the leader. + + * Default: 1 + +When using `harpctl` to change any of these settings for all proxies, use the +`global` keyword in place of the proxy name. For example: + +```bash +harpctl set proxy global max_client_conn=1000 +``` diff --git a/product_docs/docs/pgd/3.7/harp/05_bootstrapping.mdx b/product_docs/docs/pgd/3.7/harp/05_bootstrapping.mdx new file mode 100644 index 00000000000..55d78e8dac4 --- /dev/null +++ b/product_docs/docs/pgd/3.7/harp/05_bootstrapping.mdx @@ -0,0 +1,194 @@ +--- +navTitle: Bootstrapping +title: Cluster bootstrapping +--- + +To use HARP, a minimum amount of metadata must exist in the DCS. The +process of "bootstrapping" a cluster essentially means initializing node, +location, and other runtime configuration either all at once or on a +per-resource basis. + +This process is governed through the `harpctl apply` command. For more +information, see [harpctl command-line tool](08_harpctl). + +Set up the DCS and make sure it is functional before bootstrapping. + +!!! Important + You can combine any or all of + these example into a single YAML document and apply it all at once. + +## Cluster-wide bootstrapping + +Some settings are applied cluster-wide and you can specify them during +bootstrapping. Currently this applies only to the `event_sync_interval` +runtime directive, but others might be added later. + +The format is as follows: + +```yaml +cluster: + name: mycluster + event_sync_interval: 100 +``` + +Assuming that file was named `cluster.yml`, you then apply it with the +following: + +```bash +harpctl apply cluster.yml +``` + +If the cluster name isn't already defined in the DCS, this also +initializes that value. + +!!! Important + The cluster name parameter specified here always overrides the cluster + name supplied in `config.yml`. The assumption is that the bootstrap file + supplies all necessary elements to bootstrap a cluster or some portion of + its larger configuration. A `config.yml` file is primarily meant to control + the execution of HARP Manager, HARP Proxy, or `harpctl` specifically. + +## Location bootstrapping + +Every HARP node is associated with at most one location. This location can be +a single data center, a grouped region consisting of multiple underlying +servers, an Amazon availability zone, and so on. This is a logical +structure that allows HARP to group nodes together such that only one +represents the nodes in that location as the lead master. + +Thus it is necessary to initialize one or more locations. The format for this +is as follows: + +```yaml +cluster: + name: mycluster + +locations: + - location: dc1 + - location: dc2 +``` + +Assuming that file was named `locations.yml`, you then apply it with the +following: + +```bash +harpctl apply locations.yml +``` + +When performing any manipulation of the cluster, include the name as a preamble so the changes are directed to the right place. + +Once locations are bootstrapped, they show up with a quick examination: + +```bash +> harpctl get locations + +Cluster Location Leader Previous Leader Target Leader Lease Renewals +------- -------- ------ --------------- ------------- -------------- +mycluster dc1In case of a failure to connect to a CAMO partner to verify its configuration and check the status of transactions, do not retry immediately (leading to a fully busy pglogical manager process), but throttle down repeated attempts to reconnect and checks to once per minute.
+| Improvement | RPerformance and scalability | Implement buffered read for LCR segment file (BDR-1422)Implement LCR segment file buffering so that multiple LCR chunks can be read at a time. This should reduce I/O and improve CPU usage of Wal Senders when using the Decoding Worker.
+| Improvement | Performance and scalability | Avoid unnecessary LCR segment reads (BDR-1426)BDR now attempts to only read new LCR segments when there is at least one available. This reduces I/O load when Decoding Worker is enabled.
+| Improvement | Performance and scalability | Performance of COPY replication including the initial COPY during join has been greatly improved for partitioned tables (BDR-1479)For large tables this can improve the load times by order of magnitude or more.
+| Bug fix | Performance and scalability | Fix the parallel apply worker selection (BDR-1761)This makes parallel apply work again. In 4.0.0 parallel apply was never in effect due to this bug.
+| Bug fix | Reliability and operability | Fix Raft snapshot handling of `bdr.camo_pairs` (BDR-1753)The previous release would not correctly propagate changes to the CAMO pair configuration when they were received via Raft snapshot.
+| Bug fix | Reliability and operability | Correctly handle Raft snapshots from BDR 3.7 after upgrades (BDR-1754) +| Bug fix | Reliability and operability | Upgrading a CAMO configured cluster taking into account the `bdr.camo_pairs` in the snapshot while still excluding the ability to perform in place upgrade of a cluster (due to upgrade limitations unrelated to CAMO). +| Bug fix | Reliability and operability | Switch from CAMO to Local Mode only after timeouts (RT74892)Do not use the `catchup_interval` estimate when switching from CAMO protected to Local Mode, as that could induce inadvertent switching due to load spikes. Use the estimate only when switching from Local Mode back to CAMO protected (to prevent toggling forth and back due to lag on the CAMO partner).
+| Bug fix | Reliability and operability | Fix replication set cache invalidation when published replication set list have changed (BDR-1715)In previous versions we could use stale information about which replication sets (and as a result which tables) should be published until the subscription has reconnected.
+| Bug fix | Reliability and operability | Prevent duplicate values generated locally by galloc sequence in high concurrency situations when the new chunk is used (RT76528)The galloc sequence could have temporarily produce duplicate value when switching which chunk is used locally (but not across nodes) if there were multiple sessions waiting for the new value. This is now fixed.
+| Bug fix | Reliability and operability | Address memory leak on streaming transactions (BDR-1479)For large transaction this reduces memory usage and I/O considerably when using the streaming transactions feature. This primarily improves performance of COPY replication.
+| Bug fix | Reliability and operability | Don't leave slot behind after PART_CATCHUP phase of node parting when the catchup source has changed while the node was parting (BDR-1716)When node is being removed (parted) from BDR group, we do so called catchup in order to forward any missing changes from that node between remaining nodes in order to keep the data on all nodes consistent. This requires an additional replication slot to be created temporarily. Normally this replication slot is removed at the end of the catchup phase, however in certain scenarios where we have to change the source node for the changes, this slot could have previously been left behind. From this version, this slot is always correctly removed.
+| Bug fix | Reliability and operability | Ensure that the group slot is moved forward when there is only one node in the BDR groupThis prevents disk exhaustion due to WAL accumulation when the group is left running with just single BDR node for a prolonged period of time. This is not recommended setup but the WAL accumulation was not intentional.
+| Bug fix | Reliability and operability | Advance Raft protocol version when there is only one node in the BDR groupSingle node clusters would otherwise always stay on oldest support protocol until another node was added. This could limit available feature set on that single node.
+ +## Upgrades + +This release supports upgrading from the following versions of BDR: + +- 3.7.14 +- 4.0.0 and higher + +Please make sure you read and understand the process and limitations described +in the [Upgrade Guide](/pgd/latest/upgrades/) before upgrading. diff --git a/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.0.2_rel_notes.mdx b/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.0.2_rel_notes.mdx new file mode 100644 index 00000000000..d39bfb08be5 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.0.2_rel_notes.mdx @@ -0,0 +1,37 @@ +--- +title: "BDR 4.0.2" +--- + +This is a maintenance release for BDR 4.0 which includes minor +improvements as well as fixes for issues identified in previous +versions. + +| Type | Category | Description | +| ---- | -------- | ----------- | +| Improvement | Reliability and operability | Add `bdr.max_worker_backoff_delay` (BDR-1767)This changes the handling of the backoff delay to exponentially increase from `bdr.min_worker_backoff_delay` to `bdr.max_worker_backoff_delay` in presence of repeated errors. This reduces log spam and in some cases also prevents unnecessary connection attempts.
+| Improvement | User Experience | Add `execute_locally` option to `bdr.replicate_ddl_command()` (RT73533)This allows optional queueing of ddl commands for replication to other groups without executing it locally.
+| Improvement | User Experience | Change ERROR on consensus issue during JOIN to WARNINGThe reporting of these transient errors was confusing as they were also shown in bdr.worker_errors. These are now changed to WARNINGs.
+| Bug fix | Reliability and operability | WAL decoder confirms end LSN of the running transactions record (BDR-1264)Confirm end LSN of the running transactions record processed by WAL decoder so that the WAL decoder slot remains up to date and WAL senders get the candidate in timely manner.
+| Bug fix | Reliability and operability | Don't wait for autopartition tasks to complete on parting nodes (BDR-1867)When a node has started parting process, it makes no sense to wait for autopartition tasks on such nodes to finish since it's not part of the group anymore.
+| Bug fix | User Experience | Improve handling of node name reuse during parallel join (RT74789)Nodes now have a generation number so that it's easier to identify the name reuse even if the node record is received as part of a snapshot.
+| Bug fix | Reliability and operability | Fix locking and snapshot use during node management in the BDR manager process (RT74789)When processing multiple actions in the state machine, make sure to reacquire the lock on the processed node and update the snapshot to make sure all updates happening through consensus are taken into account.
+| Bug fix | Reliability and operability | Improve cleanup of catalogs on local node dropDrop all groups, not only the primary one and drop all the node state history info as well.
+| Bug fix | User Experience | Improve error checking for join request in bdr_init_physicalPreviously bdr_init_physical would simply wait forever when there was any issue with the consensus request, now we do same checking as the logical join does.
+| Bug fix | Reliability and operability | Improve handling of various timeouts and sleeps in consensusThis reduces the amount of new consensus votes needed when processing many consensus requests or time consuming consensus requests, for example during join of a new node.
+| Bug fix | Reliability and operability | Fix handling of `wal_receiver_timeout` (BDR-1848)The `wal_receiver_timeout` has not been triggered correctly due to a regression in BDR 3.7 and 4.0.
+| Bug fix | Reliability and operability | Limit the `bdr.standby_slot_names` check when reporting flush position only to physical slots (RT77985, RT78290)Otherwise flush progress is not reported in presence of disconnected nodes when using `bdr.standby_slot_names`.
+| Bug fix | Reliability and operability | Fix replication of data types created during bootstrap (BDR-1784) +| Bug fix | Reliability and operability | Fix replication of arrays of builtin types that don't have binary transfer support (BDR-1042) +| Bug fix | Reliability and operability | Prevent CAMO configuration warnings if CAMO is not being used (BDR-1825) + +## Upgrades + +This release supports upgrading from the following versions of BDR: + +- 4.0.0 and higher + +The upgrade path from BDR 3.7 is not currently stable and needs to be +considered beta. Tests should be performed with at least BDR 3.7.15. + +Please make sure you read and understand the process and limitations described +in the [Upgrade Guide](/pgd/latest/upgrades/) before upgrading. diff --git a/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.1.0_rel_notes.mdx b/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.1.0_rel_notes.mdx new file mode 100644 index 00000000000..859bb4f68e4 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.1.0_rel_notes.mdx @@ -0,0 +1,68 @@ +--- +title: "BDR 4.1.0" +--- + +This is a minor release of BDR 4 which includes new features as well +as fixes for issues identified in previous versions. + +| Type | Category | Description | +| ---- | -------- | ----------- | +| Feature | Reliability and operability | Support in-place major upgrade of Postgres on a BDR nodeThis BDR release include as new command-line utility `bdr_pg_upgrade` which uses `pg_upgrade` to do a major version upgrade of Postgres on a BDR node.
This reduces the time and network bandwidth necessary to do major version upgrades of Postgres in a BDR cluster.
+| Feature | Performance and scalability | Replication Lag ControlAdd configuration for a replication lag threshold after which the transaction commits get throttled. This allows limiting RPO without incurring the latency impact on every transaction that comes with synchronous replication.
+| Feature | UX / Initial experience | Distributed sequences by defaultThe default value of `bdr.default_sequence_kind` has been changed to `'distributed'` which is new kind of sequence that uses SnowFlakeId for `bigserial` and Galloc sequences for `serial` column type.
+| Feature | UX | Simplified synchronous replication configurationNew syntax for specifying the synchronous replication options, with focus on BDR groups and SQL based management (as opposed to config file).
In future versions this will also replace the current Eager Replication and CAMO configuration options.
+| Feature | High availability and disaster recovery | Group CommitThe initial kind of synchronous commit that can be configured via the new configuration syntax.
+| Feature | High availability and disaster recovery | Allow a Raft request to be required for CAMO switching to Local Mode (RT78928)Add a `require_raft` flag to the CAMO pairing configuration which controls the behavior of switching from CAMO protected to Local Mode, introducing the option to require a majority of nodes to be connected to allow to switch to Local Mode.
+| Feature | High availability and disaster recovery | Allow replication to continue on `ALTER TABLE ... DETACH PARTITION CONCURRENTLY` of already detached partition (RT78362)Similarly to how BDR 4 handles `CREATE INDEX CONCURRENTLY` when same index already exists, we now allow replication to continue when `ALTER TABLE ... DETACH PARTITION CONCURRENTLY` is receiver for partition that has been already detached.
+| Feature | User Experience | Add additional filtering options to DDL filters.DDL filters allow for replication of different DDL statements to different replication sets. Similar to how table membership in replication set allows DML on different tables to be replicated via different replication sets.
This release adds new controls that make it easier to use the DDL filters:
- query_match - if defined query must match this regex
- exclusive - if true, other matched filters are not taken into consideration (i.e. only the exclusive filter is applied), when multiple exclusive filters match, we throw error
When enabled this changes behavior of `LOCK TABLE` command to take take a global DML lock
+| Feature | Performance and scalability | Implement buffered write for LCR segment fileThis should reduce I/O and improve CPU usage of the Decoding Worker.
+| Feature | User Experience | Add support for partial unique index lookups for conflict detection (RT78368).Indexes on expression are however still not supported for conflict detection.
+| Feature | User Experience | Add additional statistics to `bdr.stat_subscription`:This allows optional queueing of ddl commands for replication to other groups without executing it locally.
+| Feature | User Experience | Add `fast` argument to `bdr.alter_subscription_disable()` (RT79798)The argument only influences the behavior of `immediate`. When set to `true` (default) it will stop the workers without letting them finish the current work.
+| Feature | User Experience | Keep the `bdr.worker_error` records permanently for all types of workers.BDR used to remove receiver and writer errors when those workers managed to replicate the LSN that was previously resulting in error. However this was inconsistent with how other workers behaved, as other worker errors were permanent and it also made the troubleshooting of past issues harder. So keep the last error record permanently for every worker type.
+| Feature | User Experience | Simplify `bdr.{add,remove}_camo_pair` functions to return void. +| Feature | Initial Experience | Add connectivity/lag check before taking global lock.So that application or user does not have to wait for minutes to get lock timeout when there are obvious connectivity issues.
Can be set to DEBUG, LOG, WARNING (default) or ERROR.
+| Feature | Initial Experience | Only log conflicts to conflict log table by default. They are no longer logged to the server log file by default, but this can be overridden. +| Feature | User Experience | Improve reporting of remote errors during node join. +| Feature | Reliability and operability | Make autopartition worker's max naptime configurable. +| Feature | User Experience | Add ability to request partitions upto the given upper bound with autopartition. +| Feature | Initial Experience | Don't try replicate DDL run on subscribe-only node. It has nowhere to replicate so any attempt to do so will fail. This is same as how logical standbys behave. +| Feature | User Experience | Add `bdr.accept_connections` configuration variable. When `false`, walsender connections to replication slots using BDR output plugin will fail. This is useful primarily during restore of single node from backup. +| Bug fix | Reliability and operability | Keep the `lock_timeout` as configured on non-CAMO-partner BDR nodesA CAMO partner uses a low `lock_timeout` when applying transactions from its origin node. This was inadvertently done for all BDR nodes rather than just the CAMO partner, which may have led to spurious `lock_timeout` errors on pglogical writer processes on normal BDR nodes.
+| Bug fix | User Experience | Show a proper wait event for CAMO / Eager confirmation waits (RT75900)Show correct "BDR Prepare Phase"/"BDR Commit Phase" in `bdr.stat_activity` instead of the default “unknown wait event”.
+| Bug fix | User Experience | Reduce log for bdr.run_on_nodes (RT80973)Don't log when setting `bdr.ddl_replication` to off if it's done with the "run_on_nodes" variants of function. This eliminates the flood of logs for monitoring functions.
+| Bug fix | Reliability and operability | Fix replication of arrays of composite types and arrays of builtin types that don't support binary network encoding +| Bug fix | Reliability and operability | Fix replication of data types created during bootstrap +| Bug fix | Performance and scalability | Confirm end LSN of the running transactions record processed by WAL decoder so that the WAL decoder slot remains up to date and WAL sender get the candidate in timely manner. +| Bug fix | Reliability and operability | Don't wait for autopartition tasks to complete on parting nodes +| Bug fix | Reliability and operability | Limit the `bdr.standby_slot_names` check when reporting flush position only to physical slots (RT77985, RT78290)Otherwise flush progress is not reported in presence of disconnected nodes when using `bdr.standby_slot_names`.
+| Bug fix | Reliability and operability | Request feedback reply from walsender if we are close to wal_receiver_timeout +| Bug fix | Reliability and operability | Don't record dependency of auto-paritioned table on BDR extension more than once.This resulted in "ERROR: unexpected number of extension dependency records" errors from auto-partition and broken replication on conflicts when this happens.
Note that existing broken tables need to still be fixed manually by removing the double dependency from `pg_depend`
+| Bug fix | Reliability and operability | Improve keepalive handling in receiver.Don't update position based on keepalive when in middle of streaming transaction as we might lose data on crash if we do that.
There is also new flush and signalling logic that should improve latency in low TPS scenarios. +| Bug fix | Reliability and operability | Only do post `CREATE` commands processing when BDR node exists in the database. +| Bug fix | Reliability and operability | Don't try to log ERROR conflicts to conflict history table. +| Bug fix | Reliability and operability | Fixed segfault where a conflict_slot was being used after it was released during multi-insert (COPY) (RT76439). +| Bug fix | Reliability and operability | Prevent walsender processes spinning when facing lagging standby slots (RT80295, RT78290).Correct signaling to reset a latch so that a walsender process does consume 100% of a CPU in case one of the standby slots is lagging behind.
+| Bug fix | Reliability and operability | Fix handling of `wal_sender_timeout` when `bdr.standby_slot_names` are used (RT78290) +| Bug fix | Reliability and operability | Make ALTER TABLE lock the underlying relation only once (RT80204). +| Bug fix | User Experience | Fix reporting of disconnected slots in `bdr.monitor_local_replslots`. They could have been previously reported as missing instead of disconnected. +| Bug fix | Reliability and operability | Fix apply timestamp reporting for down subscriptions in `bdr.get_subscription_progress()` function and in the `bdr.subscription_summary` that uses that function. It would report garbage value before. +| Bug fix | Reliability and operability | Fix snapshot handling in various places in BDR workers. +| Bug fix | User Experience | Be more consistent about reporting timestamps and LSNs as NULLs in monitoring functions when there is no available value for those. +| Bug fix | Reliability and operability | Reduce log information when switching between writer processes. +| Bug fix | Reliability and operability | Don't do superuser check when configuration parameter was specified on PG command-line. We can't do transactions there yet and it's guaranteed to be superuser changed at that stage. +| Bug fix | Reliability and operability | Use 64 bits for calculating lag size in bytes. To eliminate risk of overflow with large lag. + + +### Upgrades + +This release supports upgrading from the following versions of BDR: + +- 4.0.0 and higher +- 3.7.15 +- 3.7.16 + +Please make sure you read and understand the process and limitations described +in the [Upgrade Guide](/pgd/latest/upgrades/) before upgrading. diff --git a/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.1.1_rel_notes.mdx b/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.1.1_rel_notes.mdx new file mode 100644 index 00000000000..e23b928a0cf --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/release_notes/bdr4.1.1_rel_notes.mdx @@ -0,0 +1,29 @@ +--- +title: "BDR 4.1.1" +--- + +This is a maintenance release of BDR 4 which includes new features as well +as fixes for issues identified in previous versions. + + +| Type | Category | Description | +| ---- | -------- | ----------- | +| Feature | User Experience | Add generic function bdr.is_node_connected returns true if the walsender for a given peer is active. | +| Feature | User Experience | Add generic function bdr.is_node_ready returns boolean if the lag is under a specific span. | +| Bug fix | User Experience | Add support for a `--link` argument to bdr_pg_upgrade for using hard-links. | +| Bug fix | User Experience | Prevent removing a `bdr.remove_commit_scope` if still referenced by any `bdr.node_group` as the default commit scope. | +| Bug fix | Reliability and operability | Correct Raft based switching to Local Mode for CAMO pairs of nodes (RT78928) | +| Bug fix | Reliability and operability | Prevent a potential segfault in bdr.drop_node for corner cases (RT81900) | +| Bug fix | User Experience | Prevent use of CAMO or Eager All Node transactions in combination with transaction streamingThis allows BDR to stream a large transaction (greater than `logical_decoding_work_mem` in size) either to a file on the downstream or to a writer process. This ensures that the transaction is decoded even before it's committed, thus improving parallelism. Further, the transaction can even be applied concurrently if streamed straight to a writer. This improves parallelism even more.
When large transactions are streamed to files, they are decoded and the decoded changes are sent to the downstream even before they are committed. The changes are written to a set of files and applied when the transaction finally commits. If the transaction aborts, the changes are discarded, thus wasting resources on both upstream and downstream.
Sub-transactions are also handled automatically.
This feature is available on PostgreSQL 14, EDB Postgres Extended 13+ and EDB Postgres Advanced 14, see [Choosing a Postgres distribution](/pgd/latest/choosing_server/) appendix for more details on which features can be used on which versions of Postgres.
+| Feature | Compatibility | The differences that existed in earlier versions of BDR between standard and enterprise edition have been removed. With BDR 4.0 there is one extension for each supported Postgres distribution and version, i.e., PostgreSQL v12-14, EDB Postgres Extended v12-14, and EDB Postgres Advanced 12-14.Not all features are available on all versions of PostgreSQL, the available features are reported via feature flags using either `bdr_config` command line utility or `bdr.bdr_features()` database function. See [Choosing a Postgres distribution](/pgd/latest/choosing_server/) for more details.
+| Feature | User Experience | There is no pglogical 4.0 extension that corresponds to the BDR 4.0 extension. BDR no longer has a requirement for pglogical.This means also that only BDR extension and schema exist and any configuration parameters were renamed from `pglogical.` to `bdr.`.
+| Feature | Initial experience | Some configuration options have change defaults for better post-install experience:To reduce chances of misconfiguration and make CAMO pairs within the BDR cluster known globally, move the CAMO configuration from the individual node's postgresql.conf to BDR system catalogs managed by Raft. This for example can prevent against inadvertently dropping a node that's still configured to be a CAMO partner for another active node.
Please see the [Upgrades chapter](/pgd/latest/upgrades/#upgrading-a-camo-enabled-cluster) for details on the upgrade process.
This deprecates GUCs `bdr.camo_partner_of` and `bdr.camo_origin_for` and replaces the functions `bdr.get_configured_camo_origin_for()` and `get_configured_camo_partner_of` with `bdr.get_configured_camo_partner`.
+ +## Upgrades + +This release supports upgrading from the following version of BDR: + +- 3.7.13.1 + +Please make sure you read and understand the process and limitations described +in the [Upgrade Guide](/pgd/latest/upgrades/) before upgrading. diff --git a/product_docs/docs/pgd/4/overview/bdr/release_notes/index.mdx b/product_docs/docs/pgd/4/overview/bdr/release_notes/index.mdx new file mode 100644 index 00000000000..bfac5a52d48 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/release_notes/index.mdx @@ -0,0 +1,25 @@ +--- +title: Release Notes +navigation: +- bdr4.1.1_rel_notes +- bdr4.1.0_rel_notes +- bdr4.0.2_rel_notes +- bdr4.0.1_rel_notes +- bdr4_rel_notes +--- + +BDR is a PostgreSQL extension providing multi-master replication and data +distribution with advanced conflict management, data-loss protection, and +throughput up to 5X faster than native logical replication, and enables +distributed PostgreSQL clusters with a very high availability. + +The release notes in this section provide information on what was new in each release. + +| Version | Release Date | +| ----------------------- | ------------ | +| [4.1.1](bdr4.1.1_rel_notes) | 2022 June 21 | +| [4.1.0](bdr4.1.0_rel_notes) | 2022 May 17 | +| [4.0.2](bdr4.0.2_rel_notes) | 2022 Feb 15 | +| [4.0.1](bdr4.0.1_rel_notes) | 2022 Jan 18 | +| [4.0.0](bdr4_rel_notes) | 2021 Dec 01 | + diff --git a/product_docs/docs/pgd/4/overview/bdr/repsets.mdx b/product_docs/docs/pgd/4/overview/bdr/repsets.mdx new file mode 100644 index 00000000000..3e51c74036d --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/repsets.mdx @@ -0,0 +1,664 @@ +--- +title: Replication sets + + +--- + +A replication set is a group of tables that a BDR node can subscribe to. +You can use replication sets to create more complex replication topologies +than regular symmetric multi-master where each node is an exact copy of the other +nodes. + +Every BDR group creates a replication set with the same name as +the group. This replication set is the default replication set, which is +used for all user tables and DDL replication. All nodes are subscribed to it. +In other words, by default all user tables are replicated between all nodes. + +## Using replication sets + +You can create replication sets using `create_replication_set()`, +specifying whether to include insert, update, delete, or truncate actions. +One option lets you add existing tables to the set, and +a second option defines whether to add tables when they are +created. + +You can also manually define the tables to add or remove from a +replication set. + +Tables included in the replication set are maintained when the node +joins the cluster and afterwards. + +Once the node is joined, you can still remove tables from the replication +set, but you must add new tables using a resync operation. + +By default, a newly defined replication set doesn't replicate DDL or BDR +administration function calls. Use `replication_set_add_ddl_filter` +to define the commands to replicate. + +BDR creates replication set definitions on all nodes. Each node can then be +defined to publish or subscribe to each replication set using +`alter_node_replication_sets`. + +You can use functions to alter these definitions later or to drop the replication +set. + +!!! Note + Don't use the default replication set for selective replication. + Don't drop or modify the default replication set on any of + the BDR nodes in the cluster as it is also used by default for DDL + replication and administration function calls. + +## Behavior of partitioned tables + +BDR supports partitioned tables transparently, meaning that you can add a partitioned +table to a replication set. +Changes that involve any of the partitions are replicated downstream. + +!!! Note + When partitions are replicated through a partitioned table, the + statements executed directly on a partition are replicated as they + were executed on the parent table. The exception is the `TRUNCATE` command, + which always replicates with the list of affected tables or partitions. + +You can add individual partitions to the replication set, in +which case they are replicated like regular tables (to the table of the +same name as the partition on the downstream). This has some performance +advantages if the partitioning definition is the same on both provider and +subscriber, as the partitioning logic doesn't have to be executed. + +!!! Note + If a root partitioned table is part of any replication set, memberships + of individual partitions are ignored. only the membership of that root + table is taken into account. + +## Behavior with foreign keys + +A foreign key constraint ensures that each row in the referencing +table matches a row in the referenced table. Therefore, if the +referencing table is a member of a replication set, the referenced +table must also be a member of the same replication set. + +The current version of BDR doesn't automatically check for or enforce +this condition. When adding a table to a replication set, the database administrator must +make sure +that all the tables referenced by foreign keys are also added. + +You can use the following query to list all the foreign keys and +replication sets that don't satisfy this requirement. +The referencing table is a member of the replication set, while the +referenced table isn't: + +```sql +SELECT t1.relname, + t1.nspname, + fk.conname, + t1.set_name + FROM bdr.tables AS t1 + JOIN pg_catalog.pg_constraint AS fk + ON fk.conrelid = t1.relid + AND fk.contype = 'f' + WHERE NOT EXISTS ( + SELECT * + FROM bdr.tables AS t2 + WHERE t2.relid = fk.confrelid + AND t2.set_name = t1.set_name +); +``` + +The output of this query looks like the following: + +```sql + relname | nspname | conname | set_name +---------+---------+-----------+---------- + t2 | public | t2_x_fkey | s2 +(1 row) +``` + +This means that table `t2` is a member of replication set `s2`, but the +table referenced by the foreign key `t2_x_fkey` isn't. + +The `TRUNCATE CASCADE` command takes into account the +replication set membership before replicating the command. For example: + +```sql +TRUNCATE table1 CASCADE; +``` + +This becomes a `TRUNCATE` without cascade on all the tables that are +part of the replication set only: + +```sql +TRUNCATE table1, referencing_table1, referencing_table2 ... +``` + +## Replication set management + +Management of replication sets. + +With the exception of `bdr.alter_node_replication_sets`, the following +functions are considered to be `DDL`. DDL replication and global locking +apply to them, if that's currently active. See [DDL replication](ddl). + +### bdr.create_replication_set + +This function creates a replication set. + +Replication of this command is affected by DDL replication configuration +including DDL filtering settings. + +#### Synopsis + +```sql +bdr.create_replication_set(set_name name, + replicate_insert boolean DEFAULT true, + replicate_update boolean DEFAULT true, + replicate_delete boolean DEFAULT true, + replicate_truncate boolean DEFAULT true, + autoadd_tables boolean DEFAULT false, + autoadd_existing boolean DEFAULT true) +``` + +#### Parameters + +- `set_name` — Name of the new replication set. Must be unique across the BDR + group. +- `replicate_insert` — Indicates whether to replicate inserts into tables in this + replication set. +- `replicate_update` — Indicates whether to replicate updates of tables in this + replication set. +- `replicate_delete` — Indicates whether to replicate deletes from tables in this + replication set. +- `replicate_truncate` — Indicates whether to replicate truncates of tables in this + replication set. +- `autoadd_tables` — Indicates whether to replicate newly created (future) tables + to this replication set +- `autoadd_existing` — Indicates whether to add all existing user tables + to this replication set. This parameter has an effect only if `autoadd_tables` is + set to `true`. + +#### Notes + +By default, new replication sets don't replicate DDL or BDR administration +function calls. See [ddl filters](repsets#ddl-replication-filtering) for how to set +up DDL replication for replication sets. A preexisting DDL filter +is set up for the default group replication set that replicates all DDL and admin +function calls. It's created when the group is created but can be dropped +in case you don't want the BDR group default replication set to replicate +DDL or the BDR administration function calls. + +This function uses the same replication mechanism as `DDL` statements. This means +that the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. + +The function takes a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +### bdr.alter_replication_set + +This function modifies the options of an existing replication set. + +Replication of this command is affected by DDL replication configuration, +including DDL filtering settings. + +#### Synopsis + +```sql +bdr.alter_replication_set(set_name name, + replicate_insert boolean DEFAULT NULL, + replicate_update boolean DEFAULT NULL, + replicate_delete boolean DEFAULT NULL, + replicate_truncate boolean DEFAULT NULL, + autoadd_tables boolean DEFAULT NULL) +``` + +#### Parameters + +- `set_name` — Name of an existing replication set. +- `replicate_insert` — Indicates whether to replicate inserts into tables in this + replication set. +- `replicate_update` — Indicates whether to replicate updates of tables in this + replication set. +- `replicate_delete` — Indicates whether to replicate deletes from tables in this + replication set. +- `replicate_truncate` — Indicates whether to replicate truncates of tables in this + replication set. +- `autoadd_tables` — Indicates whether to add newly created (future) tables to this replication set. + +Any of the options that are set to NULL (the default) remain the same as +before. + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This means +the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. + +The function takes a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +### bdr.drop_replication_set + +This function removes an existing replication set. + +Replication of this command is affected by DDL replication configuration, +including DDL filtering settings. + +#### Synopsis + +```sql +bdr.drop_replication_set(set_name name) +``` + +#### Parameters + +- `set_name` — Name of an existing replication set. + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This means +the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. + +The function takes a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +!!! Warning + Don't drop a replication set that's being used by at least + another node, because doing so stops replication on that + node. If that happens, unsubscribe the affected node + from that replication set. + For the same reason, don't drop a replication set with + a join operation in progress when the node being joined + is a member of that replication set. Replication set membership is + checked only at the beginning of the join. + This happens because the information on replication set usage is + local to each node, so that you can configure it on a node before + it joins the group. + +You can manage replication set subscription for a node using `alter_node_replication_sets`. + +### bdr.alter_node_replication_sets + +This function changes the replication sets a node publishes and is subscribed to. + +#### Synopsis + +```sql +bdr.alter_node_replication_sets(node_name name, + set_names text[]) +``` + +#### Parameters + +- `node_name` — The node to modify. Currently has to be local node. +- `set_names` — Array of replication sets to replicate to the specified + node. An empty array results in the use of the group default replication set. + +#### Notes + +This function is executed only on the local node and isn't replicated in any manner. + +The replication sets listed aren't checked for existence, +since this function is designed to execute before the node joins. Be careful +to specify replication set names correctly to avoid errors. + +This allows for calling the function not only on the node that's part of the +BDR group but also on a node that hasn't joined any group yet. This approach limits +the data synchronized during the join. However, the schema is +always fully synchronized without regard to the replication sets setting. +All tables are copied across, not just the ones specified +in the replication set. You can drop unwanted tables by referring to +the `bdr.tables` catalog table. These might be removed automatically in later +versions of BDR. This is currently true even if the [ddl filters](repsets#ddl-replication-filtering) +configuration otherwise prevent replication of DDL. + +The replication sets that the node subscribes to after this call are published +by the other nodes for actually replicating the changes from those nodes to +the node where this function is executed. + +## Replication set membership + +You can add tables to or remove them from one or more replication sets. This +affects replication only of changes (DML) in those tables. Schema changes (DDL) are +handled by DDL replication set filters (see [DDL replication filtering](#ddl-replication-filtering)). + +The replication uses the table membership in replication sets +with the node replication sets configuration to determine the actions to +replicate to which node. The decision is done using the union of all the +memberships and replication set options. Suppose that a table is a member +of replication set A that replicates only INSERT actions and replication set B that +replicates only UPDATE actions. Both INSERT and UPDATE act8ions are replicated if the +target node is also subscribed to both replication set A and B. + +### bdr.replication_set_add_table + +This function adds a table to a replication set. + +This adds a table to a replication set and starts replicating changes +from this moment (or rather transaction commit). Any existing data the table +might have on a node isn't synchronized. + +Replication of this command is affected by DDL replication configuration, +including DDL filtering settings. + +#### Synopsis + +```sql +bdr.replication_set_add_table(relation regclass, + set_name name DEFAULT NULL, + columns text[] DEFAULT NULL, + row_filter text DEFAULT NULL) +``` + +#### Parameters + +- `relation` — Name or Oid of a table. +- `set_name` — Name of the replication set. If NULL (the default), then the BDR + group default replication set is used. +- `columns` — Reserved for future use (currently does nothing and must be NULL). +- `row_filter` — SQL expression to be used for filtering the replicated rows. + If this expression isn't defined (that is, set to NULL, the default) then all rows are sent. + +The `row_filter` specifies an expression producing a Boolean result, with NULLs. +Expressions evaluating to True or Unknown replicate the row. A False value +doesn't replicate the row. Expressions can't contain subqueries or refer to +variables other than columns of the current row being replicated. You can't reference system +columns. + +`row_filter` executes on the origin node, not on the target node. This puts an +additional CPU overhead on replication for this specific table but +completely avoids sending data for filtered rows. Hence network +bandwidth is reduced and overhead on the target node is applied. + +`row_filter` never removes `TRUNCATE` commands for a specific table. +You can filter away `TRUNCATE` commands at the replication set level. + +You can replicate just some columns of a table. See +[Replicating between nodes with differences](appusage). + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This means +that the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. + +The function takes a `DML` global lock on the relation that's being +added to the replication set if the `row_filter` isn't NULL. Otherwise +it takes just a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +### bdr.replication_set_remove_table + +This function removes a table from the replication set. + +Replication of this command is affected by DDL replication configuration, +including DDL filtering settings. + +#### Synopsis + +```sql +bdr.replication_set_remove_table(relation regclass, + set_name name DEFAULT NULL) +``` + +#### Parameters + +- `relation` — Name or Oid of a table. +- `set_name` — Name of the replication set. If NULL (the default), then the BDR + group default replication set is used. + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This means +the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. + +The function takes a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +## Listing replication sets + +You can list existing replication sets with the following query: + +```sql +SELECT set_name +FROM bdr.replication_sets; +``` + +You can use this query to list all the tables in a given replication +set: + +```sql +SELECT nspname, relname +FROM bdr.tables +WHERE set_name = 'myrepset'; +``` + +In [Behavior with foreign keys](#behavior-with-foreign-keys), we show a +query that lists all the foreign keys whose referenced table isn't +included in the same replication set as the referencing table. + +Use the following SQL to show those replication sets that the +current node publishes and subscribes from: + +```sql + SELECT node_id, + node_name, + COALESCE( + pub_repsets, pub_repsets + ) AS pub_repsets, + COALESCE( + sub_repsets, sub_repsets + ) AS sub_repsets + FROM bdr.local_node_summary; +``` + +This code produces output like this: + +```sql + node_id | node_name | pub_repsets | sub_repsets +------------+-----------+---------------------------------------- + 1834550102 | s01db01 | {bdrglobal,bdrs01} | {bdrglobal,bdrs01} +(1 row) +``` + +To execute the same query against all nodes in the cluster, you can use the following query. This approach gets +the replication sets associated with all nodes at the same time. + +```sql +WITH node_repsets AS ( + SELECT jsonb_array_elements( + bdr.run_on_all_nodes($$ + SELECT + node_id, + node_name, + COALESCE( + pub_repsets, pub_repsets + ) AS pub_repsets, + COALESCE( + sub_repsets, sub_repsets + ) AS sub_repsets + FROM bdr.local_node_summary; + $$)::jsonb + ) AS j +) +SELECT j->'response'->'command_tuples'->0->>'node_id' AS node_id, + j->'response'->'command_tuples'->0->>'node_name' AS node_name, + j->'response'->'command_tuples'->0->>'pub_repsets' AS pub_repsets, + j->'response'->'command_tuples'->0->>'sub_repsets' AS sub_repsets +FROM node_repsets; +``` + +This shows, for example: + +```sql + node_id | node_name | pub_repsets | sub_repsets +------------+-----------+---------------------------------------- + 933864801 | s02db01 | {bdrglobal,bdrs02} | {bdrglobal,bdrs02} + 1834550102 | s01db01 | {bdrglobal,bdrs01} | {bdrglobal,bdrs01} + 3898940082 | s01db02 | {bdrglobal,bdrs01} | {bdrglobal,bdrs01} + 1102086297 | s02db02 | {bdrglobal,bdrs02} | {bdrglobal,bdrs02} +(4 rows) +``` + +## DDL replication filtering + +By default, the replication of all supported DDL happens by way of the default BDR +group replication set. This is achieved with a DDL filter with +the same name as the BDR group. This filter is added to the default BDR +group replication set when the BDR group is created. + +You can adjust this by changing the DDL replication filters for +all existing replication sets. These filters are independent of table +membership in the replication sets. Just like data changes, each DDL statement +is replicated only once, even if it's matched by multiple filters on +multiple replication sets. + +You can list existing DDL filters with the following query, which +shows for each filter the regular expression applied to the command +tag and to the role name: + +```sql +SELECT * FROM bdr.ddl_replication; +``` + +You can use the following functions to manipulate DDL filters. +They are considered to be `DDL` and are therefore subject to DDL +replication and global locking. + +### bdr.replication_set_add_ddl_filter + +This function adds a DDL filter to a replication set. + +Any DDL that matches the given filter is replicated to any node that's +subscribed to that set. This also affects replication of BDR admin functions. + +This doesn't prevent execution of DDL on any node. It only +alters whether DDL is replicated to other nodes. Suppose two nodes have +a replication filter between them that excludes all index commands. Index commands can still +be executed freely by directly connecting to +each node and executing the desired DDL on that node. + +The DDL filter can specify a `command_tag` and `role_name` to allow +replication of only some DDL statements. The `command_tag` is the same as those +used by [EVENT TRIGGERs](https://www.postgresql.org/docs/current/static/event-trigger-matrix.html) +for regular PostgreSQL commands. A typical example might be to create a +filter that prevents additional index commands on a logical standby from +being replicated to all other nodes. + +You can filter the BDR admin functions used by using a tagname matching the +qualified function name. For example, `bdr.replication_set_add_table` is the +command tag for the function of the same name. In this case, this tag allows all BDR +functions to be filtered using `bdr.*`. + +The `role_name` is used for matching against the current role that is executing +the command. Both `command_tag` and `role_name` are evaluated as regular +expressions, which are case sensitive. + +#### Synopsis + +```sql +bdr.replication_set_add_ddl_filter(set_name name, + ddl_filter_name text, + command_tag text, + role_name text DEFAULT NULL) +``` + +#### Parameters + +- `set_name` — Name of the replication set. if NULL then the BDR + group default replication set is used. +- `ddl_filter_name` — Name of the DDL filter. This must be unique across the + whole BDR group. +- `command_tag` — Regular expression for matching command tags. NULL means + match everything. +- `role_name` — Regular expression for matching role name. NULL means + match all roles. + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This means +that the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. This also means that replication of changes to ddl +filter configuration is affected by the existing ddl filter configuration. + +The function takes a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +To view the defined replication filters, use the view `bdr.ddl_replication`. + +#### Examples + +To include only BDR admin functions, define a filter like this: + +```sql +SELECT bdr.replication_set_add_ddl_filter('mygroup', 'mygroup_admin', $$bdr\..*$$); +``` + +To exclude everything apart from index DDL: + +```sql +SELECT bdr.replication_set_add_ddl_filter('mygroup', 'index_filter', + '^(?!(CREATE INDEX|DROP INDEX|ALTER INDEX)).*'); +``` + +To include all operations on tables and indexes but exclude all others, add +two filters: one for tables, one for indexes. This shows that +multiple filters provide the union of all allowed DDL commands: + +```sql +SELECT bdr.replication_set_add_ddl_filter('bdrgroup','index_filter', '^((?!INDEX).)*$'); +SELECT bdr.replication_set_add_ddl_filter('bdrgroup','table_filter', '^((?!TABLE).)*$'); +``` + +### bdr.replication_set_remove_ddl_filter + +This function removes the DDL filter from a replication set. + +Replication of this command is affected by DDL replication configuration, +including DDL filtering settings themselves. + +#### Synopsis + +```sql +bdr.replication_set_remove_ddl_filter(set_name name, + ddl_filter_name text) +``` + +#### Parameters + +- `set_name` — Name of the replication set. If NULL then the BDR + group default replication set is used. +- `ddl_filter_name` — Name of the DDL filter to remove. + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This +means that the replication is affected by the +[ddl filters](repsets#ddl-replication-filtering) configuration. +This also means that replication of changes to the DDL filter configuration is +affected by the existing DDL filter configuration. + +The function takes a `DDL` global lock. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. diff --git a/product_docs/docs/pgd/4/overview/bdr/scaling.mdx b/product_docs/docs/pgd/4/overview/bdr/scaling.mdx new file mode 100644 index 00000000000..8e20d943774 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/scaling.mdx @@ -0,0 +1,340 @@ +--- +title: AutoPartition +--- + +AutoPartition allows tables to grow easily to large sizes by automatic +partitioning management. This capability uses features of BDR +such as low-conflict locking of creating and dropping partitions. + +You can create new partitions regularly and then drop them when the +data retention period expires. + +BDR management is primarily accomplished by functions that can be called by SQL. +All functions in BDR are exposed in the `bdr` schema. Unless you put it into +your search_path, you need to schema-qualify the name of each function. + +## Auto creation of partitions + +`bdr.autopartition()` creates or alters the definition of automatic +range partitioning for a table. If no definition exists, it's created. +Otherwise, later executions will alter the definition. + +`bdr.autopartition()` doesn't lock the actual table. It changes the +definition of when and how new partition maintenance actions take place. + +`bdr.autopartition()` leverages the features that allow a partition to be +attached or detached/dropped without locking the rest of the table +(when the underlying Postgres version supports it). + +An ERROR is raised if the table isn't RANGE partitioned or a multi-column +partition key is used. + +A new partition is added for every `partition_increment` range of values, with +lower and upper bound `partition_increment` apart. For tables with a partition +key of type `timestamp` or `date`, the `partition_increment` must be a valid +constant of type `interval`. For example, specifying `1 Day` causes a new +partition to be added each day, with partition bounds that are one day apart. + +If the partition column is connected to a `snowflakeid`, `timeshard`, or `ksuuid` sequence, +you must specify the `partition_increment` as type `interval`. Otherwise, +if the partition key is integer or numeric, then the `partition_increment` +must be a valid constant of the same datatype. For example, specifying +`1000000` causes new partitions to be added every 1 million values. + +If the table has no existing partition, then the specified +`partition_initial_lowerbound` is used as the lower bound for the first +partition. If you don't specify `partition_initial_lowerbound`, then the system +tries to derive its value from the partition column type and the specified +`partition_increment`. For example, if `partition_increment` is specified as `1 Day`, +then `partition_initial_lowerbound` is set to CURRENT +DATE. If `partition_increment` is specified as `1 Hour`, then +`partition_initial_lowerbound` is set to the current hour of the current +date. The bounds for the subsequent partitions are set using the +`partition_increment` value. + +The system always tries to have a certain minimum number of advance partitions. +To decide whether to create new partitions, it uses the +specified `partition_autocreate_expression`. This can be an expression that can be evaluated by SQL, +which is evaluated every time a check is performed. For example, +for a partitioned table on column type `date`, if +`partition_autocreate_expression` is specified as `DATE_TRUNC('day',CURRENT_DATE)`, +`partition_increment` is specified as `1 Day` and +`minimum_advance_partitions` is specified as 2, then new partitions are +created until the upper bound of the last partition is less than +`DATE_TRUNC('day', CURRENT_DATE) + '2 Days'::interval`. + +The expression is evaluated each time the system checks for new partitions. + +For a partitioned table on column type `integer`, you can specify the +`partition_autocreate_expression` as `SELECT max(partcol) FROM +schema.partitioned_table`. The system then regularly checks if the maximum value of +the partitioned column is within the distance of `minimum_advance_partitions * partition_increment` +of the last partition's upper bound. Create an index on the `partcol` so that the query runs efficiently. +If the `partition_autocreate_expression` isn't specified for a partition table +on column type `integer`, `smallint`, or `bigint`, then the system +sets it to `max(partcol)`. + +If the `data_retention_period` is set, partitions are +dropped after this period. Partitions are dropped at the same time as new +partitions are added, to minimize locking. If this value isn't set, you must drop the partitions manually. + +The `data_retention_period` parameter is supported only for timestamp (and +related) based partitions. The period is calculated by considering the upper +bound of the partition. The partition is either migrated to the secondary +tablespace or dropped if either of the given period expires, relative to the +upper bound. + +By default, AutoPartition manages partitions globally. In other words, when a +partition is created on one node, the same partition is also created on all +other nodes in the cluster. So all partitions are consistent and guaranteed to +be available. For this, AutoPartition makes use of Raft. You can change this behavior +by passing `managed_locally` as `true`. In that case, all partitions +are managed locally on each node. This is useful for the case when the +partitioned table isn't a replicated table and hence it might not be necessary +or even desirable to have all partitions on all nodes. For example, the +built-in `bdr.conflict_history` table isn't a replicated table and is +managed by AutoPartition locally. Each node creates partitions for this table +locally and drops them once they are old enough. + +You can't later change tables marked as `managed_locally` to be managed +globally and vice versa. + +Activities are performed only when the entry is marked `enabled = on`. + +You aren't expected to manually create or drop partitions for tables +managed by AutoPartition. Doing so can make the AutoPartition metadata +inconsistent and might cause it to fail. + +### Configure AutoPartition + +The `bdr.autopartition` function configures automatic partitioning of a table. + +#### Synopsis + +```sql +bdr.autopartition(relation regclass, + partition_increment text, + partition_initial_lowerbound text DEFAULT NULL, + partition_autocreate_expression text DEFAULT NULL, + minimum_advance_partitions integer DEFAULT 2, + maximum_advance_partitions integer DEFAULT 5, + data_retention_period interval DEFAULT NULL, + managed_locally boolean DEFAULT false, + enabled boolean DEFAULT on); +``` + +#### Parameters + +- `relation` — Name or Oid of a table. +- `partition_increment` — Interval or increment to next partition creation. +- `partition_initial_lowerbound` — If the table has no partition, then the + first partition with this lower bound and `partition_increment` apart upper + bound is created. +- `partition_autocreate_expression` — Used to detect if it's time to create new partitions. +- `minimum_advance_partitions` — The system attempts to always have at + least `minimum_advance_partitions` partitions. +- `maximum_advance_partitions` — Number of partitions to be created in a single + go once the number of advance partitions falls below `minimum_advance_partitions`. +- `data_retention_period` — Interval until older partitions are dropped, if + defined. This value must be greater than `migrate_after_period`. +- `managed_locally` — If true, then the partitions are managed locally. +- `enabled` — Allows activity to be disabled or paused and later resumed or reenabled. + +#### Examples + +Daily partitions, keep data for one month: + +```sql +CREATE TABLE measurement ( +logdate date not null, +peaktemp int, +unitsales int +) PARTITION BY RANGE (logdate); + +bdr.autopartition('measurement', '1 day', data_retention_period := '30 days'); +``` + +Create five advance partitions when there are only two more partitions remaining (each partition can hold 1 billion orders): + +```sql +bdr.autopartition('Orders', '1000000000', + partition_initial_lowerbound := '0', + minimum_advance_partitions := 2, + maximum_advance_partitions := 5 + ); +``` + +### Create one AutoPartition + +Use `bdr.autopartition_create_partition()` to create a standalone AutoPartition +on the parent table. + +#### Synopsis + +```sql +bdr.autopartition_create_partition(relname regclass, + partname name, + lowerb text, + upperb text, + nodes oid[]); +``` + +#### Parameters + +- `relname` — Name or Oid of the parent table to attach to. +- `partname` — Name of the new AutoPartition. +- `lowerb` — The lower bound of the partition. +- `upperb` — The upper bound of the partition. +- `nodes` — List of nodes that the new partition resides on. + +### Stopping automatic creation of partitions + +Use `bdr.drop_autopartition()` to drop the auto-partitioning rule for the +given relation. All pending work items for the relation are deleted and no new +work items are created. + +```sql +bdr.drop_autopartition(relation regclass); +``` + +#### Parameters + +- `relation` — Name or Oid of a table. + +### Drop one AutoPartition + +Use `bdr.autopartition_drop_partition` once a BDR AutoPartition table has been +made, as this function can specify single partitions to drop. If the partitioned +table was successfully dropped, the function returns `true`. + +#### Synopsis + +```sql +bdr.autopartition_drop_partition(relname regclass) +``` + +#### Parameters + +- `relname` — The name of the partitioned table to drop. + +### Notes + +This places a DDL lock on the parent table, before using DROP TABLE on the +chosen partition table. + +### Wait for partition creation + +Use `bdr.autopartition_wait_for_partitions()` to wait for the creation of +partitions on the local node. The function takes the partitioned table name and +a partition key column value and waits until the partition that holds that +value is created. + +The function only waits for the partitions to be created locally. It doesn't guarantee +that the partitions also exists on the remote nodes. + +To wait for the partition to be created on all BDR nodes, use the +`bdr.autopartition_wait_for_partitions_on_all_nodes()` function. This function +internally checks local as well as all remote nodes and waits until the +partition is created everywhere. + +#### Synopsis + +```sql +bdr.autopartition_wait_for_partitions(relation regclass, text bound); +``` + +#### Parameters + +- `relation` — Name or Oid of a table. +- `bound` — Partition key column value. + +#### Synopsis + +```sql +bdr.autopartition_wait_for_partitions_on_all_nodes(relation regclass, text bound); +``` + +#### Parameters + +- `relation` — Name or Oid of a table. +- `bound` — Partition key column value. + +### Find partition + +Use the `bdr.autopartition_find_partition()` function to find the partition for the +given partition key value. If partition to hold that value doesn't exist, then +the function returns NULL. Otherwise Oid of the partition is returned. + +#### Synopsis + +```sql +bdr.autopartition_find_partition(relname regclass, searchkey text); +``` + +#### Parameters + +- `relname` — Name of the partitioned table. +- `searchkey` — Partition key value to search. + +### Enable or disable AutoPartitioning + +Use `bdr.autopartition_enable()` to enable AutoPartitioning on the given table. +If AutoPartitioning is already enabled, then no action occurs. Similarly, use +`bdr.autopartition_disable()` to disable AutoPartitioning on the given table. + +#### Synopsis + +```sql +bdr.autopartition_enable(relname regclass); +``` + +#### Parameters + +- `relname` — Name of the relation to enable AutoPartitioning. + +#### Synopsis + +```sql +bdr.autopartition_disable(relname regclass); +``` + +#### Parameters + +- `relname` — Name of the relation to disable AutoPartitioning. + +#### Synopsis + +```sql +bdr.autopartition_get_last_completed_workitem(); +``` + +Return the `id` of the last workitem successfully completed on all nodes in the +cluster. + +### Check AutoPartition workers + +From using the `bdr.autopartition_work_queue_check_status` function, you can +see the status of the background workers that are doing their job to maintain +AutoPartitions. + +The workers can be seen through these views: +`autopartition_work_queue_local_status` +`autopartition_work_queue_global_status` + +#### Synopsis + +```sql +bdr.autopartition_work_queue_check_status(workid bigint + local boolean DEFAULT false); +``` + +#### Parameters + +- `workid` — The key of the AutoPartition worker. +- `local` — Check the local status only. + +#### Notes + +AutoPartition workers are always running in the background, even before the +`bdr.autopartition` function is called for the first time. If an invalid worker ID +is used, the function returns `unknown`. `In-progress` is the typical status. diff --git a/product_docs/docs/pgd/4/overview/bdr/security.mdx b/product_docs/docs/pgd/4/overview/bdr/security.mdx new file mode 100644 index 00000000000..ec785949a37 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/security.mdx @@ -0,0 +1,386 @@ +--- +title: Security and roles + + +--- + +Only superusers can create the BDR extension. However, if you want, you can set up the `pgextwlist` extension and configure it to allow a non-superuser to create a BDR extension. + +Configuring and managing BDR doesn't require superuser access, nor is that recommended. +The privileges required by BDR are split across the following default/predefined roles, named +similarly to the PostgreSQL default/predefined roles: + +- bdr_superuser — The highest-privileged role, having access to all BDR tables and functions. +- bdr_read_all_stats — The role having read-only access to the tables, views, and functions, sufficient to understand the state of BDR. +- bdr_monitor — At the moment, the same as `bdr_read_all_stats`. To be extended later. +- bdr_application — The minimal privileges required by applications running BDR. +- bdr_read_all_conflicts — Can view all conflicts in `bdr.conflict_history`. + +These BDR roles are created when the BDR extension is +installed. See [BDR default roles](#bdr-default-roles) for more details. + +Managing BDR doesn't require that administrators have access to user data. + +Arrangements for securing conflicts are discussed in +[Logging conflicts to a table](conflicts). + +You can monitor conflicts using the `BDR.conflict_history_summary` view. + +## Catalog tables + +System catalog and Information Schema tables are always excluded from replication by BDR. + +In addition, tables owned by extensions are excluded from replication. + +## BDR functions and operators + +All BDR functions are exposed in the `bdr` schema. Any calls to these +functions must be schema qualified, rather than putting `bdr` in the +search_path. + +All BDR operators are available by way of the `pg_catalog` schema to allow users +to exclude the `public` schema from the search_path without problems. + +## Granting privileges on catalog objects + +Administrators must not grant explicit privileges on catalog +objects such as tables, views, and functions. Manage access to those objects +by granting one of the roles described in [BDR default roles](#bdr-default-roles). + +This requirement is a consequence of the flexibility that allows +joining a node group even if the nodes on either side of the join don't +have the exact same version of BDR (and therefore of the BDR +catalog). + +More precisely, if privileges on individual catalog objects were +explicitly granted, then the `bdr.join_node_group()` procedure might +fail because the corresponding GRANT statements extracted from the +node being joined might not apply to the node that is joining. + +## Role management + +Users are global objects in a PostgreSQL instance. +`CREATE USER` and `CREATE ROLE` commands are replicated automatically if they +are executed in the database where BDR is running and the +`bdr.role_replication` is turned on. However, if these commands are executed +in other databases in the same PostgreSQL instance, then they aren't replicated, +even if those users have rights on the BDR database. + +When a new BDR node joins the BDR group, existing users aren't automatically +copied unless the node is added using `bdr_init_physical`. This is intentional +and is an important security feature. PostgreSQL allows users to access multiple +databases, with the default being to access any database. BDR doesn't know +which users access which database and so can't safely decide +which users to copy across to the new node. + +PostgreSQL allows you to dump all users with the command: + +```shell +pg_dumpall --roles-only > roles.sql +``` + +The file `roles.sql` can then be edited to remove unwanted users before +reexecuting that on the newly created node. +Other mechanisms are possible, depending on your identity and access +management solution (IAM) but aren't automated at this time. + +## Roles and replication + +DDL changes executed by a user are applied as that same user on each node. + +DML changes to tables are replicated as the table-owning user on the target node. +We recommend but do not enforce that a table be owned by the same user on each node. + +Suppose table A is owned by user X on node1 and owned by user Y on node2. If user Y +has higher privileges than user X, this might be viewed as a privilege escalation. +Since some nodes have different use cases, we allow this but warn against it +to allow the security administrator to plan and audit this situation. + +On tables with row-level security policies enabled, changes +are replicated without reenforcing policies on apply. +This is equivalent to the changes being applied as +`NO FORCE ROW LEVEL SECURITY`, even if +`FORCE ROW LEVEL SECURITY` is specified. +If this isn't what you want, specify a row_filter that avoids +replicating all rows. We recommend but don't enforce +that the row security policies on all nodes be identical or +at least compatible. + +The user bdr_superuser controls replication for BDR and can +add or remove any table from any replication set. bdr_superuser +doesn't need any privileges +over individual tables, nor is this recommended. If you need to restrict access +to replication set functions, restricted versions of these +functions can be implemented as `SECURITY DEFINER` functions +and granted to the appropriate users. + +## Connection role + +When allocating a new BDR node, the user supplied in the DSN for the +`local_dsn` argument of `bdr.create_node` and the `join_target_dsn` of +`bdr.join_node_group` are used frequently to refer to, create, and +manage database objects. + +BDR is carefully written to prevent privilege escalation attacks even +when using a role with `SUPERUSER` rights in these DSNs. + +To further reduce the attack surface, you can specify a more restricted user +in the above DSNs. At a minimum, such a user must be +granted permissions on all nodes, such that following stipulations are +satisfied: + +- The user has the `REPLICATION` attribute. +- It is granted the `CREATE` permission on the database. +- It inherits the `bdr_superuser` role. +- It owns all database objects to replicate, either directly or from + permissions from the owner roles. + +Once all nodes are joined, the permissions can be further reduced to +just the following to still allow DML and DDL replication: + +- The user has the `REPLICATION` attribute. +- It inherits the `bdr_superuser` role. + +## Privilege restrictions + +BDR enforces additional restrictions, effectively preventing the +use of DDL that relies solely on TRIGGER or REFERENCES privileges. + +`GRANT ALL` still grants both TRIGGER and REFERENCES privileges, +so we recommend that you state privileges explicitly. For example, use +`GRANT SELECT, INSERT, UPDATE, DELETE, TRUNCATE` instead of `ALL`. + +### Foreign key privileges + +`ALTER TABLE ... ADD FOREIGN KEY` is supported only if the user has +SELECT privilege on the referenced table or if the referenced table +has RLS restrictions enabled that the current user can't bypass. + +Thus, the REFERENCES privilege isn't sufficient to allow creating +a foreign key with BDR. Relying solely on the REFERENCES privilege +isn't typically useful since it makes the validation check execute +using triggers rather than a table scan. It is typically too expensive +to use successfully. + +### Triggers + +In PostgreSQL, both the owner of a table and anyone who +was granted the TRIGGER privilege can create triggers. Triggers granted by the non-table owner +execute as the table owner in BDR, which might cause a security issue. +The TRIGGER privilege is seldom used and PostgreSQL Core Team has said +"The separate TRIGGER permission is something we consider obsolescent." + +BDR mitigates this problem by using stricter rules on who can create a trigger +on a table: + +- superuser +- bdr_superuser +- Owner of the table can create triggers according to same rules as in PostgreSQL + (must have EXECUTE privilege on the function used by the trigger). +- Users who have TRIGGER privilege on the table can create a trigger only if + they create the trigger using a function that is owned by the same owner as the + table and they satisfy standard PostgreSQL rules (again must have EXECUTE + privilege on the function). So if both table and function have the same owner and the + owner decided to give a user both TRIGGER privilege on the table and EXECUTE + privilege on the function, it is assumed that it is okay for that user to create + a trigger on that table using this function. +- Users who have TRIGGER privilege on the table can create triggers using + functions that are defined with the SECURITY DEFINER clause if they have EXECUTE + privilege on them. This clause makes the function always execute in the context + of the owner of the function both in standard PostgreSQL and BDR. + +This logic is built on the fact that, in PostgreSQL, the owner of the trigger +isn't the user who created it but the owner of the function used by that trigger. + +The same rules apply to existing tables, and if the existing table has triggers that +aren't owned by the owner of the table and don't use SECURITY DEFINER functions, +you can't add it to a replication set. + +These checks were added with BDR 3.6.19. An application that +relies on the behavior of previous versions can set +`bdr.backwards_compatibility` to 30618 (or lower) to behave like +earlier versions. + +BDR replication apply uses the system-level default search_path only. +Replica triggers, stream triggers, +and index expression functions might assume other search_path settings which then fail when they +execute on apply. To ensure this doesn't occur, resolve object references clearly using either the default +search_path only (always use fully qualified references to objects, e.g., schema.objectname), or set the search +path for a function using `ALTER FUNCTION ... SET search_path = ...` for the functions affected. + +## BDR default/predefined roles + +BDR predefined roles are created when the BDR extension is installed. +After BDR extension is dropped from a database, the roles continue to exist +and need to be dropped manually if required. This allows BDR to be used in multiple +databases on the same PostgreSQL instance without problem. + +The `GRANT ROLE` DDL statement doesn't participate in BDR replication. +Thus, execute this on each node of a cluster. + +### bdr_superuser + +- ALL PRIVILEGES ON ALL TABLES IN SCHEMA BDR +- ALL PRIVILEGES ON ALL ROUTINES IN SCHEMA BDR + +### bdr_read_all_stats + +SELECT privilege on + +- `bdr.conflict_history_summary` +- `bdr.ddl_epoch` +- `bdr.ddl_replication` +- `bdr.global_consensus_journal_details` +- `bdr.global_lock` +- `bdr.global_locks` +- `bdr.local_consensus_state` +- `bdr.local_node_summary` +- `bdr.node` +- `bdr.node_catchup_info` +- `bdr.node_conflict_resolvers` +- `bdr.node_group` +- `bdr.node_local_info` +- `bdr.node_peer_progress` +- `bdr.node_slots` +- `bdr.node_summary` +- `bdr.replication_sets` +- `bdr.sequences` +- `bdr.state_journal_details` +- `bdr.stat_relation` +- `bdr.stat_subscription` +- `bdr.subscription` +- `bdr.subscription_summary` +- `bdr.tables` +- `bdr.worker_errors` + +EXECUTE privilege on + +- `bdr.bdr_version` +- `bdr.bdr_version_num` +- `bdr.conflict_resolution_to_string` +- `bdr.conflict_type_to_string` +- `bdr.decode_message_payload` +- `bdr.get_global_locks` +- `bdr.get_raft_status` +- `bdr.get_relation_stats` +- `bdr.get_slot_flush_timestamp` +- `bdr.get_sub_progress_timestamp` +- `bdr.get_subscription_stats` +- `bdr.peer_state_name` +- `bdr.show_subscription_status` + +### bdr_monitor + +All privileges from `bdr_read_all_stats`, plus + +EXECUTE privilege on + +- `bdr.monitor_group_versions` +- `bdr.monitor_group_raft` +- `bdr.monitor_local_replslots` + +### bdr_application + +EXECUTE privilege on + +- All functions for column_timestamps datatypes +- All functions for CRDT datatypes +- `bdr.alter_sequence_set_kind` +- `bdr.create_conflict_trigger` +- `bdr.create_transform_trigger` +- `bdr.drop_trigger` +- `bdr.get_configured_camo_partner` +- `bdr.global_lock_table` +- `bdr.is_camo_partner_connected` +- `bdr.is_camo_partner_ready` +- `bdr.logical_transaction_status` +- `bdr.ri_fkey_trigger` +- `bdr.seq_nextval` +- `bdr.seq_currval` +- `bdr.seq_lastval` +- `bdr.trigger_get_committs` +- `bdr.trigger_get_conflict_type` +- `bdr.trigger_get_origin_node_id` +- `bdr.trigger_get_row` +- `bdr.trigger_get_type` +- `bdr.trigger_get_xid` +- `bdr.wait_for_camo_partner_queue` +- `bdr.wait_slot_confirm_lsn` + +Many of these functions have additional privileges +required before you can use them. For example, you must be +the table owner to successfully execute `bdr.alter_sequence_set_kind`. +These additional rules are described with each specific function. + +### bdr_read_all_conflicts + +BDR logs conflicts into the `bdr.conflict_history` table. Conflicts are +visible to table owners only, so no extra privileges are required +to read the conflict history. If it's useful to have a user that can +see conflicts for all tables, you can optionally grant the role +bdr_read_all_conflicts to that user. + +## Verification + +BDR was verified using the following tools and approaches. + +### Coverity + +Coverity Scan was used to verify the BDR stack providing coverage +against vulnerabilities using the following rules and coding standards: + +- MISRA C +- ISO 26262 +- ISO/IEC TS 17961 +- OWASP Top 10 +- CERT C +- CWE Top 25 +- AUTOSAR + +### CIS Benchmark + +CIS PostgreSQL Benchmark v1, 19 Dec 2019 was used to verify the BDR stack. +Using the `cis_policy.yml` configuration available as an option with TPAexec +gives the following results for the Scored tests: + +| | Result | Description | +| ------ | ---------- | ----------------------------------------------------------------- | +| 1.4 | PASS | Ensure systemd Service Files Are Enabled | +| 1.5 | PASS | Ensure Data Cluster Initialized Successfully | +| 2.1 | PASS | Ensure the file permissions mask is correct | +| 2.2 | PASS | Ensure the PostgreSQL pg_wheel group membership is correct | +| 3.1.2 | PASS | Ensure the log destinations are set correctly | +| 3.1.3 | PASS | Ensure the logging collector is enabled | +| 3.1.4 | PASS | Ensure the log file destination directory is set correctly | +| 3.1.5 | PASS | Ensure the filename pattern for log files is set correctly | +| 3.1.6 | PASS | Ensure the log file permissions are set correctly | +| 3.1.7 | PASS | Ensure 'log_truncate_on_rotation' is enabled | +| 3.1.8 | PASS | Ensure the maximum log file lifetime is set correctly | +| 3.1.9 | PASS | Ensure the maximum log file size is set correctly | +| 3.1.10 | PASS | Ensure the correct syslog facility is selected | +| 3.1.11 | PASS | Ensure the program name for PostgreSQL syslog messages is correct | +| 3.1.14 | PASS | Ensure 'debug_print_parse' is disabled | +| 3.1.15 | PASS | Ensure 'debug_print_rewritten' is disabled | +| 3.1.16 | PASS | Ensure 'debug_print_plan' is disabled | +| 3.1.17 | PASS | Ensure 'debug_pretty_print' is enabled | +| 3.1.18 | PASS | Ensure 'log_connections' is enabled | +| 3.1.19 | PASS | Ensure 'log_disconnections' is enabled | +| 3.1.21 | PASS | Ensure 'log_hostname' is set correctly | +| 3.1.23 | PASS | Ensure 'log_statement' is set correctly | +| 3.1.24 | PASS | Ensure 'log_timezone' is set correctly | +| 3.2 | PASS | Ensure the PostgreSQL Audit Extension (pgAudit) is enabled | +| 4.1 | PASS | Ensure sudo is configured correctly | +| 4.2 | PASS | Ensure excessive administrative privileges are revoked | +| 4.3 | PASS | Ensure excessive function privileges are revoked | +| 4.4 | PASS | Tested Ensure excessive DML privileges are revoked | +| 5.2 | Not Tested | Ensure login via 'host' TCP/IP Socket is configured correctly | +| 6.2 | PASS | Ensure 'backend' runtime parameters are configured correctly | +| 6.7 | Not Tested | Ensure FIPS 140-2 OpenSSL Cryptography Is Used | +| 6.8 | PASS | Ensure SSL is enabled and configured correctly | +| 7.3 | PASS | Ensure WAL archiving is configured and functional | + +Test 5.2 can PASS if audited manually, but it doesn't have an +automated test. + +Test 6.7 succeeds on default deployments using CentOS, but it +requires extra packages on Debian variants. diff --git a/product_docs/docs/pgd/4/overview/bdr/sequences.mdx b/product_docs/docs/pgd/4/overview/bdr/sequences.mdx new file mode 100644 index 00000000000..776cd500eae --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/sequences.mdx @@ -0,0 +1,829 @@ +--- +title: Sequences + + +--- + +Many applications require that unique surrogate ids be assigned to database entries. +Often the database `SEQUENCE` object is used to produce these. In +PostgreSQL, these can be either: +- A manually created sequence using the +`CREATE SEQUENCE` command and retrieved by calling the `nextval()` function +- `serial` and `bigserial` columns or, alternatively, +`GENERATED BY DEFAULT AS IDENTITY` columns + +However, standard sequences in PostgreSQL aren't multi-node aware and +produce values that are unique only on the local node. This is important because +unique ids generated by such sequences cause conflict and data loss (by +means of discarded `INSERT` actions) in multi-master replication. + +## BDR global sequences + +For this reason, BDR provides an application-transparent way to generate unique +ids using sequences on bigint or bigserial datatypes across the whole BDR group, +called *global sequences*. + +BDR global sequences provide an easy way for applications to use the +database to generate unique synthetic keys in an asynchronous distributed +system that works for most—but not necessarily all—cases. + +Using BDR global sequences allows you to avoid the problems with insert +conflicts. If you define a `PRIMARY KEY` or `UNIQUE` constraint on a column +that's using a global sequence, no node can ever get +the same value as any other node. When BDR synchronizes inserts between the +nodes, they can never conflict. + +BDR global sequences extend PostgreSQL sequences, so they are crash-safe. To use +them, you must be granted the `bdr_application` role. + +There are various possible algorithms for global sequences: + +- SnowflakeId sequences +- Globally allocated range sequences + +SnowflakeId sequences generate values using an algorithm that doesn't require +inter-node communication at any point. It's faster and more robust and has the +useful property of recording the timestamp at which the values were +created. + +SnowflakeId sequences have the restriction that they work only for 64-bit BIGINT +datatypes and produce values 19 digits long, which might be too long for +use in some host language datatypes such as Javascript Integer types. +Globally allocated sequences allocate a local range of values that can +be replenished as needed by inter-node consensus, making them suitable for +either BIGINT or INTEGER sequences. + +You can create a global sequence using the `bdr.alter_sequence_set_kind()` +function. This function takes a standard PostgreSQL sequence and marks it as +a BDR global sequence. It can also convert the sequence back to the standard +PostgreSQL sequence. + +BDR also provides the configuration variable `bdr.default_sequence_kind`, which +determines the kind of sequence to create when the `CREATE SEQUENCE` +command is executed or when a `serial`, `bigserial`, or +`GENERATED BY DEFAULT AS IDENTITY` column is created. Valid settings are: + +- `local` (the default), meaning that newly created + sequences are the standard PostgreSQL (local) sequences. +- `galloc`, which always creates globally allocated range sequences. +- `snowflakeid`, which creates global sequences for BIGINT sequences that + consist of time, nodeid, and counter components. You can't use it with + INTEGER sequences (so you can use it for `bigserial` but not for `serial`). +- `timeshard`, which is the older version of SnowflakeId sequence and is provided for + backward compatibility only. The SnowflakeId is preferred. +- `distributed`, which is a special value that you can use only for + `bdr.default_sequence_kind`. It selects `snowflakeid` for `int8` + sequences (i.e., `bigserial`) and `galloc` sequence for `int4` + (i.e., `serial`) and `int2` sequences. + +The `bdr.sequences` view shows information about individual sequence kinds. + +`currval()` and `lastval()` work correctly for all types of global sequence. + +### SnowflakeId sequences + +The ids generated by SnowflakeId sequences are loosely time ordered so you can +use them to get the approximate order of data insertion, like standard PostgreSQL +sequences. Values generated within the same millisecond might be out of order, +even on one node. The property of loose time ordering means they are suitable +for use as range partition keys. + +SnowflakeId sequences work on one or more nodes and don't require any inter-node +communication after the node join process completes. So you can continue to +use them even if there's the risk of extended network partitions. They aren't +affected by replication lag or inter-node latency. + +SnowflakeId sequences generate unique ids in a different +way from standard sequences. The algorithm uses three components for a +sequence number. The first component of the sequence is a timestamp +at the time of sequence number generation. The second component of +the sequence number is the unique id assigned to each BDR node, +which ensures that the ids from different nodes are always different. +The third component is the number generated by +the local sequence. + +While adding a unique node id to the sequence number is enough +to ensure there are no conflicts, we also want to keep another useful +property of sequences. The ordering of the sequence +numbers roughly corresponds to the order in which data was inserted +into the table. Putting the timestamp first ensures this. + +A few limitations and caveats apply to SnowflakeId sequences. + +SnowflakeId sequences are 64 bits wide and need a `bigint` or `bigserial`. +Values generated are at least 19 digits long. +There's no practical 32-bit `integer` version, so you can't use it with `serial` +sequences. Use globally allocated range sequences instead. + +For SnowflakeId there's a limit of 4096 sequence values generated per +millisecond on any given node (about 4 million sequence values per +second). In case the sequence value generation wraps around within a given +millisecond, the SnowflakeId sequence waits until the next millisecond and gets a +fresh value for that millisecond. + +Since SnowflakeId sequences encode timestamps into the sequence value, you can generate new sequence +values only within the given time frame (depending on the system clock). +The oldest timestamp that you can use is 2016-10-07, which is the epoch time for +the SnowflakeId. The values wrap to negative values in the year 2086 and +completely run out of numbers by 2156. + +Since timestamp is an important part of a SnowflakeId sequence, there's additional +protection from generating sequences with a timestamp older than the latest one +used in the lifetime of a postgres process (but not between postgres restarts). + +The `INCREMENT` option on a sequence used as input for SnowflakeId sequences is +effectively ignored. This might be relevant for applications that do sequence +ID caching, like many object-relational mapper (ORM) tools, notably Hibernate. +Because the sequence is time based, this has little practical effect since the +sequence advances to a new noncolliding value by the time the +application can do anything with the cached values. + +Similarly, you might change the `START`, `MINVALUE`, `MAXVALUE`, and `CACHE` settings +on the underlying sequence, but there's no benefit to doing +so. The sequence's low 14 bits are used and the rest is discarded, so +the value range limits don't affect the function's result. For the same +reason, `setval()` isn't useful for SnowflakeId sequences. + +#### Timeshard sequences + +Timeshard sequences are provided for backward compatibility with existing +installations but aren't recommended for new application use. We recommend +using the SnowflakeId sequence instead. + +Timeshard is very similar to SnowflakeId but has different limits and fewer +protections and slower performance. + +The differences between timeshard and SnowflakeId are as following: + + - Timeshard can generate up to 16384 per millisecond (about 16 million per + second), which is more than SnowflakeId. However, there's no protection + against wraparound within a given millisecond. Schemas using the timeshard + sequence must protect the use of the `UNIQUE` constraint when using timeshard values + for given column. + - The timestamp component of timeshard sequence runs out of values in + the year 2050 and, if used in combination with bigint, the values wrap + to negative numbers in the year 2033. This means that sequences generated + after 2033 have negative values. This is a considerably shorter time + span than SnowflakeId and is the main reason why SnowflakeId is preferred. + - Timeshard sequences require occasional disk writes (similar to standard local + sequences). SnowflakeIds are calculated in memory so the SnowflakeId + sequences are in general a little faster than timeshard sequences. + +### Globally allocated range sequences + +The globally allocated range (or `galloc`) sequences allocate ranges (chunks) +of values to each node. When the local range is used up, a new range is +allocated globally by consensus amongst the other nodes. This uses the key +space efficiently but requires that the local node be connected to a majority +of the nodes in the cluster for the sequence generator to progress when the +currently assigned local range is used up. + +Unlike SnowflakeId sequences, `galloc` sequences support all sequence data types +provided by PostgreSQL: `smallint`, `integer`, and `bigint`. This means that +you can use `galloc` sequences in environments where 64-bit sequences are +problematic. Examples include using integers in javascript, since that supports only +53-bit values, or when the sequence is displayed on output with limited space. + +The range assigned by each voting is currently predetermined based on the +datatype the sequence is using: + +- smallint — 1 000 numbers +- integer — 1 000 000 numbers +- bigint — 1 000 000 000 numbers + +Each node allocates two chunks of seq_chunk_size, one for the current use +plus a reserved chunk for future usage, so the values generated from any one +node increase monotonically. However, viewed globally, the values +generated aren't ordered at all. This might cause a loss of performance +due to the effects on b-tree indexes and typically means that generated +values aren't useful as range partition keys. + +The main downside of the `galloc` sequences is that once the assigned range is +used up, the sequence generator has to ask for consensus about the next range +for the local node that requires inter-node communication. This could +lead to delays or operational issues if the majority of the BDR group isn't +accessible. This might be avoided in later releases. + +The `CACHE`, `START`, `MINVALUE`, and `MAXVALUE` options work correctly +with `galloc` sequences. However, you need to set them before transforming +the sequence to the `galloc` kind. The `INCREMENT BY` option also works +correctly. However, you can't assign an increment value that's equal +to or more than the above ranges assigned for each sequence datatype. +`setval()` doesn't reset the global state for `galloc` sequences; don't use it. + +A few limitations apply to `galloc` sequences. BDR tracks `galloc` sequences in a +special BDR catalog [bdr.sequence_alloc](catalogs#bdrsequence_alloc). This +catalog is required to track the currently allocated chunks for the `galloc` +sequences. The sequence name and namespace is stored in this catalog. Since the +sequence chunk allocation is managed by Raft, whereas any changes to the +sequence name/namespace is managed by the replication stream, BDR currently doesn't +support renaming `galloc` sequences or moving them to another namespace or +renaming the namespace that contains a `galloc` sequence. Be +mindful of this limitation while designing application schema. + +#### Converting a local sequence to a galloc sequence + +Before transforming a local sequence to galloc, you need to take care of several +prerequisites. + +##### 1. Verify that sequence and column data type match + +Check that the sequence's data type matches the data type of the column with +which it will be used. For example, you can create a `bigint` sequence +and assign an `integer` column's default to the `nextval()` returned by that +sequence. With galloc sequences, which for `bigint` are allocated in blocks of +1 000 000 000, this quickly results in the values returned by `nextval()` +exceeding the `int4` range if more than two nodes are in use. + +The following example shows what can happen: + +```sql +CREATE SEQUENCE int8_seq; + +SELECT sequencename, data_type FROM pg_sequences; + sequencename | data_type +--------------+----------- + int8_seq | bigint +(1 row) + +CREATE TABLE seqtest (id INT NOT NULL PRIMARY KEY); + +ALTER SEQUENCE int8_seq OWNED BY seqtest.id; + +SELECT bdr.alter_sequence_set_kind('public.int8_seq'::regclass, 'galloc', 1); + alter_sequence_set_kind +------------------------- + +(1 row) + +ALTER TABLE seqtest ALTER COLUMN id SET DEFAULT nextval('int8_seq'::regclass); +``` + +After executing `INSERT INTO seqtest VALUES(DEFAULT)` on two nodes, the table +contains the following values: + +```sql +SELECT * FROM seqtest; + id +------------ + 2 + 2000000002 +(2 rows) +``` + +However, attempting the same operation on a third node fails with an +`integer out of range` error, as the sequence generated the value +`4000000002`. + +!!! Tip + You can retrieve the current data type of a sequence from the PostgreSQL + [pg_sequences](https://www.postgresql.org/docs/current/view-pg-sequences.html) + view. You can modify the data type of a sequence with `ALTER SEQUENCE ... AS ...`, + for example, `ALTER SEQUENCE public.sequence AS integer`, as long as its current + value doesn't exceed the maximum value of the new data type. + +##### 2. Set a new start value for the sequence + +When the sequence kind is altered to `galloc`, it's rewritten and restarts from +the defined start value of the local sequence. If this happens on an existing +sequence in a production database, you need to query the current value and +then set the start value appropriately. To assist with this use case, BDR +allows users to pass a starting value with the function `bdr.alter_sequence_set_kind()`. +If you're already using offset and you have writes from multiple nodes, you +need to check what is the greatest used value and restart the sequence at least +to the next value. + +```sql +-- determine highest sequence value across all nodes +SELECT max((x->'response'->0->>'nextval')::bigint) + FROM json_array_elements( + bdr.run_on_all_nodes( + E'SELECT nextval(\'public.sequence\');' + )::jsonb AS x; + +-- turn into a galloc sequence +SELECT bdr.alter_sequence_set_kind('public.sequence'::regclass, 'galloc', $MAX + $MARGIN); +``` + +Since users can't lock a sequence, you must leave a `$MARGIN` value to allow +operations to continue while the `max()` value is queried. + +The `bdr.sequence_alloc` table gives information on the chunk size and the +ranges allocated around the whole cluster. +In this example, we started our sequence from `333`, and we have two nodes in the +cluster. We can see that we have a number of allocation 4, which is 2 per node, +and the chunk size is 1000000 that's related to an integer sequence. + +```sql +SELECT * FROM bdr.sequence_alloc + WHERE seqid = 'public.categories_category_seq'::regclass; + seqid | seq_chunk_size | seq_allocated_up_to | seq_nallocs | seq_last_alloc +-------------------------+----------------+---------------------+-------------+----------------------------- + categories_category_seq | 1000000 | 4000333 | 4 | 2020-05-21 20:02:15.957835+00 +(1 row) +``` + +To see the ranges currently assigned to a given sequence on each node, use +these queries: + +* Node `Node1` is using range from `333` to `2000333`. + +```sql +SELECT last_value AS range_start, log_cnt AS range_end + FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range + range_start | range_end +-------------+----------- + 334 | 1000333 +(1 row) + +SELECT last_value AS range_start, log_cnt AS range_end + FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range + range_start | range_end +-------------+----------- + 1000334 | 2000333 +(1 row) +``` + +* Node `Node2` is using range from `2000004` to `4000003`. + +```sql +SELECT last_value AS range_start, log_cnt AS range_end + FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range + range_start | range_end +-------------+----------- + 2000334 | 3000333 +(1 row) + +SELECT last_value AS range_start, log_cnt AS range_end + FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range + range_start | range_end +-------------+----------- + 3000334 | 4000333 +``` + +!!! NOTE + You can't combine it to a single query (like `WHERE ctid IN ('(0,2)', '(0,3)')`) + as that still shows only the first range. + +When a node finishes a chunk, it asks a consensus for a new one and gets the +first available. In the example, it's from 4000334 to 5000333. This is +the new reserved chunk and starts to consume the old reserved chunk. + +## UUIDs, KSUUIDs, and other approaches + +There are other ways to generate globally unique ids without using the global +sequences that can be used with BDR. For example: + +- UUIDs and their BDR variant, KSUUIDs +- Local sequences with a different offset per node (i.e., manual) +- An externally coordinated natural key + +BDR applications can't use other methods safely: +counter-table-based approaches relying on `SELECT ... FOR UPDATE`, `UPDATE ... RETURNING ...` +or similar for sequence generation doesn't work correctly in BDR because BDR +doesn't take row locks between nodes. The same values are generated on +more than one node. For the same reason, the usual strategies for "gapless" +sequence generation don't work with BDR. In most cases, the application +coordinates generation of sequences that must be gapless from some external +source using two-phase commit. Or it generates them only on one node in +the BDR group. + +### UUIDs and KSUUIDs + +`UUID` keys instead avoid sequences entirely and +use 128-bit universal unique identifiers. These are random +or pseudorandom values that are so large that it's nearly +impossible for the same value to be generated twice. There's +no need for nodes to have continuous communication when using `UUID` keys. + +In the unlikely event of a collision, conflict detection +chooses the newer of the two inserted records to retain. Conflict logging, +if enabled, records such an event. However, it's +exceptionally unlikely to ever occur, since collisions +become practically likely only after about `2^64` keys are generated. + +The main downside of `UUID` keys is that they're somewhat inefficient in terms of space and +the network. They consume more space not only as a primary key but +also where referenced in foreign keys and when transmitted on the wire. +Also, not all applications cope well with `UUID` keys. + +BDR provides functions for working with a K-Sortable variant of `UUID` data, +known as KSUUID, which generates values that can be stored using the PostgreSQL +standard `UUID` data type. A `KSUUID` value is similar to `UUIDv1` in that +it stores both timestamp and random data, following the `UUID` standard. +The difference is that `KSUUID` is K-Sortable, meaning that it's weakly +sortable by timestamp. This makes it more useful as a database key as it +produces more compact `btree` indexes, which improves +the effectiveness of search, and allows natural time-sorting of result data. +Unlike `UUIDv1`, +`KSUUID` values don't include the MAC of the computer on which they were +generated, so there are no security concerns from using them. + +`KSUUID` v2 is now recommended in all cases. You can directly sort values generated +with regular comparison operators. + +There are two versions of `KSUUID` in BDR: v1 and v2. +The legacy `KSUUID` v1 is +deprecated but is kept in order to support existing installations. Don't +use it for new installations. +The internal contents of v1 and v2 aren't compatible. As such, the +functions to manipulate them also aren't compatible. The v2 of `KSUUID` also +no longer stores the `UUID` version number. + +### Step and offset sequences + +In offset-step sequences, a normal PostgreSQL sequence is used on each node. +Each sequence increments by the same amount and starts at differing offsets. +For example, with step 1000, node1's sequence generates 1001, 2001, 3001, and +so on. node2's sequence generates 1002, 2002, 3002, and so on. This scheme works well +even if the nodes can't communicate for extended periods. However, the designer +must specify a maximum number of nodes when establishing the +schema, and it requires per-node configuration. Mistakes can easily lead to +overlapping sequences. + +It's relatively simple to configure this approach with BDR by creating the +desired sequence on one node, like this: + +``` +CREATE TABLE some_table ( + generated_value bigint primary key +); + +CREATE SEQUENCE some_seq INCREMENT 1000 OWNED BY some_table.generated_value; + +ALTER TABLE some_table ALTER COLUMN generated_value SET DEFAULT nextval('some_seq'); +``` + +Then, on each node calling `setval()`, give each node a different offset +starting value, for example: + +``` +-- On node 1 +SELECT setval('some_seq', 1); + +-- On node 2 +SELECT setval('some_seq', 2); + + -- ... etc +``` + +Be sure to allow a large enough `INCREMENT` to leave room for all +the nodes you might ever want to add, since changing it in future is difficult +and disruptive. + +If you use `bigint` values, there's no practical concern about key exhaustion, +even if you use offsets of 10000 or more. It would take hundreds of years, +with hundreds of machines, doing millions of inserts per second, to have any +chance of approaching exhaustion. + +BDR doesn't currently offer any automation for configuration of the +per-node offsets on such step/offset sequences. + +#### Composite keys + +A variant on step/offset sequences is to use a composite key composed of +`PRIMARY KEY (node_number, generated_value)`, where the +node number is usually obtained from a function that returns a different +number on each node. You can create such a function by temporarily +disabling DDL replication and creating a constant SQL function. Alternatively, you can use +a one-row table that isn't part of a replication set to store a different +value in each node. + +## Global sequence management interfaces + +BDR provides an interface for converting between a standard PostgreSQL sequence +and the BDR global sequence. + +The following functions are considered to be `DDL`, so DDL replication +and global locking applies to them. + +### bdr.alter_sequence_set_kind + +Allows the owner of a sequence to set the kind of a sequence. +Once set, `seqkind` is visible only by way of the `bdr.sequences` view. +In all other ways, the sequence appears as a normal sequence. + +BDR treats this function as `DDL`, so DDL replication and global locking applies, +if it's currently active. See [DDL Replication](ddl). + +#### Synopsis + +```sql +bdr.alter_sequence_set_kind(seqoid regclass, seqkind text) +``` + +#### Parameters + +- `seqoid` — Name or Oid of the sequence to alter. +- `seqkind` — `local` for a standard PostgreSQL sequence, `snowflakeid` or + `galloc` for globally unique BDR sequences, or `timeshard` for legacy + globally unique sequence. + +#### Notes + +When changing the sequence kind to `galloc`, the first allocated range for that +sequence uses the sequence start value as the starting point. When there are +existing values that were used by the sequence before it was changed to `galloc`, +we recommend moving the starting point so that the newly generated +values don't conflict with the existing ones using the following command: + +```sql +ALTER SEQUENCE seq_name START starting_value RESTART +``` + +This function uses the same replication mechanism as `DDL` statements. This means +that the replication is affected by the [ddl filters](repsets#ddl-replication-filtering) +configuration. + +The function takes a global `DDL` lock. It also locks the sequence locally. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +Only the owner of the sequence can execute the `bdr.alter_sequence_set_kind` function +unless `bdr.backwards_compatibility` is +set is set to 30618 or lower. + +### bdr.extract_timestamp_from_snowflakeid + +This function extracts the timestamp component of the `snowflakeid` sequence. +The return value is of type timestamptz. + +#### Synopsis +```sql +bdr.extract_timestamp_from_snowflakeid(snowflakeid bigint) +``` + +#### Parameters + - `snowflakeid` — Value of a snowflakeid sequence. + +#### Notes + +This function executes only on the local node. + +### bdr.extract_nodeid_from_snowflakeid + +This function extracts the nodeid component of the `snowflakeid` sequence. + +#### Synopsis +```sql +bdr.extract_nodeid_from_snowflakeid(snowflakeid bigint) +``` + +#### Parameters + - `snowflakeid` — Value of a snowflakeid sequence. + +#### Notes + +This function executes only on the local node. + +### bdr.extract_localseqid_from_snowflakeid + +This function extracts the local sequence value component of the `snowflakeid` sequence. + +#### Synopsis +```sql +bdr.extract_localseqid_from_snowflakeid(snowflakeid bigint) +``` + +#### Parameters + - `snowflakeid` — Value of a snowflakeid sequence. + +#### Notes + +This function executes only on the local node. + +### bdr.timestamp_to_snowflakeid + +This function converts a timestamp value to a dummy snowflakeid sequence value. + +This is useful for doing indexed searches or comparisons of values in the +snowflakeid column and for a specific timestamp. + +For example, given a table `foo` with a column `id` that's using a `snowflakeid` +sequence, we can get the number of changes since yesterday midnight like this: + +``` +SELECT count(1) FROM foo WHERE id > bdr.timestamp_to_snowflakeid('yesterday') +``` + +A query formulated this way uses an index scan on the column `id`. + +#### Synopsis +```sql +bdr.timestamp_to_snowflakeid(ts timestamptz) +``` + +#### Parameters + - `ts` — Timestamp to use for the snowflakeid sequence generation. + +#### Notes + +This function executes only on the local node. + +### bdr.extract_timestamp_from_timeshard + +This function extracts the timestamp component of the `timeshard` sequence. +The return value is of type timestamptz. + +#### Synopsis + +```sql +bdr.extract_timestamp_from_timeshard(timeshard_seq bigint) +``` + +#### Parameters + +- `timeshard_seq` — Value of a timeshard sequence. + +#### Notes + +This function executes only on the local node. + +### bdr.extract_nodeid_from_timeshard + +This function extracts the nodeid component of the `timeshard` sequence. + +#### Synopsis + +```sql +bdr.extract_nodeid_from_timeshard(timeshard_seq bigint) +``` + +#### Parameters + +- `timeshard_seq` — Value of a timeshard sequence. + +#### Notes + +This function executes only on the local node. + +### bdr.extract_localseqid_from_timeshard + +This function extracts the local sequence value component of the `timeshard` sequence. + +#### Synopsis + +```sql +bdr.extract_localseqid_from_timeshard(timeshard_seq bigint) +``` + +#### Parameters + +- `timeshard_seq` — Value of a timeshard sequence. + +#### Notes + +This function executes only on the local node. + +### bdr.timestamp_to_timeshard + +This function converts a timestamp value to a dummy timeshard sequence value. + +This is useful for doing indexed searches or comparisons of values in the +timeshard column and for a specific timestamp. + +For example, given a table `foo` with a column `id` that's using a `timeshard` +sequence, we can get the number of changes since yesterday midnight like this: + +``` +SELECT count(1) FROM foo WHERE id > bdr.timestamp_to_timeshard('yesterday') +``` + +A query formulated this way uses an index scan on the column `id`. + +#### Synopsis + +```sql +bdr.timestamp_to_timeshard(ts timestamptz) +``` + +#### Parameters + +- `ts` — Timestamp to use for the timeshard sequence generation. + +#### Notes + +This function executes only on the local node. + +## KSUUID v2 Functions + +Functions for working with `KSUUID` v2 data, K-Sortable UUID data. + +### bdr.gen_ksuuid_v2 + +This function generates a new `KSUUID` v2 value using the value of timestamp passed as an +argument or current system time if NULL is passed. +If you want to generate KSUUID automatically using the system time, pass a NULL argument. + +The return value is of type UUID. + +#### Synopsis + +```sql +bdr.gen_ksuuid_v2(timestamptz) +``` + +#### Notes + +This function executes only on the local node. + +### bdr.ksuuid_v2_cmp + +This function compares the `KSUUID` v2 values. + +It returns 1 if the first value is newer, -1 if the second value is lower, or zero if they +are equal. + +#### Synopsis + +```sql +bdr.ksuuid_v2_cmp(uuid, uuid) +``` + +#### Parameters + +- `UUID` — `KSUUID` v2 to compare. + +#### Notes + +This function executes only on the local node. + +### bdr.extract_timestamp_from_ksuuid_v2 + +This function extracts the timestamp component of `KSUUID` v2. +The return value is of type timestamptz. + +#### Synopsis + +```sql +bdr.extract_timestamp_from_ksuuid_v2(uuid) +``` + +#### Parameters + +- `UUID` — `KSUUID` v2 value to extract timestamp from. + +#### Notes + +This function executes only on the local node. + +## KSUUID v1 functions + +Functions for working with `KSUUID` v1 data, K-Sortable UUID data(v1). + +### bdr.gen_ksuuid + +This function generates a new `KSUUID` v1 value, using the current system time. +The return value is of type UUID. + +#### Synopsis + +```sql +bdr.gen_ksuuid() +``` + +#### Notes + +This function executes only on the local node. + +### bdr.uuid_v1_cmp + +This function compares the `KSUUID` v1 values. + +It returns 1 if the first value is newer, -1 if the second value is lower, or zero if they +are equal. + +#### Synopsis + +```sql +bdr.uuid_v1_cmp(uuid, uuid) +``` + +#### Notes + +This function executes only on the local node. + +#### Parameters + +- `UUID` — `KSUUID` v1 to compare. + +### bdr.extract_timestamp_from_ksuuid + +This function extracts the timestamp component of `KSUUID` v1 or `UUIDv1` values. +The return value is of type timestamptz. + +#### Synopsis + +```sql +bdr.extract_timestamp_from_ksuuid(uuid) +``` + +#### Parameters + +- `UUID` — `KSUUID` v1 value to extract timestamp from. + +#### Notes + +This function executes on the local node. diff --git a/product_docs/docs/pgd/4/overview/bdr/striggers.mdx b/product_docs/docs/pgd/4/overview/bdr/striggers.mdx new file mode 100644 index 00000000000..77e99008f49 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/striggers.mdx @@ -0,0 +1,691 @@ +--- +title: Stream triggers + + +--- + +BDR introduces new types of triggers that you can use for additional +data processing on the downstream/target node. + +- Conflict triggers +- Transform triggers + +Together, these types of triggers are known as *stream triggers*. + +Stream triggers are designed to be trigger-like in syntax. They leverage the +PostgreSQL BEFORE trigger architecture and are likely to have similar +performance characteristics as PostgreSQL BEFORE Triggers. + +Multiple trigger definitions can use one trigger function, just as with +normal PostgreSQL triggers. +A trigger function is a program defined in this form: +`CREATE FUNCTION ... RETURNS TRIGGER`. Creating the trigger doesn't +require use of the `CREATE TRIGGER` command. Instead, create stream triggers +using the special BDR functions +`bdr.create_conflict_trigger()` and `bdr.create_transform_trigger()`. + +Once created, the trigger is visible in the catalog table `pg_trigger`. +The stream triggers are marked as `tgisinternal = true` and +`tgenabled = 'D'` and have the name suffix '\_bdrc' or '\_bdrt'. The view +`bdr.triggers` provides information on the triggers in relation to the table, +the name of the procedure that is being executed, the event that triggers it, +and the trigger type. + +Stream triggers aren't enabled for normal SQL processing. +Because of this, the `ALTER TABLE ... ENABLE TRIGGER` is blocked for stream +triggers in both its specific name variant and the ALL variant. This mechanism prevents +the trigger from executing as a normal SQL trigger. + +These triggers execute on the downstream or target node. There's no +option for them to execute on the origin node. However, you might want to consider +the use of `row_filter` expressions on the origin. + +Also, any DML that is applied while executing a stream +trigger isn't replicated to other BDR nodes and doesn't +trigger the execution of standard local triggers. This is intentional. You can use it, for example, +to log changes or conflicts captured by a +stream trigger into a table that is crash-safe and specific of that +node. See [Stream triggers examples](#stream-triggers-examples) for a working example. + +## Trigger execution during Apply + +Transform triggers execute first—once for each incoming change in the +triggering table. These triggers fire before we attempt to locate a +matching target row, allowing a very wide range of transforms to be applied +efficiently and consistently. + +Next, for UPDATE and DELETE changes, we locate the target row. If there's no +target row, then no further processing occurs for those change types. + +We then execute any normal triggers that previously were explicitly enabled +as replica triggers at table-level: + +```sql +ALTER TABLE tablename +ENABLE REPLICA TRIGGER trigger_name; +``` + +We then decide whether a potential conflict exists. If so, we then call any +conflict trigger that exists for that table. + +### Missing column conflict resolution + +Before transform triggers are executed, PostgreSQL tries to match the +incoming tuple against the row-type of the target table. + +Any column that exists on the input row but not on the target table +triggers a conflict of type `target_column_missing`. Conversely, a +column existing on the target table but not in the incoming row +triggers a `source_column_missing` conflict. The default resolutions +for those two conflict types are respectively `ignore_if_null` and +`use_default_value`. + +This is relevant in the context of rolling schema upgrades, for +example, if the new version of the schema introduces a new +column. When replicating from an old version of the schema to a new +one, the source column is missing, and the `use_default_value` +strategy is appropriate, as it populates the newly introduced column +with the default value. + +However, when replicating from a node having the new schema version to +a node having the old one, the column is missing from the target +table. The `ignore_if_null` resolver isn't appropriate for a +rolling upgrade because it breaks replication as soon as the user +inserts a tuple with a non-NULL value +in the new column in any of the upgraded nodes. + +In view of this example, the appropriate setting for rolling schema +upgrades is to configure each node to apply the `ignore` resolver in +case of a `target_column_missing` conflict. + +You can do this with the following query, which you must execute +separately on each node. Replace `node1` with the actual +node name. + +```sql +SELECT bdr.alter_node_set_conflict_resolver('node1', + 'target_column_missing', 'ignore'); +``` + +#### Data loss and divergence risk + +Setting the conflict resolver to `ignore` +can lead to data loss and cluster divergence. + +Consider the following example: table `t` exists on nodes 1 and 2, but +its column `col` exists only on node 1. + +If the conflict resolver is set to `ignore`, then there can be rows on +node 1 where `c` isn't null, for example, `(pk=1, col=100)`. That row is +replicated to node 2, and the value in column `c` is discarded, +for example, `(pk=1)`. + +If column `c` is then added to the table on node 2, it is at first +set to NULL on all existing rows, and the row considered above +becomes `(pk=1, col=NULL)`. The row having `pk=1` is no longer +identical on all nodes, and the cluster is therefore divergent. + +The default `ignore_if_null` resolver isn't affected by +this risk because any row replicated to node 2 has +`col=NULL`. + +Based on this example, we recommend running LiveCompare against the +whole cluster at the end of a rolling schema upgrade where the +`ignore` resolver was used. This practice helps to ensure that you detect and fix any divergence. + +## Terminology of row-types + +We use these row-types: + +- `SOURCE_OLD` is the row before update, that is, the key. +- `SOURCE_NEW` is the new row coming from another node. +- `TARGET` is the row that exists on the node already, that is, the conflicting row. + +## Conflict triggers + +Conflict triggers execute when a conflict is detected by BDR. +They decide what happens when the conflict has occurred. + +- If the trigger function returns a row, the action is applied to the target. +- If the trigger function returns a NULL row, the action is skipped. + +For example, if the trigger is called for a `DELETE`, the trigger +returns NULL if it wants to skip the `DELETE`. If you want the `DELETE` to proceed, +then return a row value: either `SOURCE_OLD` or `TARGET` works. +When the conflicting operation is either `INSERT` or `UPDATE`, and the +chosen resolution is the deletion of the conflicting row, the trigger +must explicitly perform the deletion and return NULL. +The trigger function can perform other SQL actions as it chooses, but +those actions are only applied locally, not replicated. + +When a real data conflict occurs between two or more nodes, +two or more concurrent changes are occurring. When we apply those changes, the +conflict resolution occurs independently on each node. This means the conflict +resolution occurs once on each node and can occur with a +significant time difference between them. As a result, communication between the multiple executions of the conflict +trigger isn't possible. It is the responsibility of the author of the conflict trigger to +ensure that the trigger gives exactly the same result for all related events. +Otherwise, data divergence occurs. Technical Support recommends that you formally test all conflict +triggers using the isolationtester tool supplied with +BDR. + +!!! Warning + - You can specify multiple conflict triggers on a single table, but + they must match a distinct event. That is, each conflict must + match only a single conflict trigger. + - We don't recommend multiple triggers matching the same event on the same table. + They might result in inconsistent behavior and + will not be allowed in a future release. + +If the same conflict trigger matches more than one event, you can use the `TG_OP` +variable in the trigger to identify the operation that +produced the conflict. + +By default, BDR detects conflicts by observing a change of replication origin +for a row. Hence, you can call a conflict trigger even when +only one change is occurring. Since, in this case, there's no +real conflict, this conflict detection mechanism can generate +false-positive conflicts. The conflict trigger must handle all of those +identically. + +In some cases, timestamp conflict detection doesn't detect a +conflict at all. For example, in a concurrent `UPDATE`/`DELETE` where the +`DELETE` occurs just after the `UPDATE`, any nodes that see first the `UPDATE` +and then the `DELETE` don't see any conflict. If no conflict is seen, +the conflict trigger are never called. In the same situation but using +row version conflict detection, a conflict is seen, which a conflict trigger can then +handle. + +The trigger function has access to additional state information as well as +the data row involved in the conflict, depending on the operation type: + +- On `INSERT`, conflict triggers can access the `SOURCE_NEW` row from + the source and `TARGET` row. +- On `UPDATE`, conflict triggers can access the `SOURCE_OLD` and + `SOURCE_NEW` row from the source and `TARGET` row. +- On `DELETE`, conflict triggers can access the `SOURCE_OLD` row from + the source and `TARGET` row. + +You can use the function `bdr.trigger_get_row()` to retrieve `SOURCE_OLD`, `SOURCE_NEW`, +or `TARGET` rows, if a value exists for that operation. + +Changes to conflict triggers happen transactionally and are protected by +global DML locks during replication of the configuration change, similarly +to how some variants of `ALTER TABLE` are handled. + +If primary keys are updated inside a conflict trigger, it can +sometimes lead to unique constraint violations errors due to a difference +in timing of execution. +Hence, avoid updating primary keys in conflict triggers. + +## Transform triggers + +These triggers are similar to conflict triggers, except they are executed +for every row on the data stream against the specific table. The behavior of +return values and the exposed variables is similar, but transform triggers +execute before a target row is identified, so there is no `TARGET` row. + +You can specify multiple transform triggers on each table in BDR. +Transform triggers execute in alphabetical order. + +A transform trigger can filter away rows, and it can do additional operations +as needed. It can alter the values of any column or set them to `NULL`. The +return value decides the further action taken: + +- If the trigger function returns a row, it's applied to the target. +- If the trigger function returns a `NULL` row, there's no further action to + perform. Unexecuted triggers never execute. +- The trigger function can perform other actions as it chooses. + +The trigger function has access to additional state information as well as +rows involved in the conflict: + +- On `INSERT`, transform triggers can access the `SOURCE_NEW` row from the source. +- On `UPDATE`, transform triggers can access the `SOURCE_OLD` and `SOURCE_NEW` row from the source. +- On `DELETE`, transform triggers can access the `SOURCE_OLD` row from the source. + +You can use the function `bdr.trigger_get_row()` to retrieve `SOURCE_OLD` or `SOURCE_NEW` +rows. `TARGET` row isn't available, since this type of trigger executes before such +a target row is identified, if any. + +Transform triggers look very similar to normal BEFORE row triggers but have these +important differences: + +- A transform trigger gets called for every incoming change. + BEFORE triggers aren't called at all for `UPDATE` and `DELETE` changes + if a matching row in a table isn't found. + +- Transform triggers are called before partition table routing occurs. + +- Transform triggers have access to the lookup key via `SOURCE_OLD`, + which isn't available to normal SQL triggers. + +## Stream triggers variables + +Both conflict triggers and transform triggers have access to information about +rows and metadata by way of the predefined variables provided by the trigger API and +additional information functions provided by BDR. + +In PL/pgSQL, you can use the predefined variables that follow. + +### TG_NAME + +Data type name. This variable contains the name of the trigger actually fired. +The actual trigger name has a '\_bdrt' or '\_bdrc' suffix +(depending on trigger type) compared to the name provided during trigger creation. + +### TG_WHEN + +Data type text. This variable says `BEFORE` for both conflict and transform triggers. +You can get the stream trigger type by calling the `bdr.trigger_get_type()` +information function. See [bdr.trigger_get_type](#bdrtrigger_get_type). + +### TG_LEVEL + +Data type text: a string of `ROW`. + +### TG_OP + +Data type text: a string of `INSERT`, `UPDATE`, or `DELETE` identifying the operation for which the trigger was fired. + +### TG_RELID + +Data type oid: the object ID of the table that caused the trigger invocation. + +### TG_TABLE_NAME + +Data type name: the name of the table that caused the trigger invocation. + +### TG_TABLE_SCHEMA + +Data type name: the name of the schema of the table that caused the trigger +invocation. For partitioned tables, this is the name of the root table. + +### TG_NARGS + +Data type integer: the number of arguments given to the trigger function in +the `bdr.create_conflict_trigger()` or `bdr.create_transform_trigger()` +statement. + +### TG_ARGV\[] + +Data type array of text: the arguments from the `bdr.create_conflict_trigger()` +or `bdr.create_transform_trigger()` statement. The index counts from 0. +Invalid indexes (less than 0 or greater than or equal to `TG_NARGS`) result in +a `NULL` value. + +## Information functions + +### bdr.trigger_get_row + +This function returns the contents of a trigger row specified by an identifier +as a `RECORD`. This function returns `NULL` if called inappropriately, that is, +called with `SOURCE_NEW` when the operation type (TG_OP) is `DELETE`. + +#### Synopsis + +```sql +bdr.trigger_get_row(row_id text) +``` + +#### Parameters + +- `row_id` — identifier of the row. Can be any of `SOURCE_NEW`, `SOURCE_OLD`, and + `TARGET`, depending on the trigger type and operation (see description of + individual trigger types). + +### bdr.trigger_get_committs + +This function returns the commit timestamp of a trigger row specified by an +identifier. If not available because a row is frozen or isn't available, +returns `NULL`. Always returns `NULL` for row identifier `SOURCE_OLD`. + +#### Synopsis + +```sql +bdr.trigger_get_committs(row_id text) +``` + +#### Parameters + +- `row_id` — Identifier of the row. Can be any of `SOURCE_NEW`, `SOURCE_OLD`, and + `TARGET`, depending on trigger type and operation (see description of + individual trigger types). + +### bdr.trigger_get_xid + +This function returns the local transaction id of a TARGET row specified by an +identifier. If not available because a row is frozen or isn't available, +returns `NULL`. Always returns `NULL` for `SOURCE_OLD` and `SOURCE_NEW` row +identifiers. + +Available only for conflict triggers. + +#### Synopsis + +```sql +bdr.trigger_get_xid(row_id text) +``` + +#### Parameters + +- `row_id` — Identifier of the row. Can be any of `SOURCE_NEW`, `SOURCE_OLD`, and + `TARGET`, depending on trigger type and operation (see description of + individual trigger types). + +### bdr.trigger_get_type + +This function returns the current trigger type, which can be `CONFLICT` +or `TRANSFORM`. Returns null if called outside a stream trigger. + +#### Synopsis + +```sql +bdr.trigger_get_type() +``` + +### bdr.trigger_get_conflict_type + +This function returns the current conflict type if called inside a conflict +trigger. Otherwise, returns `NULL`. + +See [Conflict types](conflicts#list-of-conflict-types) +for possible return values of this function. + +#### Synopsis + +```sql +bdr.trigger_get_conflict_type() +``` + +### bdr.trigger_get_origin_node_id + +This function returns the node id corresponding to the origin for the trigger +row_id passed in as argument. If the origin isn't valid (which means the row +originated locally), returns the node id of the source or target node, +depending on the trigger row argument. Always returns `NULL` for row identifier +`SOURCE_OLD`. You can use this function to define conflict triggers to always favor a +trusted source node. + +#### Synopsis + +```sql +bdr.trigger_get_origin_node_id(row_id text) +``` + +#### Parameters + +- `row_id` — Identifier of the row. Can be any of `SOURCE_NEW`, `SOURCE_OLD`, and + `TARGET`, depending on trigger type and operation (see description of + individual trigger types). + +### bdr.ri_fkey_on_del_trigger + +When called as a BEFORE trigger, this function uses FOREIGN KEY information +to avoid FK anomalies. + +#### Synopsis + +```sql +bdr.ri_fkey_on_del_trigger() +``` + +## Row contents + +The `SOURCE_NEW`, `SOURCE_OLD`, and `TARGET` contents depend on the operation, REPLICA +IDENTITY setting of a table, and the contents of the target table. + +The TARGET row is available only in conflict triggers. The TARGET row +contains data only if a row was found when applying `UPDATE` or `DELETE` in the target +table. If the row isn't found, the TARGET is `NULL`. + +## Triggers notes + +Execution order for triggers: + +- Transform triggers — Execute once for each incoming row on the target. +- Normal triggers — Execute once per row. +- Conflict triggers — Execute once per row where a conflict exists. + +## Stream triggers manipulation interfaces + +You can create stream triggers only on tables with `REPLICA IDENTITY FULL` +or tables without any columns to which `TOAST` applies. + +### bdr.create_conflict_trigger + +This function creates a new conflict trigger. + +#### Synopsis + +```sql +bdr.create_conflict_trigger(trigger_name text, + events text[], + relation regclass, + function regprocedure, + args text[] DEFAULT '{}') +``` + +#### Parameters + +- `trigger_name` — Name of the new trigger. +- `events` — Array of events on which to fire this trigger. Valid values are + '`INSERT`', '`UPDATE`', and '`DELETE`'. +- `relation` — Relation to fire this trigger for. +- `function` — The function to execute. +- `args` — Optional. Specifies the array of parameters the trigger function + receives on execution (contents of `TG_ARGV` variable). + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This +means that the replication is affected by the +[ddl filters](repsets#ddl-replication-filtering) configuration. + +The function takes a global DML lock on the relation on which the trigger +is being created. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +Similar to normal PostgreSQL triggers, the `bdr.create_conflict_trigger` +function requires `TRIGGER` privilege on the `relation` and `EXECUTE` +privilege on the function. This applies with a +`bdr.backwards_compatibility` of 30619 or above. Additional +security rules apply in BDR to all triggers including conflict +triggers. See [Security and roles](security#triggers). + +### bdr.create_transform_trigger + +This function creates a transform trigger. + +#### Synopsis + +```sql +bdr.create_transform_trigger(trigger_name text, + events text[], + relation regclass, + function regprocedure, + args text[] DEFAULT '{}') +``` + +#### Parameters + +- `trigger_name` — Name of the new trigger. +- `events` — Array of events on which to fire this trigger. Valid values are + '`INSERT`', '`UPDATE`', and '`DELETE`'. +- `relation` — Relation to fire this trigger for. +- `function` — The function to execute. +- `args` — Optional. Specify array of parameters the trigger function + receives on execution (contents of `TG_ARGV` variable). + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This +means that the replication is affected by the +[ddl filters](repsets#ddl-replication-filtering) configuration. + +The function takes a global DML lock on the relation on which the trigger +is being created. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +Similarly to normal PostgreSQL triggers, the `bdr.create_transform_trigger` +function requires the `TRIGGER` privilege on the `relation` and `EXECUTE` +privilege on the function. Additional security rules apply in BDR to all +triggers including transform triggers. See +[Security and roles](security#triggers). + +### bdr.drop_trigger + +This function removes an existing stream trigger (both conflict and transform). + +#### Synopsis + +```sql +bdr.drop_trigger(trigger_name text, + relation regclass, + ifexists boolean DEFAULT false) +``` + +#### Parameters + +- `trigger_name` — Name of an existing trigger. +- `relation` — The relation the trigger is defined for. +- `ifexists` — When set to `true`, this function ignores missing + triggers. + +#### Notes + +This function uses the same replication mechanism as `DDL` statements. This +means that the replication is affected by the +[ddl filters](repsets#ddl-replication-filtering) configuration. + +The function takes a global DML lock on the relation on which the trigger +is being created. + +This function is transactional. You can roll back the effects with the +`ROLLBACK` of the transaction. The changes are visible to the current +transaction. + +Only the owner of the `relation` can execute the `bdr.drop_trigger` function. + +## Stream triggers examples + +A conflict trigger that provides similar behavior as the `update_if_newer` +conflict resolver: + +```sql +CREATE OR REPLACE FUNCTION update_if_newer_trig_func +RETURNS TRIGGER +LANGUAGE plpgsql +AS $$ +BEGIN + IF (bdr.trigger_get_committs('TARGET') > + bdr.trigger_get_committs('SOURCE_NEW')) THEN + RETURN TARGET; + ELSIF + RETURN SOURCE; + END IF; +END; +$$; +``` + +A conflict trigger that applies a delta change on a counter column and uses +SOURCE_NEW for all other columns: + +```sql +CREATE OR REPLACE FUNCTION delta_count_trg_func +RETURNS TRIGGER +LANGUAGE plpgsql +AS $$ +DECLARE + DELTA bigint; + SOURCE_OLD record; + SOURCE_NEW record; + TARGET record; +BEGIN + SOURCE_OLD := bdr.trigger_get_row('SOURCE_OLD'); + SOURCE_NEW := bdr.trigger_get_row('SOURCE_NEW'); + TARGET := bdr.trigger_get_row('TARGET'); + + DELTA := SOURCE_NEW.counter - SOURCE_OLD.counter; + SOURCE_NEW.counter = TARGET.counter + DELTA; + + RETURN SOURCE_NEW; +END; +$$; +``` + +A transform trigger that logs all changes to a log table instead of applying them: + +```sql +CREATE OR REPLACE FUNCTION log_change +RETURNS TRIGGER +LANGUAGE plpgsql +AS $$ +DECLARE + SOURCE_NEW record; + SOURCE_OLD record; + COMMITTS timestamptz; +BEGIN + SOURCE_NEW := bdr.trigger_get_row('SOURCE_NEW'); + SOURCE_OLD := bdr.trigger_get_row('SOURCE_OLD'); + COMMITTS := bdr.trigger_get_committs('SOURCE_NEW'); + + IF (TG_OP = 'INSERT') THEN + INSERT INTO log SELECT 'I', COMMITTS, row_to_json(SOURCE_NEW); + ELSIF (TG_OP = 'UPDATE') THEN + INSERT INTO log SELECT 'U', COMMITTS, row_to_json(SOURCE_NEW); + ELSIF (TG_OP = 'DELETE') THEN + INSERT INTO log SELECT 'D', COMMITTS, row_to_json(SOURCE_OLD); + END IF; + + RETURN NULL; -- do not apply the change +END; +$$; +``` + +This example shows a conflict trigger that implements trusted source +conflict detection, also known as trusted site, preferred node, or Always Wins +resolution. This uses the `bdr.trigger_get_origin_node_id()` function to provide +a solution that works with three or more nodes. + +```sql +CREATE OR REPLACE FUNCTION test_conflict_trigger() +RETURNS TRIGGER +LANGUAGE plpgsql +AS $$ +DECLARE + SOURCE record; + TARGET record; + + TRUSTED_NODE bigint; + SOURCE_NODE bigint; + TARGET_NODE bigint; +BEGIN + TARGET := bdr.trigger_get_row('TARGET'); + IF (TG_OP = 'DELETE') + SOURCE := bdr.trigger_get_row('SOURCE_OLD'); + ELSE + SOURCE := bdr.trigger_get_row('SOURCE_NEW'); + END IF; + + TRUSTED_NODE := current_setting('customer.trusted_node_id'); + + SOURCE_NODE := bdr.trigger_get_origin_node_id('SOURCE_NEW'); + TARGET_NODE := bdr.trigger_get_origin_node_id('TARGET'); + + IF (TRUSTED_NODE = SOURCE_NODE) THEN + RETURN SOURCE; + ELSIF (TRUSTED_NODE = TARGET_NODE) THEN + RETURN TARGET; + ELSE + RETURN NULL; -- do not apply the change + END IF; +END; +$$; +``` diff --git a/product_docs/docs/pgd/4/overview/bdr/transaction-streaming.mdx b/product_docs/docs/pgd/4/overview/bdr/transaction-streaming.mdx new file mode 100644 index 00000000000..61de2c341af --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/transaction-streaming.mdx @@ -0,0 +1,160 @@ +--- +navTitle: Transaction streaming +title: Transaction streaming + + +--- + +With logical replication, transactions are decoded concurrently on the publisher +but aren't sent to subscribers until the transaction is committed. If the +changes exceed `logical_decoding_work_mem` (PostgreSQL 13 and later), they're +spilled to disk. This means that, particularly with large transactions, there's +some delay before they reach subscribers and might entail additional I/O +on the publisher. + +Beginning with PostgreSQL 14, transactions can optionally be decoded and sent to +subscribers before they're committed on the publisher. The subscribers save +the incoming changes to a staging file (or set of files) and apply them when +the transaction commits (or discard them if the transaction aborts). This makes +it possible to apply transactions on subscribers as soon as the transaction +commits. + +## BDR enhancements + +PostgreSQL's built-in transaction streaming has the following +limitations: + +- While you no longer need to spill changes to disk on the publisher, you must write changes + to disk on each subscriber. +- If the transaction aborts, the work (changes received by each subscriber + and the associated storage I/O) is wasted. + +However, starting with version 3.7, BDR supports parallel apply, enabling multiple writer +processes on each subscriber. This capability is leveraged to provide the following enhancements: + +- Decoded transactions can be streamed directly to a writer on the subscriber. +- Decoded transactions don't need to be stored on disk on subscribers. +- You don't need to wait for the transaction to commit before starting to apply the + transaction on the subscriber. + +### Caveats + +- You must enable parallel apply. +- Workloads consisting of many small and conflicting transactions can lead to + frequent deadlocks between writers. + +!!! Note + Direct streaming to writer is still an experimental feature. Use it + with caution. Specifically, it might not work well with + conflict resolutions since the commit timestamp of the streaming might not + be available. (The transaction might not yet have committed on the + origin.) + +## Configuration + +Configure transaction streaming in two locations: + +- At node level, using the GUC [bdr.default_streaming_mode](configuration#transaction-streaming) +- At group level, using the function [bdr.alter_node_group_config()](nodes#bdralter_node_group_config) + +### Node configuration using bdr.default_streaming_mode + +Permitted values are: + +- `off` +- `writer` +- `file` +- `auto` + +Default value is `auto`. + +Changing this setting requires a restart of the +pglogical receiver process for each subscription for the setting to take effect. You can achieve this with a server +restart. + +If `bdr.default_streaming_mode` is set any value other than `off`, the +subscriber requests transaction streaming from the publisher. How this is +provided can also depend on the group configuration setting. See +[Node configuration using bdr.default_streaming_mode](#node-configuration-using-bdrdefault_streaming_mode) for details. + +### Group configuration using bdr.alter_node_group_config() + +You can use the parameter `streaming_mode` in the function [bdr.alter_node_group_config()](nodes#bdralter_node_group_config) +to set the group transaction streaming configuration. + +Permitted values are: + +- `off` +- `writer` +- `file` +- `auto` +- `default` + +The default value is `default`. + +The value of the current setting is contained in the column `node_group_streaming_mode` +from the view [bdr.node_group](catalogs#bdrnode_group). The value returned is +a single char type, and the possible values are `D` (`default`), `W` (`writer`), +`F` (`file`), `A` (`auto`), and `O` (`off`). + +### Configuration setting effects + +Transaction streaming is controlled at the subscriber level by the GUC +`bdr.default_streaming_mode`. Unless set to `off` (which disables transaction +streaming), the subscriber requests transaction streaming. + +If the publisher can provide transaction streaming, it +streams transactions whenever the transaction size exceeds the threshold set in +`logical_decoding_work_mem`. The publisher usually has no control over whether +the transactions is streamed to a file or to a writer. Except for some +situations (such as COPY), it might hint for the subscriber to stream the +transaction to a writer (if possible). + +The subscriber can stream transactions received from the publisher to +either a writer or a file. The decision is based on several factors: + +- If parallel apply is off (`num_writers = 1`), then it's streamed to a file. + (writer 0 is always reserved for non-streamed transactions.) +- If parallel apply is on but all writers are already busy handling streamed + transactions, then the new transaction is streamed to a file. See + [bdr.writers](monitoring#monitoring-bdr-writers) to check BDR + writer status. + +If streaming to a writer is possible (that is, a free writer is available), then the +decision whether to stream the transaction to a writer or a file is based on +the combination of group and node settings as per the following table: + +| Group | Node | Streamed to | +| ------- | ------ | ----------- | +| off | (any) | (none) | +| (any) | off | (none) | +| writer | file | file | +| file | writer | file | +| default | writer | writer | +| default | file | file | +| default | auto | writer | +| auto | (any) | writer | + +If the group configuration is set to `auto`, or the group +configuration is `default` and the node configuration is `auto`, +then the transaction is streamed to a writer only if the +publisher hinted to do this. + +Currently the publisher hints for the subscriber to stream to the writer +for the following transaction types. These are known to be conflict free +and can be safely handled by the writer. + +- `COPY` +- `CREATE INDEX CONCURRENTLY` + +## Monitoring + +You can monitor the use of transaction streaming using the [bdr.stat_subscription](catalogs#bdrstat_subscription) +function on the subscriber node. + +- `nstream_writer` — Number of transactions streamed to a writer. +- `nstream_file` — Number of transactions streamed to file. +- `nstream_commit` — Number of committed streamed transactions. +- `nstream_abort` — Number of aborted streamed transactions. +- `nstream_start` — Number of streamed transactions that were started. +- `nstream_stop` — Number of streamed transactions that were fully received. diff --git a/product_docs/docs/pgd/4/overview/bdr/tssnapshots.mdx b/product_docs/docs/pgd/4/overview/bdr/tssnapshots.mdx new file mode 100644 index 00000000000..abf162a0699 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/tssnapshots.mdx @@ -0,0 +1,60 @@ +--- +title: Timestamp-based snapshots + + +--- + +The timestamp-based snapshots allow reading data in a consistent manner by using +a user-specified timestamp rather than the usual MVCC snapshot. You can use this +to access data on different BDR nodes at a common point in time. For +example, you can use this as a way to compare data on multiple nodes for data-quality checking. + +This feature doesn't currently work with write transactions. + +Enable the use of timestamp-based snapshots using the `snapshot_timestamp` +parameter. This parameter accepts either a timestamp value or +a special value, `'current'`, which represents the current timestamp (now). If +`snapshot_timestamp` is set, queries use that timestamp to determine +visibility of rows rather than the usual MVCC semantics. + +For example, the following query returns the state of the `customers` table at +2018-12-08 02:28:30 GMT: + +```sql +SET snapshot_timestamp = '2018-12-08 02:28:30 GMT'; +SELECT count(*) FROM customers; +``` + +Without BDR, this works only with future timestamps or the +special 'current' value, so you can't use it for historical queries. + +BDR works with and improves on that feature in a multi-node environment. First, +BDR makes sure that all connections to other nodes replicate any +outstanding data that was added to the database before the specified +timestamp. This ensures that the timestamp-based snapshot is consistent across the whole +multi-master group. Second, BDR adds a parameter called +`bdr.timestamp_snapshot_keep`. This specifies a window of time when you can execute +queries against the recent history on that node. + +You can specify any interval, but be aware that VACUUM (including autovacuum) +doesn't clean dead rows that are newer than up to twice the specified +interval. This also means that transaction ids aren't freed for the same +amount of time. As a result, using this can leave more bloat in user tables. +Initially, we recommend 10 seconds as a typical setting, although you can change that as needed. + +Once the query is accepted for execution, the query might run +for longer than `bdr.timestamp_snapshot_keep` without problem, just as normal. + +Also, information about how far the snapshots were kept doesn't +survive server restart. The oldest usable timestamp for the timestamp-based +snapshot is the time of last restart of the PostgreSQL instance. + +You can combine the use of `bdr.timestamp_snapshot_keep` with the +`postgres_fdw` extension to get a consistent read across multiple nodes in a +BDR group. You can use this to run parallel queries across nodes, when used with foreign tables. + +There are no limits on the number of nodes in a multi-node query when using this +feature. + +Use of timestamp-based snapshots doesn't increase inter-node traffic or +bandwidth. Only the timestamp value is passed in addition to query data. diff --git a/product_docs/docs/pgd/4/overview/bdr/twophase.mdx b/product_docs/docs/pgd/4/overview/bdr/twophase.mdx new file mode 100644 index 00000000000..e7332335a35 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/bdr/twophase.mdx @@ -0,0 +1,66 @@ +--- +navTitle: Two-phase commit +title: Explicit two-phase commit (2PC) + + +--- + +An application can explicitly opt to use two-phase commit with BDR. See +[Distributed Transaction Processing: The XA Specification](http://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). + +The X/Open Distributed Transaction Processing (DTP) model envisions three +software components: + +- An application program (AP) that defines transaction boundaries and specifies + actions that constitute a transaction +- Resource managers (RMs, such as databases or file-access systems) that provide + access to shared resources +- A separate component called a transaction manager (TM) that assigns identifiers + to transactions, monitors their progress, and takes responsibility for + transaction completion and for failure recovery + +BDR supports explicit external 2PC using the `PREPARE TRANSACTION` and +`COMMIT PREPARED`/`ROLLBACK PREPARED` commands. Externally, a BDR cluster +appears to be a single resource manager to the transaction manager for a +single session. + +When `bdr.commit_scope` is `local`, the transaction is prepared only +on the local node. Once committed, changes are replicated, and +BDR then applies post-commit conflict resolution. + +Using `bdr.commit_scope` set to `local` might not seem to make sense with +explicit two-phase commit, but the option is offered to allow you +to control the tradeoff between transaction latency and robustness. + +Explicit two-phase commit doesn't work with either CAMO +or the global commit scope. Future releases might enable this combination. + +## Use + +Two-phase commits with a local commit scope work exactly like standard +PostgreSQL. Use the local commit scope and disable CAMO. + +```sql +BEGIN; + +SET LOCAL bdr.enable_camo = 'off'; +SET LOCAL bdr.commit_scope = 'local'; + +... other commands possible... +``` + +To start the first phase of the commit, the client must assign a +global transaction id, which can be any unique string identifying the +transaction: + +```sql +PREPARE TRANSACTION 'some-global-id'; +``` + +After a successful first phase, all nodes have applied the changes and +are prepared for committing the transaction. The client must then invoke +the second phase from the same node: + +```sql +COMMIT PREPARED 'some-global-id'; +``` diff --git a/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.0.1_rel_notes.mdx b/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.0.1_rel_notes.mdx new file mode 100644 index 00000000000..208b8eb8be2 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.0.1_rel_notes.mdx @@ -0,0 +1,18 @@ +--- +title: "Version 2.0.1" +--- + +This is a patch release of HARP 2 that includes fixes for issues identified +in previous versions. + +| Type | Description | +| ---- |------------ | +| Enhancement | Support for selecting a leader per location rather than relying on DCS like etcd to have separate setup in different locations. This still requires a majority of nodes to survive loss of a location, so an odd number of both locations and database nodes is recommended.| +| Enhancement | The BDR DCS now uses a push notification from the consensus rather than through polling nodes. This change reduces the time for new leader selection and the load that HARP does on the BDR DCS since it doesn't need to poll in short intervals anymore. | +| Enhancement | TPA now restarts each HARP Proxy one by one and wait until they come back to reduce any downtime incurred by the application during software upgrades. | +| Enhancement | The support for embedding PGBouncer directly into HARP Proxy is now deprecated and will be removed in the next major release of HARP. It's now possible to configure TPA to put PGBouncer on the same node as HARP Proxy and point to that HARP Proxy.| +| Bug Fix | `harpctl promoteHARP offers multiple options for Distributed Consensus Service (DCS) source: etcd and BDR. The BDR consensus option can be used in deployments where etcd isn't present. Use of the BDR consensus option is no longer considered beta and is now supported for use in production environments.
| +| Enhancement | Transport layer proxy now generally available.HARP offers multiple proxy options for routing connections between the client application and database: application layer (L7) and transport layer (L4). The network layer 4 or transport layer proxy simply forwards network packets, and layer 7 terminates network traffic. The transport layer proxy, previously called simple proxy, is no longer considered beta and is now supported for use in production environments.
| diff --git a/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.0.3_rel_notes.mdx b/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.0.3_rel_notes.mdx new file mode 100644 index 00000000000..75722ff6794 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.0.3_rel_notes.mdx @@ -0,0 +1,11 @@ +--- +title: "Version 2.0.3" +--- + +This is a patch release of HARP 2 that includes fixes for issues identified +in previous versions. + +| Type | Description | +| ---- |------------ | +| Enhancement | HARP Proxy supports read-only user dedicated TLS Certificate (RT78516) | +| Bug Fix | HARP Proxy continues to try and connect to DCS instead of exiting after 50 seconds. (RT75406) | diff --git a/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.1.0_rel_notes.mdx b/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.1.0_rel_notes.mdx new file mode 100644 index 00000000000..bb7e5b16d47 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/01_release_notes/harp2.1.0_rel_notes.mdx @@ -0,0 +1,17 @@ +--- +title: "Version 2.1.0" +--- + +This is a minor release of HARP 2 that includes new features as well +as fixes for issues identified in previous versions. + +| Type | Description | +| ---- |------------ | +| Feature | The BDR DCS now uses a push notification from the consensus rather than through polling nodes.This change reduces the time for new leader selection and the load that HARP does on the BDR DCS since it doesn't need to poll in short intervals anymore.
| +| Feature | TPA now restarts each HARP Proxy one by one and wait until they come back to reduce any downtime incurred by the application during software upgrades. | +| Feature | The support for embedding PGBouncer directly into HARP Proxy is now deprecated and will be removed in the next major release of HARP.It's now possible to configure TPA to put PGBouncer on the same node as HARP Proxy and point to that HARP Proxy.
| +| Bug Fix | `harpctl promoteThe use of HARP Router to translate DCS contents into appropriate online or offline states for HTTP-based URI requests meant a load balancer or HAProxy was necessary to determine the lead master. HARP Proxy now does this automatically without periodic iterative status checks.
| +| Feature | Utilizes DCS key subscription to respond directly to state changes.With relevant cluster state changes, the cluster responds immediately, resulting in improved failover and switchover times.
| +| Feature | Compatibility with etcd SSL settings.It is now possible to communicate with etcd through SSL encryption.
| +| Feature | Zero transaction lag on switchover.Transactions are not routed to the new lead node until all replicated transactions are replayed, thereby reducing the potential for conflicts.
+| Feature | Experimental BDR Consensus layer.Using BDR Consensus as the Distributed Consensus Service (DCS) reduces the amount of change needed for implementations.
+| Feature | Experimental built-in proxy.Proxy implementation for increased session control.
| diff --git a/product_docs/docs/pgd/4/overview/harp/01_release_notes/index.mdx b/product_docs/docs/pgd/4/overview/harp/01_release_notes/index.mdx new file mode 100644 index 00000000000..6905f183a81 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/01_release_notes/index.mdx @@ -0,0 +1,26 @@ +--- +title: Release Notes +navigation: +- harp2.1.0_rel_notes +- harp2.0.3_rel_notes +- harp2.0.2_rel_notes +- harp2.0.1_rel_notes +- harp2_rel_notes +--- + +High Availability Routing for Postgres (HARP) is a cluster-management tool for +[Bi-directional Replication (BDR)](/bdr/latest) clusters. The core design of +the tool is to route all application traffic in a single data center or +region to only one node at a time. This node, designated the lead master, acts +as the principle write target to reduce the potential for data conflicts. + +The release notes in this section provide information on what was new in each release. + +| Version | Release Date | +| ----------------------- | ------------ | +| [2.1.1](harp2.1.1_rel_notes) | 2022 June 21 | +| [2.1.0](harp2.1.0_rel_notes) | 2022 May 17 | +| [2.0.3](harp2.0.3_rel_notes) | 2022 Mar 31 | +| [2.0.2](harp2.0.2_rel_notes) | 2022 Feb 24 | +| [2.0.1](harp2.0.1_rel_notes) | 2021 Jan 31 | +| [2.0.0](harp2_rel_notes) | 2021 Dec 01 | diff --git a/product_docs/docs/pgd/4/overview/harp/02_overview.mdx b/product_docs/docs/pgd/4/overview/harp/02_overview.mdx new file mode 100644 index 00000000000..7db92e093cd --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/02_overview.mdx @@ -0,0 +1,246 @@ +--- +navTitle: Overview +title: HARP functionality overview +--- + +HARP is a new approach to high availability for BDR +clusters. It leverages a consensus-driven quorum to determine the correct connection endpoint +in a semi-exclusive manner to prevent unintended multi-node writes from an +application. + +## The importance of quorum + +The central purpose of HARP is to enforce full quorum on any Postgres cluster +it manages. Quorum is a term applied to a voting body that +mandates a certain minimum of attendees are available to make a decision. More simply: majority rules. + +For any vote to end in a result other than a tie, an odd number of +nodes must constitute the full cluster membership. Quorum, however, doesn't +strictly demand this restriction; a simple majority is enough. This means +that in a cluster of N nodes, quorum requires a minimum of N/2+1 nodes to hold +a meaningful vote. + +All of this ensures the cluster is always in agreement regarding the node +that is "in charge." For a BDR cluster consisting of multiple nodes, this +determines the node that is the primary write target. HARP designates this node +as the lead master. + +## Reducing write targets + +The consequence of ignoring the concept of quorum, or not applying it +well enough, can lead to a "split brain" scenario where the "correct" write +target is ambiguous or unknowable. In a standard Postgres cluster, it's +important that only a single node is ever writable and sending replication +traffic to the remaining nodes. + +Even in multi-master-capable approaches such as BDR, it can be help to +reduce the amount of necessary conflict management to derive identical data +across the cluster. In clusters that consist of multiple BDR nodes per physical +location or region, this usually means a single BDR node acts as a "leader" and +remaining nodes are "shadow." These shadow nodes are still writable, but writing to them is discouraged unless absolutely necessary. + +By leveraging quorum, it's possible for all nodes to agree on the exact +Postgres node to represent the entire cluster or a local BDR region. Any +nodes that lose contact with the remainder of the quorum, or are overruled by +it, by definition can't become the cluster leader. + +This restriction prevents split-brain situations where writes unintentionally reach two +Postgres nodes. Unlike technologies such as VPNs, proxies, load balancers, or +DNS, you can't circumvent a quorum-derived consensus by misconfiguration or +network partitions. So long as it's possible to contact the consensus layer to +determine the state of the quorum maintained by HARP, only one target is ever +valid. + +## Basic architecture + +The design of HARP comes in essentially two parts, consisting of a manager and +a proxy. The following diagram describes how these interact with a single +Postgres instance: + +![HARP Unit](images/ha-unit.png) + +The consensus layer is an external entity where Harp Manager maintains +information it learns about its assigned Postgres node, and HARP Proxy +translates this information to a valid Postgres node target. Because Proxy +obtains the node target from the consensus layer, several such instances can +exist independently. + +While using BDR as the consensus layer, each server node resembles this +variant instead: + +![HARP Unit w/BDR Consensus](images/ha-unit-bdr.png) + +In either case, each unit consists of the following elements: + +* A Postgres or EDB instance +* A consensus layer resource, meant to track various attributes of the Postgres + instance +* A HARP Manager process to convey the state of the Postgres node to the + consensus layer +* A HARP Proxy service that directs traffic to the proper lead master node, + as derived from the consensus layer + +Not every application stack has access to additional node resources +specifically for the Proxy component, so it can be combined with the +application server to simplify the stack. + +This is a typical design using two BDR nodes in a single data center organized in a lead master/shadow master configuration: + +![HARP Cluster](images/ha-ao.png) + +When using BDR as the HARP consensus layer, at least three +fully qualified BDR nodes must be present to ensure a quorum majority. (Not shown in the diagram are connections between BDR nodes.) + +![HARP Cluster w/BDR Consensus](images/ha-ao-bdr.png) + +## How it works + +When managing a BDR cluster, HARP maintains at most one leader node per +defined location. This is referred to as the lead master. Other BDR +nodes that are eligible to take this position are shadow master state until they take the leader role. + +Applications can contact the current leader only through the proxy service. +Since the consensus layer requires quorum agreement before conveying leader +state, proxy services direct traffic to that node. + +At a high level, this mechanism prevents simultaneous application interaction with +multiple nodes. + +### Determining a leader + +As an example, consider the role of lead master in a locally subdivided +BDR Always-On group as can exist in a single data center. When any +Postgres or Manager resource is started, and after a configurable refresh +interval, the following must occur: + +1. The Manager checks the status of its assigned Postgres resource. + - If Postgres isn't running, try again after configurable timeout. + - If Postgres is running, continue. +2. The Manager checks the status of the leader lease in the consensus layer. + - If the lease is unclaimed, acquire it and assign the identity of + the Postgres instance assigned to this manager. This lease duration is + configurable, but setting it too low can result in unexpected leadership + transitions. + - If the lease is already claimed by us, renew the lease TTL. + - Otherwise do nothing. + +A lot more occurs, but this simplified version explains +what's happening. The leader lease can be held by only one node, and if it's +held elsewhere, HARP Manager gives up and tries again later. + +!!! Note + Depending on the chosen consensus layer, rather than repeatedly looping to + check the status of the leader lease, HARP subscribes to notifications. In this case, it can respond immediately any time the state of the + lease changes rather than polling. Currently this functionality is + restricted to the etcd consensus layer. + +This means HARP itself doesn't hold elections or manage quorum, which is +delegated to the consensus layer. A quorum of the consensus layer must acknowledge the act of obtaining the lease, so if the request succeeds, +that node leads the cluster in that location. + +### Connection routing + +Once the role of the lead master is established, connections are handled +with a similar deterministic result as reflected by HARP Proxy. Consider a case +where HARP Proxy needs to determine the connection target for a particular backend +resource: + +1. HARP Proxy interrogates the consensus layer for the current lead master in + its configured location. +2. If this is unset or in transition: + - New client connections to Postgres are barred, but clients + accumulate and are in a paused state until a lead master appears. + - Existing client connections are allowed to complete current transactions + and are then reverted to a similar pending state as new connections. +3. Client connections are forwarded to the lead master. + +The interplay shown in this case doesn't require any +interaction with either HARP Manager or Postgres. The consensus layer +is the source of all truth from the proxy's perspective. + +### Colocation + +The arrangement of the work units is such that their organization must follow these principles: + +1. The manager and Postgres units must exist concomitantly in the same + node. +2. The contents of the consensus layer dictate the prescriptive role of all + operational work units. + +This arrangement delegates cluster quorum responsibilities to the consensus layer, +while HARP leverages it for critical role assignments and key/value storage. +Neither storage nor retrieval succeeds if the consensus layer is inoperable +or unreachable, thus preventing rogue Postgres nodes from accepting +connections. + +As a result, the consensus layer generally exists outside of HARP or HARP-managed nodes for maximum safety. Our reference diagrams show this separation, although it isn't required. + +!!! Note + To operate and manage cluster state, BDR contains its own + implementation of the Raft Consensus model. You can configure HARP to + leverage this same layer to reduce reliance on external dependencies and + to preserve server resources. However, certain drawbacks to this + approach are discussed in + [Consensus layer](09_consensus-layer). + +## Recommended architecture and use + +HARP was primarily designed to represent a BDR Always-On architecture that +resides in two or more data centers and consists of at least five BDR +nodes. This configuration doesn't count any logical standby nodes. + +The following diagram shows the current and standard representation: + +![BDR Always-On Reference Architecture](images/bdr-ao-spec.png) + +In this diagram, HARP Manager exists on BDR Nodes 1-4. The initial state +of the cluster is that BDR Node 1 is the lead master of DC A, and BDR +Node 3 is the lead master of DC B. + +This configuration results in any HARP Proxy resource in DC A connecting to BDR Node 1 +and the HARP Proxy resource in DC B connecting to BDR Node 3. + +!!! Note + While this diagram shows only a single HARP Proxy per DC, this is + an example only and should not be considered a single point of failure. Any + number of HARP Proxy nodes can exist, and they all direct application + traffic to the same node. + +### Location configuration + +For multiple BDR nodes to be eligible to take the lead master lock in +a location, you must define a location in the `config.yml` configuration +file. + +To reproduce the BDR Always-On reference architecture shown in the diagram, include these lines in the `config.yml` +configuration for BDR Nodes 1 and 2: + +```yaml +location: dca +``` + +For BDR Nodes 3 and 4, add: + +```yaml +location: dcb +``` + +This applies to any HARP Proxy nodes that are designated in those respective +data centers as well. + +### BDR 3.7 compatibility + +BDR 3.7 and later offers more direct location definition by assigning a +location to the BDR node. This is done by calling the following SQL +API function while connected to the BDR node. So for BDR Nodes 1 and 2, you +might do this: + +```sql +SELECT bdr.set_node_location('dca'); +``` + +And for BDR Nodes 3 and 4: + +```sql +SELECT bdr.set_node_location('dcb'); +``` diff --git a/product_docs/docs/pgd/4/overview/harp/03_installation.mdx b/product_docs/docs/pgd/4/overview/harp/03_installation.mdx new file mode 100644 index 00000000000..b30b6d2bdf2 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/03_installation.mdx @@ -0,0 +1,128 @@ +--- +navTitle: Installation +title: Installation +--- + +A standard installation of HARP includes two system services: + +* HARP Manager (`harp-manager`) on the node being managed +* HARP Proxy (`harp-proxy`) elsewhere + +There are two ways to install and configure these services to manage +Postgres for proper quorum-based connection routing. + +## Software versions + +HARP has dependencies on external software. These must fit a minimum +version as listed here. + +| Software | Min version | +|-----------|---------| +| etcd | 3.4 | +| PgBouncer | 1.14 | + +## TPAExec + +The easiest way to install and configure HARP is to use the EDB TPAexec utility +for cluster deployment and management. For details on this software, see the +[TPAexec product page](https://www.enterprisedb.com/docs/pgd/latest/deployments/tpaexec/). + +!!! Note + TPAExec is currently available only through an EULA specifically dedicated + to BDR cluster deployments. If you can't access the TPAExec URL, + contact your sales or account representative. + +Configure TPAexec to recognize that cluster routing is +managed through HARP by ensuring the TPA `config.yml` file contains these +attributes: + +```yaml +cluster_vars: + failover_manager: harp +``` + +!!! Note + Versions of TPAexec earlier than 21.1 require a slightly different approach: + + ```yaml + cluster_vars: + enable_harp: true + ``` + +After this, install HARP by invoking the `tpaexec` commands +for making cluster modifications: + +```bash +tpaexec provision ${CLUSTER_DIR} +tpaexec deploy ${CLUSTER_DIR} +``` + +No other modifications are necessary apart from cluster-specific +considerations. + + +## Package installation + +Currently CentOS/RHEL packages are provided by the EDB packaging +infrastructure. For details, see the [HARP product +page](https://www.enterprisedb.com/docs/harp/latest/). + +### etcd packages + +Currently `etcd` packages for many popular Linux distributions aren't +available by their standard public repositories. EDB has therefore packaged +`etcd` for RHEL and CentOS versions 7 and 8, Debian, and variants such as +Ubuntu LTS. You need access to our HARP package repository to use +these libraries. + +## Consensus layer + +HARP requires a distributed consensus layer to operate. Currently this must be +either `bdr` or `etcd`. If using fewer than three BDR nodes, you might need to rely on `etcd`. Otherwise any BDR service outage reduces the +consensus layer to a single node and thus prevents node consensus and disables +Postgres routing. + +### etcd + +If you're using `etcd` as the consensus layer, `etcd` must be installed either +directly on the Postgres nodes or in a separate location they can access. + +To set `etcd` as the consensus layer, include this code in the HARP `config.yml` +configuration file: + +```yaml +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 +``` + +When using TPAExec, all configured etcd endpoints are entered here +automatically. + +### BDR + +The `bdr` native consensus layer is available from BDR 3.6.21 and 3.7.3. This +consensus layer model requires no supplementary software when managing routing +for a BDR cluster. + +To ensure quorum is possible in the cluster, always +use more than two nodes so that BDR's consensus layer remains responsive during node +maintenance or outages. + +To set BDR as the consensus layer, include this in the `config.yml` +configuration file: + +```yaml +dcs: + driver: bdr + endpoints: + - host=host1 dbname=bdrdb user=harp_user + - host=host2 dbname=bdrdb user=harp_user + - host=host3 dbname=bdrdb user=harp_user +``` + +The endpoints for a BDR consensus layer follow the +standard Postgres DSN connection format. diff --git a/product_docs/docs/pgd/4/overview/harp/04_configuration.mdx b/product_docs/docs/pgd/4/overview/harp/04_configuration.mdx new file mode 100644 index 00000000000..10ac24fba06 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/04_configuration.mdx @@ -0,0 +1,540 @@ +--- +navTitle: Configuration +title: Configuring HARP for cluster management +--- + +The HARP configuration file follows a standard YAML-style formatting that was simplified for readability. This file is located in the `/etc/harp` +directory by default and is named `config.yml` + +You can explicitly provide the configuration file location to all HARP +executables by using the `-f`/`--config` argument. + +## Standard configuration + +HARP essentially operates as three components: + +* HARP Manager +* HARP Proxy +* harpctl + +Each of these use the same standard `config.yml` configuration format, which always include the following sections: + +* `cluster.name` — The name of the cluster to target for all operations. +* `dcs` — DCS driver and connection configuration for all endpoints. + +This means a standard preamble is always included for HARP +operations, such as the following: + +```yaml +cluster: + name: mycluster + +dcs: + ... +``` + +Other sections are optional or specific to the named HARP +component. + +### Cluster name + +The `name` entry under the `cluster` heading is required for all +interaction with HARP. Each HARP cluster has a name for both disambiguation +and for labeling data in the DCS for the specific cluster. + +HARP Manager writes information about the cluster here for consumption by +HARP Proxy and harpctl. HARP Proxy services direct traffic to nodes in +this cluster. The `harpctl` management tool interacts with this cluster. + +### DCS settings + +Configuring the consensus layer is key to HARP functionality. Without the DCS, +HARP has nowhere to store cluster metadata, can't hold leadership elections, +and so on. Therefore this portion of the configuration is required, and +certain elements are optional. + +Specify all elements under a section named `dcs` with these multiple +supplementary entries: + +- **`driver`**: Required type of consensus layer to use. + Currently can be `etcd` or `bdr`. Support for `bdr` as a consensus layer is + experimental. Using `bdr` as the consensus layer reduces the + additional software for consensus storage but expects a minimum of three + full BDR member nodes to maintain quorum during database maintenance. + +- **`endpoints`**: Required list of connection strings to contact the DCS. + List every node of the DCS here if possible. This ensures HARP + continues to function as long as a majority of the DCS can still + operate and be reached by the network. + + Format when using `etcd` as the consensus layer is as follows: + + ```yaml + dcs: + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 + ``` + Format when using the experimental `bdr` consensus layer is as follows: + + ```yaml + dcs: + # only DSN format is supported + endpoints: + - "host=host1 port=5432 dbname=bdrdb user=postgres" + - "host=host2 port=5432 dbname=bdrdb user=postgres" + - "host=host3 port=5432 dbname=bdrdb user=postgres" + ``` +Currently, `bdr` consensus layer requires the first endpoint to point to the local postgres instance. + +- **`request_timeout`**: Time in milliseconds to consider a request as failed. + If HARP makes a request to the DCS and receives no response in this time, it considers the operation as failed. This can cause the issue + to be logged as an error or retried, depending on the nature of the request. + Default: 250. + +The following DCS SSL settings apply only when ```driver: etcd``` is set in the +configuration file: + +- **`ssl`**: Either `on` or `off` to enable SSL communication with the DCS. + Default: `off` + +- **`ssl_ca_file`**: Client SSL certificate authority (CA) file. + +- **`ssl_cert_file`**: Client SSL certificate file. + +- **`ssl_key_file`**: Client SSL key file. + +#### Example + +This example shows how to configure HARP to contact an etcd DCS +consisting of three nodes: + +```yaml +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 +``` + +### HARP Manager specific + +Besides the generic service options required for all HARP components, Manager +needs other settings: + +- **`log_level`**: One of `DEBUG`, `INFO`, `WARNING`, `ERROR`, or `CRITICAL`, + which might alter the amount of log output from HARP services. + +- **`name`**: Required name of the Postgres node represented by this Manager. + Since Manager can represent only a specific node, that node is named here and + also serves to name this Manager. If this is a BDR node, it must match the + value used at node creation when executing the + `bdr.create_node(node_name, ...)` function and as reported by the + `bdr.local_node_summary.node_name` view column. Alphanumeric characters + and underscores only. + +- **`start_command`**: This can be used instead of the information in DCS for + starting the database to monitor. This is required if using bdr as the + consensus layer. + +- **`status_command`**: This can be used instead of the information in DCS for + the Harp Manager to determine whether the database is running. This is + required if using bdr as the consensus layer. + +- **`stop_command`**: This can be used instead of the information in DCS for + stopping the database. + +- **`db_retry_wait_min`**: The initial time in seconds to wait if Harp Manager cannot + connect to the database before trying again. Harp Manager will increase the + wait time with each attempt, up to the `db_retry_wait_max` value. + +- **`db_retry_wait_max`**: The maximum time in seconds to wait if Harp Manager cannot + connect to the database before trying again. + + +Thus a complete configuration example for HARP Manager might look like this: + +```yaml +cluster: + name: mycluster + +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 + +manager: + name: node1 + log_level: INFO +``` + +This configuration is essentially the DCS contact information, any associated +service customizations, the name of the cluster, and the name of the +node. All other settings are associated with the node and is stored +in the DCS. + +Read the [Node bootstrapping](05_bootstrapping) for more about +specific node settings and initializing nodes to be managed by HARP Manager. + +### HARP Proxy specific + +Some configuration options are specific to HARP Proxy. These affect how the +daemon operates and thus are currently located in `config.yml`. + +Specify Proxy-based settings under a `proxy` heading, and include: + +- **`location`**: Required name of location for HARP Proxy to represent. + HARP Proxy nodes are directly tied to the location where they are running, as + they always direct traffic to the current lead master node. Specify location + for any defined proxy. + +- **`log_level`**: One of `DEBUG`, `INFO`, `WARNING`, `ERROR`, or `CRITICAL`, + which might alter the amount of log output from HARP services. + + * Default: `INFO` + +- **`name`**: Name of this specific proxy. + Each proxy node is named to ensure any associated statistics or operating + state are available in status checks and other interactive events. + +- **`type`**: Specifies whether to use pgbouncer or the experimental built-in passthrough proxy. All proxies must use the same proxy type. We recommend to experimenting with only the simple proxy in combination with the experimental BDR DCS. + Can be `pgbouncer` or `builtin`. + + * Default: `pgbouncer` + +- **`pgbouncer_bin_dir`**: Directory where PgBouncer binaries are located. + As HARP uses PgBouncer binaries, it needs to know where they are + located. This can be depend on the platform or distribution, so it has no + default. Otherwise, the assumption is that the appropriate binaries are in the + environment's `PATH` variable. + +#### Example + +HARP Proxy requires the cluster name, DCS connection settings, location, and +name of the proxy in operation. For example: + +```yaml +cluster: + name: mycluster + +dcs: + driver: etcd + endpoints: + - host1:2379 + - host2:2379 + - host3:2379 + +proxy: + name: proxy1 + location: dc1 + pgbouncer_bin_dir: /usr/sbin +``` + +All other attributes are obtained from the DCS on proxy startup. + +## Runtime directives + +While it is possible to configure HARP Manager, HARP Proxy, or harpctl with a +minimum of YAML in the `config.yml` file, some customizations are held in +the DCS. These values must either be initialized via bootstrap or set +specifically with `harpctl set` directives. + +### Cluster-wide + +Set these settings under a `cluster` YAML heading during bootstrap, or +modify them with a `harpctl set cluster` command. + +- **`event_sync_interval`**: Time in milliseconds to wait for synchronization. + When events occur in HARP, they do so asynchronously across the cluster. + HARP managers start operating immediately when they detect metadata changes, + and HARP proxies might pause traffic and start reconfiguring endpoints. This is + a safety interval that roughly approximates the maximum amount of + event time skew that exists between all HARP components. + + For example, suppose Node A goes offline and HARP Manager on Node B commonly + receives this event 5 milliseconds before Node C. A setting of at least 5 ms + is then needed to ensure all HARP Manager services receive the + event before they begin to process it. + + This also applies to HARP Proxy. + +### Node directives + +You can change most node-oriented settings and then apply them while HARP +Manager is active. These items are retained in the DCS after initial bootstrap, +and thus you can modify them without altering a configuration file. + +Set these settings under a `node` YAML heading during bootstrap, or +modify them with a `harpctl set node` command. + +- **`node_type`**: The type of this database node, either `bdr` or `witness`. You can't promote a + witness node to leader. + +- **`camo_enforcement`**: Whether to strictly enforce CAMO queue state. + When set to `strict`, HARP never allows switchover or failover to a BDR + CAMO partner node unless it's fully caught up with the entire CAMO queue at + the time of the migration. When set to `lag_only`, only standard lag + thresholds such as `maximum_camo_lag` are applied. + +- **`dcs_reconnect_interval`**: The interval, measured in milliseconds, between attempts that a disconnected node tries to reconnect to the DCS. + + * Default: 1000. + +- **`dsn`**: Required full connection string to the managed Postgres node. + This parameter applies equally to all HARP services and enables + micro-architectures that run only one service per container. + + !!! Note + HARP sets the `sslmode` argument to `require` by default and prevents + connections to servers that don't require SSL. To disable this behavior, + explicitly set this parameter to a more permissive value such as + `disable`, `allow`, or `prefer`. + +- **`db_data_dir`**: Required Postgres data directory. + This is required by HARP Manager to start, stop, or reload the Postgres + service. It's also the default location for configuration files, which you can use + later for controlling promotion of streaming replicas. + +- **`db_conf_dir`**: Location of Postgres configuration files. + Some platforms prefer storing Postgres configuration files away from the + Postgres data directory. In these cases, set this option to that + expected location. + +- **`db_log_file`**: Location of Postgres log file. + + * Default: `/tmp/pg_ctl.out` + +- **`fence_node_on_dcs_failure`**: If HARP can't reach the DCS, several readiness keys and the leadership lease expire. This implicitly prevents a node from routing consideration. However, such a node isn't officially fenced, and the Manager doesn't stop monitoring the database if `stop_database_when_fenced` is set to `false`. + + * Default: False + +- **`leader_lease_duration`**: Amount of time in seconds the lead master + lease persists if not refreshed. This allows any HARP Manager a certain + grace period to refresh the lock, before expiration allows another node to + obtain the lead master lock instead. + + * Default: 6 + +- **`lease_refresh_interval`**: Amount of time in milliseconds between + refreshes of the lead master lease. This essentially controls the time + between each series of checks HARP Manager performs against its assigned + Postgres node and when the status of the node is updated in the consensus + layer. + + * Default: 2000 +- **`max_dcs_failures`**: The amount of DCS request failures before marking a node as fenced according to `fence_node_on_dcs_failure`. This setting prevents transient communication disruptions from shutting down database nodes. + + * Default: 10 + +- **`maximum_lag`**: Highest allowable variance (in bytes) between last + recorded LSN of previous lead master and this node before being allowed to + take the lead master lock. This setting prevents nodes experiencing terminal amounts + of lag from taking the lead master lock. Set to `-1` to disable this check. + + * Default: -1 + +- **`maximum_camo_lag`**: Highest allowable variance (in bytes) between last + received LSN and applied LSN between this node and its CAMO partners. + This applies only to clusters where CAMO is both available and enabled. + Thus this applies only to BDR EE clusters where `pg2q.enable_camo` is set. + For clusters with particularly stringent CAMO apply queue restrictions, set + this very low or even to `0` to avoid any unapplied CAMO transactions. Set to + `-1` to disable this check. + + * Default: -1 + +- **`ready_status_duration`**: Amount of time in seconds the node's readiness + status persists if not refreshed. This is a failsafe that removes a + node from being contacted by HARP Proxy if the HARP Manager in charge of it + stops operating. + + * Default: 30 + +- **`db_bin_dir`**: Directory where Postgres binaries are located. + As HARP uses Postgres binaries, such as `pg_ctl`, it needs to know where + they're located. This can depend on the platform or distribution, so it has no + default. Otherwise, the assumption is that the appropriate binaries are in the + environment's `PATH` variable. + +- **`priority`**: Any numeric value. + Any node where this option is set to `-1` can't take the lead master role, even when attempting to explicitly set the lead master using `harpctl`. + + * Default: 100 + +- **`stop_database_when_fenced`**: Rather than removing a node from all possible routing, stop the database on a node when it is fenced. This is an extra safeguard to prevent data from other sources than HARP Proxy from reaching the database or in case proxies can't disconnect clients for some other reason. + + * Default: False + +- **`consensus_timeout`**: Amount of milliseconds before aborting a read or + write to the consensus layer. If the consensus layer loses + quorum or becomes unreachable, you want near-instant errors rather than + infinite timeouts. This prevents blocking behavior in such cases. + When using `bdr` as the consensus layer, the highest recognized timeout + is 1000 ms. + + * Default: 250 + +- **`use_unix_socket`**: Specifies for HARP Manager to prefer to use + Unix sockets to connect to the database. + + * Default: False + +All of these runtime directives can be modified via `harpctl`. Consider if you +want to decrease the `lease_refresh_interval` to 100ms on `node1`: + +```bash +harpctl set node node1 lease_refresh_interval=100 +``` + +### Proxy directives + +You can change certain settings to the proxy while the service is active. These +items are retained in the DCS after initial bootstrap, and thus you can modify them +without altering a configuration file. Many of these settings are direct +mappings to their PgBouncer equivalent, and we will note these where relevant. + +Set these settings under a `proxies` YAML heading during bootstrap, or +modify them with a `harpctl set proxy` command. +Properties set by `harpctl set proxy` require a restart of the proxy. + +- **`auth_file`**: The full path to a PgBouncer-style `userlist.txt` file. + HARP Proxy uses this file to store a `pgbouncer` user that has + access to PgBouncer's Admin database. You can use this for other users + as well. Proxy modifies this file to add and modify the password for the + `pgbouncer` user. + + * Default: `/etc/harp/userlist.txt` + +- **`auth_type`**: The type of Postgres authentication to use for password + matching. This is actually a PgBouncer setting and isn't fully compatible + with the Postgres `pg_hba.conf` capabilities. We recommend using `md5`, `pam` + `cert`, or `scram-sha-256`. + + * Default: `md5` + +- **`auth_query`**: Query to verify a user’s password with Postgres. + Direct access to `pg_shadow` requires admin rights. It’s better to use a + non-superuser that calls a `SECURITY DEFINER` function instead. If using + TPAexec to create a cluster, a function named `pgbouncer_get_auth` is + installed on all databases in the `pg_catalog` namespace to fulfill this + purpose. + +- **`auth_user`**: If `auth_user` is set, then any user not specified in + `auth_file` is queried through the `auth_query` query from `pg_shadow` + in the database, using `auth_user`. The password of `auth_user` is + taken from `auth_file`. + +- **`client_tls_ca_file`**: Root certificate file to validate client + certificates. Requires `client_tls_sslmode` to be set. + +- **`client_tls_cert_file`**: Certificate for private key. Clients can + validate it. Requires `client_tls_sslmode` to be set. + +- **`client_tls_key_file`**: Private key for PgBouncer to accept client + connections. Requires `client_tls_sslmode` to be set. + +- **`client_tls_protocols`**: TLS protocol versions allowed for + client connections. + Allowed values: `tlsv1.0`, `tlsv1.1`, `tlsv1.2`, `tlsv1.3`. + Shortcuts: `all` (tlsv1.0,tlsv1.1,tlsv1.2,tlsv1.3), + `secure` (tlsv1.2,tlsv1.3), `legacy` (all). + + * Default: `secure` + +- **`client_tls_sslmode`**: Whether to enable client SSL functionality. + Possible values are `disable`, `allow`, `prefer`, `require`, `verify-ca`, and `verify-full`. + + * Default: `disable` + +- **`database_name`**: Required name that represents the database clients + use when connecting to HARP Proxy. This is a stable endpoint that doesn't + change and points to the current node, database name, port, etc., + necessary to connect to the lead master. You can use the global value `*` + here so all connections get directed to this target regardless of database + name. + +- **`default_pool_size`**: The maximum number of active connections to allow + per database/user combination. This is for connection pooling purposes + but does nothing in session pooling mode. This is a PgBouncer setting. + + * Default: 25 + +- **`ignore_startup_parameters`**: By default, PgBouncer allows only + parameters it can keep track of in startup packets: `client_encoding`, + `datestyle`, `timezone`, and `standard_conforming_strings`. All other + parameters raise an error. To allow other parameters, you can specify them here so that PgBouncer knows that they are handled by the admin + and it can ignore them. Often, you need to set this to + `extra_float_digits` for Java applications to function properly. + + * Default: `extra_float_digits` + +- **`listen_address`**: IP addresses where Proxy should listen for + connections. Used by pgbouncer and builtin proxy. + + * Default: 0.0.0.0 + +- **`listen_port`**: System port where Proxy listens for connections. + Used by pgbouncer and builtin proxy. + + * Default: 6432 + +- **`max_client_conn`**: The total maximum number of active client + connections that are allowed on the proxy. This can be many orders of + magnitude greater than `default_pool_size`, as these are all connections that + have yet to be assigned a session or have released a session for use by + another client connection. This is a PgBouncer setting. + + * Default: 100 + +- **`monitor_interval`**: Time in seconds between Proxy checks of PgBouncer. + Since HARP Proxy manages PgBouncer as the actual connection management + layer, it needs to periodically check various status and stats to verify + it's still operational. You can also log or register some of this information to the DCS. + + * Default: 5 + +- **`server_tls_protocols`**: TLS protocol versions are allowed for + server connections. + Allowed values: `tlsv1.0`, `tlsv1.1`, `tlsv1.2`, `tlsv1.3`. + Shortcuts: `all` (tlsv1.0,tlsv1.1,tlsv1.2,tlsv1.3), + `secure` (tlsv1.2,tlsv1.3), `legacy` (all). + + * Default: `secure` + +- **`server_tls_sslmode`**: Whether to enable server SSL functionality. + Possible values are `disable`, `allow`, `prefer`, `require`, `verify-ca`, and `verify-full`. + + * Default: `disable` + +- **`session_transfer_mode`**: Method by which to transfer sessions. + Possible values are `fast`, `wait`, and `reconnect`. + + * Default: `wait` + + This property isn't used by the builtin proxy. + +- **`server_transfer_timeout`**: The number of seconds Harp Proxy waits before giving up on a PAUSE and issuing a KILL command. + + * Default: 30 + +The following two options apply only when using the built-in proxy. + +- **`keepalive`**: The number of seconds the built-in proxy waits before sending a keepalive message to an idle leader connection. + + * Default: 5 + + +- **`timeout`**: The number of seconds the built-in proxy waits before giving up on connecting to the leader. + + * Default: 1 + +When using `harpctl` to change any of these settings for all proxies, use the +`global` keyword in place of the proxy name. For example: + +```bash +harpctl set proxy global max_client_conn=1000 +``` diff --git a/product_docs/docs/pgd/4/overview/harp/05_bootstrapping.mdx b/product_docs/docs/pgd/4/overview/harp/05_bootstrapping.mdx new file mode 100644 index 00000000000..55d78e8dac4 --- /dev/null +++ b/product_docs/docs/pgd/4/overview/harp/05_bootstrapping.mdx @@ -0,0 +1,194 @@ +--- +navTitle: Bootstrapping +title: Cluster bootstrapping +--- + +To use HARP, a minimum amount of metadata must exist in the DCS. The +process of "bootstrapping" a cluster essentially means initializing node, +location, and other runtime configuration either all at once or on a +per-resource basis. + +This process is governed through the `harpctl apply` command. For more +information, see [harpctl command-line tool](08_harpctl). + +Set up the DCS and make sure it is functional before bootstrapping. + +!!! Important + You can combine any or all of + these example into a single YAML document and apply it all at once. + +## Cluster-wide bootstrapping + +Some settings are applied cluster-wide and you can specify them during +bootstrapping. Currently this applies only to the `event_sync_interval` +runtime directive, but others might be added later. + +The format is as follows: + +```yaml +cluster: + name: mycluster + event_sync_interval: 100 +``` + +Assuming that file was named `cluster.yml`, you then apply it with the +following: + +```bash +harpctl apply cluster.yml +``` + +If the cluster name isn't already defined in the DCS, this also +initializes that value. + +!!! Important + The cluster name parameter specified here always overrides the cluster + name supplied in `config.yml`. The assumption is that the bootstrap file + supplies all necessary elements to bootstrap a cluster or some portion of + its larger configuration. A `config.yml` file is primarily meant to control + the execution of HARP Manager, HARP Proxy, or `harpctl` specifically. + +## Location bootstrapping + +Every HARP node is associated with at most one location. This location can be +a single data center, a grouped region consisting of multiple underlying +servers, an Amazon availability zone, and so on. This is a logical +structure that allows HARP to group nodes together such that only one +represents the nodes in that location as the lead master. + +Thus it is necessary to initialize one or more locations. The format for this +is as follows: + +```yaml +cluster: + name: mycluster + +locations: + - location: dc1 + - location: dc2 +``` + +Assuming that file was named `locations.yml`, you then apply it with the +following: + +```bash +harpctl apply locations.yml +``` + +When performing any manipulation of the cluster, include the name as a preamble so the changes are directed to the right place. + +Once locations are bootstrapped, they show up with a quick examination: + +```bash +> harpctl get locations + +Cluster Location Leader Previous Leader Target Leader Lease Renewals +------- -------- ------ --------------- ------------- -------------- +mycluster dc1