Skip to content

Commit

Permalink
First pass reorg to Application Usage
Browse files Browse the repository at this point in the history
Signed-off-by: Dj Walker-Morgan <[email protected]>
  • Loading branch information
djw-m committed Oct 2, 2023
1 parent 5c68c06 commit 7ec801c
Show file tree
Hide file tree
Showing 9 changed files with 425 additions and 349 deletions.
349 changes: 0 additions & 349 deletions product_docs/docs/pgd/5/appusage.mdx

This file was deleted.

140 changes: 140 additions & 0 deletions product_docs/docs/pgd/5/appusage/behavior.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
title: Application behavior
navTitle: Behavior
---

Much of PGD's replication behavior is transparent to applications. Understanding how it
achieves that and what elements are not transparent is important to successfully Developing
an application that works well with PGD.

### Replication behavior

PGD supports replicating changes made on one node to other nodes.

PGD, by default, replicates all changes from INSERT, UPDATE, DELETE and TRUNCATE
operations from the source node to other nodes. Only the final changes are sent,
after all triggers and rules are processed. For example, `INSERT ... ON CONFLICT
UPDATE` sends either an insert or an update, depending on what occurred on the
origin. If an update or delete affects zero rows, then no changes are sent.

You can replicate INSERT without any preconditions.

For updates and deletes to replicate on other nodes, PGD must be able to
identify the unique rows affected. PGD requires that a table have either a
PRIMARY KEY defined, a UNIQUE constraint, or an explicit REPLICA IDENTITY
defined on specific columns. If one of those isn't defined, a warning is
generated, and later updates or deletes are explicitly blocked. If REPLICA
IDENTITY FULL is defined for a table, then a unique index isn't required. In
that case, updates and deletes are allowed and use the first non-unique index
that's live, valid, not deferred, and doesn't have expressions or WHERE clauses.
Otherwise, a sequential scan is used.

### Truncate

You can use TRUNCATE even without a defined replication identity. Replication of
TRUNCATE commands is supported, but take care when truncating groups of tables
connected by foreign keys. When replicating a truncate action, the subscriber
truncates the same group of tables that was truncated on the origin, either
explicitly specified or implicitly collected by CASCADE, except in cases where
replication sets are defined. See [Replication sets](../repsets) for further
details and examples. This works correctly if all affected tables are part of
the same subscription. But if some tables to truncate on the subscriber have
foreign-key links to tables that aren't part of the same (or any) replication
set, then applying the truncate action on the subscriber fails.

### Row-level locks

Row-level locks taken implicitly by INSERT, UPDATE, and DELETE commands are
replicated as the changes are made. Table-level locks taken implicitly by
INSERT, UPDATE, DELETE, and TRUNCATE commands are also replicated. Explicit
row-level locking (`SELECT ... FOR UPDATE/FOR SHARE`) by user sessions isn't
replicated, nor are advisory locks. Information stored by transactions running
in SERIALIZABLE mode isn't replicated to other nodes. The transaction isolation
level of SERIALIAZABLE is supported, but transactions aren't serialized across
nodes in the presence of concurrent transactions on multiple nodes.

If DML is executed on multiple nodes concurrently, then potential conflicts
might occur if executing with asynchronous replication. You must either handle
these or avoid them. Various avoidance mechanisms are possible, discussed in
[Conflicts](../consistency/conflicts).

### Sequences

Sequences need special handling, described in [Sequences](../sequences). This is
because in a cluster, sequences must be global to avoid nodes creating
conflicting values. Global sequences are available with global locking to ensure
integrity.

### Binary objects

Binary data in BYTEA columns is replicated normally, allowing "blobs" of data up
to 1 GB. Use of the PostgreSQL "large object" facility isn't supported in PGD.

### Rules

Rules execute only on the origin node so aren't executed during apply,
even if they're enabled for replicas.

### Base tables only

Replication is possible only from base tables to base tables. That is, the
tables on the source and target on the subscription side must be tables, not
views, materialized views, or foreign tables. Attempts to replicate tables other
than base tables result in an error. DML changes that are made through updatable
views are resolved to base tables on the origin and then applied to the same
base table name on the target.

### Partitioned tables

PGD supports partitioned tables transparently, meaning that you can add a
partitioned table to a replication set and changes that involve any of the
partitions are replicated downstream.

### Triggers

By default, triggers execute only on the origin node. For example, an INSERT
trigger executes on the origin node and is ignored when you apply the change on
the target node. You can specify for triggers to execute on both the origin node
at execution time and on the target when it's replicated ("apply time") by using
`ALTER TABLE ... ENABLE ALWAYS TRIGGER`. Or, use the `REPLICA` option to execute
only at apply time: `ALTER TABLE ... ENABLE REPLICA TRIGGER`.

Some types of trigger aren't executed on apply, even if they exist on a
table and are currently enabled. Trigger types not executed are:

- Statement-level triggers (`FOR EACH STATEMENT`)
- Per-column UPDATE triggers (`UPDATE OF column_name [, ...]`)

PGD replication apply uses the system-level default search_path. Replica
triggers, stream triggers, and index expression functions can assume other
search_path settings that then fail when they execute on apply. To prevent this
from occurring, use any of these techniques:

- Resolve object references clearly using either only the default search_path.
- Always use fully qualified references to objects, e.g., schema.objectname.
- Set the search path for a function using `ALTER FUNCTION ... SET search_path
= ...` for the functions affected.

PGD assumes that there are no issues related to text or other collatable
datatypes, i.e., all collations in use are available on all nodes, and the
default collation is the same on all nodes. Replicating changes uses equality
searches to locate Replica Identity values, so this does't have any effect
except where unique indexes are explicitly defined with nonmatching collation
qualifiers. Row filters might be affected by differences in collations if
collatable expressions were used.

### Toast

PGD handling of very long "toasted" data in PostgreSQL is transparent to the
user. The TOAST "chunkid" values likely differ between the same row on different
nodes, but that doesn't cause any problems.

### Other restrictions

PGD can't work correctly if Replica Identity columns are marked as external.

PostgreSQL allows CHECK() constraints that contain volatile functions. Since PGD
re-executes CHECK() constraints on apply, any subsequent re-execution that
doesn't return the same result as before causes data divergence.

PGD doesn't restrict the use of foreign keys. Cascading FKs are allowed.
21 changes: 21 additions & 0 deletions product_docs/docs/pgd/5/appusage/dml-ddl.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: DML and DDL replication
navTitle: DML and DDL
---

PGD doesn't replicate the DML statement. It replicates the changes caused by the
DML statement. For example, an UPDATE that changed two rows replicates two
changes, whereas a DELETE that didn't remove any rows doesn't replicate
anything. This means that the results of executing volatile statements are
replicated, ensuring there's no divergence between nodes as might occur with
statement-based replication.

DDL replication works differently to DML. For DDL, PGD replicates the statement,
which then executes on all nodes. So a `DROP TABLE IF EXISTS` might not
replicate anything on the local node, but the statement is still sent to other
nodes for execution if DDL replication is enabled. Full details are covered in
[DDL replication](ddl).

PGD works to ensure that intermixed DML and DDL statements work correctly, even
in the same transaction.

35 changes: 35 additions & 0 deletions product_docs/docs/pgd/5/appusage/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Application use
redirects:
- ../bdr/appusage
navigation:
- behavior
- nonreplicated
- dml-ddl
- nodes-with-differences
- rules
- timing
- table-access-methods
---

Developing an application with PGD is predominantly the same as working with any PostgreSQL database. What is different, though, is that you need to be aware of how your application will interact with replication. Detailed in this section are how PGD behaves with applications, what SQL is and isn't replicated, how different nodes are handled and other important information for application developers.

* [Application behavior](behavior) looks at how PGD replication appears to an application. Which commands are replicated, which run locally, when row-level locks are acquired, how and where triggers fire, large objects, toast, and more are covered.

* [DML and DDL](dml-and-ddl) shows what the differences between the two classes of SQL statement are and how PGD handles them.

* [Nodes with differences](differences) examines how PGD worls with configurations where there are differing table structures and schemas on replicated nodes. Also covered is how to compare between such nodes with LiveCompare and how differences of PostgreSQL versions running on nodes can be handled.

* [Application rules](rules) offers some general rules for applications to avoid data anomalies.

* [Timing considerations](timing) shows how the asynchronous/synchronous replication may affect an applications view of data and notes functions to mitigate stale reads.

* [Table access methods](table-access-methods)(TAMs) notes which TAMs are available with PGD and how to enable them.








122 changes: 122 additions & 0 deletions product_docs/docs/pgd/5/appusage/nodes-with-differences.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
title: Nodes with differences
navTitle: Nodes with differences
---

## Replicating between nodes with differences

By default, DDL is automatically sent to all nodes. You can control this
manually, as described in [DDL replication](ddl), and you can use it to create
differences between database schemas across nodes. PGD is designed to allow
replication to continue even with minor differences between nodes. These
features are designed to allow application schema migration without downtime or
to allow logical standby nodes for reporting or testing.

Currently, replication requires the same table name on all nodes. A future
feature might allow a mapping between different table names.

It's possible to replicate between tables with dissimilar partitioning
definitions, such as a source that's a normal table replicating to a partitioned
table, including support for updates that change partitions on the target. It
can be faster if the partitioning definition is the same on the source and
target since dynamic partition routing doesn't need to execute at apply time.
For details, see [Replication sets](../repsets).

By default, all columns are replicated.

PGD replicates data columns based on the column name. If a column has the same
name but a different datatype, PGD attempt to cast from the source type to the
target type, if casts were defined that allow that.

PGD supports replicating between tables that have a different number of columns.

If the target has missing columns from the source, then PGD raises a
`target_column_missing` conflict, for which the default conflict resolver is
`ignore_if_null`. This throws an error if a non-NULL value arrives.
Alternatively, you can also configure a node with a conflict resolver of
`ignore`. This setting doesn't throw an error but silently ignores any
additional columns.

If the target has additional columns not seen in the source record, then PGD
raises a `source_column_missing` conflict, for which the default conflict
resolver is `use_default_value`. Replication proceeds if the additional columns
have a default, either NULL (if nullable) or a default expression. It throws an
error and halts replication if not.

Transform triggers can also be used on tables to provide default values or alter
the incoming data in various ways before apply.

If the source and the target have different constraints, then replication is
attempted, but it might fail if the rows from source can't be applied to the
target. Row filters can help here.

Replicating data from one schema to a more relaxed schema won't cause failures.
Replicating data from a schema to a more restrictive schema can be a source of
potential failures. The right way to solve this is to place a constraint on the
more relaxed side, so bad data can't be entered. That way, no bad data ever
arrives by replication, so it never fails the transform into the more
restrictive schema. For example, if one schema has a column of type TEXT and
another schema defines the same column as XML, add a CHECK constraint onto the
TEXT column to enforce that the text is XML.

You can define a table with different indexes on each node. By default, the
index definitions are replicated. See [DDL replication](../ddl) to specify how
to create an index on only a subset of nodes or just locally.

Storage parameters, such as `fillfactor` and `toast_tuple_target`, can differ
between nodes for a table without problems. An exception to that is that the
value of a table's storage parameter `user_catalog_table` must be identical on
all nodes.

A table being replicated must be owned by the same user/role on each node. See
[Security and roles](../security) for further discussion.

Roles can have different passwords for connection on each node, although by
default changes to roles are replicated to each node. See [DDL
replication](../ddl) to specify how to alter a role password on only a subset of
nodes or locally.

## Comparison between nodes with differences

LiveCompare is a tool for data comparison on a database, against PGD and non-PGD
nodes. It needs a minimum of two connections to compare against and reach a
final result.

Since LiveCompare 1.3, you can configure with `all_bdr_nodes` set. This setting
saves you from clarifying all the relevant DSNs for each separate node in the
cluster. An EDB Postgres Distributed cluster has N amount of nodes with
connection information, but it's only the initial and output connection that
LiveCompare 1.3+ needs to complete its job. Setting `logical_replication_mode`
states how all the nodes are communicating.

All the configuration is done in a `.ini` file named `bdrLC.ini`, for example.
Find templates for this configuration file in `/etc/2ndq-livecompare/`.

While LiveCompare executes, you see N+1 progress bars, N being the number of
processes. Once all the tables are sourced, a time displays as the transactions
per second (tps) was measured. This continues to count the time, giving you an
estimate and then a total execution time at the end.

This tool offers a lot of customization and filters, such as tables, schemas,
and replication_sets. LiveCompare can use stop-start without losing context
information, so it can run at convenient times. After the comparison, a summary
and a DML script are generated so you can review it. Apply the DML to fix any
differences found.

## Replicating between different release levels

The other difference between nodes which you may encounter is where there's
different major versions of PostgreSQL on the nodes. PGD is designed to
replicate between different major release versions. This feature is designed to
allow major version upgrades without downtime.

PGD is also designed to replicate between nodes that have different versions of
PGD software. This feature is designed to allow version upgrades and maintenance
without downtime.

However, while it's possible to join a node with a major version in a cluster,
you can't add a node with a minor version if the cluster uses a newer protocol
version. Doing so returns an error.

Both of these features might be affected by specific restrictions. See [Release
notes](../rel_notes/) for any known incompatibilities.
28 changes: 28 additions & 0 deletions product_docs/docs/pgd/5/appusage/nonreplicated.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: Nonreplicated statements
navTitle: Nonreplicated
---

None of the following user commands are replicated by PGD, so their effects
occur on the local/origin node only:

- Cursor operations (DECLARE, CLOSE, FETCH)
- Execution commands (DO, CALL, PREPARE, EXECUTE, EXPLAIN)
- Session management (DEALLOCATE, DISCARD, LOAD)
- Parameter commands (SET, SHOW)
- Constraint manipulation (SET CONSTRAINTS)
- Locking commands (LOCK)
- Table maintenance commands (VACUUM, ANALYZE, CLUSTER, REINDEX)
- Async operations (NOTIFY, LISTEN, UNLISTEN)

Since the `NOTIFY` SQL command and the `pg_notify()` functions aren't
replicated, notifications aren't reliable in case of failover. This means that
notifications can easily be lost at failover if a transaction is committed just
when the server crashes. Applications running `LISTEN` might miss notifications
in case of failover.

This is true in standard PostgreSQL replication, and PGD doesn't yet improve on
this.

CAMO and Eager Replication options don't allow the `NOTIFY` SQL command or the
`pg_notify()` function.
34 changes: 34 additions & 0 deletions product_docs/docs/pgd/5/appusage/rules.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: General rules for applications
navTitle: Application rules
---

## Background

PGD uses replica identity values to identify the rows to change. Applications
can cause difficulties if they insert, delete, and then later reuse the same
unique identifiers. This is known as the [ABA
problem](https://en.wikipedia.org/wiki/ABA_problem). PGD can't know whether the
rows are the current row, the last row, or much older rows.

Similarly, since PGD uses table names to identify the table against which
changes are replayed, a similar ABA problem exists with applications that
create, drop, and then later reuse the same object names.

## Rules for applications

These issues give rise to some simple rules for applications to follow:

- Use unique identifiers for rows (INSERT).
- Avoid modifying unique identifiers (UPDATE).
- Avoid reusing deleted unique identifiers.
- Avoid reusing dropped object names.

In the general case, breaking those rules can lead to data anomalies and
divergence. Applications can break those rules as long as certain conditions are
met, but use caution: while anomalies are unlikely, they aren't impossible. For
example, you can reuse a row value as long as the DELETE was replayed on all
nodes, including down nodes. This might normally occur in less than a second but
can take days if a severe issue occurred on one node that prevented it from
restarting correctly.

Loading

0 comments on commit 7ec801c

Please sign in to comment.