Skip to content

Commit

Permalink
Merge pull request #5416 from EnterpriseDB/release-2024-03-20b
Browse files Browse the repository at this point in the history
Release 2024-03-20b
  • Loading branch information
djw-m authored Mar 20, 2024
2 parents 9c0ac9a + 2c8f5a1 commit 1ba91f3
Show file tree
Hide file tree
Showing 4 changed files with 399 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ The following options aren't available when creating your cluster:

- For Google Cloud, in **Volume Type**, select **SSD Persistent Disk**.

In **Volume Properties**, select the disk size for your cluster, and configure the IOPS.
In **Volume Properties**, select the disk size for your cluster.


2. ##### Network, Logs, & Telemetry section
Expand Down
389 changes: 389 additions & 0 deletions product_docs/docs/biganimal/release/migration/dha_bulk_migration.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,389 @@
---
title: Bulk loading data into DHA/PGD clusters
navTITLE: Bulk loading into DHA/PGD clusters
description: This guide is specifically for environments where there is no direct access to the PGD Nodes, only PGD Proxy endpoints, such as BigAnimal’s Distributed High Availability deployments of PGD.
deepToC: true
---

## Bulk loading data into PGD clusters.

**This guide is specifically for environments where there is no direct access to the PGD Nodes, only PGD Proxy endpoints, such as BigAnimal’s Distributed High Availability deployments of PGD.**

Bulk loading data into a PGD cluster can, without care, cause a lot of replication load on a cluster. With that in mind, this document lays out a process to mitigate that replication load.

## Provision or Prepare a PGD Cluster

You will need to have provisioned a PGD Cluster, either manually, via TPA or on BigAnimal. This will be the target database for

It is recommended that when provisioning, or if needed, after provisioning, you set the following Postgres GUC variables:


| GUC variable | Setting |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| maintenance_work_mem | 1GB |
| wal_sender_timeout | 60min |
| wal_receiver_timeout | 60min |
| max_wal_size | Should be either<br/> • a multiple (2 or 3) of your largest table<br/> or<br/> • more than one third of the capacity of your dedicated WAL disk (if configured) |


You will need to make note of the target’s proxy hostname and port. Also you will need a user and password for the target cluster. Your tar

In the following instructions, we give examples for a cluster named `ab-cluster`, with an `ab-group` sub-group and three nodes `ab-node-1`, `ab-node-2` and `ab-node3`. The cluster is accessed through a host named `ab-proxy`. On BigAnimal, a cluster is configured, by default, with an `edb_admin` user which can be used for the bulk upload.


## Identify your data source

You will need to source hostname, port, database name, user and password for your source database.

Also, you will currently need a list of tables in the database that you wish to migrate to the target database.


## Prepare a bastion server

Create a virtual machine with your preferred operating system in the cloud to orchestrate your bulk loading.

* Use your EDB account.
* Obtain your EDB repository token from the [EDB Repos 2.0](https://www.enterprisedb.com/repos-downloads) page.
* Set environment variables.
* Set the `EDB_SUBSCRIPTION_TOKEN` environment variable to the repository token:
* Configure the repositories.
* Run the automated installer to install the repositories.
* Install the required software.
* You will need to install and configure:
* psql
* PGD CLI
* Migration Toolkit


### Configure repositories

The required software is available from the EDB repositories. You will need to install the EDB repositories on your bastion server.

* Red Hat
```
curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/postgres_distributed/setup.rpm.sh" | sudo -E bash
curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/enterprise/setup.rpm.sh" | sudo -E bash
```

* Ubuntu/Debian
```
curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/postgres_distributed/setup.deb.sh" | sudo -E bash
curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/enterprise/setup.deb.sh" | sudo -E bash
```

### Install the required software

Once the repositories have been configured, you can install the required software.

#### psql and pg_dump/pg_restore

The psql command is the interactive terminal for working with PostgreSQL. It is a client application and can be installed on any operating system. Packaged with PSQL are pg_dump and pg_restore, command line utilities for dumping and restoring PostgreSQL databases.

* Ubuntu
```
sudo apt install postgresql-client-16
```

* Red Hat
```
sudo dnf install postgresql-client-16
```

To simplify logging in to the databases, create a [.pgpass](https://www.postgresql.org/docs/current/libpq-pgpass.html) file for both your source and target servers:

```
source-host:source-port:source-dbname:source-user:source-password
target-proxy:target-port:target-dbname:target-user:target-password
```

Create the file in your home directory and change its permissions to read/write only for the owner.

```
chmod 0600 $HOME/.pgpass
```

#### PGD CLI

PGD CLI is a command line interface for managing and monitoring PGD clusters. It is a Go application and can be installed on any operating system.

* Ubuntu
```
sudo apt-get install edb-pgd5-cli
```
* Red Hat
```
sudo dnf install edb-pgd5-cli
```

Create a configuration file for the pgd cli.

```
cluster:
name: target-cluster-name
endpoints:
- host=target-cluster-hostname dbname=target-cluster-dbname port=target-cluster-port user=target-cluster-user-name
```

For our example ab-cluster:

```
cluster:
name: ab-cluster
endpoints:
- host=ab-proxy dbname=bdrbd port=5432 user=edb_admin
```

Save it as `pgd-cli-config.yml`.

See also [https://www.enterprisedb.com/docs/pgd/latest/cli/installing_cli/](https://www.enterprisedb.com/docs/pgd/latest/cli/installing_cli/)


#### Migration Toolkit

EDB's Migration Toolkit (MTK) is a command-line tool that can be used to migrate data from a source database to a target database. It is a Java application and requires a Java runtime environment to be installed.

* Ubuntu
```
sudo apt-get -y install edb-migrationtoolkit
sudo wget https://jdbc.postgresql.org/download/postgresql-42.7.2.jar -P /usr/edb/migrationtoolkit/lib
```
* Red Hat
```
sudo apt-get -y install edb-migrationtoolkit
sudo wget https://jdbc.postgresql.org/download/postgresql-42.7.2.jar -P /usr/edb/migrationtoolkit/lib
```

See also [https://www.enterprisedb.com/docs/migration_toolkit/latest/installing/](https://www.enterprisedb.com/docs/migration_toolkit/latest/installing/)


## Setup and tune the target cluster

On the target cluster and within the regional group required, select one node that will be the destination for the data.

If we have a group `ab-group` with `ab-node-1`, `ab-node-2` and `ab-node-3`, we may select `ab-node-1` as our destination node.


### Set up a fence

Fence off all other nodes apart from the destination node.

Connect to any node on the destination group using the `psql` command.
Use `bdr.alter_node_option` and turn the `route_fence` option to `true`
for each node in the group apart from the destination node.

```sql
select bdr.alter_node_option('ab-node-2','route_fence','t');
select bdr.alter_node_option('ab-node-3','route_fence','t');
```

The next time you connect with `psql`, you will be directed to the write leader which should be the destination node. To ensure that it is, we need to send two more commands.


### Make the destination node both write and raft leader

To minimize the possibility of disconnections, we move the raft and write leader roles to our destination node.

Make the destination node the raft leader using `bdr.raft_leadership_transfer`:

```
bdr.raft_leadership_transfer('ab-node-1',true);
```

This will trigger a write leader election which will elect the `ab-node-1` as write leader because you have fenced off the other nodes in the group.


### Record then clear default commit scopes

We need to make a record of the default commit scopes in the cluster. The next step will overwrite the settings and at the end of this process we will need to restore them. Run:

```sql
select node_group_name,default_commit_scope from bdr.node_group_summary ;
```

This will produce an output similar to::

```
node_group_name | default_commit_scope
-----------------+----------------------
world |
ab-group | ba001_ab-group-a
```

Record these values. We can now overwrite the settings:

```sql
select bdr.alter_node_group_option('ab-group','default_commit_scope', 'local');
```

## Prepare to monitor the data migration

Check the target cluster is healthy

* Run` pgd -f pgd-cli-config.yml check-health` to check the overall health of the cluster:
```
Check Status Message
----- ------ -------
ClockSkew Ok All BDR node pairs have clockskew within permissible limit
Connection Ok All BDR nodes are accessible
Raft Ok Raft Consensus is working correctly
Replslots Ok All BDR replication slots are working correctly
Version Ok All nodes are running same BDR versions
```
(All checks should pass)

* Run `pgd -f pgd-cli-config.yml verify-cluster` to verify the configuration of the cluster:
```
Check Status Groups
----- ------ ------
There is always at least 1 Global Group and 1 Data Group Ok
There are at least 2 data nodes in a Data Group (except for the witness-only group) Ok
There is at most 1 witness node in a Data Group Ok
Witness-only group does not have any child groups Ok
There is at max 1 witness-only group iff there is even number of local Data Groups Ok
There are at least 2 proxies configured per Data Group if routing is enabled Ok
```
(All checks should pass)

* Run `pgd -f pgd-cli-config.yml show-nodes` to check the status of the nodes:
```
Node Node ID Group Type Current State Target State Status Seq ID
---- ------- ----- ---- ------------- ------------ ------ ------
ab-node-1 807899305 ab-group data ACTIVE ACTIVE Up 1
ab-node-2 2587806295 ab-group data ACTIVE ACTIVE Up 2
ab-node-3 199017004 ab-group data ACTIVE ACTIVE Up 3
```


* Run `pgd -f pgd-cli-config.yml show-raft` to confirm the raft leader:

* Run `pgd -f pgd-cli-config.yml show-replslots` to confirm the replication slots:

* Run `pgd -f pgd-cli-config.yml show-subscriptions` to confirm the subscriptions:

* Run `pgd -f pgd-cli-config.yml show-groups` to confirm the groups:

These commands will provide a snapshot of the state of the cluster before the migration begins.

## Migrating the data

This currently has to be performed in three phases.

1. Transferring the “pre-data” using pg_dump and pg_restore. This exports and imports all the data definitions.
1. Using MTK (Migration Toolkit) with the `--dataonly` option to transfer only the data from each table, repeating as necessary for each table.
1. Transferring the “post-data” using pg_dump and pg_restore. This completes the data transfer.


### Transferring the pre-data

Use the `pg_dump` utility against the source database to dump the pre-data section in directory format.

```
pg_dump -Fd -f predata --section=pre-data -h <source-hostname> -p <source-port> -U <source-user> <source-dbname>
```

Once the pre-data has been dumped into the predata directory, it can be loaded, using `pg_restore` into the target cluster.

```
pg_restore -Fd --section=pre-data -d "host=ab-node-1-host dbname=<target-dbname> user=<target-user> options='-cbdr.ddl_locking=off -cbdr.commit_scope=local'" predata
```

The `options=` section in the connection string to the server is important. The options disable DDL locking and sets the commit scope to `local` overriding any default commit scopes. Using `--section=pre-data` limits the restore to the configuration that precedes the data in the dump:

### Transferring the data

In this step, the Migration Toolkit will be used to transfer the table data between the source and target.

Edit `/usr/edb/migrationtoolkit/etc/toolkit.properties`.

You will need to use sudo to raise your privilege to do this - ie. `sudo vi /usr/edb/migrationtoolkit/etc/toolkit.properties`.

```
SRC_DB_URL=jdbc:postgresql://<source-host>:<source-port>/<source-dbname>
SRC_DB_USER=<source-user>
SRC_DB_PASSWORD=<source-password>
TARGET_DB_URL=jdbc:postgresql://<target-host>:<target-port>/<target-dbname>
TARGET_DB_USER=<target-user>
TARGET_DB_PASSWORD=<target-password>
```

Edit the relevant values into the settings:

Ensure that the configuration file is owned by the user you intend to run the data transfer as and read-write only for its owner.

Now, select sets of tables in the source database that should be be transferred together, ideally grouping them for redundancy in case of failure.

```
nohup /usr/edb/migrationtoolkit/bin/runMTK.sh -sourcedbtype postgres -targetdbtype postgres -loaderCount 1 -tableLoaderLimit 1 -fetchSize 4000 -parallelLoadRowLimit 1000 -truncLoad -dataOnly -tables <tablename1>,<tablename2>,... <schemaname> > mtk.log
```

This command uses the `-truncLoad` option and will drop indexes and constraints before the data is loaded, then recreate them after the loading has completed.

You can run multiple instances of this command in parallel; add an `&` to the end of the command. Ensure that you write the output from each to different files (e.g mtk_1.log, mtk_2.log).

For example:

```
nohup /usr/edb/migrationtoolkit/bin/runMTK.sh -sourcedbtype postgres -targetdbtype postgres -loaderCount 1 -tableLoaderLimit 1 -fetchSize 4000 -parallelLoadRowLimit 1000 -truncLoad -dataOnly -tables warehouse,district,item,new_order,orders,history public >mtk_1.log &
nohup /usr/edb/migrationtoolkit/bin/runMTK.sh -sourcedbtype postgres -targetdbtype postgres -loaderCount 1 -tableLoaderLimit 1 -fetchSize 4000 -parallelLoadRowLimit 1000 -truncLoad -dataOnly -tables customer public >mtk_2.log &
nohup /usr/edb/migrationtoolkit/bin/runMTK.sh -sourcedbtype postgres -targetdbtype postgres -loaderCount 1 -tableLoaderLimit 1 -fetchSize 4000 -parallelLoadRowLimit 1000 -truncLoad -dataOnly -tables order_line public >mtk_3.log &
nohup /usr/edb/migrationtoolkit/bin/runMTK.sh -sourcedbtype postgres -targetdbtype postgres -loaderCount 1 -tableLoaderLimit 1 -fetchSize 4000 -parallelLoadRowLimit 1000 -truncLoad -dataOnly -tables stock public >mtk_4.log &
```

This sets up four processes, each transferring a particular table or sets of tables as a background process.

While this is running, monitor the lag. Log into the destination node with psql and monitor lag with:

```sql
SELECT NOW(); SELECT pg_size_pretty( pg_database_size('bdrdb') ); SELECT * FROM bdr.node_replication_rates;
```

Once the lag has been consumed, return to the shell. You can now use `tail` to monitor the progress of the data transfer by following the log files of each process:

```
tail -f mtk_1.log mtk_2.log mtk_3.log mtk_4.log
```

### Transferring the post-data

Make sure there is no replication lag across the entire cluster before proceeding with post-data.

Now dump the post-data section of the source database:

```
pg_dump -Fd -f postdata --section=post-data -h <source-hostname> -p <source-port> -U <source-user> <source-dbname>
```

And then load the post-data section into the target database:

```
pg_restore -Fd -d “host=ab-node-1-host dbname=<target-dbname> user=<target-user> options='-cbdr.ddl_locking=off -cbdr.commit_scope=local'” --section=post-data postdata
```

If this step fails due to a disconnection, return to monitoring lag (as above) then, when no synchronization lag is present, repeat the restore.

## Resume the cluster

### Remove the routing fences you set up earlier on the other nodes.

Connect directly to the destination node via psql. Use `bdr.alter_node_option` and turn off the `route_fence` option for each node in the group apart from the destination node, which is already off.

```sql
select bdr.alter_node_option('ab-node-2','route_fence','f');
select bdr.alter_node_option('ab-node-3','route_fence','f');
```

Proxies will now be able to route to all the nodes in the group.

### Reset commit scopes

You can now restore the default commit scopes to the cluster to allow PGD to manage the replication load. Set the `default_commit_scope` for the groups to the value for [the groups that you recorded in an earlier step](#record-then-clear-default-commit-scopes).

```sql
select bdr.alter_node_group_option('ab-group','default_commit_scope', 'ba001_ab-group-a');
```

The cluster is now loaded and ready for production. For more assurance, you can run the `pgd -f pgd-cli-config.yml check-health` command to check the overall health of the cluster (and the other pgd commands from when you checked the cluster earlier).
Loading

1 comment on commit 1ba91f3

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.