Skip to content

Commit

Permalink
Update the multi-ha docs (#3510)
Browse files Browse the repository at this point in the history
* draft
  • Loading branch information
atovpeko authored Oct 18, 2024
1 parent 89a8d30 commit a6cb09e
Showing 1 changed file with 31 additions and 24 deletions.
55 changes: 31 additions & 24 deletions use-timescale/ha-replicas/high-availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,52 +13,59 @@ cloud_ui:
# Manage high availability

For Timescale Cloud Service with very low tolerance for downtime, Timescale Cloud offers
High Availability(HA) replicas. HA replicas significantly reduce the risk of downtime and data loss due to
High Availability (HA) replicas. HA replicas significantly reduce the risk of downtime and data loss due to
system failure, and enable services to avoid downtime during routine maintenance.

This page shows you how to choose the best high availability option for your Timescale Cloud Service.

## What is HA replication?

HA replicas are exact, up-to-date copies of your database that automatically take over operations as your primary data
node if the original primary data node becomes unavailable. HA replicas are synchronous or asynchronous hot standbys
hosted in multiple AWS availability zones(AZ) that use streaming replication to minimize the chance of data loss during
failover. That is, the primary node streams its write-ahead log (WAL) to the replicas.
HA replicas are exact, up-to-date copies of your database hosted in multiple AWS availability zones (AZ) within the same region as your primary node. They automatically take over operations if the original primary data node becomes unavailable. The primary node streams its write-ahead log (WAL) to the replicas to minimize the chances of data loss during failover.

HA replicas can be synchronous and asynchronous.

- Synchronous: the primary commits its next write once the replica confirms that the previous write is complete. There is no lag between the primary and the replica. They are in the same state at all times. This is preferable if you need the highest level of data integrity. However, this affects the primary ingestion time.

- Asynchronous: the primary commits its next write without the confirmation of the previous write completion. The asynchronous HA replicas often have a lag, in both time and data, compared to the primary. This is preferable if you need the shortest primary ingest time.

![Sync and async replication](https://assets.timescale.com/docs/images/sync_async_replication_draft.png)

HA replicas have separate unique addresses that you can use to serve read-only requests in parallel to your
primary data node. When your primary data node fails, Timescale Cloud automatically _fails over_ to
a HA replica. During failover, the read-only address is unavailable while Timescale Cloud automatically create
a new HA replica. The time to make this replica depends on several factors, including the size of your data.
You
primary data node. When your primary data node fails, Timescale Cloud automatically fails over to
an HA replica within 30 seconds. During failover, the read-only address is unavailable while Timescale Cloud automatically creates a new HA replica. The time to make this replica depends on several factors, including the size of your data.

Operations such as upgrading your Timescale Cloud Service to a new major or minor version may necessitate
a service restart. Restarts are run during the [maintenance window][upgrade]. To avoid any downtime, each data
node is updated in turn. That is, while the primary data node is updated, a replica is promoted to primary.
After the primary is updated and online, the same maintenance is performed on the HA replicas.

To ensure that all Timescale Cloud Services have minimum downtown and data loss in the most common
To ensure that all Timescale Cloud Services have minimum downtime and data loss in the most common
failure scenarios and during maintenance, [rapid recovery][rapid-recovery] is enabled by default for all services.

## Choose an HA strategy

The following HA configurations are available in Timescale Cloud:

- **Non-production**: no replica, best for developer environments.
- **High Availability**: a single replica in a separate AWS availability zone. The High availability optimized mode is
good for both price sensitive customers and those who care most about failover speed and performance.
Async replication provides faster write speeds and improved performance for apps with less stringent
consistency requirements.

- **Highest Availability**: two readable replicas in separate AWS availability zones. Available replication modes are:
- *Optimized* - two asynchronous replicas: transactions are considered complete without waiting for the replicas to
confirm. Async replication provides faster write speeds and improved performance for apps with less stringent
consistency requirements. When you access a HA read endpoint Timescale Cloud load balances across the replicas.
- *High data integrity* - one synchronous replica and one asynchronous replica: A synchronous replica is guaranteed to
always be in the exact same state as the primary, minimizing failover time and ensuring no data loss. However,
synchronous replicas reduce ingest performance and do not provide a replica endpoint

Synchronous replication ensures the highest level of data consistency and safety.

- **High availability**: a single async replica in a different AWS availability zone from your primary. Provides high availability with cost efficiency. Best for production apps.

- **Highest availability**: two replicas in different AWS availability zones from your primary. Available replication modes are:

- **High performance** - two async replicas. Provides the highest level of availability with two AZs and the ability to query the HA system. Best for absolutely critical apps.
- **High data integrity** - one sync replica and one async replica. The sync replica is identical to the primary at all times. Best for apps that can tolerate no data loss.

The following table summarizes the differences between these HA configurations:

|| High availability <br/> (1 async) | High performance <br/> (2 async) | High data integrity <br/> (1 sync + 1 async) |
|-------|----------|------------|-----|
|Write flow |The primary streams its WAL to the async replica, which may have a slight lag compared to the primary, providing 99.9% uptime SLA. |The primary streams its writes to both async replicas, providing 99.9+% uptime SLA.|The primary streams its writes to the sync and async replicas. The async replica is never ahead of the sync one.|
|Additional read replica|Recommended. Reads from the HA replica may cause availability and lag issues. |Not needed. You can still read from the HA replica even if one of them is down. Configure an additional read replica only if your read use case is significantly different from your write use case.|Highly recommended. If you run heavy queries on a sync replica, it may fall behind the primary. Specifically, if it takes too long for the replica to confirm a transaction, the next transaction is canceled.|
|Choosing the replica to read from manually| Not applicable. |Not available. Queries are load-balanced against all available HA replicas. |Not available. Queries are load-balanced against all available HA replicas.|
| Sync replication | Only async replicas are supported in this configuration. |Only async replicas are supported in this configuration. | Supported.|
| Failover flow | <ul><li>If the primary fails, the replica becomes the primary while a new node is created, with only seconds of downtime.</li><li>If the replica fails, a new async replica is created without impacting the primary. If you read from the async HA replica, those reads fail until the new replica is available.</li></ul> |<ul><li>If the primary fails, one of the replicas becomes the primary while a new node is created, with the other one still available for reads.</li><li>If the replica fails, a new async replica is created in another AZ, without impacting the primary. The newly created replica is behind the primary and the original replica while it catches up.</li></ul>|<ul><li>If the primary fails, the sync replica becomes the primary while a new node is created, with the async one still available for reads.</li><li>If the async replica fails, a new async replica is created. Heavy reads on the sync replica may delay the ingest time of the primary while a new async replica is created. Data integrity remains high but primary ingest performance may degrade.</li><li>If the sync replica fails, the async replica becomes the sync one, and a new async replica is created. The primary may experience some ingest performance degradation during this time.</li></ul>|
| Cost composition | Primary + async (2x) |Primary + 2 async (3x)|Primary + 1 async + 1 sync (3x)|
| Tier | Performance, Scale, and Enterprise |Scale and Enterprise|Scale and Enterprise|

The `High` and `Highest` HA strategies are available with the [Scale and the Enterprise][pricing-plans] pricing plans.

Expand Down

0 comments on commit a6cb09e

Please sign in to comment.