Skip to content

Commit

Permalink
Merge pull request #5660 from EnterpriseDB/content/docs/biganimal/hea…
Browse files Browse the repository at this point in the history
…lth_status

BigAnimal - Health Status doc
  • Loading branch information
nidhibhammar authored May 23, 2024
2 parents b56ef17 + 15ac017 commit a705d43
Showing 1 changed file with 74 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: "Health Status"
deepToC: true
---

The Health Status dashboard provides real-time insight into the topology and health of Postgres high-availability clusters. It supports both Primary/Standby and Distributed high-availability clusters.

The Health Status dashboard displays:
- A set of cluster-wide health indicators that helps to draw your attention immediately to the critical issues.
- A schematic view of all the nodes organized in a cluster distributed across regions displaying health and role(s) of each node.
- A replication status view displaying the status of each replication slot and the associated replication lags.

## Viewing Health Status dashboard

To view the **Health Status** dashboard from the BigAnimal portal:

1. In the left navigation of BigAnimal Portal, go to **Clusters**.

2. Select any ready high-availability or PGD cluster.

3. On the cluster detail page, select the **Health Status** tab.

The **Health Status** tab displays the dashboard with health status categorized in sections:
- [Global cluster health](#global-cluster-health)
- [Regional nodes health and roles](#regional-nodes-health-and-roles)
- [Replication status](#replication-status)

### Global cluster health

The global cluster health section displays the cluster-wide view including the following metrics:

- **Raft Status** (PGD only) indicates whether the Raft consensus algorithm is running correctly in the cluster. It verifies that one node is elected as the global leader and the Raft roles such as RAFT_LEADER and RAFT_FOLLOWER are defined correctly across all the nodes.
- **Replication Slot Status** (PGD only) indicates whether all the replication slots are in streaming state or not.
- **Clock Skew** indicates whether the node's clock is in sync and doesn't exceed a threshold of 60 seconds.
- **Proxy Status** (PGD only) provides the number of PGD proxies up and running as compared to the available proxies.
- **Node Status** provides the number of nodes that are up and running.
- **Transaction rate** provides the total number of transactions including committed and rolled back transactions per second in the cluster.


### Regional nodes health and roles

The regional nodes health and roles section displays fine-grained health status at regional and node level. It is structured as an accordion, with each element representing a group of nodes deployed in the same region. Each item displays basic information including:
- **Proxy Status** (only PGD) indicates the number of active proxies compared to the available proxies in the specified regions.
- **Node Status** indicates the number of nodes up and running as compared to the available nodes in the specified region in a text chart. It provides the status of all nodes in the specified region using boolean indicator (green (OK)/red (KO)) in a text chart.

On expanding each item, it provides a list of nodes with information like:
- Total number of active connections as compared to the maximum number of configured connections for each node.
- A **Node Ko** tag for each node if it is down.
- Memory usage percentage on a progress bar.
- Storage usage percentage on a progress bar.

For PGD, it provides the tags below the node name:
- **Raft Leader** indicates that the node is a Raft Leader locally in the region.
- **Raft Follower** indicates that the node is a Raft Follower locally in the region.
- **Global Raft Leader** indicates that the nodes is a Raft Leader globally in the cluster.
- **Global Raft Follower** indicates that the node is a Raft Follower globally in the cluster.
- **Witness** indicates that the node is a witness in the PGD cluster. See [witness node docs](/pgd/latest/node_management/witness_nodes/) for more information.

For high-availability, it provides the tags below the node name:
- **Primary** indicates if the node role is primary.

### Replication status

The Replication Status section has a matrix displaying the replication lag across all the nodes of the cluster. The matrix provides different types of replication lags for **Write**, **Replay**, **Flush**, and **Sent**. It provides the lag in both bytes and time for **Write**, **Replay**, and **Flush** whereas only in bytes for **Sent**.

!!!note
- In high-availability clusters, replication occurs only from the primary (source) to the replicas (target). So the matrix displays only one row for the source and multiple columns for the targets.
- In PGD clusters, the replications are bidirectional, so the matrix displays an equal number of rows and columns. Every node is both a source and a target.
!!!

!!!note
The data on the Health Status dashboard is dynamic and is updated continuously. However, the cluster architecture is based on a snapshot. If a new node is added or an existing node is removed, you must reload the Health Status dashboard by clicking the tab again or reloading the browser page.
!!!

1 comment on commit a705d43

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.