From 016b99dc3365353c39b0e10a5420140533040003 Mon Sep 17 00:00:00 2001 From: Ilya Mashchenko Date: Tue, 24 Dec 2024 09:12:12 +0200 Subject: [PATCH 1/3] docs: improve on-prem troubleshooting readability (#19279) * docs: improve on-prem troubleshooting readability * Apply suggestions from code review --------- Co-authored-by: Fotis Voutsas --- .../netdata-cloud-on-prem/troubleshooting.md | 86 +++++++++++++------ 1 file changed, 61 insertions(+), 25 deletions(-) diff --git a/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md b/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md index c8a05e3f34ac06..b09a1c1d013575 100644 --- a/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md +++ b/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md @@ -11,9 +11,9 @@ These components should be monitored and managed according to your organization' ## Common Issues -### Installation cannot finish +### Timeout During Installation -If you are getting error like: +If your installation fails with this error: ``` Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart... @@ -21,49 +21,85 @@ Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart... Error: client rate limiter Wait returned an error: Context deadline exceeded. ``` -There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured. +This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue. -To verify check if there are any `Pending` pods: +#### Diagnosis Steps -```shell -kubectl get pods -n netdata-cloud | grep -v Running -``` +> **Important** +> +> - For full installation: Ensure you're in the correct cluster context. +> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured. +> - For Light PoC, always perform a complete uninstallation before attempting a new installation. -To check which resource is a limiting factor pick one of the `Pending` pods and issue command: +1. Check for pods stuck in Pending state: -```shell -kubectl describe pod -n netdata-cloud -``` + ```shell + kubectl get pods -n netdata-cloud | grep -v Running + ``` -At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear. -Please check the minimum requirements for your on-prem installation type or contact our support - `support@netdata.cloud`. +2. If you find Pending pods, examine the resource constraints: -> **Warning** -> -> In case of the Light PoC installations always uninstall before the next attempt. + ```shell + kubectl describe pod -n netdata-cloud + ``` + + Review the Events section at the bottom of the output. Look for messages about: + - Insufficient CPU + - Insufficient Memory + - Node capacity issues + +3. View overall cluster resources: + + ```shell + # Check resource allocation across nodes + kubectl top nodes + + # View detailed node capacity + kubectl describe nodes | grep -A 5 "Allocated resources" + ``` -### Installation finished but login does not work +#### Solution -It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants: +1. Compare your available resources against the [minimum requirements](https://github.com/netdata/netdata/blob/master/docs/netdata-cloud/netdata-cloud-on-prem/installation.md#system-requirements). +2. Take one of these actions: + - Add more resources to your cluster. + - Free up existing resources. -1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct. -2. Mail login is not working: - 1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token. - 2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings. +### Login Issues After Installation -If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem. +Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation. + +| Issue | Symptoms | Cause | Solution | +|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------| +| SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs
- Expired/invalid SSO tokens
- Untrusted certificates
- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`
- Verify certificates are valid and trusted
- Ensure FQDN matches certificate | +| MailCatcher Login (Light PoC) | - Magic links not arriving
- "Invalid token" errors | - Incorrect hostname during installation
- Modified default MailCatcher values | - Reinstall with correct FQDN
- Restore default MailCatcher settings
- Ensure hostname matches certificate | +| Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration
- Network connectivity issues | - Update SMTP settings in `values.yaml`
- Verify network allows SMTP traffic
- Check mail server logs | +| Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched `netdata-cloud-common` secret
- Database hash mismatch
- Namespace change without secret migration | - Migrate secret before namespace change
- Perform fresh installation
- Contact support for data recovery | > **Warning** > -> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation. +> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated. +> +> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts. ### Slow Chart Loading or Chart Errors When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents. | Issue | Symptoms | Cause | Solution | -| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Agent Connectivity | - Queries stall or timeout
- Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points/README.md) nodes to provide reliable backends. The system will automatically prefer these for queries when available | | Kubernetes Resources | - Service throttling
- Slow data processing
- Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed | | Database Performance | - Slow query responses
- Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization:
- CPU usage
- Memory allocation
- Disk I/O performance | | Message Broker | - Delayed node status updates (online/offline/stale)
- Slow alert transitions
- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration
- Adjust microservice resource allocation
- Monitor message processing rates | + +## Need Help? + +If issues persist: + +1. Gather the following information: + + - Installation logs + - Your cluster specifications + +2. Contact support at `support@netdata.cloud`. From 5158928b9e3c6726e50675a374d9c3b37bbaddfb Mon Sep 17 00:00:00 2001 From: Ilya Mashchenko Date: Tue, 24 Dec 2024 13:29:21 +0200 Subject: [PATCH 2/3] improvement(go.d/nats): add leafz metrics (#19282) --- src/go/plugin/go.d/collector/nats/cache.go | 78 +++++++++---- src/go/plugin/go.d/collector/nats/charts.go | 109 ++++++++++++++++++ src/go/plugin/go.d/collector/nats/collect.go | 40 ++++++- .../go.d/collector/nats/collector_test.go | 13 ++- .../plugin/go.d/collector/nats/metadata.yaml | 38 ++++++ src/go/plugin/go.d/collector/nats/restapi.go | 21 ++++ .../nats/testdata/v2.10.24/leafz.json | 21 ++++ 7 files changed, 289 insertions(+), 31 deletions(-) create mode 100644 src/go/plugin/go.d/collector/nats/testdata/v2.10.24/leafz.json diff --git a/src/go/plugin/go.d/collector/nats/cache.go b/src/go/plugin/go.d/collector/nats/cache.go index 93ddbdef3ab07c..96c6ed048cd19a 100644 --- a/src/go/plugin/go.d/collector/nats/cache.go +++ b/src/go/plugin/go.d/collector/nats/cache.go @@ -2,12 +2,17 @@ package nats +import ( + "fmt" +) + func newCache() *cache { return &cache{ accounts: make(accCache), routes: make(routeCache), inGateways: make(gwCache), outGateways: make(gwCache), + leafs: make(leafCache), } } @@ -16,6 +21,7 @@ type cache struct { routes routeCache inGateways gwCache outGateways gwCache + leafs leafCache } func (c *cache) resetUpdated() { @@ -36,17 +42,18 @@ func (c *cache) resetUpdated() { type ( accCache map[string]*accCacheEntry accCacheEntry struct { - accName string hasCharts bool updated bool + + accName string } ) -func (c *accCache) put(name string) { - acc, ok := (*c)[name] +func (c *accCache) put(ai accountInfo) { + acc, ok := (*c)[ai.Account] if !ok { - acc = &accCacheEntry{accName: name} - (*c)[name] = acc + acc = &accCacheEntry{accName: ai.Account} + (*c)[ai.Account] = acc } acc.updated = true } @@ -54,18 +61,19 @@ func (c *accCache) put(name string) { type ( routeCache map[uint64]*routeCacheEntry routeCacheEntry struct { - rid uint64 - remoteId string hasCharts bool updated bool + + rid uint64 + remoteId string } ) -func (c *routeCache) put(rid uint64, remoteId string) { - route, ok := (*c)[rid] +func (c *routeCache) put(ri routeInfo) { + route, ok := (*c)[ri.Rid] if !ok { - route = &routeCacheEntry{rid: rid, remoteId: remoteId} - (*c)[rid] = route + route = &routeCacheEntry{rid: ri.Rid, remoteId: ri.RemoteID} + (*c)[ri.Rid] = route } route.updated = true } @@ -73,36 +81,58 @@ func (c *routeCache) put(rid uint64, remoteId string) { type ( gwCache map[string]*gwCacheEntry gwCacheEntry struct { - gwName string - rgwName string hasCharts bool updated bool - conns map[uint64]*gwConnCacheEntry + + gwName string + rgwName string + conns map[uint64]*gwConnCacheEntry } gwConnCacheEntry struct { - gwName string - rgwName string - cid uint64 hasCharts bool updated bool + + gwName string + rgwName string + cid uint64 } ) -func (c *gwCache) put(gwName, rgwName string) { +func (c *gwCache) put(gwName, rgwName string, rgi *remoteGatewayInfo) { gw, ok := (*c)[gwName] if !ok { gw = &gwCacheEntry{gwName: gwName, rgwName: rgwName, conns: make(map[uint64]*gwConnCacheEntry)} (*c)[gwName] = gw } gw.updated = true -} -func (c *gwCache) putConn(gwName, rgwName string, cid uint64) { - c.put(gwName, rgwName) - conn, ok := (*c)[gwName].conns[cid] + conn, ok := gw.conns[rgi.Connection.Cid] if !ok { - conn = &gwConnCacheEntry{gwName: gwName, rgwName: rgwName, cid: cid} - (*c)[gwName].conns[cid] = conn + conn = &gwConnCacheEntry{gwName: gwName, rgwName: rgwName, cid: rgi.Connection.Cid} + gw.conns[rgi.Connection.Cid] = conn } conn.updated = true } + +type ( + leafCache map[string]*leafCacheEntry + leafCacheEntry struct { + hasCharts bool + updated bool + + leafName string + account string + ip string + port int + } +) + +func (c *leafCache) put(li leafInfo) { + key := fmt.Sprintf("%s_%s_%s_%d", li.Name, li.Account, li.IP, li.Port) + leaf, ok := (*c)[key] + if !ok { + leaf = &leafCacheEntry{leafName: li.Name, account: li.Account, ip: li.IP, port: li.Port} + (*c)[key] = leaf + } + leaf.updated = true +} diff --git a/src/go/plugin/go.d/collector/nats/charts.go b/src/go/plugin/go.d/collector/nats/charts.go index 50767f1cea285c..5772dab90377c1 100644 --- a/src/go/plugin/go.d/collector/nats/charts.go +++ b/src/go/plugin/go.d/collector/nats/charts.go @@ -41,6 +41,11 @@ const ( prioGatewayConnMessages prioGatewayConnSubscriptions prioGatewayConnUptime + + prioLeafConnTraffic + prioLeafConnMessages + prioLeafConnSubscriptions + prioLeafRTT ) var serverCharts = func() module.Charts { @@ -391,6 +396,65 @@ var ( } ) +var leafConnChartsTmpl = module.Charts{ + leafConnTrafficTmpl.Copy(), + leafConnMessagesTmpl.Copy(), + leafConnSubscriptionsTmpl.Copy(), + leafConnRTT.Copy(), +} + +var ( + leafConnTrafficTmpl = module.Chart{ + ID: "leaf_node_conn_%s_%s_%s_%d_traffic", + Title: "Leaf Node Connection Traffic", + Units: "bytes/s", + Fam: "leaf traffic", + Ctx: "nats.leaf_node_conn_traffic", + Priority: prioLeafConnTraffic, + Type: module.Area, + Dims: module.Dims{ + {ID: "leafz_leaf_%s_%s_%s_%d_in_bytes", Name: "in", Algo: module.Incremental}, + {ID: "leafz_leaf_%s_%s_%s_%d_out_bytes", Name: "out", Mul: -1, Algo: module.Incremental}, + }, + } + leafConnMessagesTmpl = module.Chart{ + ID: "leaf_node_conn_%s_%s_%s_%d_messages", + Title: "Leaf Node Connection Messages", + Units: "messages/s", + Fam: "leaf traffic", + Ctx: "nats.leaf_node_conn_messages", + Priority: prioLeafConnMessages, + Type: module.Line, + Dims: module.Dims{ + {ID: "leafz_leaf_%s_%s_%s_%d_in_msgs", Name: "in", Algo: module.Incremental}, + {ID: "leafz_leaf_%s_%s_%s_%d_out_msgs", Name: "out", Mul: -1, Algo: module.Incremental}, + }, + } + leafConnSubscriptionsTmpl = module.Chart{ + ID: "leaf_node_conn_%s_%s_%s_%d_subscriptions", + Title: "Leaf Node Connection Active Subscriptions", + Units: "subscriptions", + Fam: "leaf subscriptions", + Ctx: "nats.leaf_node_conn_subscriptions", + Priority: prioLeafConnSubscriptions, + Type: module.Line, + Dims: module.Dims{ + {ID: "leafz_leaf_%s_%s_%s_%d_num_subs", Name: "active"}, + }, + } + leafConnRTT = module.Chart{ + ID: "leaf_node_conn_%s_%s_%s_%d_rtt", + Title: "Leaf Node Connection RTT", + Units: "microseconds", + Fam: "leaf rtt", + Ctx: "nats.leaf_node_conn_rtt", + Priority: prioLeafRTT, + Dims: module.Dims{ + {ID: "leafz_leaf_%s_%s_%s_%d_rtt", Name: "rtt"}, + }, + } +) + func (c *Collector) updateCharts() { c.onceAddSrvCharts.Do(c.addServerCharts) @@ -444,6 +508,17 @@ func (c *Collector) updateCharts() { }) return false }) + maps.DeleteFunc(c.cache.leafs, func(_ string, leaf *leafCacheEntry) bool { + if !leaf.updated { + c.removeLeafCharts(leaf) + return true + } + if !leaf.hasCharts { + leaf.hasCharts = true + c.addLeafCharts(leaf) + } + return false + }) } func (c *Collector) addServerCharts() { @@ -545,6 +620,35 @@ func (c *Collector) removeGatewayConnCharts(gwConn *gwConnCacheEntry, isInbound c.removeCharts(px) } +func (c *Collector) addLeafCharts(leaf *leafCacheEntry) { + charts := leafConnChartsTmpl.Copy() + + for _, chart := range *charts { + chart.ID = fmt.Sprintf(chart.ID, leaf.leafName, leaf.account, leaf.ip, leaf.port) + chart.ID = cleanChartID(chart.ID) + chart.Labels = []module.Label{ + {Key: "server_id", Value: c.srvMeta.id}, + {Key: "remote_name", Value: leaf.leafName}, + {Key: "account", Value: leaf.account}, + {Key: "ip", Value: leaf.ip}, + {Key: "port", Value: strconv.Itoa(leaf.port)}, + } + for _, dim := range chart.Dims { + dim.ID = fmt.Sprintf(dim.ID, leaf.leafName, leaf.account, leaf.ip, leaf.port) + } + } + + if err := c.Charts().Add(*charts...); err != nil { + c.Warningf("failed to add charts for leaf %s: %s", leaf.leafName, err) + } +} + +func (c *Collector) removeLeafCharts(leaf *leafCacheEntry) { + px := fmt.Sprintf("leaf_node_conn_%s_%s_%s_%d_", leaf.leafName, leaf.account, leaf.ip, leaf.port) + cleanChartID(px) + c.removeCharts(px) +} + func (c *Collector) removeCharts(prefix string) { for _, chart := range *c.Charts() { if strings.HasPrefix(chart.ID, prefix) { @@ -553,3 +657,8 @@ func (c *Collector) removeCharts(prefix string) { } } } + +func cleanChartID(id string) string { + r := strings.NewReplacer(".", "_", " ", "_") + return strings.ToLower(r.Replace(id)) +} diff --git a/src/go/plugin/go.d/collector/nats/collect.go b/src/go/plugin/go.d/collector/nats/collect.go index 0ae46da810815d..b287911691c2f5 100644 --- a/src/go/plugin/go.d/collector/nats/collect.go +++ b/src/go/plugin/go.d/collector/nats/collect.go @@ -42,6 +42,9 @@ func (c *Collector) collect() (map[string]int64, error) { if err := c.collectGatewayz(mx); err != nil { return mx, err } + if err := c.collectLeafz(mx); err != nil { + return mx, err + } c.updateCharts() @@ -142,7 +145,7 @@ func (c *Collector) collectAccstatz(mx map[string]int64) error { } for _, acc := range resp.AccStats { - c.cache.accounts.put(acc.Account) + c.cache.accounts.put(acc) px := fmt.Sprintf("accstatz_acc_%s_", acc.Account) @@ -172,7 +175,7 @@ func (c *Collector) collectRoutez(mx map[string]int64) error { } for _, route := range resp.Routes { - c.cache.routes.put(route.Rid, route.RemoteID) + c.cache.routes.put(route) px := fmt.Sprintf("routez_route_id_%d_", route.Rid) @@ -198,8 +201,7 @@ func (c *Collector) collectGatewayz(mx map[string]int64) error { } for name, ogw := range resp.OutboundGateways { - c.cache.outGateways.put(resp.Name, name) - c.cache.outGateways.putConn(resp.Name, name, ogw.Connection.Cid) + c.cache.outGateways.put(resp.Name, name, ogw) px := fmt.Sprintf("gatewayz_outbound_gw_%s_cid_%d_", name, ogw.Connection.Cid) @@ -213,9 +215,8 @@ func (c *Collector) collectGatewayz(mx map[string]int64) error { } for name, igws := range resp.InboundGateways { - c.cache.inGateways.put(resp.Name, name) for _, igw := range igws { - c.cache.inGateways.putConn(resp.Name, name, igw.Connection.Cid) + c.cache.inGateways.put(resp.Name, name, igw) px := fmt.Sprintf("gatewayz_inbound_gw_%s_cid_%d_", name, igw.Connection.Cid) @@ -232,6 +233,33 @@ func (c *Collector) collectGatewayz(mx map[string]int64) error { return nil } +func (c *Collector) collectLeafz(mx map[string]int64) error { + req, err := web.NewHTTPRequestWithPath(c.RequestConfig, urlPathLeafz) + if err != nil { + return err + } + + var resp leafzResponse + if err := web.DoHTTP(c.httpClient).RequestJSON(req, &resp); err != nil { + return err + } + + for _, leaf := range resp.Leafs { + c.cache.leafs.put(leaf) + px := fmt.Sprintf("leafz_leaf_%s_%s_%s_%d_", leaf.Name, leaf.Account, leaf.IP, leaf.Port) + + mx[px+"in_bytes"] = leaf.InBytes + mx[px+"out_bytes"] = leaf.OutBytes + mx[px+"in_msgs"] = leaf.InMsgs + mx[px+"out_msgs"] = leaf.OutMsgs + mx[px+"num_subs"] = int64(leaf.NumSubs) + rtt, _ := time.ParseDuration(leaf.RTT) + mx[px+"rtt"] = rtt.Microseconds() + } + + return nil +} + func parseUptime(uptime string) (time.Duration, error) { // https://github.com/nats-io/nats-server/blob/v2.10.24/server/monitor.go#L1354 diff --git a/src/go/plugin/go.d/collector/nats/collector_test.go b/src/go/plugin/go.d/collector/nats/collector_test.go index c073dcade26988..ec5047e7cbff72 100644 --- a/src/go/plugin/go.d/collector/nats/collector_test.go +++ b/src/go/plugin/go.d/collector/nats/collector_test.go @@ -25,6 +25,7 @@ var ( dataVer210Accstatz, _ = os.ReadFile("testdata/v2.10.24/accstatz.json") dataVer210Routez, _ = os.ReadFile("testdata/v2.10.24/routez.json") dataVer210Gatewayz, _ = os.ReadFile("testdata/v2.10.24/gatewayz.json") + dataVer210Leafz, _ = os.ReadFile("testdata/v2.10.24/leafz.json") ) func Test_testDataIsValid(t *testing.T) { @@ -36,6 +37,7 @@ func Test_testDataIsValid(t *testing.T) { "dataVer210Accstatz": dataVer210Accstatz, "dataVer210Routez": dataVer210Routez, "dataVer210Gatewayz": dataVer210Gatewayz, + "dataVer210Leafz": dataVer210Leafz, } { require.NotNil(t, data, name) } @@ -134,7 +136,8 @@ func TestCollector_Collect(t *testing.T) { wantNumOfCharts: len(serverCharts) + len(accountChartsTmpl)*3 + len(routeChartsTmpl)*1 + - len(gatewayConnChartsTmpl)*5, + len(gatewayConnChartsTmpl)*5 + + len(leafConnChartsTmpl)*1, wantMetrics: map[string]int64{ "accstatz_acc_$G_conns": 0, "accstatz_acc_$G_leaf_nodes": 0, @@ -193,6 +196,12 @@ func TestCollector_Collect(t *testing.T) { "gatewayz_outbound_gw_region3_cid_5_out_bytes": 0, "gatewayz_outbound_gw_region3_cid_5_out_msgs": 0, "gatewayz_outbound_gw_region3_cid_5_uptime": 6, + "leafz_leaf__$G_127.0.0.1_6223_in_bytes": 0, + "leafz_leaf__$G_127.0.0.1_6223_in_msgs": 0, + "leafz_leaf__$G_127.0.0.1_6223_num_subs": 1, + "leafz_leaf__$G_127.0.0.1_6223_out_bytes": 1280000, + "leafz_leaf__$G_127.0.0.1_6223_out_msgs": 10000, + "leafz_leaf__$G_127.0.0.1_6223_rtt": 0, "routez_route_id_1_in_bytes": 4, "routez_route_id_1_in_msgs": 1, "routez_route_id_1_num_subs": 1, @@ -284,6 +293,8 @@ func caseOk(t *testing.T) (*Collector, func()) { _, _ = w.Write(dataVer210Routez) case urlPathGatewayz: _, _ = w.Write(dataVer210Gatewayz) + case urlPathLeafz: + _, _ = w.Write(dataVer210Leafz) default: w.WriteHeader(http.StatusNotFound) } diff --git a/src/go/plugin/go.d/collector/nats/metadata.yaml b/src/go/plugin/go.d/collector/nats/metadata.yaml index 39476e8ca439f0..ac040ac5295a61 100644 --- a/src/go/plugin/go.d/collector/nats/metadata.yaml +++ b/src/go/plugin/go.d/collector/nats/metadata.yaml @@ -412,3 +412,41 @@ modules: chart_type: line dimensions: - name: uptime + - name: leaf node connection + description: These metrics refer to [Leaf Node Connections](https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#leaf-node-information). + labels: + - name: remote_name + description: "Unique identifier of the remote leaf node server, either its configured name or automatically assigned ID." + - name: account + description: "Name of the associated account." + - name: ip + description: "IP address of the remote server." + - name: port + description: "Port used for the connection to the remote server." + metrics: + - name: nats.leaf_node_conn_traffic + description: Leaf Node Connection Traffic + unit: bytes/s + chart_type: area + dimensions: + - name: in + - name: out + - name: nats.leaf_node_conn_messages + description: Leaf Node Connection Messages + unit: messages/s + chart_type: line + dimensions: + - name: in + - name: out + - name: nats.leaf_node_conn_subscriptions + description: Leaf Node Connection Active Subscriptions + unit: subscriptions + chart_type: line + dimensions: + - name: active + - name: nats.leaf_node_conn_rtt + description: Leaf Node Connection RTT + unit: microseconds + chart_type: line + dimensions: + - name: rtt diff --git a/src/go/plugin/go.d/collector/nats/restapi.go b/src/go/plugin/go.d/collector/nats/restapi.go index 7de48ea1409788..c2a07cf8469c4c 100644 --- a/src/go/plugin/go.d/collector/nats/restapi.go +++ b/src/go/plugin/go.d/collector/nats/restapi.go @@ -19,6 +19,8 @@ const ( urlPathRoutez = "/routez" // https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#gateway-information urlPathGatewayz = "/gatewayz" + // https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#leaf-node-information + urlPathLeafz = "/leafz" ) var ( @@ -138,3 +140,22 @@ type ( NumSubs uint32 `json:"subscriptions"` } ) + +type ( + // https://github.com/nats-io/nats-server/blob/v2.10.24/server/monitor.go#L2163 + leafzResponse struct { + Leafs []leafInfo `json:"leafs"` + } + leafInfo struct { + Name string `json:"name"` // remote server name or id + Account string `json:"account"` + IP string `json:"ip"` + Port int `json:"port"` + RTT string `json:"rtt,omitempty"` + InMsgs int64 `json:"in_msgs"` + OutMsgs int64 `json:"out_msgs"` + InBytes int64 `json:"in_bytes"` + OutBytes int64 `json:"out_bytes"` + NumSubs uint32 `json:"subscriptions"` + } +) diff --git a/src/go/plugin/go.d/collector/nats/testdata/v2.10.24/leafz.json b/src/go/plugin/go.d/collector/nats/testdata/v2.10.24/leafz.json new file mode 100644 index 00000000000000..7d438072fc183b --- /dev/null +++ b/src/go/plugin/go.d/collector/nats/testdata/v2.10.24/leafz.json @@ -0,0 +1,21 @@ +{ + "server_id": "NC2FJCRMPBE5RI5OSRN7TKUCWQONCKNXHKJXCJIDVSAZ6727M7MQFVT3", + "now": "2019-08-27T09:07:05.841132-06:00", + "leafnodes": 1, + "leafs": [ + { + "account": "$G", + "ip": "127.0.0.1", + "port": 6223, + "rtt": "200µs", + "in_msgs": 0, + "out_msgs": 10000, + "in_bytes": 0, + "out_bytes": 1280000, + "subscriptions": 1, + "subscriptions_list": [ + "foo" + ] + } + ] +} From 7822c5eded1374cf32cf3b736cc8323f118a56bc Mon Sep 17 00:00:00 2001 From: Netdata bot <43409846+netdatabot@users.noreply.github.com> Date: Tue, 24 Dec 2024 06:32:44 -0500 Subject: [PATCH 3/3] Regenerate integrations docs (#19283) Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com> --- .../go.d/collector/nats/integrations/nats.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/src/go/plugin/go.d/collector/nats/integrations/nats.md b/src/go/plugin/go.d/collector/nats/integrations/nats.md index 1925bfd842a256..f13fb5cbab3059 100644 --- a/src/go/plugin/go.d/collector/nats/integrations/nats.md +++ b/src/go/plugin/go.d/collector/nats/integrations/nats.md @@ -186,6 +186,28 @@ Metrics: | nats.outbound_gateway_conn_subscriptions | active | subscriptions | | nats.outbound_gateway_conn_uptime | uptime | seconds | +### Per leaf node connection + +These metrics refer to [Leaf Node Connections](https://docs.nats.io/running-a-nats-service/nats_admin/monitoring#leaf-node-information). + +Labels: + +| Label | Description | +|:-----------|:----------------| +| remote_name | Unique identifier of the remote leaf node server, either its configured name or automatically assigned ID. | +| account | Name of the associated account. | +| ip | IP address of the remote server. | +| port | Port used for the connection to the remote server. | + +Metrics: + +| Metric | Dimensions | Unit | +|:------|:----------|:----| +| nats.leaf_node_conn_traffic | in, out | bytes/s | +| nats.leaf_node_conn_messages | in, out | messages/s | +| nats.leaf_node_conn_subscriptions | active | subscriptions | +| nats.leaf_node_conn_rtt | rtt | microseconds | + ## Alerts