Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from netdata:master #254

Merged
merged 3 commits into from
Dec 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 61 additions & 25 deletions docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,59 +11,95 @@ These components should be monitored and managed according to your organization'

## Common Issues

### Installation cannot finish
### Timeout During Installation

If you are getting error like:
If your installation fails with this error:

```
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.
```

There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured.
This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue.

To verify check if there are any `Pending` pods:
#### Diagnosis Steps

```shell
kubectl get pods -n netdata-cloud | grep -v Running
```
> **Important**
>
> - For full installation: Ensure you're in the correct cluster context.
> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured.
> - For Light PoC, always perform a complete uninstallation before attempting a new installation.

To check which resource is a limiting factor pick one of the `Pending` pods and issue command:
1. Check for pods stuck in Pending state:

```shell
kubectl describe pod <POD_NAME> -n netdata-cloud
```
```shell
kubectl get pods -n netdata-cloud | grep -v Running
```

At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear.
Please check the minimum requirements for your on-prem installation type or contact our support - `[email protected]`.
2. If you find Pending pods, examine the resource constraints:

> **Warning**
>
> In case of the Light PoC installations always uninstall before the next attempt.
```shell
kubectl describe pod <POD_NAME> -n netdata-cloud
```

Review the Events section at the bottom of the output. Look for messages about:
- Insufficient CPU
- Insufficient Memory
- Node capacity issues

3. View overall cluster resources:

```shell
# Check resource allocation across nodes
kubectl top nodes

# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
```

### Installation finished but login does not work
#### Solution

It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants:
1. Compare your available resources against the [minimum requirements](https://github.com/netdata/netdata/blob/master/docs/netdata-cloud/netdata-cloud-on-prem/installation.md#system-requirements).
2. Take one of these actions:
- Add more resources to your cluster.
- Free up existing resources.

1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct.
2. Mail login is not working:
1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token.
2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings.
### Login Issues After Installation

If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem.
Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.

| Issue | Symptoms | Cause | Solution |
|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs<br/>- Expired/invalid SSO tokens<br/>- Untrusted certificates<br/>- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`<br/>- Verify certificates are valid and trusted<br/>- Ensure FQDN matches certificate |
| MailCatcher Login (Light PoC) | - Magic links not arriving<br/>- "Invalid token" errors | - Incorrect hostname during installation<br/>- Modified default MailCatcher values | - Reinstall with correct FQDN<br/>- Restore default MailCatcher settings<br/>- Ensure hostname matches certificate |
| Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration<br/>- Network connectivity issues | - Update SMTP settings in `values.yaml`<br/>- Verify network allows SMTP traffic<br/>- Check mail server logs |
| Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched `netdata-cloud-common` secret<br/>- Database hash mismatch<br/>- Namespace change without secret migration | - Migrate secret before namespace change<br/>- Perform fresh installation<br/>- Contact support for data recovery |

> **Warning**
>
> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation.
> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated.
>
> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.

### Slow Chart Loading or Chart Errors

When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.

| Issue | Symptoms | Cause | Solution |
| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Agent Connectivity | - Queries stall or timeout<br/>- Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points/README.md) nodes to provide reliable backends. The system will automatically prefer these for queries when available |
| Kubernetes Resources | - Service throttling<br/>- Slow data processing<br/>- Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed |
| Database Performance | - Slow query responses<br/>- Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization:<br/>- CPU usage<br/>- Memory allocation<br/>- Disk I/O performance |
| Message Broker | - Delayed node status updates (online/offline/stale)<br/>- Slow alert transitions<br/>- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration<br/>- Adjust microservice resource allocation<br/>- Monitor message processing rates |

## Need Help?

If issues persist:

1. Gather the following information:

- Installation logs
- Your cluster specifications

2. Contact support at `[email protected]`.
78 changes: 54 additions & 24 deletions src/go/plugin/go.d/collector/nats/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,17 @@

package nats

import (
"fmt"
)

func newCache() *cache {
return &cache{
accounts: make(accCache),
routes: make(routeCache),
inGateways: make(gwCache),
outGateways: make(gwCache),
leafs: make(leafCache),
}
}

Expand All @@ -16,6 +21,7 @@ type cache struct {
routes routeCache
inGateways gwCache
outGateways gwCache
leafs leafCache
}

func (c *cache) resetUpdated() {
Expand All @@ -36,73 +42,97 @@ func (c *cache) resetUpdated() {
type (
accCache map[string]*accCacheEntry
accCacheEntry struct {
accName string
hasCharts bool
updated bool

accName string
}
)

func (c *accCache) put(name string) {
acc, ok := (*c)[name]
func (c *accCache) put(ai accountInfo) {
acc, ok := (*c)[ai.Account]
if !ok {
acc = &accCacheEntry{accName: name}
(*c)[name] = acc
acc = &accCacheEntry{accName: ai.Account}
(*c)[ai.Account] = acc
}
acc.updated = true
}

type (
routeCache map[uint64]*routeCacheEntry
routeCacheEntry struct {
rid uint64
remoteId string
hasCharts bool
updated bool

rid uint64
remoteId string
}
)

func (c *routeCache) put(rid uint64, remoteId string) {
route, ok := (*c)[rid]
func (c *routeCache) put(ri routeInfo) {
route, ok := (*c)[ri.Rid]
if !ok {
route = &routeCacheEntry{rid: rid, remoteId: remoteId}
(*c)[rid] = route
route = &routeCacheEntry{rid: ri.Rid, remoteId: ri.RemoteID}
(*c)[ri.Rid] = route
}
route.updated = true
}

type (
gwCache map[string]*gwCacheEntry
gwCacheEntry struct {
gwName string
rgwName string
hasCharts bool
updated bool
conns map[uint64]*gwConnCacheEntry

gwName string
rgwName string
conns map[uint64]*gwConnCacheEntry
}
gwConnCacheEntry struct {
gwName string
rgwName string
cid uint64
hasCharts bool
updated bool

gwName string
rgwName string
cid uint64
}
)

func (c *gwCache) put(gwName, rgwName string) {
func (c *gwCache) put(gwName, rgwName string, rgi *remoteGatewayInfo) {
gw, ok := (*c)[gwName]
if !ok {
gw = &gwCacheEntry{gwName: gwName, rgwName: rgwName, conns: make(map[uint64]*gwConnCacheEntry)}
(*c)[gwName] = gw
}
gw.updated = true
}

func (c *gwCache) putConn(gwName, rgwName string, cid uint64) {
c.put(gwName, rgwName)
conn, ok := (*c)[gwName].conns[cid]
conn, ok := gw.conns[rgi.Connection.Cid]
if !ok {
conn = &gwConnCacheEntry{gwName: gwName, rgwName: rgwName, cid: cid}
(*c)[gwName].conns[cid] = conn
conn = &gwConnCacheEntry{gwName: gwName, rgwName: rgwName, cid: rgi.Connection.Cid}
gw.conns[rgi.Connection.Cid] = conn
}
conn.updated = true
}

type (
leafCache map[string]*leafCacheEntry
leafCacheEntry struct {
hasCharts bool
updated bool

leafName string
account string
ip string
port int
}
)

func (c *leafCache) put(li leafInfo) {
key := fmt.Sprintf("%s_%s_%s_%d", li.Name, li.Account, li.IP, li.Port)
leaf, ok := (*c)[key]
if !ok {
leaf = &leafCacheEntry{leafName: li.Name, account: li.Account, ip: li.IP, port: li.Port}
(*c)[key] = leaf
}
leaf.updated = true
}
Loading
Loading