Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from netdata:master #30

Merged
merged 6 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@

**Merged pull requests:**

- updated sizing netdata [\#17057](https://github.com/netdata/netdata/pull/17057) ([ktsaou](https://github.com/ktsaou))
- fix zpool state chart family [\#17054](https://github.com/netdata/netdata/pull/17054) ([ilyam8](https://github.com/ilyam8))
- DYNCFG: call the interceptor when a test is made on a new job [\#17052](https://github.com/netdata/netdata/pull/17052) ([ktsaou](https://github.com/ktsaou))
- fix alerts jsonschema prototype for latest dyncfg [\#17047](https://github.com/netdata/netdata/pull/17047) ([ktsaou](https://github.com/ktsaou))
- Do not use backtrace when sentry is enabled. [\#17043](https://github.com/netdata/netdata/pull/17043) ([vkalintiris](https://github.com/vkalintiris))
- Keep a count of metrics and samples collected [\#17042](https://github.com/netdata/netdata/pull/17042) ([stelfrag](https://github.com/stelfrag))
Expand Down Expand Up @@ -311,7 +314,6 @@
- code cleanup [\#16542](https://github.com/netdata/netdata/pull/16542) ([ktsaou](https://github.com/ktsaou))
- Assorted kickstart script fixes. [\#16537](https://github.com/netdata/netdata/pull/16537) ([Ferroin](https://github.com/Ferroin))
- wip documentation about functions table [\#16535](https://github.com/netdata/netdata/pull/16535) ([ktsaou](https://github.com/ktsaou))
- Remove openSUSE 15.4 from CI [\#16449](https://github.com/netdata/netdata/pull/16449) ([tkatsoulas](https://github.com/tkatsoulas))

## [v1.44.3](https://github.com/netdata/netdata/tree/v1.44.3) (2024-02-12)

Expand Down Expand Up @@ -402,9 +404,6 @@
- set journal path for logging [\#16457](https://github.com/netdata/netdata/pull/16457) ([ktsaou](https://github.com/ktsaou))
- add sbindir\_POST to PATH of bash scripts that use `systemd-cat-native` [\#16456](https://github.com/netdata/netdata/pull/16456) ([ilyam8](https://github.com/ilyam8))
- add LogNamespace to systemd units [\#16454](https://github.com/netdata/netdata/pull/16454) ([ilyam8](https://github.com/ilyam8))
- Update non-zero uuid key + child conf. [\#16452](https://github.com/netdata/netdata/pull/16452) ([vkalintiris](https://github.com/vkalintiris))
- Add missing argument. [\#16451](https://github.com/netdata/netdata/pull/16451) ([vkalintiris](https://github.com/vkalintiris))
- log flood protection to 1000 log lines / 1 minute [\#16450](https://github.com/netdata/netdata/pull/16450) ([ilyam8](https://github.com/ilyam8))

## [v1.43.2](https://github.com/netdata/netdata/tree/v1.43.2) (2023-10-30)

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,19 @@ It scales nicely from just a single server to thousands of servers, even in comp
Operating system metrics, container metrics, virtual machines, hardware sensors, applications metrics, OpenMetrics exporters, StatsD, and logs.

- :muscle: **Real-Time, Low-Latency, High-Resolution**<br/>
All metrics are collected per second and are on the dashboard immediately after data collection. Netdata is designed to be fast.
All metrics are collected per second and are on the dashboard immediately after data collection.

- :face_in_clouds: **Unsupervised Anomaly Detection**<br/>
Trains multiple Machine-Learning (ML) models for each metric collected and detects anomalies based on the past behavior of each metric individually.
Trains multiple Machine-Learning (ML) models for each metric and uses AI to detect anomalies based on the past behavior of each metric.

- :fire: **Powerful Visualization**<br/>
Clear and precise visualization that allows you to quickly understand any dataset, but also to filter, slice and dice the data directly on the dashboard, without the need to learn any query language.
Clear and precise visualization allowing you to understand any dataset at first sight, but also to filter, slice and dice the data directly on the dashboard, without the need to learn a query language.

- :bell: **Out of box Alerts**<br/>
Comes with hundreds of alerts out of the box to detect common issues and pitfalls, revealing issues that can easily go unnoticed. It supports several notification methods to let you know when your attention is needed.

- 📖 **systemd Journal Logs Explorer**<br/>
Provides a `systemd` journal logs explorer, to view, filter and analyze system and applications logs by directly accessing `systemd` journal files on individual hosts and infrastructure-wide logs centralization servers.
System and application logs of all servers are available in-real-time, for filtering and analysis, on both individual nodes and infrastructure-wide logs centralization servers.

- :sunglasses: **Low Maintenance**<br/>
Fully automated in every aspect: automated dashboards, out-of-the-box alerts, auto-detection and auto-discovery of metrics, zero-touch machine-learning, easy scalability and high availability, and CI/CD friendly.
Expand Down
4 changes: 3 additions & 1 deletion docs/netdata-agent/sizing-netdata-agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ The following are some of the innovations the open-source Netdata agent has, tha

2. **4 bytes per sample uncompressed**

To achieve optimal memory and disk footprint, Netdata uses a custom 32-bit floating point number we have developed. This floating point number is used to store the samples collected, together with their anomaly bit. The database of Netdata is fixed-step, so it has predefined slots for every sample, allowing Netdata to store timestamps once every several hundreds samples, minimizing both its memory requirements and the disk footprint.
To achieve optimal memory and disk footprint, Netdata uses a custom 32-bit floating point number. This floating point number is used to store the samples collected, together with their anomaly bit. The database of Netdata is fixed-step, so it has predefined slots for every sample, allowing Netdata to store timestamps once every several hundreds samples, minimizing both its memory requirements and the disk footprint.

The final disk footprint of Netdata varies due to compression efficiency. It is usually about 0.6 bytes per sample for the high-resolution tier (per-second), 6 bytes per sample for the mid-resolution tier (per-minute) and 18 bytes per sample for the low-resolution tier (per-hour).

3. **Query priorities**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ To configure database mode `ram` or `alloc`, in `netdata.conf`, set the followin

`dbengine` supports up to 5 tiers. By default, 3 tiers are used, like this:

| Tier | Resolution | Uncompressed Sample Size |
|:--------:|:--------------------------------------------------------------------------------------------:|:------------------------:|
| `tier0` | native resolution (metrics collected per-second as stored per-second) | 4 bytes |
| `tier1` | 60 iterations of `tier0`, so when metrics are collected per-second, this tier is per-minute. | 16 bytes |
| `tier2` | 60 iterations of `tier1`, so when metrics are collected per second, this tier is per-hour. | 16 bytes |
| Tier | Resolution | Uncompressed Sample Size | Usually On Disk |
|:--------:|:--------------------------------------------------------------------------------------------:|:------------------------:|:---------------:|
| `tier0` | native resolution (metrics collected per-second as stored per-second) | 4 bytes | 0.6 bytes |
| `tier1` | 60 iterations of `tier0`, so when metrics are collected per-second, this tier is per-minute. | 16 bytes | 6 bytes |
| `tier2` | 60 iterations of `tier1`, so when metrics are collected per second, this tier is per-hour. | 16 bytes | 18 bytes |

Data are saved to disk compressed, so the actual size on disk varies depending on compression efficiency.

Expand All @@ -56,40 +56,46 @@ You can find information about the current disk utilization of a Netdata Parent,
```json
{
// more information about the agent
// near the end:
// then, near the end:
"db_size": [
{
"tier": 0,
"disk_used": 1677528462156,
"disk_max": 1677721600000,
"disk_percent": 99.9884881,
"from": 1706201952,
"to": 1707401946,
"retention": 1199994,
"expected_retention": 1200132,
"currently_collected_metrics": 2198777
"metrics": 43070,
"samples": 88078162001,
"disk_used": 41156409552,
"disk_max": 41943040000,
"disk_percent": 98.1245269,
"from": 1705033983,
"to": 1708856640,
"retention": 3822657,
"expected_retention": 3895720,
"currently_collected_metrics": 27424
},
{
"tier": 1,
"disk_used": 838123468064,
"disk_max": 838860800000,
"disk_percent": 99.9121032,
"from": 1702885800,
"to": 1707401946,
"retention": 4516146,
"expected_retention": 4520119,
"currently_collected_metrics": 2198777
"metrics": 72987,
"samples": 5155155269,
"disk_used": 20585157180,
"disk_max": 20971520000,
"disk_percent": 98.1576785,
"from": 1698287340,
"to": 1708856640,
"retention": 10569300,
"expected_retention": 10767675,
"currently_collected_metrics": 27424
},
{
"tier": 2,
"disk_used": 334329683032,
"disk_max": 419430400000,
"disk_percent": 79.710408,
"from": 1679670000,
"to": 1707401946,
"retention": 27731946,
"expected_retention": 34790871,
"currently_collected_metrics": 2198777
"metrics": 148234,
"samples": 314919121,
"disk_used": 5957346684,
"disk_max": 10485760000,
"disk_percent": 56.8136853,
"from": 1667808000,
"to": 1708856640,
"retention": 41048640,
"expected_retention": 72251324,
"currently_collected_metrics": 27424
}
]
}
Expand All @@ -98,6 +104,8 @@ You can find information about the current disk utilization of a Netdata Parent,
In this example:

- `tier` is the database tier.
- `metrics` is the number of unique time-series in the database.
- `samples` is the number of samples in the database.
- `disk_used` is the currently used disk space in bytes.
- `disk_max` is the configured max disk space in bytes.
- `disk_percent` is the current disk space utilization for this tier.
Expand All @@ -107,21 +115,13 @@ In this example:
- `expected_retention` is the expected retention in seconds when `disk_percent` will be 100 (divide by 3600 for hours, divide by 86400 for days).
- `currently_collected_metrics` is the number of unique time-series currently being collected for this tier.

The estimated number of samples on each tier can be calculated as follows:

```
estimasted number of samples = retention / sample duration * currently_collected_metrics
```

So, for our example above:

| Tier | Sample Duration (seconds) | Estimated Number of Samples | Disk Space Used | Current Retention (days) | Expected Retention (days) | Bytes Per Sample |
|:-------:|:-------------------------:|:---------------------------:|:---------------:|:------------------------:|:-------------------------:|:----------------:|
| `tier0` | 1 | 2.64 trillion samples | 1.56 TiB | 13.8 | 13.9 | 0.64 |
| `tier1` | 60 | 165.5 billion samples | 780 GiB | 52.2 | 52.3 | 5.01 |
| `tier2` | 3600 | 16.9 billion samples | 311 GiB | 320.9 | 402.7 | 19.73 |

Note: as you can see in this example, the disk footprint per sample of `tier2` is bigger than the uncompressed sample size (19.73 bytes vs 16 bytes). This is due to the fact that samples are organized into pages and pages into extents. When Netdata is restarted frequently, it saves all data prematurely, before filling up entire pages and extents, leading to increased overheads per sample.
| Tier | # Of Metrics | # Of Samples | Disk Used | Disk Free | Current Retention | Expected Retention | Sample Size |
|-----:|-------------:|--------------:|----------:|----------:|------------------:|-------------------:|------------:|
| 0 | 43.1K | 88.1 billion | 38.4Gi | 1.88% | 44.2 days | 45.0 days | 0.46 B |
| 1 | 73.0K | 5.2 billion | 19.2Gi | 1.84% | 122.3 days | 124.6 days | 3.99 B |
| 2 | 148.3K | 315.0 million | 5.6Gi | 43.19% | 475.1 days | 836.2 days | 18.91 B |

To configure retention, in `netdata.conf`, set the following:

Expand Down
2 changes: 1 addition & 1 deletion packaging/version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
v1.44.0-406-nightly
v1.44.0-412-nightly
2 changes: 1 addition & 1 deletion src/collectors/proc.plugin/proc_spl_kstat_zfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ int update_zfs_pool_state_chart(const DICTIONARY_ITEM *item, void *pool_p, void
"zfspool",
chart_id,
NULL,
name,
"state",
"zfspool.state",
"ZFS pool state",
"boolean",
Expand Down
2 changes: 1 addition & 1 deletion src/daemon/config/dyncfg-intercept.c
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ static int dyncfg_intercept_early_error(struct rrd_function_execute *rfe, int rc
return rc;
}

static const DICTIONARY_ITEM *dyncfg_get_template_of_new_job(const char *job_id) {
const DICTIONARY_ITEM *dyncfg_get_template_of_new_job(const char *job_id) {
char id_copy[strlen(job_id) + 1];
memcpy(id_copy, job_id, sizeof(id_copy));

Expand Down
2 changes: 2 additions & 0 deletions src/daemon/config/dyncfg-internals.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ const DICTIONARY_ITEM *dyncfg_add_internal(RRDHOST *host, const char *id, const
int dyncfg_function_intercept_cb(struct rrd_function_execute *rfe, void *data);
void dyncfg_cleanup(DYNCFG *v);

const DICTIONARY_ITEM *dyncfg_get_template_of_new_job(const char *job_id);

bool dyncfg_is_user_disabled(const char *id);

RRDHOST *dyncfg_rrdhost_by_uuid(UUID *uuid);
Expand Down
46 changes: 38 additions & 8 deletions src/daemon/config/dyncfg-tree.c
Original file line number Diff line number Diff line change
Expand Up @@ -204,31 +204,57 @@ static int dyncfg_config_execute_cb(struct rrd_function_execute *rfe, void *data
action = path;
path = NULL;

if(id && *id && dyncfg_cmds2id(action) == DYNCFG_CMD_REMOVE) {
const DICTIONARY_ITEM *item = dictionary_get_and_acquire_item(dyncfg_globals.nodes, id);
if(item) {
DYNCFG *df = dictionary_acquired_item_value(item);
DYNCFG_CMDS cmd = dyncfg_cmds2id(action);
const DICTIONARY_ITEM *item = dictionary_get_and_acquire_item(dyncfg_globals.nodes, id);
if(!item)
item = dyncfg_get_template_of_new_job(id);

if(!rrd_function_available(host, string2str(df->function)))
df->current.status = DYNCFG_STATUS_ORPHAN;
if(item) {
DYNCFG *df = dictionary_acquired_item_value(item);

if(!rrd_function_available(host, string2str(df->function)))
df->current.status = DYNCFG_STATUS_ORPHAN;

if(cmd == DYNCFG_CMD_REMOVE) {
bool delete = (df->current.status == DYNCFG_STATUS_ORPHAN);
dictionary_acquired_item_release(dyncfg_globals.nodes, item);
item = NULL;

if(delete) {
if(!http_access_user_has_enough_access_level_for_endpoint(rfe->user_access, df->edit_access)) {
code = dyncfg_default_response(
rfe->result.wb, HTTP_RESP_FORBIDDEN,
"dyncfg: you don't have enough edit permissions to execute this command");
goto cleanup;
}

dictionary_del(dyncfg_globals.nodes, id);
dyncfg_file_delete(id);
code = dyncfg_default_response(rfe->result.wb, 200, "");
goto cleanup;
}
}
else if(cmd == DYNCFG_CMD_TEST && df->type == DYNCFG_TYPE_TEMPLATE && df->current.status != DYNCFG_STATUS_ORPHAN) {
const char *old_rfe_function = rfe->function;
char buf2[2048];
snprintfz(buf2, sizeof(buf2), "config %s %s", dictionary_acquired_item_name(item), action);
rfe->function = buf2;
dictionary_acquired_item_release(dyncfg_globals.nodes, item);
item = NULL;
code = dyncfg_function_intercept_cb(rfe, data);
rfe->function = old_rfe_function;
return code;
}

if(item)
dictionary_acquired_item_release(dyncfg_globals.nodes, item);
}

code = HTTP_RESP_NOT_FOUND;
nd_log(NDLS_DAEMON, NDLP_ERR,
"DYNCFG: unknown config id '%s' in call: '%s'. "
"This can happen if the plugin that registered the dynamic configuration is not running now.",
action, rfe->function);
id, rfe->function);

rrd_call_function_error(
rfe->result.wb,
Expand All @@ -248,7 +274,11 @@ static int dyncfg_config_execute_cb(struct rrd_function_execute *rfe, void *data
// for which there is no id overloaded.

void dyncfg_host_init(RRDHOST *host) {
// IMPORTANT:
// This function needs to be async, although it is internal.
// The reason is that it can call by itself another function that may or may not be internal (sync).

rrd_function_add(host, NULL, PLUGINSD_FUNCTION_CONFIG, 120,
1000, "Dynamic configuration", "config", HTTP_ACCESS_ANONYMOUS_DATA,
true, dyncfg_config_execute_cb, host);
false, dyncfg_config_execute_cb, host);
}
Loading