diff --git a/CHANGELOG.md b/CHANGELOG.md index 42bc0fa3be4112..c99ae104a59b9f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,9 @@ **Merged pull requests:** +- updated sizing netdata [\#17057](https://github.com/netdata/netdata/pull/17057) ([ktsaou](https://github.com/ktsaou)) +- fix zpool state chart family [\#17054](https://github.com/netdata/netdata/pull/17054) ([ilyam8](https://github.com/ilyam8)) +- DYNCFG: call the interceptor when a test is made on a new job [\#17052](https://github.com/netdata/netdata/pull/17052) ([ktsaou](https://github.com/ktsaou)) - fix alerts jsonschema prototype for latest dyncfg [\#17047](https://github.com/netdata/netdata/pull/17047) ([ktsaou](https://github.com/ktsaou)) - Do not use backtrace when sentry is enabled. [\#17043](https://github.com/netdata/netdata/pull/17043) ([vkalintiris](https://github.com/vkalintiris)) - Keep a count of metrics and samples collected [\#17042](https://github.com/netdata/netdata/pull/17042) ([stelfrag](https://github.com/stelfrag)) @@ -311,7 +314,6 @@ - code cleanup [\#16542](https://github.com/netdata/netdata/pull/16542) ([ktsaou](https://github.com/ktsaou)) - Assorted kickstart script fixes. [\#16537](https://github.com/netdata/netdata/pull/16537) ([Ferroin](https://github.com/Ferroin)) - wip documentation about functions table [\#16535](https://github.com/netdata/netdata/pull/16535) ([ktsaou](https://github.com/ktsaou)) -- Remove openSUSE 15.4 from CI [\#16449](https://github.com/netdata/netdata/pull/16449) ([tkatsoulas](https://github.com/tkatsoulas)) ## [v1.44.3](https://github.com/netdata/netdata/tree/v1.44.3) (2024-02-12) @@ -402,9 +404,6 @@ - set journal path for logging [\#16457](https://github.com/netdata/netdata/pull/16457) ([ktsaou](https://github.com/ktsaou)) - add sbindir\_POST to PATH of bash scripts that use `systemd-cat-native` [\#16456](https://github.com/netdata/netdata/pull/16456) ([ilyam8](https://github.com/ilyam8)) - add LogNamespace to systemd units [\#16454](https://github.com/netdata/netdata/pull/16454) ([ilyam8](https://github.com/ilyam8)) -- Update non-zero uuid key + child conf. [\#16452](https://github.com/netdata/netdata/pull/16452) ([vkalintiris](https://github.com/vkalintiris)) -- Add missing argument. [\#16451](https://github.com/netdata/netdata/pull/16451) ([vkalintiris](https://github.com/vkalintiris)) -- log flood protection to 1000 log lines / 1 minute [\#16450](https://github.com/netdata/netdata/pull/16450) ([ilyam8](https://github.com/ilyam8)) ## [v1.43.2](https://github.com/netdata/netdata/tree/v1.43.2) (2023-10-30) diff --git a/README.md b/README.md index ba0ae0b4cc4a56..1b20c452df62b6 100644 --- a/README.md +++ b/README.md @@ -41,19 +41,19 @@ It scales nicely from just a single server to thousands of servers, even in comp Operating system metrics, container metrics, virtual machines, hardware sensors, applications metrics, OpenMetrics exporters, StatsD, and logs. - :muscle: **Real-Time, Low-Latency, High-Resolution**
- All metrics are collected per second and are on the dashboard immediately after data collection. Netdata is designed to be fast. + All metrics are collected per second and are on the dashboard immediately after data collection. - :face_in_clouds: **Unsupervised Anomaly Detection**
- Trains multiple Machine-Learning (ML) models for each metric collected and detects anomalies based on the past behavior of each metric individually. + Trains multiple Machine-Learning (ML) models for each metric and uses AI to detect anomalies based on the past behavior of each metric. - :fire: **Powerful Visualization**
- Clear and precise visualization that allows you to quickly understand any dataset, but also to filter, slice and dice the data directly on the dashboard, without the need to learn any query language. + Clear and precise visualization allowing you to understand any dataset at first sight, but also to filter, slice and dice the data directly on the dashboard, without the need to learn a query language. - :bell: **Out of box Alerts**
Comes with hundreds of alerts out of the box to detect common issues and pitfalls, revealing issues that can easily go unnoticed. It supports several notification methods to let you know when your attention is needed. - 📖 **systemd Journal Logs Explorer**
- Provides a `systemd` journal logs explorer, to view, filter and analyze system and applications logs by directly accessing `systemd` journal files on individual hosts and infrastructure-wide logs centralization servers. + System and application logs of all servers are available in-real-time, for filtering and analysis, on both individual nodes and infrastructure-wide logs centralization servers. - :sunglasses: **Low Maintenance**
Fully automated in every aspect: automated dashboards, out-of-the-box alerts, auto-detection and auto-discovery of metrics, zero-touch machine-learning, easy scalability and high availability, and CI/CD friendly. diff --git a/docs/netdata-agent/sizing-netdata-agents/README.md b/docs/netdata-agent/sizing-netdata-agents/README.md index 22437c8b9d35b7..b945dc56c6c527 100644 --- a/docs/netdata-agent/sizing-netdata-agents/README.md +++ b/docs/netdata-agent/sizing-netdata-agents/README.md @@ -58,7 +58,9 @@ The following are some of the innovations the open-source Netdata agent has, tha 2. **4 bytes per sample uncompressed** - To achieve optimal memory and disk footprint, Netdata uses a custom 32-bit floating point number we have developed. This floating point number is used to store the samples collected, together with their anomaly bit. The database of Netdata is fixed-step, so it has predefined slots for every sample, allowing Netdata to store timestamps once every several hundreds samples, minimizing both its memory requirements and the disk footprint. + To achieve optimal memory and disk footprint, Netdata uses a custom 32-bit floating point number. This floating point number is used to store the samples collected, together with their anomaly bit. The database of Netdata is fixed-step, so it has predefined slots for every sample, allowing Netdata to store timestamps once every several hundreds samples, minimizing both its memory requirements and the disk footprint. + + The final disk footprint of Netdata varies due to compression efficiency. It is usually about 0.6 bytes per sample for the high-resolution tier (per-second), 6 bytes per sample for the mid-resolution tier (per-minute) and 18 bytes per sample for the low-resolution tier (per-hour). 3. **Query priorities** diff --git a/docs/netdata-agent/sizing-netdata-agents/disk-requirements-and-retention.md b/docs/netdata-agent/sizing-netdata-agents/disk-requirements-and-retention.md index 0c73a99de43b24..d9e879cb62ba41 100644 --- a/docs/netdata-agent/sizing-netdata-agents/disk-requirements-and-retention.md +++ b/docs/netdata-agent/sizing-netdata-agents/disk-requirements-and-retention.md @@ -28,11 +28,11 @@ To configure database mode `ram` or `alloc`, in `netdata.conf`, set the followin `dbengine` supports up to 5 tiers. By default, 3 tiers are used, like this: -| Tier | Resolution | Uncompressed Sample Size | -|:--------:|:--------------------------------------------------------------------------------------------:|:------------------------:| -| `tier0` | native resolution (metrics collected per-second as stored per-second) | 4 bytes | -| `tier1` | 60 iterations of `tier0`, so when metrics are collected per-second, this tier is per-minute. | 16 bytes | -| `tier2` | 60 iterations of `tier1`, so when metrics are collected per second, this tier is per-hour. | 16 bytes | +| Tier | Resolution | Uncompressed Sample Size | Usually On Disk | +|:--------:|:--------------------------------------------------------------------------------------------:|:------------------------:|:---------------:| +| `tier0` | native resolution (metrics collected per-second as stored per-second) | 4 bytes | 0.6 bytes | +| `tier1` | 60 iterations of `tier0`, so when metrics are collected per-second, this tier is per-minute. | 16 bytes | 6 bytes | +| `tier2` | 60 iterations of `tier1`, so when metrics are collected per second, this tier is per-hour. | 16 bytes | 18 bytes | Data are saved to disk compressed, so the actual size on disk varies depending on compression efficiency. @@ -56,40 +56,46 @@ You can find information about the current disk utilization of a Netdata Parent, ```json { // more information about the agent - // near the end: + // then, near the end: "db_size": [ { "tier": 0, - "disk_used": 1677528462156, - "disk_max": 1677721600000, - "disk_percent": 99.9884881, - "from": 1706201952, - "to": 1707401946, - "retention": 1199994, - "expected_retention": 1200132, - "currently_collected_metrics": 2198777 + "metrics": 43070, + "samples": 88078162001, + "disk_used": 41156409552, + "disk_max": 41943040000, + "disk_percent": 98.1245269, + "from": 1705033983, + "to": 1708856640, + "retention": 3822657, + "expected_retention": 3895720, + "currently_collected_metrics": 27424 }, { "tier": 1, - "disk_used": 838123468064, - "disk_max": 838860800000, - "disk_percent": 99.9121032, - "from": 1702885800, - "to": 1707401946, - "retention": 4516146, - "expected_retention": 4520119, - "currently_collected_metrics": 2198777 + "metrics": 72987, + "samples": 5155155269, + "disk_used": 20585157180, + "disk_max": 20971520000, + "disk_percent": 98.1576785, + "from": 1698287340, + "to": 1708856640, + "retention": 10569300, + "expected_retention": 10767675, + "currently_collected_metrics": 27424 }, { "tier": 2, - "disk_used": 334329683032, - "disk_max": 419430400000, - "disk_percent": 79.710408, - "from": 1679670000, - "to": 1707401946, - "retention": 27731946, - "expected_retention": 34790871, - "currently_collected_metrics": 2198777 + "metrics": 148234, + "samples": 314919121, + "disk_used": 5957346684, + "disk_max": 10485760000, + "disk_percent": 56.8136853, + "from": 1667808000, + "to": 1708856640, + "retention": 41048640, + "expected_retention": 72251324, + "currently_collected_metrics": 27424 } ] } @@ -98,6 +104,8 @@ You can find information about the current disk utilization of a Netdata Parent, In this example: - `tier` is the database tier. +- `metrics` is the number of unique time-series in the database. +- `samples` is the number of samples in the database. - `disk_used` is the currently used disk space in bytes. - `disk_max` is the configured max disk space in bytes. - `disk_percent` is the current disk space utilization for this tier. @@ -107,21 +115,13 @@ In this example: - `expected_retention` is the expected retention in seconds when `disk_percent` will be 100 (divide by 3600 for hours, divide by 86400 for days). - `currently_collected_metrics` is the number of unique time-series currently being collected for this tier. -The estimated number of samples on each tier can be calculated as follows: - -``` -estimasted number of samples = retention / sample duration * currently_collected_metrics -``` - So, for our example above: -| Tier | Sample Duration (seconds) | Estimated Number of Samples | Disk Space Used | Current Retention (days) | Expected Retention (days) | Bytes Per Sample | -|:-------:|:-------------------------:|:---------------------------:|:---------------:|:------------------------:|:-------------------------:|:----------------:| -| `tier0` | 1 | 2.64 trillion samples | 1.56 TiB | 13.8 | 13.9 | 0.64 | -| `tier1` | 60 | 165.5 billion samples | 780 GiB | 52.2 | 52.3 | 5.01 | -| `tier2` | 3600 | 16.9 billion samples | 311 GiB | 320.9 | 402.7 | 19.73 | - -Note: as you can see in this example, the disk footprint per sample of `tier2` is bigger than the uncompressed sample size (19.73 bytes vs 16 bytes). This is due to the fact that samples are organized into pages and pages into extents. When Netdata is restarted frequently, it saves all data prematurely, before filling up entire pages and extents, leading to increased overheads per sample. +| Tier | # Of Metrics | # Of Samples | Disk Used | Disk Free | Current Retention | Expected Retention | Sample Size | +|-----:|-------------:|--------------:|----------:|----------:|------------------:|-------------------:|------------:| +| 0 | 43.1K | 88.1 billion | 38.4Gi | 1.88% | 44.2 days | 45.0 days | 0.46 B | +| 1 | 73.0K | 5.2 billion | 19.2Gi | 1.84% | 122.3 days | 124.6 days | 3.99 B | +| 2 | 148.3K | 315.0 million | 5.6Gi | 43.19% | 475.1 days | 836.2 days | 18.91 B | To configure retention, in `netdata.conf`, set the following: diff --git a/packaging/version b/packaging/version index 8f804f2327eb72..a31f6e8ea2ecc3 100644 --- a/packaging/version +++ b/packaging/version @@ -1 +1 @@ -v1.44.0-406-nightly +v1.44.0-412-nightly diff --git a/src/collectors/proc.plugin/proc_spl_kstat_zfs.c b/src/collectors/proc.plugin/proc_spl_kstat_zfs.c index fe748aa7d42201..e6b12c31f83e91 100644 --- a/src/collectors/proc.plugin/proc_spl_kstat_zfs.c +++ b/src/collectors/proc.plugin/proc_spl_kstat_zfs.c @@ -272,7 +272,7 @@ int update_zfs_pool_state_chart(const DICTIONARY_ITEM *item, void *pool_p, void "zfspool", chart_id, NULL, - name, + "state", "zfspool.state", "ZFS pool state", "boolean", diff --git a/src/daemon/config/dyncfg-intercept.c b/src/daemon/config/dyncfg-intercept.c index dd7052e7284d75..812059f6fc3151 100644 --- a/src/daemon/config/dyncfg-intercept.c +++ b/src/daemon/config/dyncfg-intercept.c @@ -180,7 +180,7 @@ static int dyncfg_intercept_early_error(struct rrd_function_execute *rfe, int rc return rc; } -static const DICTIONARY_ITEM *dyncfg_get_template_of_new_job(const char *job_id) { +const DICTIONARY_ITEM *dyncfg_get_template_of_new_job(const char *job_id) { char id_copy[strlen(job_id) + 1]; memcpy(id_copy, job_id, sizeof(id_copy)); diff --git a/src/daemon/config/dyncfg-internals.h b/src/daemon/config/dyncfg-internals.h index df9af6fd527e58..181d2328fc16ad 100644 --- a/src/daemon/config/dyncfg-internals.h +++ b/src/daemon/config/dyncfg-internals.h @@ -76,6 +76,8 @@ const DICTIONARY_ITEM *dyncfg_add_internal(RRDHOST *host, const char *id, const int dyncfg_function_intercept_cb(struct rrd_function_execute *rfe, void *data); void dyncfg_cleanup(DYNCFG *v); +const DICTIONARY_ITEM *dyncfg_get_template_of_new_job(const char *job_id); + bool dyncfg_is_user_disabled(const char *id); RRDHOST *dyncfg_rrdhost_by_uuid(UUID *uuid); diff --git a/src/daemon/config/dyncfg-tree.c b/src/daemon/config/dyncfg-tree.c index 0983a9ee1fbcf7..6af384daa273f8 100644 --- a/src/daemon/config/dyncfg-tree.c +++ b/src/daemon/config/dyncfg-tree.c @@ -204,31 +204,57 @@ static int dyncfg_config_execute_cb(struct rrd_function_execute *rfe, void *data action = path; path = NULL; - if(id && *id && dyncfg_cmds2id(action) == DYNCFG_CMD_REMOVE) { - const DICTIONARY_ITEM *item = dictionary_get_and_acquire_item(dyncfg_globals.nodes, id); - if(item) { - DYNCFG *df = dictionary_acquired_item_value(item); + DYNCFG_CMDS cmd = dyncfg_cmds2id(action); + const DICTIONARY_ITEM *item = dictionary_get_and_acquire_item(dyncfg_globals.nodes, id); + if(!item) + item = dyncfg_get_template_of_new_job(id); - if(!rrd_function_available(host, string2str(df->function))) - df->current.status = DYNCFG_STATUS_ORPHAN; + if(item) { + DYNCFG *df = dictionary_acquired_item_value(item); + if(!rrd_function_available(host, string2str(df->function))) + df->current.status = DYNCFG_STATUS_ORPHAN; + + if(cmd == DYNCFG_CMD_REMOVE) { bool delete = (df->current.status == DYNCFG_STATUS_ORPHAN); dictionary_acquired_item_release(dyncfg_globals.nodes, item); + item = NULL; if(delete) { + if(!http_access_user_has_enough_access_level_for_endpoint(rfe->user_access, df->edit_access)) { + code = dyncfg_default_response( + rfe->result.wb, HTTP_RESP_FORBIDDEN, + "dyncfg: you don't have enough edit permissions to execute this command"); + goto cleanup; + } + dictionary_del(dyncfg_globals.nodes, id); dyncfg_file_delete(id); code = dyncfg_default_response(rfe->result.wb, 200, ""); goto cleanup; } } + else if(cmd == DYNCFG_CMD_TEST && df->type == DYNCFG_TYPE_TEMPLATE && df->current.status != DYNCFG_STATUS_ORPHAN) { + const char *old_rfe_function = rfe->function; + char buf2[2048]; + snprintfz(buf2, sizeof(buf2), "config %s %s", dictionary_acquired_item_name(item), action); + rfe->function = buf2; + dictionary_acquired_item_release(dyncfg_globals.nodes, item); + item = NULL; + code = dyncfg_function_intercept_cb(rfe, data); + rfe->function = old_rfe_function; + return code; + } + + if(item) + dictionary_acquired_item_release(dyncfg_globals.nodes, item); } code = HTTP_RESP_NOT_FOUND; nd_log(NDLS_DAEMON, NDLP_ERR, "DYNCFG: unknown config id '%s' in call: '%s'. " "This can happen if the plugin that registered the dynamic configuration is not running now.", - action, rfe->function); + id, rfe->function); rrd_call_function_error( rfe->result.wb, @@ -248,7 +274,11 @@ static int dyncfg_config_execute_cb(struct rrd_function_execute *rfe, void *data // for which there is no id overloaded. void dyncfg_host_init(RRDHOST *host) { + // IMPORTANT: + // This function needs to be async, although it is internal. + // The reason is that it can call by itself another function that may or may not be internal (sync). + rrd_function_add(host, NULL, PLUGINSD_FUNCTION_CONFIG, 120, 1000, "Dynamic configuration", "config", HTTP_ACCESS_ANONYMOUS_DATA, - true, dyncfg_config_execute_cb, host); + false, dyncfg_config_execute_cb, host); }