diff --git a/CHANGELOG.md b/CHANGELOG.md index c8cc778d44333e..2a02fceb9c9e34 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,12 +6,22 @@ **Merged pull requests:** +- Regenerate integrations.js [\#17761](https://github.com/netdata/netdata/pull/17761) ([netdatabot](https://github.com/netdatabot)) +- add clickhouse alerts [\#17760](https://github.com/netdata/netdata/pull/17760) ([ilyam8](https://github.com/ilyam8)) +- simplify installation page [\#17759](https://github.com/netdata/netdata/pull/17759) ([Ancairon](https://github.com/Ancairon)) +- Regenerate integrations.js [\#17758](https://github.com/netdata/netdata/pull/17758) ([netdatabot](https://github.com/netdatabot)) +- Collecting metrics docs section simplification [\#17757](https://github.com/netdata/netdata/pull/17757) ([Ancairon](https://github.com/Ancairon)) +- go.d clickhouse add more metrics [\#17756](https://github.com/netdata/netdata/pull/17756) ([ilyam8](https://github.com/ilyam8)) +- mention how to remove highlight in documentation [\#17755](https://github.com/netdata/netdata/pull/17755) ([Ancairon](https://github.com/Ancairon)) +- Regenerate integrations.js [\#17752](https://github.com/netdata/netdata/pull/17752) ([netdatabot](https://github.com/netdatabot)) +- go.d clickhouse add running queries [\#17751](https://github.com/netdata/netdata/pull/17751) ([ilyam8](https://github.com/ilyam8)) - remove unused go.d/prometheus meta file [\#17749](https://github.com/netdata/netdata/pull/17749) ([ilyam8](https://github.com/ilyam8)) - Regenerate integrations.js [\#17748](https://github.com/netdata/netdata/pull/17748) ([netdatabot](https://github.com/netdatabot)) - add go.d clickhouse [\#17743](https://github.com/netdata/netdata/pull/17743) ([ilyam8](https://github.com/ilyam8)) - fix clickhouse in apps groups [\#17742](https://github.com/netdata/netdata/pull/17742) ([ilyam8](https://github.com/ilyam8)) - fix ebpf cgroup swap context [\#17740](https://github.com/netdata/netdata/pull/17740) ([ilyam8](https://github.com/ilyam8)) - Update netdata-agent-security.md [\#17738](https://github.com/netdata/netdata/pull/17738) ([Ancairon](https://github.com/Ancairon)) +- Collecting metrics docs grammar pass [\#17736](https://github.com/netdata/netdata/pull/17736) ([Ancairon](https://github.com/Ancairon)) - Grammar pass on docs [\#17735](https://github.com/netdata/netdata/pull/17735) ([Ancairon](https://github.com/Ancairon)) - Ensure that the choice of compiler and target is passed to sub-projects. [\#17732](https://github.com/netdata/netdata/pull/17732) ([Ferroin](https://github.com/Ferroin)) - Include the Host in the HTTP header \(mqtt\) [\#17731](https://github.com/netdata/netdata/pull/17731) ([stelfrag](https://github.com/stelfrag)) @@ -385,7 +395,6 @@ - Bump github.com/docker/docker from 25.0.4+incompatible to 25.0.5+incompatible in /src/go/collectors/go.d.plugin [\#17211](https://github.com/netdata/netdata/pull/17211) ([dependabot[bot]](https://github.com/apps/dependabot)) - Add -Wno-builtin-macro-redefined to compiler flags. [\#17209](https://github.com/netdata/netdata/pull/17209) ([Ferroin](https://github.com/Ferroin)) - Move bundling of JSON-C to CMake. [\#17207](https://github.com/netdata/netdata/pull/17207) ([Ferroin](https://github.com/Ferroin)) -- Compatibility with Prometheus HELP [\#17191](https://github.com/netdata/netdata/pull/17191) ([thiagoftsm](https://github.com/thiagoftsm)) ## [v1.45.5](https://github.com/netdata/netdata/tree/v1.45.5) (2024-05-21) @@ -424,11 +433,6 @@ - Fix alert hash table definition [\#17196](https://github.com/netdata/netdata/pull/17196) ([stelfrag](https://github.com/stelfrag)) - health: unsilence cpu % alarm [\#17194](https://github.com/netdata/netdata/pull/17194) ([ilyam8](https://github.com/ilyam8)) - Fix sum calculation in rrdr2value [\#17193](https://github.com/netdata/netdata/pull/17193) ([stelfrag](https://github.com/stelfrag)) -- Move bundling of libyaml to CMake. [\#17190](https://github.com/netdata/netdata/pull/17190) ([Ferroin](https://github.com/Ferroin)) -- add callout that snapshots only available on v1 [\#17189](https://github.com/netdata/netdata/pull/17189) ([hugovalente-pm](https://github.com/hugovalente-pm)) -- Bump github.com/vmware/govmomi from 0.36.0 to 0.36.1 in /src/go/collectors/go.d.plugin [\#17185](https://github.com/netdata/netdata/pull/17185) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Bump k8s.io/client-go from 0.29.2 to 0.29.3 in /src/go/collectors/go.d.plugin [\#17184](https://github.com/netdata/netdata/pull/17184) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Bump github.com/prometheus/common from 0.48.0 to 0.50.0 in /src/go/collectors/go.d.plugin [\#17182](https://github.com/netdata/netdata/pull/17182) ([dependabot[bot]](https://github.com/apps/dependabot)) ## [v1.44.3](https://github.com/netdata/netdata/tree/v1.44.3) (2024-02-12) diff --git a/docs/collecting-metrics/application-metrics.md b/docs/collecting-metrics/application-metrics.md deleted file mode 100644 index 44f52fa5f71681..00000000000000 --- a/docs/collecting-metrics/application-metrics.md +++ /dev/null @@ -1,83 +0,0 @@ - - -# Collect application metrics with Netdata - -Netdata instantly collects per-second metrics from many different types of applications running on your systems, such as -web servers, databases, message brokers, email servers, search platforms, and much more. Metrics collectors are -pre-installed with every Netdata Agent and usually require zero configuration. Netdata also collects and visualizes -resource utilization per application on Linux systems using `apps.plugin`. - -[**apps.plugin**](/src/collectors/apps.plugin/README.md) looks at the Linux process tree every second, much like `top` or -`ps fax`, and collects resource utilization information on every running process. By reading the process tree, Netdata -shows CPU, disk, networking, processes, and eBPF for every application or Linux user. Unlike `top` or `ps fax`, Netdata -adds a layer of meaningful visualization on top of the process tree metrics, such as grouping applications into useful -dimensions, and then creates per-application charts under the **Applications** section of a Netdata dashboard, per-user -charts under **Users**, and per-user group charts under **User Groups**. - -Our most popular application collectors: - -- [Prometheus endpoints](/src/go/collectors/go.d.plugin/modules/prometheus/README.md): Gathers - metrics from one or more Prometheus endpoints that use the OpenMetrics exposition format. Auto-detects more than 600 - endpoints. -- [Web server logs (Apache, NGINX)](/src/go/collectors/go.d.plugin/modules/weblog/README.md): - Tail access logs and provide very detailed web server performance statistics. This module is able to parse 200k+ - rows in less than half a second. -- [MySQL](/src/go/collectors/go.d.plugin/modules/mysql/README.md): Collect database global, - replication, and per-user statistics. -- [Redis](/src/go/collectors/go.d.plugin/modules/redis/README.md): Monitor database status by - reading the server's response to the `INFO` command. -- [Apache](/src/go/collectors/go.d.plugin/modules/apache/README.md): Collect Apache web server - performance metrics via the `server-status?auto` endpoint. -- [Nginx](/src/go/collectors/go.d.plugin/modules/nginx/README.md): Monitor web server status - information by gathering metrics via `ngx_http_stub_status_module`. -- [Postgres](/src/go/collectors/go.d.plugin/modules/postgres/README.md): Collect database health - and performance metrics. -- [ElasticSearch](/src/go/collectors/go.d.plugin/modules/elasticsearch/README.md): Collect search - engine performance and health statistics. Optionally collects per-index metrics. -- [PHP-FPM](/src/go/collectors/go.d.plugin/modules/phpfpm/README.md): Collect application summary - and processes health metrics by scraping the status page (`/status?full`). - -Our [supported collectors list](/src/collectors/COLLECTORS.md#service-and-application-collectors) shows all Netdata's -application metrics collectors, including those for containers/k8s clusters. - -## Collect metrics from applications running on Windows - -Netdata is fully capable of collecting and visualizing metrics from applications running on Windows systems. The only -caveat is that you must [install Netdata](/packaging/installer/README.md) on a separate system or a compatible VM because there -is no native Windows version of the Netdata Agent. - -Once you have Netdata running on that separate system, you can follow the [collectors configuration reference](/src/collectors/REFERENCE.md) documentation to tell the collector to look for exposed metrics on the Windows system's IP -address or hostname, plus the applicable port. - -For example, you have a MySQL database with a root password of `my-secret-pw` running on a Windows system with the IP -address 203.0.113.0. you can configure the [MySQL -collector](/src/go/collectors/go.d.plugin/modules/mysql/README.md) to look at `203.0.113.0:3306`: - -```yml -jobs: - - name: local - dsn: root:my-secret-pw@tcp(203.0.113.0:3306)/ -``` - -This same logic applies to any application in our [supported collectors -list](/src/collectors/COLLECTORS.md#service-and-application-collectors) that can run on Windows. - -## What's next? - -If you haven't yet seen the [supported collectors list](/src/collectors/COLLECTORS.md) give it a once-over for any -additional applications you may want to monitor using Netdata's native collectors, or the [generic Prometheus -collector](/src/go/collectors/go.d.plugin/modules/prometheus/README.md). - -Collecting all the available metrics on your nodes, and across your entire infrastructure, is just one piece of the -puzzle. Next, learn more about Netdata's famous real-time visualizations by [seeing an overview of your -infrastructure](/docs/dashboards-and-charts/home-tab.md) using Netdata Cloud. - - diff --git a/docs/collecting-metrics/container-metrics.md b/docs/collecting-metrics/container-metrics.md deleted file mode 100644 index 95926010333bc6..00000000000000 --- a/docs/collecting-metrics/container-metrics.md +++ /dev/null @@ -1,101 +0,0 @@ - - -# Collect container metrics with Netdata - -Thanks to close integration with Linux cgroups and the virtual files it maintains under `/sys/fs/cgroup`, Netdata can -monitor the health, status, and resource utilization of many different types of Linux containers. - -Netdata uses [cgroups.plugin](/src/collectors/cgroups.plugin/README.md) to poll `/sys/fs/cgroup` and convert the raw data -into human-readable metrics and meaningful visualizations. Through cgroups, Netdata is compatible with **all Linux -containers**, such as Docker, LXC, LXD, Libvirt, systemd-nspawn, and more. Read more about [Docker-specific -monitoring](#collect-docker-metrics) below. - -Netdata also has robust **Kubernetes monitoring** support thanks to a -[Helmchart](/packaging/installer/methods/kubernetes.md) to automate deployment, collectors for k8s agent services, and -robust [service discovery](https://github.com/netdata/agent-service-discovery/#service-discovery) to monitor the -services running inside of pods in your k8s cluster. Read more about [Kubernetes -monitoring](#collect-kubernetes-metrics) below. - -A handful of additional collectors gather metrics from container-related services, such as -[dockerd](/src/go/collectors/go.d.plugin/modules/docker/README.md) or [Docker -Engine](/src/go/collectors/go.d.plugin/modules/docker_engine/README.md). You can find all -container collectors in our supported collectors list under the -[containers/VMs](/src/collectors/COLLECTORS.md#containers-and-vms) and -[Kubernetes](/src/collectors/COLLECTORS.md#containers-and-vms) headings. - -## Collect Docker metrics - -Netdata has robust Docker monitoring thanks to the aforementioned -[cgroups.plugin](/src/collectors/cgroups.plugin/README.md). By polling cgroups every second, Netdata can produce meaningful -visualizations about the CPU, memory, disk, and network utilization of all running containers on the host system with -zero configuration. - -Netdata also collects metrics from applications running inside of Docker containers. For example, if you create a MySQL -database container using `docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag`, it exposes -metrics on port 3306. You can configure the [MySQL -collector](/src/go/collectors/go.d.plugin/modules/mysql/README.md) to look at `127.0.0.0:3306` for -MySQL metrics: - -```yml -jobs: - - name: local - dsn: root:my-secret-pw@tcp(127.0.0.1:3306)/ -``` - -Netdata then collects metrics from the container itself, but also dozens [MySQL-specific -metrics](/src/go/collectors/go.d.plugin/modules/mysql/README.md#charts) as well. - -### Collect metrics from applications running in Docker containers - -You could use this technique to monitor an entire infrastructure of Docker containers. The same [enable and configure](/src/collectors/REFERENCE.md) procedures apply whether an application runs on the host system or inside -a container. You may need to configure the target endpoint if it's not the application's default. - -Netdata can even [run in a Docker container](/packaging/docker/README.md) itself, and then collect metrics about the -host system, its own container with cgroups, and any applications you want to monitor. - -See our [application metrics doc](/docs/collecting-metrics/application-metrics.md) for details about Netdata's application metrics -collection capabilities. - -## Collect Kubernetes metrics - -We already have a few complementary tools and collectors for monitoring the many layers of a Kubernetes cluster, -_entirely for free_. These methods work together to help you troubleshoot performance or availability issues across -your k8s infrastructure. - -- A [Helm chart](https://github.com/netdata/helmchart), which bootstraps a Netdata Agent pod on every node in your - cluster, plus an additional parent pod for storing metrics and managing alert notifications. -- A [service discovery plugin](https://github.com/netdata/agent-service-discovery), which discovers and creates - configuration files for [compatible - applications](https://github.com/netdata/helmchart#service-discovery-and-supported-services) and any endpoints - covered by our [generic Prometheus - collector](/src/go/collectors/go.d.plugin/modules/prometheus/README.md). With these - configuration files, Netdata collects metrics from any compatible applications as they run _inside_ a pod. - Service discovery happens without manual intervention as pods are created, destroyed, or moved between nodes. -- A [Kubelet collector](/src/go/collectors/go.d.plugin/modules/k8s_kubelet/README.md), which runs - on each node in a k8s cluster to monitor the number of pods/containers, the volume of operations on each container, - and more. -- A [kube-proxy collector](/src/go/collectors/go.d.plugin/modules/k8s_kubeproxy/README.md), which - also runs on each node and monitors latency and the volume of HTTP requests to the proxy. -- A [cgroups collector](/src/collectors/cgroups.plugin/README.md), which collects CPU, memory, and bandwidth metrics for - each container running on your k8s cluster. - -For a holistic view of Netdata's Kubernetes monitoring capabilities, see our guide: [_Monitor a Kubernetes (k8s) cluster -with Netdata_](/docs/developer-and-contributor-corner/kubernetes-k8s-netdata.md). - -## What's next? - -Netdata is capable of collecting metrics from hundreds of applications, such as web servers, databases, messaging -brokers, and more. See more in the [application metrics doc](/docs/collecting-metrics/application-metrics.md). - -If you already have all the information you need about collecting metrics, move into Netdata's meaningful visualizations -with [seeing an overview of your infrastructure](/docs/dashboards-and-charts/home-tab.md) using Netdata Cloud. - - diff --git a/docs/collecting-metrics/system-metrics.md b/docs/collecting-metrics/system-metrics.md deleted file mode 100644 index 65caa8cdccf7a5..00000000000000 --- a/docs/collecting-metrics/system-metrics.md +++ /dev/null @@ -1,62 +0,0 @@ - - -# Collect system metrics with Netdata - -Netdata collects thousands of metrics directly from the operating systems of physical and virtual systems, IoT/edge -devices, and [containers](/docs/collecting-metrics/container-metrics.md) with zero configuration. - -To gather system metrics, Netdata uses roughly a dozen plugins, each of which has one or more collectors for very -specific metrics exposed by the host. The system metrics Netdata users interact with most for health monitoring and -performance troubleshooting are collected and visualized by `proc.plugin`, `cgroups.plugin`, and `ebpf.plugin`. - -[**proc.plugin**](/src/collectors/proc.plugin/README.md) gathers metrics from the `/proc` and `/sys` folders in Linux -systems, along with a few other endpoints, and is responsible for the bulk of the system metrics collected and -visualized by Netdata. It collects CPU, memory, disks, load, networking, mount points, and more with zero configuration. -It even allows Netdata to monitor its own resource utilization! - -[**cgroups.plugin**](/src/collectors/cgroups.plugin/README.md) collects rich metrics about containers and virtual machines -using the virtual files under `/sys/fs/cgroup`. By reading cgroups, Netdata can instantly collect resource utilization -metrics for systemd services, all containers (Docker, LXC, LXD, Libvirt, systemd-nspawn), and more. Learn more in the -[collecting container metrics](/docs/collecting-metrics/container-metrics.md) doc. - -[**ebpf.plugin**](/src/collectors/ebpf.plugin/README.md): Netdata's extended Berkeley Packet Filter (eBPF) collector -monitors Linux kernel-level metrics for file descriptors, virtual filesystem IO, and process management. You can use our -eBPF collector to analyze how and when a process accesses files, when it makes system calls, whether it leaks memory or -creating zombie processes, and more. - -While the above plugins and associated collectors are the most important for system metrics, there are many others. You -can find all system collectors in our [supported collectors list](/src/collectors/COLLECTORS.md#system-collectors). - -## Collect Windows system metrics - -Netdata is also capable of monitoring Windows systems. The [Windows -collector](/src/go/collectors/go.d.plugin/modules/windows/README.md) integrates with -[windows_exporter](https://github.com/prometheus-community/windows_exporter), a small Go-based binary that you can run -on Windows systems. The Windows collector then gathers metrics from an endpoint created by windows_exporter, for more -details see [the requirements](/src/go/collectors/go.d.plugin/modules/windows/README.md#requirements). - -Next, [configure](/src/go/collectors/go.d.plugin/modules/windows/README.md#configuration) the Windows -collector to point to the URL and port of your exposed endpoint. Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate -method](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) for your system. You'll start seeing Windows system metrics, such as CPU -utilization, memory, bandwidth per NIC, number of processes, and much more. - -For information about collecting metrics from applications _running on Windows systems_, see the [application metrics -doc](/docs/collecting-metrics/application-metrics.md#collect-metrics-from-applications-running-on-windows). - -## What's next? - -Because there's some overlap between system metrics and [container metrics](/docs/collecting-metrics/container-metrics.md), you -should investigate Netdata's container compatibility if you use them heavily in your infrastructure. - -If you don't use containers, skip ahead to collecting [application metrics](/docs/collecting-metrics/application-metrics.md) with -Netdata. - - diff --git a/docs/dashboards-and-charts/netdata-charts.md b/docs/dashboards-and-charts/netdata-charts.md index 0304eb1ab13300..9a15ea9a7ca9a3 100644 --- a/docs/dashboards-and-charts/netdata-charts.md +++ b/docs/dashboards-and-charts/netdata-charts.md @@ -369,6 +369,10 @@ Selecting timeframes is useful when you see an interesting spike or change in a |:-----------------------------------|:---------------------------------------------------------|:---------------------| | **Highlight** a specific timeframe | `Alt + mouse selection` or `⌘ + mouse selection` (macOS) | `n/a` | +> **Note** +> +> To clear a highlighted timeframe, simply click on the chart area. + ### Select and zoom You can zoom to a specific timeframe, either horizontally of vertically, by selecting a timeframe. diff --git a/docs/deployment-guides/README.md b/docs/deployment-guides/README.md index 26ebc2e8de39e6..1b6571b999445a 100644 --- a/docs/deployment-guides/README.md +++ b/docs/deployment-guides/README.md @@ -1,6 +1,6 @@ # Deployment Guides -Netdata can be used to monitor all kinds of infrastructure, from stand-alone tiny IoT devices to complex hybrid setups combining on-premise and cloud infrastructure, mixing bare-metal servers, virtual machines and containers. +Netdata can be used to monitor all kinds of infrastructure, from tiny stand-alone IoT devices to complex hybrid setups combining on-premise and cloud infrastructure, mixing bare-metal servers, virtual machines and containers. There are 3 components to structure your Netdata ecosystem: diff --git a/docs/deployment-guides/deployment-strategies.md b/docs/deployment-guides/deployment-strategies.md index 67d9d95da78080..4d795bec2f10af 100644 --- a/docs/deployment-guides/deployment-strategies.md +++ b/docs/deployment-guides/deployment-strategies.md @@ -1,60 +1,38 @@ -# Deployment strategies - +# Deployment Examples ## Deployment Options Overview -This section provides a quick overview of a few common deployment options. The next sections go into configuration examples and further reading. - -### Stand-alone Deployment - -To help our users have a complete experience of Netdata when they install it for the first time, a Netdata Agent with default configuration -is a complete monitoring solution out of the box, having all these features enabled and available. - -The Agent will act as a _stand-alone_ Agent by default, and this is great to start out with for small setups and home labs. By [connecting each Agent to Cloud](/src/claim/README.md), you can see an overview of all your nodes, with aggregated charts and centralized alerting, without setting up a Parent. - -![image](https://github.com/netdata/netdata/assets/116741/6a638175-aec4-4d46-85a6-520c283ab6a8) - -### Parent – Child Deployment - -An Agent connected to a Parent is called a _Child_. It will _stream_ metrics to its Parent. The Parent can then take care of storing metrics on behalf of that node (with longer retention), handle metrics queries for showing dashboards, and provide alerting. - -When using Cloud, it is recommended that just the Parent is connected to Cloud. Child Agents can then be configured to have short retention, in RAM instead of on Disk, and have alerting and other features disabled. Because they don't need to connect to Cloud themselves, those children can then be further secured by not allowing outbound traffic. - -![image](https://github.com/netdata/netdata/assets/116741/cb65698d-a6b7-43ee-a2d1-c30d0a46f084) +This section provides a quick overview for a few common deployment options for Netdata. -This setup allows for leaner Child nodes and is good for setups with more than a handful of nodes. Metrics data remains accessible if the Child node is temporarily unavailable or decommissioned, although there is no failover in case the Parent becomes unavailable. +You can read about [Standalone Deployment](/docs/deployment-guides/standalone-deployment.md) and [Deployment with Centralization Points](/docs/deployment-guides/deployment-with-centralization-points.md) in the documentation inside this section. +The sections below go into configuration examples about these deployment concepts. -### Active–Active Parent Deployment +## Deployment Configuration Details -For high availability, Parents can be configured to stream data for their children between them, and keep the data sets in sync. Child Agents are configured with the addresses of both Parent Agents, but will only stream to one of them at a time. When that Parent becomes unavailable, it reconnects to another. When the first Parent becomes available again, that Parent will catch up by receiving the backlog from the second. +### Stand-alone -With both Parent Agents connected to Cloud, Cloud will route queries to either Parent transparently, depending on their availability. Alerts trigger on either Parent will stream to Cloud, and Cloud will deduplicate and debounce state changes to prevent spurious notifications. +The stand-alone setup is configured out of the box with reasonable defaults, but please consult our [configuration documentation](/docs/netdata-agent/configuration/README.md) for details, including the overview of [common configuration changes](/docs/netdata-agent/configuration/common-configuration-changes.md). -![image](https://github.com/netdata/netdata/assets/116741/6ae2b10c-7f7d-4503-aac4-0a9381c6f80b) +### Parent – Child +For setups involving Parent and Child Agents, they need to be configured for [streaming](docs/observability-centralization-points/metrics-centralization-points/configuration.md), through the configuration file `stream.conf`. -## Configuration Details +This will instruct the Child to stream data to the Parent and the Parent to accept streaming connections for one or more Child Agents. To secure this connection, both need a shared API key (to replace the string `API_KEY` in the examples below). Additionally, the Child can be configured with one or more addresses of Parent Agents (`PARENT_IP_ADDRESS`). -### Stand-alone Deployment - -The stand-alone setup is configured out of the box with reasonable defaults, but please consult our [configuration documentation](/docs/netdata-agent/configuration/cheatsheet.md) for details, including the overview of [common configuration changes](/docs/netdata-agent/configuration/common-configuration-changes.md). - -### Parent – Child Deployment - -For setups involving Child and Parent Agents, the Agents need to be configured for [_streaming_](/src/streaming/README.md), through the configuration file `stream.conf`. This will instruct the Child to stream data to the Parent and the Parent to accept streaming connections for one or more Child Agents. To secure this connection, both need set up a shared API key (to replace the string `API_KEY` in the examples below). Additionally, the Child is configured with one or more addresses of Parent Agents (`PARENT_IP_ADDRESS`). - -An API key is a key created with `uuidgen` and is used for authentication and/or customization in the Parent side. I.e. a Child will stream using the API key, and a Parent is configured to accept connections from Child, but can also apply different options for children by using multiple different API keys. The easiest setup uses just one API key for all Child Agents. +An API key is a key created with `uuidgen` and is used for authentication and/or customization on the Parent side. For example, a Child can stream using the API key, and a Parent can be configured to accept connections from the Child, but it can also apply different options for Children by using multiple different API keys. The easiest setup uses just one API key for all Child Agents. #### Child config -As mentioned above, the recommendation is to not claim the Child to Cloud directly during your setup, avoiding establishing an [ACLK](/src/aclk/README.md) connection. +As mentioned above, we do not recommend to claim the Child to Cloud directly during your setup. -To reduce the footprint of the Netdata Agent on your production system, some capabilities can be switched OFF on the Child and kept ON on the Parent. In this example, Machine Learning and Alerting are disabled in the Child, so that the Parent can take the load. We also use RAM instead of disk to store metrics with limited retention, covering temporary network issues. +This is done in order to reduce the footprint of the Netdata Agent on your production system, as some capabilities can be switched OFF for the Child and kept ON for the Parent. + +In this example, Machine Learning and Alerting are disabled for the Child, so that the Parent can take the load. We also use RAM instead of disk to store metrics with limited retention, covering temporary network issues. ##### netdata.conf -On the child node, edit `netdata.conf` by using the edit-config script: `/etc/netdata/edit-config netdata.conf` set the following parameters: +On the child node, edit `netdata.conf` by using the [edit-config](docs/netdata-agent/configuration/README.md#edit-netdataconf) script and set the following parameters: ```yaml [db] @@ -85,9 +63,7 @@ On the child node, edit `netdata.conf` by using the edit-config script: `/etc/ne ##### stream.conf -To edit `stream.conf`, again use the edit-config script: `/etc/netdata/edit-config stream.conf`. - -Set the following parameters: +To edit `stream.conf`, use again the [edit-config](docs/netdata-agent/configuration/README.md#edit-netdataconf) script and set the following parameters: ```yaml [stream] @@ -101,7 +77,7 @@ Set the following parameters: #### Parent config -For the Parent, besides setting up streaming, the example will also provide an example configuration of multiple [tiers](/src/database/engine/README.md#tiering) of metrics [storage](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md), for 10 children, with about 2k metrics each. +For the Parent, besides setting up streaming, this example also provides configuration for multiple [tiers of metrics storage](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics), for 10 Children, with about 2k metrics each. This allows for: - 1s granularity at tier 0 for 1 week - 1m granularity at tier 1 for 1 month @@ -114,7 +90,7 @@ Requiring: ##### netdata.conf -On the Parent, edit `netdata.conf` with `/etc/netdata/edit-config netdata.conf` and set the following parameters: +On the Parent, edit `netdata.conf` by using the [edit-config](docs/netdata-agent/configuration/README.md#edit-netdataconf) script and set the following parameters: ```yaml [db] @@ -149,7 +125,7 @@ On the Parent, edit `netdata.conf` with `/etc/netdata/edit-config netdata.conf` ##### stream.conf -On the Parent node, edit `stream.conf` with `/etc/netdata/edit-config stream.conf`, and then set the following parameters: +On the Parent node, edit `stream.conf` by using the [edit-config](docs/netdata-agent/configuration/README.md#edit-netdataconf) script and set the following parameters: ```yaml [API_KEY] @@ -157,13 +133,13 @@ On the Parent node, edit `stream.conf` with `/etc/netdata/edit-config stream.con enabled = yes ``` -### Active–Active Parent Deployment +### Active–Active Parents -In order to setup active–active streaming between Parent 1 and Parent 2, Parent 1 needs to be instructed to stream data to Parent 2 and Parent 2 to stream data to Parent 1. The Child Agents need to be configured with the addresses of both Parent Agents. The Agent will only connect to one Parent at a time, falling back to the next if the previous failed. These examples use the same API key between Parent Agents as for connections from Child Agents. +In order to setup active–active streaming between Parent 1 and Parent 2, Parent 1 needs to be instructed to stream data to Parent 2 and Parent 2 to stream data to Parent 1. The Child Agents need to be configured with the addresses of both Parent Agents. An Agent will only connect to one Parent at a time, falling back to the next upon failure. These examples use the same API key between Parent Agents and for connections for Child Agents. -On both Netdata Parent and all Child Agents, edit `stream.conf` with `/etc/netdata/edit-config stream.conf`: +On both Netdata Parent and all Child Agents, edit `stream.conf` by using the [edit-config](docs/netdata-agent/configuration/README.md#edit-netdataconf) script: -##### stream.conf on Parent 1 +#### stream.conf on Parent 1 ```yaml [stream] @@ -178,7 +154,7 @@ On both Netdata Parent and all Child Agents, edit `stream.conf` with `/etc/netda enabled = yes ``` -##### stream.conf on Parent 2 +#### stream.conf on Parent 2 ```yaml [stream] @@ -192,7 +168,7 @@ On both Netdata Parent and all Child Agents, edit `stream.conf` with `/etc/netda enabled = yes ``` -##### stream.conf on Child Agents +#### stream.conf on Child Agents ```yaml [stream] @@ -208,19 +184,11 @@ On both Netdata Parent and all Child Agents, edit `stream.conf` with `/etc/netda We strongly recommend the following configuration changes for production deployments: -1. Understand Netdata's [security and privacy design](/docs/security-and-privacy-design/README.md) and - [secure your nodes](/docs/netdata-agent/securing-netdata-agents.md) +1. Understand Netdata's [security and privacy design](/docs/security-and-privacy-design/README.md) and [secure your nodes](/docs/netdata-agent/securing-netdata-agents.md) To safeguard your infrastructure and comply with your organization's security policies. -2. Set up [streaming and replication](/src/streaming/README.md) to: - - - Offload Netdata Agents running on production systems and free system resources for the production applications running on them. - - Isolate production systems from the rest of the world and improve security. - - Increase data retention. - - Make your data highly available. - -3. [Optimize the Netdata Agents system utilization and performance](/docs/netdata-agent/configuration/optimize-the-netdata-agents-performance.md) +2. [Optimize the Netdata Agents system utilization and performance](/docs/netdata-agent/configuration/optimize-the-netdata-agents-performance.md) To save valuable system resources, especially when running on weak IoT devices. @@ -228,11 +196,11 @@ We also suggest that you: 1. [Use Netdata Cloud to access the dashboards](/docs/netdata-cloud/monitor-your-infrastructure.md) - For increased security, user management and access to our latest tools for advanced dashboarding and troubleshooting. + For increased security, user management and access to our latest features, tools and troubleshooting solutions. 2. [Change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md) - To control Netdata's memory use, when you have a lot of ephemeral metrics. + To control Netdata's memory use, when you have a lot of ephemeral metrics. 3. [Use host labels](/docs/netdata-agent/configuration/organize-systems-metrics-and-alerts.md) diff --git a/docs/deployment-guides/deployment-with-centralization-points.md b/docs/deployment-guides/deployment-with-centralization-points.md index b3e2b40dc6fc5d..87fd4a61a87c6b 100644 --- a/docs/deployment-guides/deployment-with-centralization-points.md +++ b/docs/deployment-guides/deployment-with-centralization-points.md @@ -14,7 +14,7 @@ When metrics and logs are centralized, the Children are never queried for metric | Unified infrastructure dashboards for logs | All logs are accessible via the same dashboard at Netdata Cloud, although they are unified per Netdata Parent | | Centrally configured alerts | Yes, at Netdata Parents | | Centrally dispatched alert notifications | Yes, at Netdata Cloud | -| Data are exclusively on-prem | Yes, Netdata Cloud queries Netdata Agents to satisfy dashboard queries. | +| Data are exclusively on-prem | Yes, Netdata Cloud queries Netdata Agents to satisfy dashboard queries. | A configuration with 2 observability centralization points, looks like this: @@ -24,7 +24,7 @@ flowchart LR dashboard for all nodes"]] NC(["Netdata Cloud - decides which agents + decides which Agents need to be queried"]) SA1["Netdata at AWS A1"] @@ -93,16 +93,24 @@ flowchart LR SB1 & SB2 & SBN ---|stream| PB ``` -### Configuration steps for deploying Netdata with Observability Centralization Points +## Active–Active Parent Deployment + +For high availability, Parents can be configured to stream data for their Children between them, and keep their data sets in sync. Children are configured with the addresses of both Parents, but will only stream to one of them at a time. When one Parent becomes unavailable, the Child reconnects to the other. When the first Parent becomes available again, that Parent will catch up by receiving the backlog from the second. + +With both Parent Agents connected to Netdata Cloud, it will route queries to either of them transparently, depending on their availability. Alerts trigger on either Parent will stream to Cloud, and Cloud will deduplicate and debounce state changes to prevent spurious notifications. + +## Configuration steps for deploying Netdata with Observability Centralization Points For Metrics: -- Install Netdata agents on all systems and the Netdata Parents. +- Install Netdata Agents on all systems and the Netdata Parents. - Configure `stream.conf` at the Netdata Parents to enable streaming access with an API key. - Configure `stream.conf` at the Netdata Children to enable streaming to the configured Netdata Parents. +Check the [related section in our documentation](/docs/observability-centralization-points/metrics-centralization-points/README.md) for more info + For Logs: - Install `systemd-journal-remote` on all systems and the Netdata Parents. @@ -111,11 +119,4 @@ For Logs: - Configure `systemd-journal-upload` at the Netdata Children to enable transmission of their logs to the Netdata Parents. -Optionally: - -- Disable ML, health checks and dashboard access at Netdata Children to save resources and avoid duplicate notifications. - -When using Netdata Cloud: - -- Optionally: disable dashboard access on all Netdata agents (including Netdata Parents). -- Optionally: disable alert notifications on all Netdata agents (including Netdata Parents). +Check the [related section in our documentation](/docs/observability-centralization-points/logs-centralization-points-with-systemd-journald/README.md) for more info diff --git a/docs/deployment-guides/standalone-deployment.md b/docs/deployment-guides/standalone-deployment.md index 230f47bd5f0d4f..3138141f71953f 100644 --- a/docs/deployment-guides/standalone-deployment.md +++ b/docs/deployment-guides/standalone-deployment.md @@ -1,22 +1,22 @@ # Standalone Deployment -To help our users have a complete experience of Netdata when they install it for the first time, a Netdata Agent with default configuration is a complete monitoring solution out of the box, having all its features enabled and available. +To help our users have a complete experience of Netdata when they install it for the first time, the Netdata Agent with default configuration is a complete monitoring solution out of the box, with features enabled and available. -So, each Netdata agent acts as a standalone monitoring system by default. +So, each Netdata Agent acts as a standalone monitoring system by default. -## Standalone agents, without Netdata Cloud +## Standalone Agents, without Netdata Cloud | Feature | How it works | |:---------------------------------------------:|:----------------------------------------------------:| -| Unified infrastructure dashboards for metrics | No, each Netdata agent provides its own dashboard | -| Unified infrastructure dashboards for logs | No, each Netdata agent exposes its own logs | +| Unified infrastructure dashboards for metrics | No, each Netdata Agent provides its own dashboard | +| Unified infrastructure dashboards for logs | No, each Netdata Agent exposes its own logs | | Centrally configured alerts | No, each Netdata has its own alerts configuration | -| Centrally dispatched alert notifications | No, each Netdata agent sends notifications by itself | +| Centrally dispatched alert notifications | No, each Netdata Agent sends notifications by itself | | Data are exclusively on-prem | Yes | -When using Standalone Netdata agents, each of them offers an API and a dashboard, at its own unique URL, that looks like `http://agent-ip:19999`. +When using Standalone Netdata Agents, each of them offers an API and a dashboard, at its own unique URL, that looks like `http://agent-ip:19999`. -So, each of the Netdata agents has to be accessed individually and independently of the others: +So, each of the Netdata Agents has to be accessed individually and independently of the others: ```mermaid flowchart LR @@ -37,7 +37,7 @@ flowchart LR WEB -->|URL N| SN ``` -The same is true for alert notifications. Each of the Netdata agents runs its own alerts and sends notifications by itself, according to its configuration: +The same is true for alert notifications. Each of the Netdata Agents runs its own alerts and sends notifications by itself, according to its configuration: ```mermaid flowchart LR @@ -61,23 +61,23 @@ flowchart LR S1 & S2 & SN ==> OTHER ``` -### Configuration steps for standalone Netdata agents without Netdata Cloud +### Configuration steps for standalone Netdata Agents without Netdata Cloud No special configuration needed. -- Install Netdata agents on all your systems, then access each of them via its own unique URL, that looks like `http://agent-ip:19999/`. +- Install Netdata Agents on all your systems, then access each of them via its own unique URL, that looks like `http://agent-ip:19999/`. -## Standalone agents, with Netdata Cloud +## Standalone Agents, with Netdata Cloud | Feature | How it works | |:---------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Unified infrastructure dashboards for metrics | Yes, via Netdata Cloud, all charts aggregate metrics from all servers. | | Unified infrastructure dashboards for logs | All logs are accessible via the same dashboard at Netdata Cloud, although they are not unified (ie. logs from different servers are not multiplexed into a single view) | -| Centrally configured alerts | No, each Netdata has its own alerts configuration | +| Centrally configured alerts | No, each Netdata has its own alerts configuration | | Centrally dispatched alert notifications | Yes, via Netdata Cloud | | Data are exclusively on-prem | Yes, Netdata Cloud queries Netdata Agents to satisfy dashboard queries. | -By [connecting all Netdata agents to Netdata Cloud](/src/claim/README.md), you can have a unified infrastructure view of all your nodes, with aggregated charts, without configuring [observability centralization points](/docs/observability-centralization-points/README.md). +By [connecting all Netdata Agents to Netdata Cloud](/src/claim/README.md), you can have a unified infrastructure view of all your nodes, with aggregated charts, without configuring [observability centralization points](/docs/observability-centralization-points/README.md). ```mermaid flowchart LR @@ -85,7 +85,7 @@ flowchart LR dashboard for all nodes"]] NC(["Netdata Cloud - decides which agents + decides which Agents need to be queried"]) S1["Standalone Netdata @@ -100,7 +100,7 @@ flowchart LR NC -->|queries| S1 & S2 & SN ``` -Similarly for alerts, Netdata Cloud receives all alert transitions from all agents, decides which notifications should be sent and how, applies silencing rules, maintenance windows and based on each Netdata Cloud space and user settings, dispatches notifications: +Similarly for alerts, Netdata Cloud receives all alert transitions from all Agents, decides which notifications should be sent and how, applies silencing rules, maintenance windows and based on each Netdata Cloud space and user settings, dispatches notifications: ```mermaid flowchart LR @@ -128,12 +128,14 @@ flowchart LR S1 & S2 & SN -->|alert transition| NC ``` -> Note that alerts are still triggered by Netdata agents. Netdata Cloud takes care of the notifications only. +> **Note** +> +> Alerts are still triggered by Netdata Agents. Netdata Cloud only takes care of the notifications. -### Configuration steps for standalone Netdata agents with Netdata Cloud +### Configuration steps for standalone Netdata Agents with Netdata Cloud -- Install Netdata agents using the commands given by Netdata Cloud, so that they will be automatically added to your Netdata Cloud space. Otherwise, install Netdata agents and then claim them via the command line or their dashboard. +- Install Netdata Agents using the commands given by Netdata Cloud, so that they will be automatically connected to your Netdata Cloud space. Otherwise, install Netdata Agents and then claim them via the command line or their dashboard. - Optionally: disable their direct dashboard access to secure them. -- Optionally: disable their alert notifications to avoid receiving email notifications directly from them (email notifications are automatically enabled when a working MTA is found on the systems Netdata agents are installed). +- Optionally: disable their alert notifications to avoid receiving email notifications directly from them (email notifications are automatically enabled when a working MTA is found on the systems Netdata Agents are installed). diff --git a/docs/developer-and-contributor-corner/lamp-stack.md b/docs/developer-and-contributor-corner/lamp-stack.md index da2d3c95a8129e..bdec9e75030522 100644 --- a/docs/developer-and-contributor-corner/lamp-stack.md +++ b/docs/developer-and-contributor-corner/lamp-stack.md @@ -61,7 +61,7 @@ replacing `NODE` with the hostname or IP address of your system. ## Enable hardware and Linux system monitoring -There's nothing you need to do to enable [system monitoring](/docs/collecting-metrics/system-metrics.md) and Linux monitoring with +There's nothing you need to do to enable system monitoring and Linux monitoring with the Netdata Agent, which autodetects metrics from CPUs, memory, disks, networking devices, and Linux processes like systemd without any configuration. If you're using containers, Netdata automatically collects resource utilization metrics from each using the [cgroups data collector](/src/collectors/cgroups.plugin/README.md). diff --git a/docs/metric-correlations.md b/docs/metric-correlations.md index ed0f10a8f7461b..46da43bc63ae4d 100644 --- a/docs/metric-correlations.md +++ b/docs/metric-correlations.md @@ -10,7 +10,7 @@ Because Metric Correlations uses every available metric from your infrastructure When viewing the [Metrics tab or a single-node dashboard](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md), the **Metric Correlations** button appears in the top right corner of the page. -To start correlating metrics, click the **Metric Correlations** button, hold the `Alt` key (or `⌘` on macOS), and drag a selection of metrics on a single chart. The selected timeframe needs at least 15 seconds for Metric Correlation to work. +To start correlating metrics, click the **Metric Correlations** button, [highlight a selection of metrics](/docs/dashboards-and-charts/netdata-charts.md#highlight) on a single chart. The selected timeframe needs at least 15 seconds for Metric Correlation to work. The menu then displays information about the selected area and reference baseline. Metric Correlations uses the reference baseline to discover which additional metrics are most closely connected to the selected metrics. The reference baseline is based upon the period immediately preceding the highlighted window and is the length of 4 times the highlighted window. This is to ensure that the reference baseline is always immediately before the highlighted window of interest and a bit longer so as to ensure it's a more representative short term baseline. diff --git a/docs/netdata-cloud/monitor-your-infrastructure.md b/docs/netdata-cloud/monitor-your-infrastructure.md index 7356f4331d57e8..2b741d0e37383f 100644 --- a/docs/netdata-cloud/monitor-your-infrastructure.md +++ b/docs/netdata-cloud/monitor-your-infrastructure.md @@ -143,17 +143,11 @@ After you've learned the basics, you should [secure your infrastructure's nodes] one of our recommended methods. These security best practices ensure no untrusted parties gain access to the metrics collected on any of your nodes. -### Collect metrics from systems and applications +### Collect metrics from anywhere -Netdata has [300+ pre-installed collectors](/src/collectors/COLLECTORS.md) that gather thousands of metrics with zero -configuration. Collectors search each of your nodes in default locations and ports to find running applications and -gather as many metrics as they can without you having to configure them individually. +Netdata has [300+ pre-installed collectors](/src/collectors/COLLECTORS.md) that gather thousands of metrics with zero configuration. Collectors search each of your nodes in default locations and ports to find running applications and gather as many metrics as they can without you having to configure them individually. -Most collectors work without configuration, should you want more info, you can read more on [how Netdata's metrics collectors work](/src/collectors/README.md) and the [Collectors configuration reference](/src/collectors/REFERENCE.md) documentation. - -In addition, find detailed information about which [system](/docs/collecting-metrics/system-metrics.md), -[container](/docs/collecting-metrics/container-metrics.md), and [application](/docs/collecting-metrics/application-metrics.md) metrics you can -collect from across your infrastructure with Netdata. +Check our [comprehensive integrations section](/src/collectors/COLLECTORS.md), to find information about whatever you want to monitor with Netdata ## Netdata Cloud features diff --git a/integrations/integrations.js b/integrations/integrations.js index c219dcc1d5833f..857f14f1d54ab3 100644 --- a/integrations/integrations.js +++ b/integrations/integrations.js @@ -3431,11 +3431,11 @@ export const integrations = [ }, "most_popular": false }, - "overview": "# ClickHouse\n\nPlugin: go.d.plugin\nModule: clickhouse\n\n## Overview\n\nThis collector retrieves performance data from ClickHouse for connections, queries, resources, replication, IO, and data operations (inserts, selects, merges) using HTTP requests and ClickHouse system tables. It monitors your ClickHouse server's health and activity.\n\n\nIt sends HTTP requests to the ClickHouse [HTTP interface](https://clickhouse.com/docs/en/interfaces/http), executing SELECT queries to retrieve data from various system tables.\nSpecifically, it collects metrics from the following tables:\n\n- system.metrics\n- systemd.async_metrics\n- system.events\n- system.disks\n- system.parts\n\n\nThis collector is supported on all platforms.\n\nThis collector supports collecting metrics from multiple instances of this integration, including remote instances.\n\n\n### Default Behavior\n\n#### Auto-Detection\n\nBy default, it detects ClickHouse instances running on localhost that are listening on port 8123.\nOn startup, it tries to collect metrics from:\n\n- http://127.0.0.1:8123\n\n\n#### Limits\n\nThe default configuration for this integration does not impose any limits on data collection.\n\n#### Performance Impact\n\nThe default configuration for this integration is not expected to impose a significant performance impact on the system.\n", + "overview": "# ClickHouse\n\nPlugin: go.d.plugin\nModule: clickhouse\n\n## Overview\n\nThis collector retrieves performance data from ClickHouse for connections, queries, resources, replication, IO, and data operations (inserts, selects, merges) using HTTP requests and ClickHouse system tables. It monitors your ClickHouse server's health and activity.\n\n\nIt sends HTTP requests to the ClickHouse [HTTP interface](https://clickhouse.com/docs/en/interfaces/http), executing SELECT queries to retrieve data from various system tables.\nSpecifically, it collects metrics from the following tables:\n\n- system.metrics\n- system.async_metrics\n- system.events\n- system.disks\n- system.parts\n- system.processes\n\n\nThis collector is supported on all platforms.\n\nThis collector supports collecting metrics from multiple instances of this integration, including remote instances.\n\n\n### Default Behavior\n\n#### Auto-Detection\n\nBy default, it detects ClickHouse instances running on localhost that are listening on port 8123.\nOn startup, it tries to collect metrics from:\n\n- http://127.0.0.1:8123\n\n\n#### Limits\n\nThe default configuration for this integration does not impose any limits on data collection.\n\n#### Performance Impact\n\nThe default configuration for this integration is not expected to impose a significant performance impact on the system.\n", "setup": "## Setup\n\n### Prerequisites\n\nNo action required.\n\n### Configuration\n\n#### File\n\nThe configuration file name for this integration is `go.d/clickhouse.conf`.\n\n\nYou can edit the configuration file using the `edit-config` script from the\nNetdata [config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).\n\n```bash\ncd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata\nsudo ./edit-config go.d/clickhouse.conf\n```\n#### Options\n\nThe following options can be defined globally: update_every, autodetection_retry.\n\n\n{% details summary=\"Config options\" %}\n| Name | Description | Default | Required |\n|:----|:-----------|:-------|:--------:|\n| update_every | Data collection frequency. | 1 | no |\n| autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no |\n| url | Server URL. | http://127.0.0.1:8123 | yes |\n| timeout | HTTP request timeout. | 1 | no |\n| username | Username for basic HTTP authentication. | | no |\n| password | Password for basic HTTP authentication. | | no |\n| proxy_url | Proxy URL. | | no |\n| proxy_username | Username for proxy basic HTTP authentication. | | no |\n| proxy_password | Password for proxy basic HTTP authentication. | | no |\n| method | HTTP request method. | GET | no |\n| body | HTTP request body. | | no |\n| headers | HTTP request headers. | | no |\n| not_follow_redirects | Redirect handling policy. Controls whether the client follows redirects. | no | no |\n| tls_skip_verify | Server certificate chain and hostname validation policy. Controls whether the client performs this check. | no | no |\n| tls_ca | Certification authority that the client uses when verifying the server's certificates. | | no |\n| tls_cert | Client TLS certificate. | | no |\n| tls_key | Client TLS key. | | no |\n\n{% /details %}\n#### Examples\n\n##### Basic\n\nA basic example configuration.\n\n```yaml\njobs:\n - name: local\n url: http://127.0.0.1:8123\n\n```\n##### HTTP authentication\n\nBasic HTTP authentication.\n\n{% details summary=\"Config\" %}\n```yaml\njobs:\n - name: local\n url: http://127.0.0.1:8123\n username: username\n password: password\n\n```\n{% /details %}\n##### HTTPS with self-signed certificate\n\nClickHouse with enabled HTTPS and self-signed certificate.\n\n{% details summary=\"Config\" %}\n```yaml\njobs:\n - name: local\n url: https://127.0.0.1:8123\n tls_skip_verify: yes\n\n```\n{% /details %}\n##### Multi-instance\n\n> **Note**: When you define multiple jobs, their names must be unique.\n\nCollecting metrics from local and remote instances.\n\n\n{% details summary=\"Config\" %}\n```yaml\njobs:\n - name: local\n url: http://127.0.0.1:8123\n\n - name: remote\n url: http://192.0.2.1:8123\n\n```\n{% /details %}\n", "troubleshooting": "## Troubleshooting\n\n### Debug Mode\n\nTo troubleshoot issues with the `clickhouse` collector, run the `go.d.plugin` with the debug option enabled. The output\nshould give you clues as to why the collector isn't working.\n\n- Navigate to the `plugins.d` directory, usually at `/usr/libexec/netdata/plugins.d/`. If that's not the case on\n your system, open `netdata.conf` and look for the `plugins` setting under `[directories]`.\n\n ```bash\n cd /usr/libexec/netdata/plugins.d/\n ```\n\n- Switch to the `netdata` user.\n\n ```bash\n sudo -u netdata -s\n ```\n\n- Run the `go.d.plugin` to debug the collector:\n\n ```bash\n ./go.d.plugin -d -m clickhouse\n ```\n\n", - "alerts": "## Alerts\n\nThere are no alerts configured by default for this integration.\n", - "metrics": "## Metrics\n\nMetrics grouped by *scope*.\n\nThe scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.\n\n\n\n### Per ClickHouse instance\n\nThese metrics refer to the entire monitored application.\n\nThis scope has no labels.\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.connections | tcp, http, mysql, postgresql, interserver | connections |\n| clickhouse.slow_reads | slow | reads/s |\n| clickhouse.read_backoff | read_backoff | events/s |\n| clickhouse.memory_usage | used | bytes |\n| clickhouse.queries | successful, failed | queries/s |\n| clickhouse.select_queries | successful, failed | selects/s |\n| clickhouse.insert_queries | successful, failed | inserts/s |\n| clickhouse.queries_preempted | preempted | queries |\n| clickhouse.queries_memory_limit_exceeded | mem_limit_exceeded | queries/s |\n| clickhouse.queries_latency | queries_time | microseconds |\n| clickhouse.select_queries_latency | selects_time | microseconds |\n| clickhouse.insert_queries_latency | inserts_time | microseconds |\n| clickhouse.io | reads, writes | bytes/s |\n| clickhouse.iops | reads, writes | ops/s |\n| clickhouse.io_errors | read, write | errors/s |\n| clickhouse.io_seeks | lseek | ops/s |\n| clickhouse.io_file_opens | file_open | ops/s |\n| clickhouse.replicated_parts_current_activity | fetch, send, check | parts |\n| clickhouse.replicated_readonly_tables | read_only | tables |\n| clickhouse.replicated_data_loss | data_loss | events |\n| clickhouse.replicated_part_fetches | successful, failed | fetches/s |\n| clickhouse.inserted_rows | inserted | rows/s |\n| clickhouse.inserted_bytes | inserted | bytes/s |\n| clickhouse.rejected_inserts | rejected | inserts/s |\n| clickhouse.delayed_inserts | delayed | inserts/s |\n| clickhouse.delayed_inserts_throttle_time | delayed_inserts_throttle_time | milliseconds |\n| clickhouse.selected_bytes | selected | bytes/s |\n| clickhouse.selected_rows | selected | rows/s |\n| clickhouse.selected_parts | selected | parts/s |\n| clickhouse.selected_ranges | selected | ranges/s |\n| clickhouse.selected_marks | selected | marks/s |\n| clickhouse.merges | merge | ops/s |\n| clickhouse.merges_latency | merges_time | milliseconds |\n| clickhouse.merged_uncompressed_bytes | merged_uncompressed | bytes/s |\n| clickhouse.merged_rows | merged | rows/s |\n| clickhouse.merge_tree_data_writer_inserted_rows | inserted | rows/s |\n| clickhouse.merge_tree_data_writer_uncompressed_bytes | inserted | bytes/s |\n| clickhouse.merge_tree_data_writer_compressed_bytes | written | bytes/s |\n| clickhouse.uncompressed_cache_requests | hits, misses | requests/s |\n| clickhouse.mark_cache_requests | hits, misses | requests/s |\n| clickhouse.parts_count | temporary, pre_active, active, deleting, delete_on_destroy, outdated, wide, compact | parts |\n| distributed_connections | active | connections |\n| distributed_connections_attempts | connection | attempts/s |\n| distributed_connections_fail_retries | connection_retry | fails/s |\n| distributed_connections_fail_exhausted_retries | connection_retry_exhausted | fails/s |\n| distributed_files_to_insert | pending_insertions | files |\n| distributed_rejected_inserts | rejected | inserts/s |\n| distributed_delayed_inserts | delayed | inserts/s |\n| distributed_delayed_inserts_latency | delayed_time | milliseconds |\n| distributed_sync_insertion_timeout_exceeded | sync_insertion | timeouts/s |\n| distributed_async_insertions_failures | async_insertions | failures/s |\n| clickhouse.uptime | uptime | seconds |\n\n### Per disk\n\nThese metrics refer to the Disk.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| disk_name | Name of the disk as defined in the [server configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure). |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.disk_space_usage | free, used | bytes |\n\n### Per table\n\nThese metrics refer to the Database Table.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| database | Name of the database. |\n| table | Name of the table. |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.database_table_size | size | bytes |\n| clickhouse.database_table_parts | parts | parts |\n| clickhouse.database_table_parts | parts | parts |\n| clickhouse.database_table_rows | rows | rows |\n\n", + "alerts": "## Alerts\n\n\nThe following alerts are available:\n\n| Alert name | On metric | Description |\n|:------------|:----------|:------------|\n| [ clickhouse_restarted ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.uptime | ClickHouse has recently been restarted |\n| [ clickhouse_queries_preempted ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.queries_preempted | ClickHouse has queries that are stopped and waiting due to priority setting |\n| [ clickhouse_long_running_query ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.longest_running_query_time | ClickHouse has a long-running query exceeding the threshold |\n| [ clickhouse_rejected_inserts ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.rejected_inserts | ClickHouse has INSERT queries that are rejected due to high number of active data parts for partition in a MergeTree |\n| [ clickhouse_delayed_inserts ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.delayed_inserts | ClickHouse has INSERT queries that are throttled due to high number of active data parts for partition in a MergeTree |\n| [ clickhouse_replication_lag ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.replicas_max_absolute_delay | ClickHouse is experiencing replication lag greater than 5 minutes |\n| [ clickhouse_replicated_readonly_tables ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.replicated_readonly_tables | ClickHouse has replicated tables in readonly state due to ZooKeeper session loss/startup without ZooKeeper configured |\n| [ clickhouse_max_part_count_for_partition ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.max_part_count_for_partition | ClickHouse high number of parts per partition |\n| [ clickhouse_distributed_connections_failures ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.distributed_connections_fail_exhausted_retries | ClickHouse has failed distributed connections after exhausting all retry attempts |\n| [ clickhouse_distributed_files_to_insert ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.distributed_files_to_insert | ClickHouse high number of pending files to process for asynchronous insertion into Distributed tables |\n", + "metrics": "## Metrics\n\nMetrics grouped by *scope*.\n\nThe scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.\n\n\n\n### Per ClickHouse instance\n\nThese metrics refer to the entire monitored application.\n\nThis scope has no labels.\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.connections | tcp, http, mysql, postgresql, interserver | connections |\n| clickhouse.slow_reads | slow | reads/s |\n| clickhouse.read_backoff | read_backoff | events/s |\n| clickhouse.memory_usage | used | bytes |\n| clickhouse.running_queries | running | queries |\n| clickhouse.queries_preempted | preempted | queries |\n| clickhouse.queries | successful, failed | queries/s |\n| clickhouse.select_queries | successful, failed | selects/s |\n| clickhouse.insert_queries | successful, failed | inserts/s |\n| clickhouse.queries_memory_limit_exceeded | mem_limit_exceeded | queries/s |\n| clickhouse.longest_running_query_time | longest_query_time | seconds |\n| clickhouse.queries_latency | queries_time | microseconds |\n| clickhouse.select_queries_latency | selects_time | microseconds |\n| clickhouse.insert_queries_latency | inserts_time | microseconds |\n| clickhouse.io | reads, writes | bytes/s |\n| clickhouse.iops | reads, writes | ops/s |\n| clickhouse.io_errors | read, write | errors/s |\n| clickhouse.io_seeks | lseek | ops/s |\n| clickhouse.io_file_opens | file_open | ops/s |\n| clickhouse.replicated_parts_current_activity | fetch, send, check | parts |\n| clickhouse.replicas_max_absolute_dela | replication_delay | seconds |\n| clickhouse.replicated_readonly_tables | read_only | tables |\n| clickhouse.replicated_data_loss | data_loss | events |\n| clickhouse.replicated_part_fetches | successful, failed | fetches/s |\n| clickhouse.inserted_rows | inserted | rows/s |\n| clickhouse.inserted_bytes | inserted | bytes/s |\n| clickhouse.rejected_inserts | rejected | inserts/s |\n| clickhouse.delayed_inserts | delayed | inserts/s |\n| clickhouse.delayed_inserts_throttle_time | delayed_inserts_throttle_time | milliseconds |\n| clickhouse.selected_bytes | selected | bytes/s |\n| clickhouse.selected_rows | selected | rows/s |\n| clickhouse.selected_parts | selected | parts/s |\n| clickhouse.selected_ranges | selected | ranges/s |\n| clickhouse.selected_marks | selected | marks/s |\n| clickhouse.merges | merge | ops/s |\n| clickhouse.merges_latency | merges_time | milliseconds |\n| clickhouse.merged_uncompressed_bytes | merged_uncompressed | bytes/s |\n| clickhouse.merged_rows | merged | rows/s |\n| clickhouse.merge_tree_data_writer_inserted_rows | inserted | rows/s |\n| clickhouse.merge_tree_data_writer_uncompressed_bytes | inserted | bytes/s |\n| clickhouse.merge_tree_data_writer_compressed_bytes | written | bytes/s |\n| clickhouse.uncompressed_cache_requests | hits, misses | requests/s |\n| clickhouse.mark_cache_requests | hits, misses | requests/s |\n| clickhouse.max_part_count_for_partition | max_parts_partition | parts |\n| clickhouse.parts_count | temporary, pre_active, active, deleting, delete_on_destroy, outdated, wide, compact | parts |\n| distributed_connections | active | connections |\n| distributed_connections_attempts | connection | attempts/s |\n| distributed_connections_fail_retries | connection_retry | fails/s |\n| distributed_connections_fail_exhausted_retries | connection_retry_exhausted | fails/s |\n| distributed_files_to_insert | pending_insertions | files |\n| distributed_rejected_inserts | rejected | inserts/s |\n| distributed_delayed_inserts | delayed | inserts/s |\n| distributed_delayed_inserts_latency | delayed_time | milliseconds |\n| distributed_sync_insertion_timeout_exceeded | sync_insertion | timeouts/s |\n| distributed_async_insertions_failures | async_insertions | failures/s |\n| clickhouse.uptime | uptime | seconds |\n\n### Per disk\n\nThese metrics refer to the Disk.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| disk_name | Name of the disk as defined in the [server configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure). |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.disk_space_usage | free, used | bytes |\n\n### Per table\n\nThese metrics refer to the Database Table.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| database | Name of the database. |\n| table | Name of the table. |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.database_table_size | size | bytes |\n| clickhouse.database_table_parts | parts | parts |\n| clickhouse.database_table_rows | rows | rows |\n\n", "integration_type": "collector", "id": "go.d.plugin-clickhouse-ClickHouse", "edit_link": "https://github.com/netdata/netdata/blob/master/src/go/collectors/go.d.plugin/modules/clickhouse/metadata.yaml", diff --git a/integrations/integrations.json b/integrations/integrations.json index 50f68316d5b82e..5db4ea0c6c1c56 100644 --- a/integrations/integrations.json +++ b/integrations/integrations.json @@ -3429,11 +3429,11 @@ }, "most_popular": false }, - "overview": "# ClickHouse\n\nPlugin: go.d.plugin\nModule: clickhouse\n\n## Overview\n\nThis collector retrieves performance data from ClickHouse for connections, queries, resources, replication, IO, and data operations (inserts, selects, merges) using HTTP requests and ClickHouse system tables. It monitors your ClickHouse server's health and activity.\n\n\nIt sends HTTP requests to the ClickHouse [HTTP interface](https://clickhouse.com/docs/en/interfaces/http), executing SELECT queries to retrieve data from various system tables.\nSpecifically, it collects metrics from the following tables:\n\n- system.metrics\n- systemd.async_metrics\n- system.events\n- system.disks\n- system.parts\n\n\nThis collector is supported on all platforms.\n\nThis collector supports collecting metrics from multiple instances of this integration, including remote instances.\n\n\n### Default Behavior\n\n#### Auto-Detection\n\nBy default, it detects ClickHouse instances running on localhost that are listening on port 8123.\nOn startup, it tries to collect metrics from:\n\n- http://127.0.0.1:8123\n\n\n#### Limits\n\nThe default configuration for this integration does not impose any limits on data collection.\n\n#### Performance Impact\n\nThe default configuration for this integration is not expected to impose a significant performance impact on the system.\n", + "overview": "# ClickHouse\n\nPlugin: go.d.plugin\nModule: clickhouse\n\n## Overview\n\nThis collector retrieves performance data from ClickHouse for connections, queries, resources, replication, IO, and data operations (inserts, selects, merges) using HTTP requests and ClickHouse system tables. It monitors your ClickHouse server's health and activity.\n\n\nIt sends HTTP requests to the ClickHouse [HTTP interface](https://clickhouse.com/docs/en/interfaces/http), executing SELECT queries to retrieve data from various system tables.\nSpecifically, it collects metrics from the following tables:\n\n- system.metrics\n- system.async_metrics\n- system.events\n- system.disks\n- system.parts\n- system.processes\n\n\nThis collector is supported on all platforms.\n\nThis collector supports collecting metrics from multiple instances of this integration, including remote instances.\n\n\n### Default Behavior\n\n#### Auto-Detection\n\nBy default, it detects ClickHouse instances running on localhost that are listening on port 8123.\nOn startup, it tries to collect metrics from:\n\n- http://127.0.0.1:8123\n\n\n#### Limits\n\nThe default configuration for this integration does not impose any limits on data collection.\n\n#### Performance Impact\n\nThe default configuration for this integration is not expected to impose a significant performance impact on the system.\n", "setup": "## Setup\n\n### Prerequisites\n\nNo action required.\n\n### Configuration\n\n#### File\n\nThe configuration file name for this integration is `go.d/clickhouse.conf`.\n\n\nYou can edit the configuration file using the `edit-config` script from the\nNetdata [config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).\n\n```bash\ncd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata\nsudo ./edit-config go.d/clickhouse.conf\n```\n#### Options\n\nThe following options can be defined globally: update_every, autodetection_retry.\n\n\n| Name | Description | Default | Required |\n|:----|:-----------|:-------|:--------:|\n| update_every | Data collection frequency. | 1 | no |\n| autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no |\n| url | Server URL. | http://127.0.0.1:8123 | yes |\n| timeout | HTTP request timeout. | 1 | no |\n| username | Username for basic HTTP authentication. | | no |\n| password | Password for basic HTTP authentication. | | no |\n| proxy_url | Proxy URL. | | no |\n| proxy_username | Username for proxy basic HTTP authentication. | | no |\n| proxy_password | Password for proxy basic HTTP authentication. | | no |\n| method | HTTP request method. | GET | no |\n| body | HTTP request body. | | no |\n| headers | HTTP request headers. | | no |\n| not_follow_redirects | Redirect handling policy. Controls whether the client follows redirects. | no | no |\n| tls_skip_verify | Server certificate chain and hostname validation policy. Controls whether the client performs this check. | no | no |\n| tls_ca | Certification authority that the client uses when verifying the server's certificates. | | no |\n| tls_cert | Client TLS certificate. | | no |\n| tls_key | Client TLS key. | | no |\n\n#### Examples\n\n##### Basic\n\nA basic example configuration.\n\n```yaml\njobs:\n - name: local\n url: http://127.0.0.1:8123\n\n```\n##### HTTP authentication\n\nBasic HTTP authentication.\n\n```yaml\njobs:\n - name: local\n url: http://127.0.0.1:8123\n username: username\n password: password\n\n```\n##### HTTPS with self-signed certificate\n\nClickHouse with enabled HTTPS and self-signed certificate.\n\n```yaml\njobs:\n - name: local\n url: https://127.0.0.1:8123\n tls_skip_verify: yes\n\n```\n##### Multi-instance\n\n> **Note**: When you define multiple jobs, their names must be unique.\n\nCollecting metrics from local and remote instances.\n\n\n```yaml\njobs:\n - name: local\n url: http://127.0.0.1:8123\n\n - name: remote\n url: http://192.0.2.1:8123\n\n```\n", "troubleshooting": "## Troubleshooting\n\n### Debug Mode\n\nTo troubleshoot issues with the `clickhouse` collector, run the `go.d.plugin` with the debug option enabled. The output\nshould give you clues as to why the collector isn't working.\n\n- Navigate to the `plugins.d` directory, usually at `/usr/libexec/netdata/plugins.d/`. If that's not the case on\n your system, open `netdata.conf` and look for the `plugins` setting under `[directories]`.\n\n ```bash\n cd /usr/libexec/netdata/plugins.d/\n ```\n\n- Switch to the `netdata` user.\n\n ```bash\n sudo -u netdata -s\n ```\n\n- Run the `go.d.plugin` to debug the collector:\n\n ```bash\n ./go.d.plugin -d -m clickhouse\n ```\n\n", - "alerts": "## Alerts\n\nThere are no alerts configured by default for this integration.\n", - "metrics": "## Metrics\n\nMetrics grouped by *scope*.\n\nThe scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.\n\n\n\n### Per ClickHouse instance\n\nThese metrics refer to the entire monitored application.\n\nThis scope has no labels.\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.connections | tcp, http, mysql, postgresql, interserver | connections |\n| clickhouse.slow_reads | slow | reads/s |\n| clickhouse.read_backoff | read_backoff | events/s |\n| clickhouse.memory_usage | used | bytes |\n| clickhouse.queries | successful, failed | queries/s |\n| clickhouse.select_queries | successful, failed | selects/s |\n| clickhouse.insert_queries | successful, failed | inserts/s |\n| clickhouse.queries_preempted | preempted | queries |\n| clickhouse.queries_memory_limit_exceeded | mem_limit_exceeded | queries/s |\n| clickhouse.queries_latency | queries_time | microseconds |\n| clickhouse.select_queries_latency | selects_time | microseconds |\n| clickhouse.insert_queries_latency | inserts_time | microseconds |\n| clickhouse.io | reads, writes | bytes/s |\n| clickhouse.iops | reads, writes | ops/s |\n| clickhouse.io_errors | read, write | errors/s |\n| clickhouse.io_seeks | lseek | ops/s |\n| clickhouse.io_file_opens | file_open | ops/s |\n| clickhouse.replicated_parts_current_activity | fetch, send, check | parts |\n| clickhouse.replicated_readonly_tables | read_only | tables |\n| clickhouse.replicated_data_loss | data_loss | events |\n| clickhouse.replicated_part_fetches | successful, failed | fetches/s |\n| clickhouse.inserted_rows | inserted | rows/s |\n| clickhouse.inserted_bytes | inserted | bytes/s |\n| clickhouse.rejected_inserts | rejected | inserts/s |\n| clickhouse.delayed_inserts | delayed | inserts/s |\n| clickhouse.delayed_inserts_throttle_time | delayed_inserts_throttle_time | milliseconds |\n| clickhouse.selected_bytes | selected | bytes/s |\n| clickhouse.selected_rows | selected | rows/s |\n| clickhouse.selected_parts | selected | parts/s |\n| clickhouse.selected_ranges | selected | ranges/s |\n| clickhouse.selected_marks | selected | marks/s |\n| clickhouse.merges | merge | ops/s |\n| clickhouse.merges_latency | merges_time | milliseconds |\n| clickhouse.merged_uncompressed_bytes | merged_uncompressed | bytes/s |\n| clickhouse.merged_rows | merged | rows/s |\n| clickhouse.merge_tree_data_writer_inserted_rows | inserted | rows/s |\n| clickhouse.merge_tree_data_writer_uncompressed_bytes | inserted | bytes/s |\n| clickhouse.merge_tree_data_writer_compressed_bytes | written | bytes/s |\n| clickhouse.uncompressed_cache_requests | hits, misses | requests/s |\n| clickhouse.mark_cache_requests | hits, misses | requests/s |\n| clickhouse.parts_count | temporary, pre_active, active, deleting, delete_on_destroy, outdated, wide, compact | parts |\n| distributed_connections | active | connections |\n| distributed_connections_attempts | connection | attempts/s |\n| distributed_connections_fail_retries | connection_retry | fails/s |\n| distributed_connections_fail_exhausted_retries | connection_retry_exhausted | fails/s |\n| distributed_files_to_insert | pending_insertions | files |\n| distributed_rejected_inserts | rejected | inserts/s |\n| distributed_delayed_inserts | delayed | inserts/s |\n| distributed_delayed_inserts_latency | delayed_time | milliseconds |\n| distributed_sync_insertion_timeout_exceeded | sync_insertion | timeouts/s |\n| distributed_async_insertions_failures | async_insertions | failures/s |\n| clickhouse.uptime | uptime | seconds |\n\n### Per disk\n\nThese metrics refer to the Disk.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| disk_name | Name of the disk as defined in the [server configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure). |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.disk_space_usage | free, used | bytes |\n\n### Per table\n\nThese metrics refer to the Database Table.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| database | Name of the database. |\n| table | Name of the table. |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.database_table_size | size | bytes |\n| clickhouse.database_table_parts | parts | parts |\n| clickhouse.database_table_parts | parts | parts |\n| clickhouse.database_table_rows | rows | rows |\n\n", + "alerts": "## Alerts\n\n\nThe following alerts are available:\n\n| Alert name | On metric | Description |\n|:------------|:----------|:------------|\n| [ clickhouse_restarted ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.uptime | ClickHouse has recently been restarted |\n| [ clickhouse_queries_preempted ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.queries_preempted | ClickHouse has queries that are stopped and waiting due to priority setting |\n| [ clickhouse_long_running_query ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.longest_running_query_time | ClickHouse has a long-running query exceeding the threshold |\n| [ clickhouse_rejected_inserts ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.rejected_inserts | ClickHouse has INSERT queries that are rejected due to high number of active data parts for partition in a MergeTree |\n| [ clickhouse_delayed_inserts ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.delayed_inserts | ClickHouse has INSERT queries that are throttled due to high number of active data parts for partition in a MergeTree |\n| [ clickhouse_replication_lag ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.replicas_max_absolute_delay | ClickHouse is experiencing replication lag greater than 5 minutes |\n| [ clickhouse_replicated_readonly_tables ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.replicated_readonly_tables | ClickHouse has replicated tables in readonly state due to ZooKeeper session loss/startup without ZooKeeper configured |\n| [ clickhouse_max_part_count_for_partition ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.max_part_count_for_partition | ClickHouse high number of parts per partition |\n| [ clickhouse_distributed_connections_failures ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.distributed_connections_fail_exhausted_retries | ClickHouse has failed distributed connections after exhausting all retry attempts |\n| [ clickhouse_distributed_files_to_insert ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.distributed_files_to_insert | ClickHouse high number of pending files to process for asynchronous insertion into Distributed tables |\n", + "metrics": "## Metrics\n\nMetrics grouped by *scope*.\n\nThe scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.\n\n\n\n### Per ClickHouse instance\n\nThese metrics refer to the entire monitored application.\n\nThis scope has no labels.\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.connections | tcp, http, mysql, postgresql, interserver | connections |\n| clickhouse.slow_reads | slow | reads/s |\n| clickhouse.read_backoff | read_backoff | events/s |\n| clickhouse.memory_usage | used | bytes |\n| clickhouse.running_queries | running | queries |\n| clickhouse.queries_preempted | preempted | queries |\n| clickhouse.queries | successful, failed | queries/s |\n| clickhouse.select_queries | successful, failed | selects/s |\n| clickhouse.insert_queries | successful, failed | inserts/s |\n| clickhouse.queries_memory_limit_exceeded | mem_limit_exceeded | queries/s |\n| clickhouse.longest_running_query_time | longest_query_time | seconds |\n| clickhouse.queries_latency | queries_time | microseconds |\n| clickhouse.select_queries_latency | selects_time | microseconds |\n| clickhouse.insert_queries_latency | inserts_time | microseconds |\n| clickhouse.io | reads, writes | bytes/s |\n| clickhouse.iops | reads, writes | ops/s |\n| clickhouse.io_errors | read, write | errors/s |\n| clickhouse.io_seeks | lseek | ops/s |\n| clickhouse.io_file_opens | file_open | ops/s |\n| clickhouse.replicated_parts_current_activity | fetch, send, check | parts |\n| clickhouse.replicas_max_absolute_dela | replication_delay | seconds |\n| clickhouse.replicated_readonly_tables | read_only | tables |\n| clickhouse.replicated_data_loss | data_loss | events |\n| clickhouse.replicated_part_fetches | successful, failed | fetches/s |\n| clickhouse.inserted_rows | inserted | rows/s |\n| clickhouse.inserted_bytes | inserted | bytes/s |\n| clickhouse.rejected_inserts | rejected | inserts/s |\n| clickhouse.delayed_inserts | delayed | inserts/s |\n| clickhouse.delayed_inserts_throttle_time | delayed_inserts_throttle_time | milliseconds |\n| clickhouse.selected_bytes | selected | bytes/s |\n| clickhouse.selected_rows | selected | rows/s |\n| clickhouse.selected_parts | selected | parts/s |\n| clickhouse.selected_ranges | selected | ranges/s |\n| clickhouse.selected_marks | selected | marks/s |\n| clickhouse.merges | merge | ops/s |\n| clickhouse.merges_latency | merges_time | milliseconds |\n| clickhouse.merged_uncompressed_bytes | merged_uncompressed | bytes/s |\n| clickhouse.merged_rows | merged | rows/s |\n| clickhouse.merge_tree_data_writer_inserted_rows | inserted | rows/s |\n| clickhouse.merge_tree_data_writer_uncompressed_bytes | inserted | bytes/s |\n| clickhouse.merge_tree_data_writer_compressed_bytes | written | bytes/s |\n| clickhouse.uncompressed_cache_requests | hits, misses | requests/s |\n| clickhouse.mark_cache_requests | hits, misses | requests/s |\n| clickhouse.max_part_count_for_partition | max_parts_partition | parts |\n| clickhouse.parts_count | temporary, pre_active, active, deleting, delete_on_destroy, outdated, wide, compact | parts |\n| distributed_connections | active | connections |\n| distributed_connections_attempts | connection | attempts/s |\n| distributed_connections_fail_retries | connection_retry | fails/s |\n| distributed_connections_fail_exhausted_retries | connection_retry_exhausted | fails/s |\n| distributed_files_to_insert | pending_insertions | files |\n| distributed_rejected_inserts | rejected | inserts/s |\n| distributed_delayed_inserts | delayed | inserts/s |\n| distributed_delayed_inserts_latency | delayed_time | milliseconds |\n| distributed_sync_insertion_timeout_exceeded | sync_insertion | timeouts/s |\n| distributed_async_insertions_failures | async_insertions | failures/s |\n| clickhouse.uptime | uptime | seconds |\n\n### Per disk\n\nThese metrics refer to the Disk.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| disk_name | Name of the disk as defined in the [server configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure). |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.disk_space_usage | free, used | bytes |\n\n### Per table\n\nThese metrics refer to the Database Table.\n\nLabels:\n\n| Label | Description |\n|:-----------|:----------------|\n| database | Name of the database. |\n| table | Name of the table. |\n\nMetrics:\n\n| Metric | Dimensions | Unit |\n|:------|:----------|:----|\n| clickhouse.database_table_size | size | bytes |\n| clickhouse.database_table_parts | parts | parts |\n| clickhouse.database_table_rows | rows | rows |\n\n", "integration_type": "collector", "id": "go.d.plugin-clickhouse-ClickHouse", "edit_link": "https://github.com/netdata/netdata/blob/master/src/go/collectors/go.d.plugin/modules/clickhouse/metadata.yaml", diff --git a/packaging/installer/README.md b/packaging/installer/README.md index 3b747102bd7219..d15925dca877a4 100644 --- a/packaging/installer/README.md +++ b/packaging/installer/README.md @@ -1,44 +1,34 @@ -import { OneLineInstallWget, OneLineInstallCurl } from '@site/src/components/OneLineInstall/' -import { InstallRegexLink, InstallBoxRegexLink } from '@site/src/components/InstallRegexLink/' -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; - # Netdata Agent Installation Netdata is very flexible and can be used to monitor all kinds of infrastructure. Read more about possible [Deployment guides](/docs/deployment-guides/README.md) to understand what better suites your needs. ## Install through Netdata Cloud -Netdata is a free and open-source (FOSS) monitoring agent that collects thousands of hardware and software metrics from any physical or virtual system (we call them _nodes_). These metrics are organized in an easy-to-use and -navigate interface. - -Netdata runs permanently on all your physical/virtual servers, containers, cloud deployments, and edge/IoT devices. -It runs on Linux distributions (Ubuntu, Debian, CentOS, and more), container/microservice platforms (Kubernetes clusters, Docker), and many other operating systems (FreeBSD, macOS), with no `sudo` required. +The easiest way to install Netdata on your system is via Netdata Cloud, to do so: -To install Netdata in minutes on your platform: +1. Sign up to . +2. You will be presented with an empty space, and a prompt to "Connect Nodes" with the install command for each platform. +3. Select the platform you want to install Netdata to, copy and paste the script into your node's terminal, and run it. -1. Sign up to -2. You will be presented with an empty space, and a prompt to "Connect Nodes" with the install command for each platform -3. Select the platform you want to install Netdata to, copy and paste the script into your node's terminal, and run it +Once Netdata is installed, you can see the node live in your Netdata Space and charts in the [Metrics tab](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md). -Upon installation completing successfully, you should be able to see the node live in your Netdata Space and live charts in the Overview tab. [Take a look at our Dashboards and Charts](/docs/dashboards-and-charts/README.md) section to read more about Netdata's features. +Take a look at our [Dashboards and Charts](/docs/dashboards-and-charts/README.md) section to read more about Netdata's features. -## Maintaining a Netdata Agent installation +## Post-install -For actions like starting, stopping, restarting, updating and uninstalling the Netdata Agent take a look at your specific installation platform in the current section of our Documentation. - -## Configuration +### Configuration If you are looking to configure your Netdata Agent installation, refer to the [respective section in our Documentation](/docs/netdata-agent/configuration/README.md). -## Data collection +### Data collection -If Netdata didn't autodetect all the hardware, containers, services, or applications running on your node, you should learn more about [how data collectors work](/src/collectors/README.md). If there's a [supported collector](/src/collectors/COLLECTORS.md) for metrics you need, [configure the collector](/src/collectors/REFERENCE.md) or read about its requirements to configure your endpoint to publish metrics in the correct format and endpoint. +If Netdata didn't autodetect all the hardware, containers, services, or applications running on your node, you should learn more about [how data collectors work](/src/collectors/README.md). If there's a [supported integration](/src/collectors/COLLECTORS.md) for metrics you need, refer to its respective page and read about its requirements to configure your endpoint to publish metrics in the correct format and endpoint. -## Alerts & notifications +### Alerts & notifications Netdata comes with hundreds of pre-configured alerts, designed by our monitoring gurus in parallel with our open-source community, but you may want to [edit alerts](/src/health/REFERENCE.md) or [enable notifications](/docs/alerts-and-notifications/notifications/README.md) to customize your Netdata experience. -## Make your deployment production ready +### Make your deployment production ready Go through our [deployment guides](/docs/deployment-guides/README.md), for suggested configuration changes for production deployments. @@ -48,32 +38,16 @@ Go through our [deployment guides](/docs/deployment-guides/README.md), for sugge By default, Netdata's installation scripts enable automatic updates for both nightly and stable release channels. -If you preferred to update your Netdata agent manually, you can disable automatic updates by using the `--no-updates` -option when you install or update Netdata using the [automatic one-line installation -script](/packaging/installer/methods/kickstart.md). +If you preferred to update your Netdata Agent manually, you can disable automatic updates by using the `--no-updates` +option when you install or update Netdata using the [automatic one-line installation script](/packaging/installer/methods/kickstart.md). ```bash wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh && sh /tmp/netdata-kickstart.sh --no-updates ``` -With automatic updates disabled, you can choose exactly when and how you [update -Netdata](/packaging/installer/UPDATE.md). - -#### Network usage of Netdata’s automatic updater - -The auto-update functionality set up by the installation scripts requires working internet access to function -correctly. In particular, it currently requires access to GitHub (to check if a newer version of the updater script -is available or not, as well as potentially fetching build-time dependencies that are bundled as part of the install), -and Google Cloud Storage (to check for newer versions of Netdata and download the sources if there is a newer version). - -Note that the auto-update functionality will check for updates to itself independently of updates to Netdata, -and will try to use the latest version of the updater script whenever possible. This is intended to reduce the -amount of effort required by users to get updates working again in the event of a bug in the updater code. - -### Nightly vs. stable releases +With automatic updates disabled, you can choose exactly when and how you [update Netdata](/packaging/installer/UPDATE.md). -The Netdata team maintains two releases of the Netdata agent: **nightly** and **stable**. By default, Netdata's -installation scripts will give you **automatic, nightly** updates, as that is our recommended configuration. +### Nightly vs. Stable Releases **Nightly**: We create nightly builds every 24 hours. They contain fully-tested code that fixes bugs or security flaws, or introduces new features to Netdata. Every nightly release is a candidate for then becoming a stable release—when @@ -94,8 +68,7 @@ the community helps fix any bugs that might have been introduced in previous rel **Pros of using stable releases:** -- Protect yourself from the rare instance when major bugs slip through our testing and negatively affect a Netdata - installation +- Protect yourself from the rare instance when major bugs slip through our testing and negatively affect a Netdata installation - Retain more control over the Netdata version you use ### Anonymous statistics diff --git a/packaging/version b/packaging/version index 52427f83b2e1a7..0e5f44ca669f4b 100644 --- a/packaging/version +++ b/packaging/version @@ -1 +1 @@ -v1.45.0-478-nightly +v1.45.0-490-nightly diff --git a/src/collectors/COLLECTORS.md b/src/collectors/COLLECTORS.md index 2c6a9bf96f120e..dc0e0d0e7fd69c 100644 --- a/src/collectors/COLLECTORS.md +++ b/src/collectors/COLLECTORS.md @@ -1,45 +1,29 @@ # Monitor anything with Netdata Netdata uses collectors to help you gather metrics from your favorite applications and services and view them in -real-time, interactive charts. The following list includes collectors for both external services/applications and -internal system metrics. +real-time, interactive charts. The following list includes all the integrations where Netdata can gather metrics from. -Learn more -about [how collectors work](/src/collectors/README.md), and -then learn how to [enable or -configure](/src/collectors/REFERENCE.md#enable-and-disable-a-specific-collection-module) any of the below collectors using the same process. +Learn more about [how collectors work](/src/collectors/README.md), and then learn how to [enable or configure](/src/collectors/REFERENCE.md#enable-and-disable-a-specific-collection-module) a specific collector. -Some collectors have both Go and Python versions as we continue our effort to migrate all collectors to Go. In these -cases, _Netdata always prioritizes the Go version_, and we highly recommend you use the Go versions for the best -experience. - -If you want to use a Python version of a collector, you need to -explicitly [disable the Go version](/src/collectors/REFERENCE.md#enable-and-disable-a-specific-collection-module), -and enable the Python version. Netdata then skips the Go version and attempts to load the Python version and its -accompanying configuration file. +> **Note** +> +> Some collectors have both Go and Python versions as we continue our effort to migrate all collectors to Go. In these cases, _Netdata always prioritizes the Go version_, and we highly recommend you use the Go versions for the best experience. ## Add your application to Netdata If you don't see the app/service you'd like to monitor in this list: -- If your application has a Prometheus endpoint, Netdata can monitor it! Look at our - [generic Prometheus collector](/src/go/collectors/go.d.plugin/modules/prometheus/README.md). +- If your application has a Prometheus endpoint, Netdata can monitor it! Look at our [generic Prometheus collector](/src/go/collectors/go.d.plugin/modules/prometheus/README.md). -- If your application is instrumented to expose [StatsD](https://blog.netdata.cloud/introduction-to-statsd/) metrics, - see our [generic StatsD collector](/src/collectors/statsd.plugin/README.md). +- If your application is instrumented to expose [StatsD](https://blog.netdata.cloud/introduction-to-statsd/) metrics, see our [generic StatsD collector](/src/collectors/statsd.plugin/README.md). -- If you have data in CSV, JSON, XML or other popular formats, you may be able to use our - [generic structured data (Pandas) collector](/src/collectors/python.d.plugin/pandas/README.md), +- If you have data in CSV, JSON, XML or other popular formats, you may be able to use our [generic structured data (Pandas) collector](/src/collectors/python.d.plugin/pandas/README.md), -- Check out our [GitHub issues](https://github.com/netdata/netdata/issues). Use the search bar to look for previous - discussions about that collector—we may be looking for assistance from users such as yourself! +- Check out our [GitHub issues](https://github.com/netdata/netdata/issues). Use the search bar to look for previous discussions about that collector—we may be looking for assistance from users such as yourself! -- If you don't see the collector there, you can make - a [feature request](https://github.com/netdata/netdata/issues/new/choose) on GitHub. +- If you don't see the collector there, you can make a [feature request](https://github.com/netdata/netdata/issues/new/choose) on GitHub. -- If you have basic software development skills, you can add your own plugin - in [Go](/src/go/collectors/go.d.plugin/README.md#how-to-develop-a-collector) - or [Python](/docs/developer-and-contributor-corner/python-collector.md) +- If you have basic software development skills, you can add your own plugin in [Go](/src/go/collectors/go.d.plugin/README.md#how-to-develop-a-collector) or [Python](/docs/developer-and-contributor-corner/python-collector.md) ## Available Data Collection Integrations diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/charts.go b/src/go/collectors/go.d.plugin/modules/clickhouse/charts.go index 479438880bb9e3..cefcca1e2108ce 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/charts.go +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/charts.go @@ -19,12 +19,14 @@ const ( prioDiskSpaceUsage + prioRunningQueries + prioQueriesPreempted prioQueries prioSelectQueries prioInsertQueries - prioQueriesPreempted prioQueriesMemoryLimitExceeded + prioLongestRunningQueryTime prioQueriesLatency prioSelectQueriesLatency prioInsertQueriesLatency @@ -40,6 +42,7 @@ const ( prioDatabaseTableRows prioReplicatedPartsCurrentActivity + prioReplicasMaxAbsoluteDelay prioReadOnlyReplica prioReplicatedDataLoss prioReplicatedPartFetches @@ -70,6 +73,7 @@ const ( prioUncompressedCacheRequests prioMarkCacheRequests + prioMaxPartCountForPartition prioParts prioDistributedSend @@ -95,12 +99,14 @@ var chCharts = module.Charts{ chartSlowReads.Copy(), chartReadBackoff.Copy(), + chartRunningQueries.Copy(), chartQueries.Copy(), chartSelectQueries.Copy(), chartInsertQueries.Copy(), chartQueriesPreempted.Copy(), chartQueriesMemoryLimitExceeded.Copy(), + chartLongestRunningQueryTime.Copy(), chartQueriesLatency.Copy(), chartSelectQueriesLatency.Copy(), chartInsertQueriesLatency.Copy(), @@ -112,6 +118,7 @@ var chCharts = module.Charts{ chartIOFileOpens.Copy(), chartReplicatedPartsActivity.Copy(), + chartReplicasMaxAbsoluteDelay.Copy(), chartReadonlyReplica.Copy(), chartReplicatedDataLoss.Copy(), chartReplicatedPartFetches.Copy(), @@ -142,6 +149,7 @@ var chCharts = module.Charts{ chartUncompressedCacheRequests.Copy(), chartMarkCacheRequests.Copy(), + chartMaxPartCountForPartition.Copy(), chartPartsCount.Copy(), chartDistributedConnections.Copy(), @@ -238,6 +246,28 @@ var ( ) var ( + chartRunningQueries = module.Chart{ + ID: "running_queries", + Title: "Running queries", + Units: "queries", + Fam: "queries", + Ctx: "clickhouse.running_queries", + Priority: prioRunningQueries, + Dims: module.Dims{ + {ID: "metrics_Query", Name: "running"}, + }, + } + chartQueriesPreempted = module.Chart{ + ID: "queries_preempted", + Title: "Queries waiting due to priority", + Units: "queries", + Fam: "queries", + Ctx: "clickhouse.queries_preempted", + Priority: prioQueriesPreempted, + Dims: module.Dims{ + {ID: "metrics_QueryPreempted", Name: "preempted"}, + }, + } chartQueries = module.Chart{ ID: "queries", Title: "Queries", @@ -277,17 +307,6 @@ var ( {ID: "events_FailedInsertQuery", Name: "failed", Algo: module.Incremental}, }, } - chartQueriesPreempted = module.Chart{ - ID: "queries_preempted", - Title: "Queries waiting due to priority", - Units: "queries", - Fam: "queries", - Ctx: "clickhouse.queries_preempted", - Priority: prioQueriesPreempted, - Dims: module.Dims{ - {ID: "metrics_QueryPreempted", Name: "preempted"}, - }, - } chartQueriesMemoryLimitExceeded = module.Chart{ ID: "queries_memory_limit_exceeded", Title: "Memory limit exceeded for query", @@ -302,6 +321,17 @@ var ( ) var ( + chartLongestRunningQueryTime = module.Chart{ + ID: "longest_running_query_time", + Title: "Longest running query time", + Units: "seconds", + Fam: "query latency", + Ctx: "clickhouse.longest_running_query_time", + Priority: prioLongestRunningQueryTime, + Dims: module.Dims{ + {ID: "LongestRunningQueryTime", Name: "longest_query_time", Div: precision}, + }, + } chartQueriesLatency = module.Chart{ ID: "queries_latency", Title: "Queries latency", @@ -456,6 +486,17 @@ var ( {ID: "metrics_ReplicatedChecks", Name: "check"}, }, } + chartReplicasMaxAbsoluteDelay = module.Chart{ + ID: "replicas_max_absolute_delay", + Title: "Replicas max absolute delay", + Units: "seconds", + Fam: "replicas", + Ctx: "clickhouse.replicas_max_absolute_delay", + Priority: prioReplicasMaxAbsoluteDelay, + Dims: module.Dims{ + {ID: "async_metrics_ReplicasMaxAbsoluteDelay", Name: "replication_delay", Div: precision}, + }, + } chartReadonlyReplica = module.Chart{ ID: "readonly_replica", Title: "Replicated tables in readonly state", @@ -746,6 +787,17 @@ var ( ) var ( + chartMaxPartCountForPartition = module.Chart{ + ID: "max_part_count_for_partition", + Title: "Max part count for partition", + Units: "parts", + Fam: "parts", + Ctx: "clickhouse.max_part_count_for_partition", + Priority: prioMaxPartCountForPartition, + Dims: module.Dims{ + {ID: "async_metrics_MaxPartCountForPartition", Name: "max_parts_partition"}, + }, + } chartPartsCount = module.Chart{ ID: "parts_count", Title: "Parts", diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/clickhouse_test.go b/src/go/collectors/go.d.plugin/modules/clickhouse/clickhouse_test.go index e4bcad9f62f75b..de78bed43477e0 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/clickhouse_test.go +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/clickhouse_test.go @@ -24,6 +24,7 @@ var ( dataRespSystemEvents, _ = os.ReadFile("testdata/resp_system_events.csv") dataRespSystemParts, _ = os.ReadFile("testdata/resp_system_parts.csv") dataRespSystemDisks, _ = os.ReadFile("testdata/resp_system_disks.csv") + dataRespLongestQueryTime, _ = os.ReadFile("testdata/resp_longest_query_time.csv") ) func Test_testDataIsValid(t *testing.T) { @@ -35,6 +36,7 @@ func Test_testDataIsValid(t *testing.T) { "dataRespSystemEvents": dataRespSystemEvents, "dataRespSystemParts": dataRespSystemParts, "dataRespSystemDisks": dataRespSystemDisks, + "dataRespLongestQueryTime": dataRespLongestQueryTime, } { require.NotNil(t, data, name) } @@ -122,6 +124,9 @@ func TestClickHouse_Collect(t *testing.T) { "success on valid response": { prepare: prepareCaseOk, wantMetrics: map[string]int64{ + "LongestRunningQueryTime": 73, + "async_metrics_MaxPartCountForPartition": 7, + "async_metrics_ReplicasMaxAbsoluteDelay": 0, "async_metrics_Uptime": 64380, "disk_default_free_space_bytes": 165494767616, "disk_default_used_space_bytes": 45184565248, @@ -198,6 +203,7 @@ func TestClickHouse_Collect(t *testing.T) { "metrics_PartsTemporary": 0, "metrics_PartsWide": 76, "metrics_PostgreSQLConnection": 0, + "metrics_Query": 1, "metrics_QueryPreempted": 0, "metrics_ReadonlyReplica": 0, "metrics_ReplicatedChecks": 0, @@ -271,6 +277,8 @@ func prepareCaseOk(t *testing.T) (*ClickHouse, func()) { _, _ = w.Write(dataRespSystemParts) case querySystemDisks: _, _ = w.Write(dataRespSystemDisks) + case queryLongestQueryTime: + _, _ = w.Write(dataRespLongestQueryTime) default: w.WriteHeader(http.StatusNotFound) } diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/collect.go b/src/go/collectors/go.d.plugin/modules/clickhouse/collect.go index 63f1b67a4be220..8bb756528aa16b 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/collect.go +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/collect.go @@ -11,6 +11,8 @@ import ( "slices" ) +const precision = 1000 + func (c *ClickHouse) collect() (map[string]int64, error) { mx := make(map[string]int64) @@ -29,6 +31,9 @@ func (c *ClickHouse) collect() (map[string]int64, error) { if err := c.collectSystemDisks(mx); err != nil { return nil, err } + if err := c.collectLongestRunningQueryTime(mx); err != nil { + return nil, err + } return mx, nil } diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_async_metrics.go b/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_async_metrics.go index 41c2f1b377a61c..46b8fed49eaa45 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_async_metrics.go +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_async_metrics.go @@ -14,15 +14,21 @@ SELECT metric, value FROM - system.asynchronous_metrics where metric like 'Uptime' FORMAT CSVWithNames + system.asynchronous_metrics +where + metric LIKE 'Uptime' + OR metric LIKE 'MaxPartCountForPartition' + OR metric LIKE 'ReplicasMaxAbsoluteDelay' FORMAT CSVWithNames ` func (c *ClickHouse) collectSystemAsyncMetrics(mx map[string]int64) error { req, _ := web.NewHTTPRequest(c.Request) req.URL.RawQuery = makeURLQuery(querySystemAsyncMetrics) - want := map[string]bool{ - "Uptime": true, + want := map[string]float64{ + "Uptime": 1, + "MaxPartCountForPartition": 1, + "ReplicasMaxAbsoluteDelay": precision, } px := "async_metrics_" @@ -34,12 +40,13 @@ func (c *ClickHouse) collectSystemAsyncMetrics(mx map[string]int64) error { case "metric": metric = value case "value": - if !want[metric] { + mul, ok := want[metric] + if !ok { return } n++ if v, err := strconv.ParseFloat(value, 64); err == nil { - mx[px+metric] = int64(v) + mx[px+metric] = int64(v * mul) } } }) diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_metrics.go b/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_metrics.go index 8acb3f8f596317..f7c3981c82e3d0 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_metrics.go +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_metrics.go @@ -50,6 +50,7 @@ func (c *ClickHouse) collectSystemMetrics(mx map[string]int64) error { } var wantSystemMetrics = map[string]bool{ + "Query": true, "TCPConnection": true, "HTTPConnection": true, "MySQLConnection": true, diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_processes.go b/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_processes.go new file mode 100644 index 00000000000000..d31103a8f48c7f --- /dev/null +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/collect_system_processes.go @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +package clickhouse + +import ( + "strconv" + + "github.com/netdata/netdata/go/go.d.plugin/pkg/web" +) + +const queryLongestQueryTime = ` +SELECT + toString(max(elapsed)) as value +FROM + system.processes FORMAT CSVWithNames +` + +func (c *ClickHouse) collectLongestRunningQueryTime(mx map[string]int64) error { + req, _ := web.NewHTTPRequest(c.Request) + req.URL.RawQuery = makeURLQuery(queryLongestQueryTime) + + return c.doOKDecodeCSV(req, func(column, value string, lineEnd bool) { + if column == "value" { + if v, err := strconv.ParseFloat(value, 64); err == nil { + mx["LongestRunningQueryTime"] = int64(v * precision) + } + } + }) +} diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/integrations/clickhouse.md b/src/go/collectors/go.d.plugin/modules/clickhouse/integrations/clickhouse.md index 4bfbed46976716..f54e00e7792c32 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/integrations/clickhouse.md +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/integrations/clickhouse.md @@ -28,10 +28,11 @@ It sends HTTP requests to the ClickHouse [HTTP interface](https://clickhouse.com Specifically, it collects metrics from the following tables: - system.metrics -- systemd.async_metrics +- system.async_metrics - system.events - system.disks - system.parts +- system.processes This collector is supported on all platforms. @@ -80,11 +81,13 @@ Metrics: | clickhouse.slow_reads | slow | reads/s | | clickhouse.read_backoff | read_backoff | events/s | | clickhouse.memory_usage | used | bytes | +| clickhouse.running_queries | running | queries | +| clickhouse.queries_preempted | preempted | queries | | clickhouse.queries | successful, failed | queries/s | | clickhouse.select_queries | successful, failed | selects/s | | clickhouse.insert_queries | successful, failed | inserts/s | -| clickhouse.queries_preempted | preempted | queries | | clickhouse.queries_memory_limit_exceeded | mem_limit_exceeded | queries/s | +| clickhouse.longest_running_query_time | longest_query_time | seconds | | clickhouse.queries_latency | queries_time | microseconds | | clickhouse.select_queries_latency | selects_time | microseconds | | clickhouse.insert_queries_latency | inserts_time | microseconds | @@ -94,6 +97,7 @@ Metrics: | clickhouse.io_seeks | lseek | ops/s | | clickhouse.io_file_opens | file_open | ops/s | | clickhouse.replicated_parts_current_activity | fetch, send, check | parts | +| clickhouse.replicas_max_absolute_dela | replication_delay | seconds | | clickhouse.replicated_readonly_tables | read_only | tables | | clickhouse.replicated_data_loss | data_loss | events | | clickhouse.replicated_part_fetches | successful, failed | fetches/s | @@ -116,6 +120,7 @@ Metrics: | clickhouse.merge_tree_data_writer_compressed_bytes | written | bytes/s | | clickhouse.uncompressed_cache_requests | hits, misses | requests/s | | clickhouse.mark_cache_requests | hits, misses | requests/s | +| clickhouse.max_part_count_for_partition | max_parts_partition | parts | | clickhouse.parts_count | temporary, pre_active, active, deleting, delete_on_destroy, outdated, wide, compact | parts | | distributed_connections | active | connections | | distributed_connections_attempts | connection | attempts/s | @@ -162,14 +167,27 @@ Metrics: |:------|:----------|:----| | clickhouse.database_table_size | size | bytes | | clickhouse.database_table_parts | parts | parts | -| clickhouse.database_table_parts | parts | parts | | clickhouse.database_table_rows | rows | rows | ## Alerts -There are no alerts configured by default for this integration. + +The following alerts are available: + +| Alert name | On metric | Description | +|:------------|:----------|:------------| +| [ clickhouse_restarted ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.uptime | ClickHouse has recently been restarted | +| [ clickhouse_queries_preempted ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.queries_preempted | ClickHouse has queries that are stopped and waiting due to priority setting | +| [ clickhouse_long_running_query ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.longest_running_query_time | ClickHouse has a long-running query exceeding the threshold | +| [ clickhouse_rejected_inserts ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.rejected_inserts | ClickHouse has INSERT queries that are rejected due to high number of active data parts for partition in a MergeTree | +| [ clickhouse_delayed_inserts ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.delayed_inserts | ClickHouse has INSERT queries that are throttled due to high number of active data parts for partition in a MergeTree | +| [ clickhouse_replication_lag ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.replicas_max_absolute_delay | ClickHouse is experiencing replication lag greater than 5 minutes | +| [ clickhouse_replicated_readonly_tables ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.replicated_readonly_tables | ClickHouse has replicated tables in readonly state due to ZooKeeper session loss/startup without ZooKeeper configured | +| [ clickhouse_max_part_count_for_partition ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.max_part_count_for_partition | ClickHouse high number of parts per partition | +| [ clickhouse_distributed_connections_failures ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.distributed_connections_fail_exhausted_retries | ClickHouse has failed distributed connections after exhausting all retry attempts | +| [ clickhouse_distributed_files_to_insert ](https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf) | clickhouse.distributed_files_to_insert | ClickHouse high number of pending files to process for asynchronous insertion into Distributed tables | ## Setup diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/metadata.yaml b/src/go/collectors/go.d.plugin/modules/clickhouse/metadata.yaml index b6399d4cbecbee..e9a6b9152e9e82 100644 --- a/src/go/collectors/go.d.plugin/modules/clickhouse/metadata.yaml +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/metadata.yaml @@ -27,10 +27,11 @@ modules: Specifically, it collects metrics from the following tables: - system.metrics - - systemd.async_metrics + - system.async_metrics - system.events - system.disks - system.parts + - system.processes supported_platforms: include: [] exclude: [] @@ -172,7 +173,47 @@ modules: troubleshooting: problems: list: [] - alerts: [] + alerts: + - name: clickhouse_restarted + metric: clickhouse.uptime + info: ClickHouse has recently been restarted + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_queries_preempted + metric: clickhouse.queries_preempted + info: ClickHouse has queries that are stopped and waiting due to priority setting + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_long_running_query + metric: clickhouse.longest_running_query_time + info: ClickHouse has a long-running query exceeding the threshold + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_rejected_inserts + metric: clickhouse.rejected_inserts + info: ClickHouse has INSERT queries that are rejected due to high number of active data parts for partition in a MergeTree + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_delayed_inserts + metric: clickhouse.delayed_inserts + info: ClickHouse has INSERT queries that are throttled due to high number of active data parts for partition in a MergeTree + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_replication_lag + metric: clickhouse.replicas_max_absolute_delay + info: ClickHouse is experiencing replication lag greater than 5 minutes + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_replicated_readonly_tables + metric: clickhouse.replicated_readonly_tables + info: ClickHouse has replicated tables in readonly state due to ZooKeeper session loss/startup without ZooKeeper configured + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_max_part_count_for_partition + metric: clickhouse.max_part_count_for_partition + info: ClickHouse high number of parts per partition + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_distributed_connections_failures + metric: clickhouse.distributed_connections_fail_exhausted_retries + info: ClickHouse has failed distributed connections after exhausting all retry attempts + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf + - name: clickhouse_distributed_files_to_insert + metric: clickhouse.distributed_files_to_insert + info: ClickHouse high number of pending files to process for asynchronous insertion into Distributed tables + link: https://github.com/netdata/netdata/blob/master/src/health/health.d/clickhouse.conf metrics: folding: title: Metrics @@ -212,6 +253,18 @@ modules: chart_type: area dimensions: - name: used + - name: clickhouse.running_queries + description: Running queries + unit: queries + chart_type: line + dimensions: + - name: running + - name: clickhouse.queries_preempted + description: Queries waiting due to priority + unit: queries + chart_type: line + dimensions: + - name: preempted - name: clickhouse.queries description: Queries unit: queries/s @@ -233,18 +286,18 @@ modules: dimensions: - name: successful - name: failed - - name: clickhouse.queries_preempted - description: Queries waiting due to priority - unit: queries - chart_type: line - dimensions: - - name: preempted - name: clickhouse.queries_memory_limit_exceeded description: Memory limit exceeded for query unit: queries/s chart_type: line dimensions: - name: mem_limit_exceeded + - name: clickhouse.longest_running_query_time + description: Longest running query time + unit: seconds + chart_type: line + dimensions: + - name: longest_query_time - name: clickhouse.queries_latency description: Queries latency unit: microseconds @@ -304,6 +357,12 @@ modules: - name: fetch - name: send - name: check + - name: clickhouse.replicas_max_absolute_dela + description: Replicas max absolute delay + unit: seconds + chart_type: line + dimensions: + - name: replication_delay - name: clickhouse.replicated_readonly_tables description: Replicated tables in readonly state unit: tables @@ -439,6 +498,12 @@ modules: dimensions: - name: hits - name: misses + - name: clickhouse.max_part_count_for_partition + description: Max part count for partition + unit: parts + chart_type: line + dimensions: + - name: max_parts_partition - name: clickhouse.parts_count description: Parts unit: parts @@ -551,12 +616,6 @@ modules: chart_type: line dimensions: - name: parts - - name: clickhouse.database_table_parts - description: Table parts - unit: parts - chart_type: line - dimensions: - - name: parts - name: clickhouse.database_table_rows description: Table rows unit: rows diff --git a/src/go/collectors/go.d.plugin/modules/clickhouse/testdata/resp_longest_query_time.csv b/src/go/collectors/go.d.plugin/modules/clickhouse/testdata/resp_longest_query_time.csv new file mode 100644 index 00000000000000..85119aa6fddead --- /dev/null +++ b/src/go/collectors/go.d.plugin/modules/clickhouse/testdata/resp_longest_query_time.csv @@ -0,0 +1,2 @@ +"value" +"0.0738" diff --git a/src/health/health.d/clickhouse.conf b/src/health/health.d/clickhouse.conf new file mode 100644 index 00000000000000..e24f71830908dd --- /dev/null +++ b/src/health/health.d/clickhouse.conf @@ -0,0 +1,140 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: clickhouse_restarted + on: clickhouse.uptime + class: Error + type: Database +component: ClickHouse + calc: $uptime + units: seconds + every: 10s + warn: $this > 1 AND $this < 180 + summary: ClickHouse restart detected + info: ClickHouse has recently been restarted + to: silent + + template: clickhouse_queries_preempted + on: clickhouse.queries_preempted + class: Workload + type: Database +component: ClickHouse + lookup: max -1m unaligned + units: preempted_queries + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse preempted queries detected + info: ClickHouse has queries that are stopped and waiting due to priority setting + to: dba + + template: clickhouse_long_running_query + on: clickhouse.longest_running_query_time + class: Latency + type: Database +component: ClickHouse + lookup: max -1m unaligned + units: seconds + every: 10s + warn: $this > (($status >= $WARNING) ? (300) : (600)) + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse long-running query detected + info: ClickHouse has a long-running query exceeding the threshold + to: dba + + template: clickhouse_rejected_inserts + on: clickhouse.rejected_inserts + class: Workload + type: Database +component: ClickHouse + lookup: sum -1m unaligned + units: rejected_inserts + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse rejected INSERT queries detected + info: ClickHouse has INSERT queries that are rejected due to high number of active data parts for partition in a MergeTree + to: dba + + template: clickhouse_delayed_inserts + on: clickhouse.delayed_inserts + class: Workload + type: Database +component: ClickHouse + lookup: sum -1m unaligned + units: delayed_inserts + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse delayed INSERT queries detected + info: ClickHouse has INSERT queries that are throttled due to high number of active data parts for partition in a MergeTree + to: silent + + template: clickhouse_replication_lag + on: clickhouse.replicas_max_absolute_delay + class: Workload + type: Database +component: ClickHouse + lookup: avg -1m unaligned + units: seconds + every: 10s + warn: $this > (($status >= $WARNING) ? (250) : (300)) + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse high replication lag detected + info: ClickHouse is experiencing replication lag greater than 5 minutes + to: dba + + template: clickhouse_replicated_readonly_tables + on: clickhouse.replicated_readonly_tables + class: Error + type: Database +component: ClickHouse + lookup: max -1m unaligned + units: readonly_tables + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse replicated tables in readonly state detected + info: ClickHouse has replicated tables in readonly state due to ZooKeeper session loss/startup without ZooKeeper configured + to: dba + + template: clickhouse_max_part_count_for_partition + on: clickhouse.max_part_count_for_partition + class: Workload + type: Database +component: ClickHouse + lookup: avg -1m unaligned + units: parts + every: 10s + warn: $this > (($status >= $WARNING) ? (200) : (300)) + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse high parts/partition detected + info: ClickHouse high number of parts per partition + to: dba + + template: clickhouse_distributed_connections_failures + on: clickhouse.distributed_connections_fail_exhausted_retries + class: Error + type: Database +component: ClickHouse + lookup: sum -1m unaligned + units: failures + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse distributed connections failures detected + info: ClickHouse has failed distributed connections after exhausting all retry attempts + to: dba + + template: clickhouse_distributed_files_to_insert + on: clickhouse.distributed_files_to_insert + class: Workload + type: Database +component: ClickHouse + lookup: max -1m unaligned + units: files + every: 10s + warn: $this > (($status >= $WARNING) ? (40) : (80)) + delay: down 5m multiplier 1.5 max 1h + summary: ClickHouse high files to insert detected + info: ClickHouse high number of pending files to process for asynchronous insertion into Distributed tables + to: silent