Watchdog

watchdog will check service / K8S / Node level health. It is a long running service and will continues report & record health metrics to log files.

How to use

For example

Check watchdog log and find an error:

ssh watchdogPodHostIP

cd /var/log/containers

vi watchdog-xx.log

log example:

{"log":"2018-05-23 02:13:14,783 - watchdog - ERROR - container_status_not_ready{pod="node-exporter-xmvbz", container="gpu-exporter", hostip="10.151.40.133"} 1\n","stream":"stderr","time":"2018-05-23T02:13:14.7838574Z"}

Go to 10.151.40.133 to check containers detail exception:

ssh 10.151.40.133

vi /var/log/containers/node-exporter-xmvbz_default_gpu-exporter-xxx.log

Find error:

nvidia cmd error.

How to view watchdog logs

Operator could get these logs & metrics from watchdog container's log

From K8S portal

Find the watchdog pod and container logs

From Watchdog container log

go to k8s container log folder

cd /var/log/containers

vi watchdog-xx.log

Metrics

Service Health Metrics

Metric name	Description
pai_pod_count	describe count of pai service like webportal, grafana, lable contains pod status like phase="running", ready="true"
pai_container_count	describe count of pai service container, like pai_pod_count, lable in pai_contain_count contains container status like state="running"

Node Health Metrics

Metric name	Description
pai_node_count	describe count of node in open pai, lable describe state of node like ready="true" and condition like disk_pressure="false"

Docker Daemon Health Metrics

Metric name	Description
docker_daemon_count	has error key in label, if error != "ok", means docker daemon is not functioning correctly

K8s Health Metrics

Metric name	Description
k8s_api_server_count	has error key in label, if error != "ok", means api server is not functioning correctly
k8s_etcd_count	has error key in label, if error != "ok", means etcd is not functioning correctly
k8s_kubelet_count	has error key in label, if error != "ok", means kubelet is not functioning correctly

Other Metrics

Metric name	Description
process_error_log_total	count of error/exception log
k8s_api_healthz_resp_latency_seconds	response latency from k8s api healthz page
ssh_resp_latency_seconds	response latency from ssh into worker node and execute docker daemon check cmd
k8s_etcd_resp_latency_seconds	response latency from etcd healthz page
k8s_kubelet_resp_latency_seconds	response latency from kubelet healthz page
k8s_api_list_pods_latency_seconds	response latency from listing pods from k8s api
k8s_api_list_nodes_latency_seconds	response latency from listing nodes from k8s api

Alerting

Alerting rules are under [prometheus/prometheus-alert](../prometheus-alert), we added some basic healthcheck rules for pai service and node. You can add more alert rule by adding file *.rules to prometheus/prometheus-alert directory. Read doc from prometheus for rule syntax reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watchdog-metrics.md

watchdog-metrics.md

Watchdog

How to use

How to view watchdog logs

Metrics

Service Health Metrics

Node Health Metrics

Docker Daemon Health Metrics

K8s Health Metrics

Other Metrics

Alerting

Files

watchdog-metrics.md

Latest commit

History

watchdog-metrics.md

File metadata and controls

Watchdog

How to use

How to view watchdog logs

Metrics

Service Health Metrics

Node Health Metrics

Docker Daemon Health Metrics

K8s Health Metrics

Other Metrics

Alerting