-
Notifications
You must be signed in to change notification settings - Fork 15
Service monitoring
Fernando Barreiro edited this page Mar 28, 2019
·
7 revisions
Harvester service metrics can be pushed out. You need to enable it in your harvester configuration file. It requires psutil >= 5.4.8 and harvester code after 12 Feb 2019
[service_monitor]
active = True
disk_volumes = data,data1
pidfile = /var/log/harvester/panda_harvester.pid
- disk_volumes is optional, and supports a comma separated list of volumes
- pidfile is only mandatory when using uwsgi
The logs will be written to panda-service_monitor.log
. A healthy snippet is:
2019-03-28 03:36:15,559 panda.log.service_monitor: DEBUG Running service monitor
2019-03-28 03:36:15,576 panda.log.service_monitor: DEBUG Memory usage: 178.6640625 MiB/2.5024127947056387%, CPU usage: 0.0
2019-03-28 03:36:15,589 panda.log.service_monitor: DEBUG Disk usage of data: 69.0 %
...
Once harvester is pushing out service metrics, you need to configure the thresholds and alerts on the alerting agent. The completed xml file will have to be added to the configuration directory(send it to the service managers):
<?xml version="1.0"?>
<instances>
<instance harvesterid="YOUR HARVESTER ID" instanceisenable="True">
<hostlist>
<host hostname="THE HOST RUNNING HARVESTER" hostisenable="True">
<contacts>
<email>WHO TO NOTIFY 1</email>
<email>WHO TO NOTIFY 2</email>
</contacts>
<metrics>
<metric name="lastsubmittedworker" enable="True">
<value>30</value>
</metric>
<metric name="lastheartbeat" enable="True">
<value>30</value>
</metric>
<metric name="memory" enable="True">
<memory_warning>50</memory_warning>
<memory_critical>80</memory_critical>
</metric>
<metric name="cpu" enable="True">
<cpu_warning>50</cpu_warning>
<cpu_critical>80</cpu_critical>
</metric>
<metric name="disk" enable="True">
<disk_warning>75</disk_warning>
<disk_critical>80</disk_critical>
</metric>
</metrics>
</host>
... YOU CAN ADD MULTIPLE HOSTS
</hostlist>
</instance>
</instances>
- lastsubmittedworker and lastheartbeat examples: 30 (minutes), 60d... (you can disable the metric in cases where you don't expect regular worker submission)
- disk_warning/critical, cpu_warning/critical, memory_warning/critical: 50 (expressed in %)
Getting started |
---|
Installation and configuration |
Testing and running |
Debugging |
Work with Middleware |
Admin FAQ |
Development guides |
---|
Development workflow |
Tagging |
Production & commissioning |
---|
Scale up submission |
Condor experiences |
Commissioning on the grid |
Production servers |
Service monitoring |
Auto Queue Configuration with CRIC |
SSH+RPC middleware setup |
Kubernetes section |
---|
Kubernetes setup |
X509 credentials |
AWS setup |
GKE setup |
CERN setup |
CVMFS installation |
Generic service accounts |
Advanced payloads |
---|
Horovod integration |