-
Notifications
You must be signed in to change notification settings - Fork 170
Nagios Integration
NOTE: Please see the monitor-core wiki page on [https://github.com/ganglia/monitor-core/wiki/Integrating-Ganglia-with-Nagios Integrating Ganglia with Nagios] for an overview of different approaches to letting the two pieces of software communicate. This page is specifically about using the ganglia-web Nagios integration.
Ganglia Nagios integration is a new feature that is included with Ganglia Web 2.2.0+. It is based on following implementation
http://vuksan.com/linux/nagios_scripts.html#check_ganglia_metrics
with the exception that it uses a shell script wrapper which is more efficient since PHP interpreter doesn't need to be spawned each time we check a metric.
There are 4 different Ganglia Checks
- Check heartbeat
- Check single metric on a specific host
- Check multiple metrics on a specific host
- Check multiple metrics on a range of hosts defined with a regular expression
Ganglia uses heartbeat packets to determine if a machine has gone down. It is reset every time a new packet is received. This check avoids you from having to do things like check_ping to make sure machine is alive. To use this check please copy check_heartbeat.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default
GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"
Define the check command in Nagios. Threshold is the amount of time since last reported heartbeat to raise critical alert.
define command {
command_name check_ganglia_heartbeat
command_line /bin/sh /var/www/html/ganglia/nagios/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$
}
Now for every host you want monitored change check_command to be
check_command check_ganglia_heartbeat!50
This will mark any node that reported to Ganglia 50 seconds or more ago as CRITICAL.
To use it please copy check_ganglia_metric.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default
GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"
Nagios configuration consists of defining following command
define command {
command_name check_ganglia_metric
command_line /bin/sh /var/www/html/ganglia/nagios/check_ganglia_metric.sh host=$HOSTADDRESS$ metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}
Now you can use it in a service check. For instance say you want to be alerted if 1-minute load average goes over 5 you would add following directive
check_command check_ganglia_metric!load_one!more!5
If you wanted to alert when disk space goes less than 10 GB
check_command check_ganglia_metric!disk_free!less!10
Be reminded that operators indicate what should be "critical" state. For instance if you use notequal it means state is critical if the value is NOT equal. etc.
Check multiple metrics is a modification of the check single metric script. It will check multiple metrics on the same host e.g. instead of having separate checks for e.g. disk utilization on /, /tmp and /var which may produce three separate alerts you have a single alert any time disk utilization goes below or above a threshold.
To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default
GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"
Then define a check command in Nagios
define command {
command_name check_ganglia_multiple_metrics
command_line /bin/sh /var/www/html/ganglia/nagios/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$'
}
Then add a list of checks that are delimited with :. Each check consists of
metric_name,operator,critical_value
e.g.
check_command check_ganglia_multiple_metrics!disk_free_rootfs,less,10:disk_free_tmp,less,20
WARNING: Drawback of using check multiple metrics is that in certain instances you may not be aware of the scale of a problem. For example say you get an alert for /tmp nearing full. You get this alert over the weekend so you figure it's not THAT critical. After the alert your /var starts rapidly filling up which may be really serious. Unfortunately you will not get another alert (unless obviously you had an aggressive notification interval). Beware.
Use this check to check a single or multiple metrics on a range of hosts defined using a regular expression. This is useful when you want to get a single alert if particular metric is critical across a number of hosts.
To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default
GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"
Then define a check command in Nagios
define command {
command_name check_ganglia_host_regex
command_line /bin/sh /usr/share/ganglia-web2/nagios/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$'
}
Then add a list of checks that are delimited with :. Each check consists of
metric_name,operator,critical_value
e.g.
For example to check free space on / and /tmp for any machine starting with web-* or app-* you would use something like this
check_command check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,10:disk_free_tmp,less,10
DOWNSIDES: Downside of this approach similar to check multiple metrics on a single host is that in certain situation the scale of a problem may not be apparent since only a single alert will be generated. Also currently since Nagios and Ganglia are decoupled you may get an alert if machine is under scheduled maintenance and e.g. you start writing to /tmp.
Use this check to check a single or multiple metrics on a range of hosts have the same value. For example let's say you wanted to make sure SVN revision of the deployed code was the same across all servers. You would send the SVN revision as e.g. a string metric then list it as metric that needs to be same everywhere
To use it please copy check_value_same_everywhere.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default
GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"
Then define a check command in Nagios
define command {
command_name check_ganglia_host_regex
command_line /bin/sh /usr/share/ganglia-web2/nagios/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$'
}
e.g.
check_command check_ganglia_host_regex!^web-|^app-!svn_revision,num_config_files