Skip to content

Nagios Integration

vvuksan edited this page Oct 1, 2012 · 5 revisions

Ganglia Nagios Integration

Ganglia Nagios integration is a new feature that is included with Ganglia Web 2.2.0+. It is based on following implementation

http://vuksan.com/linux/nagios_scripts.html#check_ganglia_metrics

with the exception that it uses a shell script wrapper which is more efficient since PHP interpreter doesn't need to be spawned each time we check a metric.

There are 4 different Ganglia Checks

  • Check heartbeat
  • Check single metric on a specific host
  • Check multiple metrics on a specific host
  • Check multiple metrics on a range of hosts defined with a regular expression

Check Heartbeat

Ganglia uses heartbeat packets to determine if a machine has gone down. It is reset every time a new packet is received. This check avoids you from having to do things like check_ping to make sure machine is alive. To use this check please copy check_heartbeat.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

{{{ GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php" }}}

Define the check command in Nagios. Threshold is the amount of time since last reported heartbeat to raise critical alert.

{{{ define command { command_name check_ganglia_heartbeat command_line /bin/sh /var/www/html/ganglia/nagios/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$ } }}}

Now for every host you want monitored change check_command to be

{{{ check_command check_ganglia_heartbeat!50 }}}

This will mark any node that reported to Ganglia 50 seconds or more ago as CRITICAL.

== Check single metric on a specific host ==

To use it please copy check_ganglia_metric.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

{{{ GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php" }}}

Nagios configuration consists of defining following command

{{{ define command { command_name check_ganglia_metric command_line /bin/sh /var/www/html/ganglia/nagios/check_ganglia_metric.sh host=$HOSTADDRESS$ metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$ } }}}

Now you can use it in a service check. For instance say you want to be alerted if 1-minute load average goes over 5 you would add following directive

{{{ check_command check_ganglia_metric!load_one!more!5 }}}

If you wanted to alert when disk space goes less than 10 GB

{{{ check_command check_ganglia_metric!disk_free!less!10 }}}

Be reminded that operators indicate what should be "critical" state. For instance if you use notequal it means state is critical if the value is NOT equal. etc.

== Check multiple metrics on a specific host ==

Check multiple metrics is a modification of the check single metric script. It will check multiple metrics on the same host e.g. instead of having separate checks for e.g. disk utilization on /, /tmp and /var which may produce three separate alerts you have a single alert any time disk utilization goes below or above a threshold.

To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

{{{ GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php" }}}

Then define a check command in Nagios

{{{ define command { command_name check_ganglia_multiple_metrics command_line /bin/sh /var/www/html/ganglia/nagios/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$' } }}}

Then add a list of checks that are delimited with :. Each check consists of

metric_name,operator,critical_value

E.g.

{{{ check_command check_ganglia_multiple_metrics!disk_free_rootfs,less,10:disk_free_tmp,less,20 }}}

'''WARNING:''' Drawback of using check multiple metrics is that in certain instances you may not be aware of the scale of a problem. For example say you get an alert for /tmp nearing full. You get this alert over the weekend so you figure it's not THAT critical. After the alert your /var starts rapidly filling up which may be really serious. Unfortunately you will not get another alert (unless obviously you had an aggressive notification interval). Beware.

== Check multiple metrics on a range of hosts defined with a regular expression ==

Use this check to check a single or multiple metrics on a range of hosts defined using a regular expression. This is useful when you want to get a single alert if particular metric is critical across a number of hosts.

To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

{{{ GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php" }}}

Then define a check command in Nagios

{{{ define command { command_name check_ganglia_host_regex command_line /bin/sh /usr/share/ganglia-web2/nagios/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$' } }}}

Then add a list of checks that are delimited with :. Each check consists of

metric_name,operator,critical_value

E.g.

For example to check free space on / and /tmp for any machine starting with web-* or app-* you would use something like this

{{{ check_command check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,10:disk_free_tmp,less,10 }}}

'''DOWNSIDES:''' Downside of this approach similar to check multiple metrics on a single host is that in certain situation the scale of a problem may not be apparent since only a single alert will be generated. Also currently since Nagios and Ganglia are decoupled you may get an alert if machine is under scheduled maintenance and e.g. you start writing to /tmp.

== Check value(s) is same on a set of hosts ==

Use this check to check a single or multiple metrics on a range of hosts have the same value. For example let's say you wanted to make sure SVN revision of the deployed code was the same across all servers. You would send the SVN revision as e.g. a string metric then list it as metric that needs to be same everywhere

To use it please copy check_value_same_everywhere.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

{{{ GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php" }}}

Then define a check command in Nagios

{{{ define command { command_name check_ganglia_host_regex command_line /bin/sh /usr/share/ganglia-web2/nagios/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$' } }}}

e.g.

{{{ check_command check_ganglia_host_regex!^web-|^app-!svn_revision,num_config_files }}}

Clone this wiki locally