Making Metrics Available to Your Alerting Mechanisms

At the beginning of this chapter, I mentioned that monitoring systems built to alert on metrics can be and sometimes are different tools than the system collecting the metrics. Nagios is an example of a monitoring/alerting tool that is commonly found alongside metrics collection systems.

One of the advantages of having a metrics collection system focus entirely on gathering metrics is the ability to find integration points for alerting on the value metrics. At Flickr, Ganglia was our metrics collection system, and Nagios was our monitoring and alerting system. In some cases, we tied the two together to create more sophisticated alerting criteria. Giving Nagios awareness of metrics gathered by Ganglia allowed for a more advanced monitoring approach in which a fault could occur not with a single node reaching a critical threshold but with a multiple-value subthreshold pattern.

For example, let's say you have a cluster of web servers, and they are running Apache. And let's say they ask backend infrastructure such as databases running MySQL or Postgres for information used to build web pages. A common scenario that can come up is a database query taking longer than expected for whatever reason. The number of total active database connections increases, because they aren't closing as quickly. As a result, the number of busy Apache processes waiting on those connections is also increasing as they wait for their results. Both the web servers ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.