Providing Context for Anomaly Detection and Alerts

The main reason you'll want to collect metrics in the first place is so that, as you would with your gas gauge, you have some idea of what the infrastructure is doing and where it's headed. One of the benefits of knowing where your resources are growing (or shrinking) is the ability to make forecasts. Using forecasting to predict your infrastructure's sizing needs is called capacity planning, and there are a couple of books written on the topic already, so we won't cover that here. Suffice it to say that medium- to long-term forecasting can be more art than science when it comes to predicting infrastructure usage. It can be difficult if not impossible to rely on metrics collection alone to provide confidence surrounding organic growth experienced by social web applications, and the nonorganic step-function growth that can happen surrounding feature releases that exponentially drive user engagement.

Metrics collection gets really interesting when you're looking for anomalies in your usage. When you get some warning or critical alerts on various pieces of your system, you should be able to find those values on a graph somewhere among all of the metrics you're gathering. Got an alert that says CPU usage on a web server is unacceptably high? Yep, it's right there on that graph. Hit the maximum number of connections on a database? Yep, it's right there on that graph. Anything that your monitoring system would alert you to, you should ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.