A Metrics Collection System, Illustrated: Ganglia

As I've said, metrics collection systems should do the heavy lifting for you while you build and manage a growing infrastructure. I've given examples of how metrics collection can assist in the forecasting and troubleshooting of system and application anomalies, and it should be obvious that metrics collection should be considered mandatory, not optional. Without it, you're blind. With it, you're in control of your site's destiny.

As with all tools and their implementation, the devil is in the details. No matter if you're choosing an open source tool, using a commercial piece of metrics software, or writing a collection of scripts to gather application-specific metrics, they all have variations on the same ideas. How your tool collects, aggregates, stores, and serves your metrics will make a world of difference, and how you use (and rely on) it will largely depend on how easy those operations are.

To take a look at a real-world example, I asked Matt Massie to give his insight when he was designing and building a metrics collection system meant to scale with thousands of nodes across disparate physical locations. Ten years ago, Matt wrote the open source metrics collection tool called Ganglia, and although it was originally written with the High Performance Computing (HPC) industry in mind, it has become popular over the years with growing web infrastructures.

Because we've discussed the what and why of metrics collection, I think it's ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.