Chapter 12

Monitoring Strategies

Real-time monitoring is the new face of testing.

—Noah Sussman

Most cloud services are built to be always on, meaning the customer expects to be able to use the service 24 hours a day, 365 days a year. A considerable amount of engineering is required to build cloud services that provide the high levels of uptime, reliability, and scalability required to be always on. Even with a great architecture, it still takes a proactive monitoring strategy in order to meet the service level agreements (SLAs) required to deliver a system that does not go down. This chapter discusses strategies for monitoring cloud services.

Proactive vs. Reactive Monitoring

Many IT shops are accustomed to monitoring systems to detect failures. These shops track the consumption of memory, CPU, and disk space of servers and the throughput of the network to detect symptoms of system failures. Tools that ping URLs to check if websites are responding are very common, as well. All of these types of monitors are reactive. The tools tell us either that something is failing or that something is about to fail. Reactive monitoring focuses on detection. There should be a corresponding monitoring strategy for prevention.

The goal of proactive monitoring is to prevent failures. Prevention requires a different mind-set than detection. To prevent failures, we first must define what healthy system metrics look like. Once we define the baseline metrics for a healthy system, we must watch patterns ...

Get Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS) now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.