Step 1: Understand What You Are Monitoring

After graduating from university, I got a job managing websites at a government organization. We inherited a heterogeneous set of servers previously managed by different departments. Little documentation was available, and as we started exploring the environment, it almost felt like we were reverse-engineering. We made an inventory of all the servers we could find and started adding them to the monitoring: the first checks we added were those for availability using simple network pings as well as HTTP request checks and response times. To gather more information on what would cause problems we also added checks for memory, disk, and CPU usage in addition to checks for essential processes such as SSH, HTTPD, and NTPD. Looking at these results gave us a good overview of the situation we were facing.

Occasionally, we got emails from people stating that they couldn't access the website. When we'd check the website and the monitoring, we'd see that everything was working fine. We'd politely reply that as far as we could see, everything was working fine, and the problem was probably on their PC and they needed to reboot. In reality, we thought these cases were typical examples of PEBKAC (Problem Exists Between Keyboard and Chair).

Then one day my boss sent me an email saying that he too was experiencing problems accessing the website, and because he urgently needed some information, I had to come to his office ASAP. I opened his browser and typed ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.