Step 2: Understand Normal Behavior

Over the years, I've done my share of election sites. It's one of those rare occasions that you can experience mass website usage in a very short period of time—a bit like the Slashdot effect but anticipated. After the government work, I got involved in building a portal website for a broadcaster that would launch by first covering the election and later turning into a major news site.

While we were setting up the new website we were also implementing a new monitoring solution for the web farm. It was the first time they set up a major Internet-facing website, and they demanded a much higher security standard because it was facing the Internet. None of the servers in the DMZ was allowed to connect back to the intranet. So, the agents running on that server to collect the metrics such as CPU, memory, and disk usage could not report back to a central monitoring console. To comply with this, we collected these metrics by logging in over SSH and running a few scripts. Additionally, SNMP agents couldn't send traps to the master, so we had to resort to polling the correct metrics.

A few weeks before the site went live we noticed that the load on all the machines was rising, even if no test users were connected. How could that happen? Well, the monitoring itself created the load: as everybody wanted to be 100% sure that everything was OK, check after check was added to the monitoring. And to be sure that we didn't miss any information, the polling frequency ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.