Cover by Stoyan Stefanov

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Chapter 15. Using Intelligent Caching to Avoid the Bot Performance Tax

Matthew Prince

In 2004, Lee Holloway (https://twitter.com/icqheretic) and I started Project Honey Pot (http://www.projecthoneypot.org/). The site, which tracks online fraud and abuse, primarily consists of web pages that report the reputation of IP addresses. While we had limited resources and tried to get the most of them, I just checked Google which lists more than 31 million pages in its index that make up the www.projecthoneypot.org (http://www.projecthoneypot.org/) site.

Project Honey Pot’s pages are relatively simple and asset-light, but like many sites today they include significant dynamic content that is regularly updated at unpredictable intervals. To deliver near realtime updates, the pages need to be database driven.

To maximize performance of the site, from the beginning we used a number of different caching layers to store the most frequently accessed pages. Lee, whose background is high-performance database design, studied reports from services like Google Analytics to understand how visitors moved through the site and built caching to keep regularly accessed pages from needing to hit the database.

We thought we were pretty smart but, in spite of following the best practices of web application performance design, with alarming frequency the site would grind to a halt. The culprit turned out to be something unexpected and hidden from the view of many people optimizing web performance: automated bots. ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required