O'Reilly logo

The Art of Capacity Planning by John Allspaw

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Long-Term Trends

Now you know how to apply the statistics collected in Chapter 3 to immediate needs. But you may also want to view your site from a more global perspective—both in the literal sense (as your site becomes popular internationally), and in a figurative sense, as you look at the issues surrounding the product and the site's strategy.

Traffic Pattern Changes

As mentioned earlier, getting to know the peaks and valleys of your various resources and application usage is paramount to predicting the future. As you gain more and more history with your metrics, you may be able to perceive more subtle trends that will inform your long-term decisions.

For example, let's take a look at Figure 4-15, which illustrates a typical traffic pattern for a web server.

Figure 4-15 shows a pretty typical U.S. daily traffic pattern. The load rises slowly in the morning, East Coast time, as users begin browsing. These users go to lunch as West Coast users come online, keeping up the load, which finally drops off as people leave work. At this point, the load drops to only those users browsing over night.

As your usage grows, you can expect this graph to grow vertically as more users visit your site during the same peaks and valleys. But if your audience grows more internationally, the bump you see every day will widen as the number of active user time zones increases. As seen in Figure 4-16, you may even see distinct bumps after the U.S. drop-off if your site's popularity grows in a region further away than Europe.

Figure 4-16 displays two daily traffic patterns, taken one year apart, and superimposed one on top of the other. What once was a smooth bump and decline has become a two-peak bump, due to the global effect of popularity.

Typical daily web server traffic pattern

Figure 4-15. Typical daily web server traffic pattern

Daily traffic patterns grow wider with increasing international usage

Figure 4-16. Daily traffic patterns grow wider with increasing international usage

Of course, your product and marketing people are probably very aware of the demographics and geographic distribution of your audience, but tying this data to your system's resources can help you predict your capacity needs.

Figure 4-16 also shows that your web servers must sustain their peak traffic for longer periods of time. This will indicate when you should schedule any maintenance windows to minimize the effect of downtime or degraded service to the users. Notice the ratio between your peak and your low period has changed as well. This will affect how many servers you can stand to lose to failure during those periods, which is effectively the ceiling of your cluster.

It's important to watch the change in your application's traffic pattern, not only for operational issues, but to drive capacity decisions, such as whether to deploy any capacity into international data centers.

Application Usage Changes and Product Planning

A good capacity plan not only relies on system statistics such as peaks and valleys, but user behavior as well. How your users interact with your site is yet another valuable vein of data you should mine for information to help keep your crystal ball as clear as possible.

If you run an online community, you might have discussion boards in which users create new topics, make comments, and upload media such as video and photos. In addition to the previously discussed system-related metrics, such as storage consumption, video and photo processing CPU usage, and processing time, some other metrics you might want to track are:

  • Discussion posts per minute

  • Posts per day, per user

  • Video uploads per day, per user

  • Photo uploads per day, per user

Application usage is just another way of saying user engagement, to borrow a term from the product and marketing folks.

Recall back to our database-planning example. In that example, we found our database ceiling by measuring our hardware's resources (CPU, disk I/O, memory, and so on), relating them to the database's resources (queries per second, replication lag) and tying those ceilings to something we can measure from the user interaction perspective (how many photos per user are on each database).

This is where capacity planning and product management tie together. Using your system and application statistics histories, you can now predict with some (hopefully increasing) degree of accuracy what you'll need to meet future demand. But your history is only part of the picture. If your product team is planning new features, you can bet they'll affect your capacity plan in some way.

Historically, corporate culture has isolated product development from engineering. Product people develop ideas and plans for the product, while engineering develops and maintains the product once it's on the market. Both groups make forecasts for different ends, but the data used in those forecasts should tie together.

One of the best practices for a capacity planner is to develop an ongoing conversation with product management. Understanding the timeline for new features is critical to guaranteeing capacity needs don't interfere with product improvements. Having enough capacity is an engineering requirement, in the same way development time and resources are.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required