O'Reilly logo

The Art of Capacity Planning by John Allspaw

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Iteration and Calibration

Producing forecasts by curve-fitting your system and application data isn't the end of your capacity planning. In order to make it accurate, you need to revisit your plan, re-fit the data, and adjust accordingly.

Ideally, you should have periodic reviews of your forecasts. You should check how your capacity is doing against your predictions on a weekly, or even daily, basis. If you know you're nearing capacity on one of your resources and are awaiting delivery of new hardware, you might keep a much closer eye on it. The important thing to remember is your plan is going to be accurate only if you consistently re-examine your trends and question your past predictions.

As an example, we can revisit our simple storage consumption data. We made a forecast based on data we gleaned for a 15-day period, from 7/26/05 to 8/09/05. We also discovered that on 8/30/2005 (roughly two weeks later), we expected to run out of space if we didn't deploy more storage. More accurately, we were slated to reach 20,446.81 GB of space, which would have exceeded our total available space is 20,480 GB.

How accurate was that prediction? Figure 4-17 shows what actually happened.

Disk consumption: predicted trend versus actual

Figure 4-17. Disk consumption: predicted trend versus actual

As it turned out, we had a little more time than we thought—about four days more. We made a guess based on the trend at the time, which ended up being inaccurate but at least in favor of allowing more time to integrate new capacity. Sometimes, forecasts can either widen the window of time (as in this case) or tighten that window.

This is why the process of revisiting your forecasts is critical; it's the only way to adjust your capacity plan over time. Every time you update your capacity plan, you should go back and evaluate how your previous forecasts fared.

Since your curve-fitting and trending results tend to improve as you add more data points, you should have a moving window with which you make your forecasts. The width of that forecasting window will vary depending on how long your procurement process takes.

For example, if you know that it's going to take three months on average to order, install, and deploy capacity, then you'd want your forecast goal to be three months out, each time. As the months pass, you'll want to add the influence of most recent events to your past data and recalculate your predictions, as is illustrated in Figure 4-18.

A moving window of forecasts

Figure 4-18. A moving window of forecasts

Best Guesses

This process of plotting, prediction, and iteration can provide a lot of confidence in how you manage your capacity. You'll have accumulated a lot of data about how your current infrastructure is performing, and how close each piece is to their respective ceilings, taking into account comfortable margins of safety. This confidence is important because the capacity planning process (as we've seen) is just as much about educated guessing and luck as it is about hard science and math. Hopefully, the iterations in your planning process will point out any flawed assumptions in the working data, but it should also be said the ceilings you're using could become flawed or obsolete over time as well.

Just as your ceilings can change depending on the hardware specifications of a server, so too can the actual metric you're assuming is your ceiling. For example, the defining metric of a database might be disk I/O, but after upgrading to a newer and faster disk subsystem, you might find the limiting factor isn't disk I/O anymore, but the single gigabit network card you're using. It bears mentioning that picking the right metric to follow can be difficult, as not all bottlenecks are obvious, and the metric you choose can change as the architecture and hardware limitations change.

During this process you might notice seasonal variations. College starts in the fall, so there might be increased usage as students browse your site for materials related to their studies (or just to avoid going to class). As another example, the holiday season in November and December almost always witness a bump in traffic, especially for sites involving retail sales. At Flickr, we see both of those seasonal effects.

Taking into account these seasonal or holiday variations should be yet another influence on how wide or narrow your forecasting window might be. Obviously, the more often you recalculate your forecast, the better prepared you'll be, and the sooner you'll notice variations you didn't expect.

Diagonal Scaling Opportunities

As I pointed out near the beginning of the chapter, predicting capacity requires two essential bits of information: your ceilings and your historical data. Your historical data is etched in stone. Your ceilings are not, since each ceiling you have is indicative of a particular hardware configuration. Performance tuning can hopefully change those ceilings for the better, but upgrading the hardware to newer and better technology is also an option.

As I mentioned at the beginning of the book, new technology (such as multicore processors) can dramatically change how much horsepower you can squeeze from a single server. The forecasting process shown in this chapter allows you to not only track where you're headed on a per-node basis, but also to think about which segments of the architecture you might possibly move to new hardware options.

As discussed, due to the random access patterns of Flickr's application, our database hardware is currently bound by the amount of disk I/O the servers can manage. Each database machine currently comprises six disks in a RAID10 configuration with 16 GB of RAM. A majority of that physical RAM is given to MySQL, and the rest is used for a filesystem cache to help with disk I/O. This is a hardware bottleneck that can be mitigated in a variety of ways, including at least:

  • Spreading the load horizontally across many six-disk servers (current plan)

  • Replacing each six-disk server with hardware containing more disk spindles

  • Adding more physical RAM to assist both the filesystem and MySQL

  • Using faster I/O options such as Solid-State Disks (SSDs)

Which of these options is most likely to help our capacity footprint? Unless we have to grow the number of nodes very quickly, we might not care right now. If we have accurate forecasts for how many servers we'll need in our current configuration, we're in a good place to evaluate the alternatives.

Another long-term option may be to take advantage of the bottlenecks we have on those machines. Since our disk-bound boxes are not using many CPU cycles, we could put those mostly idle CPUs to use for other tasks, making more efficient use of the hardware. We'll talk more about efficiency and virtualization in Appendix A.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required