The Art of Capacity Planning

Preface

SOMEWHERE AROUND 3 A.M. ON JULY 7TH, 2005, MY COWORKER, CAL HENDERSON, AND I WERE FINISHING up some final details before moving all of the traffic for our website, Flickr.com, to its new home: a Yahoo! data center in Texas. The original infrastructure in Vancouver was becoming more and more overloaded, and suffering from serious power and space constraints. Since Yahoo! had just acquired Flickr, it was time to bring new capacity online. It was about an hour after we changed DNS records to point to our shiny new servers that Cal happened to glance at the news. The London subway had just been bombed.

Londoners responded with their camera phones, among other things. Over the next 24 hours, Flickr saw more traffic than ever before, as photos from the disaster were uploaded to the site. News outlets began linking to the photos, and traffic on our new servers went through the roof.

It was not only a great example of citizen journalism, but also an object lesson—sadly, one born of tragedy—in capacity planning. Traffic can be sporadic and unpredictable at times. Had we not moved over to the new data center, Flickr.com wouldn’t have been available that day.

Capacity planning has been around since ancient times, with roots in everything from economics to engineering. In a basic sense, capacity planning is resource management. When resources are finite, and come at a cost, you need to do some capacity planning.

When a civil engineering firm designs a new highway system, it’s planning for capacity, as is a power company planning to deliver electricity to a metropolitan area. In some ways, their concerns have a lot in common with web operations; many of the basic concepts and concerns can be applied to all three disciplines.

While systems administration has been around since the 1960s, the branch focused on serving websites is still emerging. A large part of web operations is capacity planning and management. Those are processes, not tasks, and they are composed of many different parts. Although every organization goes about it differently, the basic concepts are the same:

Ensure proper resources (servers, storage, network, etc.) are available to handle expected and unexpected loads.
Have a clearly defined procurement and approval system in place.
Be prepared to justify capital expenditures in support of the business.
Have a deployment and management system in place to manage the resources once they are deployed.

Why I Wrote This Book

One of my frustrations as an operations engineering manager was not having somewhere to turn to help me figure out how much equipment we’d need to keep running. Existing books on the topic of computer capacity planning were focused on the mathematical theory of resource planning, rather than the practical implementation of the whole process.

A lot of literature addressed only rudimentary models of website use cases, and lacked specific information or advice. Instead, they tended to offer mathematical models designed to illustrate the principles of queuing theory, which is the foundation of traditional capacity planning. This approach might be mathematically interesting and elegant, but it doesn’t help the operations engineer when informed he has a week to prepare for some unknown amount of additional traffic—perhaps due to the launch of a super new feature—or seeing his site dying under the weight of a link from the front page of Yahoo!, Digg, or CNN.

I’ve found most books on web capacity planning were written with the implied assumption that concepts and processes found in non-web environments, such as manufacturing or industrial engineering, applied uniformly to website environments as well. While some of the theory surrounding such planning may indeed be similar, the practical application of those concepts doesn’t map very well to the short timelines of website development.

In most web development settings, it’s been my observation that change happens too fast and too often to allow for the detailed and rigorous capacity investigations common to other fields. By the time the operations engineer comes up with the queuing model for his system, new code is deployed and the usage characteristics have likely already changed dramatically. Or some other technological, social, or real-world event occurs, making all of the modeling and simulations irrelevant.

What I’ve found to be far more helpful, is talking to colleagues in the industry—people who come up against many of the same scaling and capacity issues. Over time, I’ve had contact with many different companies, each employing diverse architectures, and each experiencing different problems. But quite often they shared very similar approaches to solutions. My hope is that I can illustrate some of these approaches in this book.

Focus and Topics

This book is not about building complex models and simulations, nor is it about spending time running benchmarks over and over. It’s not about mathematical concepts such as Little’s Law, Markov chains, or Poisson arrival rates.

What this book is about is practical capacity planning and management that can take place in the real world. It’s about using real tools, and being able to adapt to changing usage on a website that will (hopefully) grow over time. When you have a flat tire on the highway, you could spend a lot of time trying to figure out the cause, or you can get on with the obvious task of installing the spare and getting back on the road.

This is the approach I’m presenting to capacity planning: adaptive, not theoretical.

Keep in mind a good deal of the information in this book will seem a lot like common sense—this is a good thing. Quite often the simplest approaches to problem solving are the best ones, and capacity planning is no exception.

This book will cover the process of capacity planning for growing websites, including measurement, procurement, and deployment. I’ll discuss some of the more popular and proven measurement tools and techniques. Most of these tools run in both LAMP and Windows-based environments. As such, I’ll try to keep the discussion as platform-agnostic as possible.

Of course, it’s beyond the scope of this book to cover the details of every database, web server, caching server, and storage solution. Instead, I’ll use examples of each to illustrate the process and concepts, but this book is not meant to be an implementation guide. The intention is to be as generic as possible when it comes to explaining resource management—it’s the process itself we want to emphasize.

For example, a database is used to store data and provide responses to queries. Most of the more popular databases allow for replicating data to other servers, which enhances redundancy, performance, and architectural decisions. It also assists the technical implementation of replication with Postgres, Oracle, or MySQL (a topic for other books). This book covers what replication means in terms of planning capacity and deployment.

Essentially, this book is about measuring, planning, and managing growth for a web application, regardless of the underlying technologies you choose.

Audience for This Book

This book is for systems, storage, database, and network administrators, engineering managers, and of course, capacity planners.

It’s intended for anyone who hopes (or perhaps fears) their website will grow like those of Facebook, Flickr, MySpace, Twitter, and others—companies that underwent the trial-by-fire process of scaling up as their usage skyrocketed. The approaches in this text come from real experience with sites where traffic has grown both heavily and rapidly. If you expect the popularity of your site will dramatically increase the amount of traffic you experience, then please read this book.

Organization of the Material

Chapter 1, Goals, Issues, and Processes in Capacity Planning, presents the issues that arise over and over on heavily trafficked websites.

Chapter 2, Setting Goals for Capacity, illustrates the various concerns involved with planning for the growth of a web application, and how capacity fits into the overall picture of availability and performance.

Chapter 3, Measurement: Units of Capacity, discusses capacity measurement and monitoring.

Chapter 4, Predicting Trends, explains how to turn measurement data into forecasts, and how trending fits into the overall planning process.

Chapter 5, Deployment, discusses concepts related to deployment; automation of installation, configuration, and management.

Appendix A, Virtualization and Cloud Computing, discusses where virtualization and cloud services fit into a capacity plan.

Appendix B, Dealing with Instantaneous Growth, offers insight into what can be done in capacity crisis situations, and some best practices for dealing with site outages.

Appendix C, Capacity Tools, is an annotated list of measurement, installation, configuration, and management tools highlighted throughout the book.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic: Indicates new terms, URLs, filenames, Unix utilities, and command-line options.
Constant width: Indicates the contents of files, the output from commands, and generally anything found in programs.
Constant width bold: Shows commands or other text that should be typed literally by the user, and parts of code or files highlighted to stand out for discussion.
Constant width italic: Shows text that should be replaced with user-supplied values.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "The Art of Capacity Planning by John Allspaw. Copyright 2008 Yahoo! Inc., 978-0-596-51857-8.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

We’d Like to Hear from You

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

http://www.oreilly.com/catalog/9780596518578

To comment or ask technical questions about this book, send email to:

bookquestions@oreilly.com

For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at:

http://www.oreilly.com

Safari® Books Online

When you see a Safari® Books Online icon on the cover of your favorite technology book, that means the book is available online through the O’Reilly Network Safari Bookshelf.

Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.

Acknowledgments

It’s simply not possible to thank everyone enough in this single, small paragraph, but I will most certainly mention their names. Most of the material in this book was derived from experiences in the trenches, and there are many people who have toughed it out in those trenches alongside me. Peter Norby, Gil Raphaelli, Kevin Collins, Dathan Pattishall, Cal Henderson, Aaron Cope, Paul Hammond, Paul Lloyd, Serguei Mourachov and Chad Dickerson need special thanks, as does Heather Champ and the entire Flickr customer care team. Thank you Flickr development engineering: you all think like operations engineers and for that I am grateful. Thanks to Stewart Butterfield and Caterina Fake for convincing me to join the Flickr team early on. Thanks to David Filo and Hugo Gunnarsen for forcing me to back up my hardware requests with real data. Major thanks go out to Kevin Murphy for providing so much material in the automated deployment chapter. Thanks to Andy Oram and Isabel Kunkle for editing, and special thanks to my good friend Chris Colin for excellent pre-pre-editing advice.

Thanks to Adam Jacob, Matt St. Onge, Jeremy Zawodny, and Theo Schlossnagle for the super tech review.

Much thanks to Matt Mullenweg and Don MacAskill for sharing their cloud infrastructure use cases.

Most important, thanks to my wife, Elizabeth Kairys, for encouraging and supporting me in this insane endeavor. Accomplishing this without her would have been impossible.

Get The Art of Capacity Planning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial