I'M ASSUMING YOU'VE MADE A FEW PASSES THROUGH Chapter 3 AND HAVE JUST DEPLOYED A SUPER-AWESOME, totally amazing, monitoring, trending, graphing, and measurement system. You're graphing everything you can get your hands on, as often as you can. You probably didn't gain anything from graphing the peak barking periods of your neighbor's dog—but hey, you did it, and I'm proud of you.
Now you'll be able to use this data (excluding the barking statistics) like a crystal ball, and predict the future like Nostradamus. But let's stop here for a moment to remember an irritating little detail: it's impossible to accurately predict the future.
Forecasting capacity needs is part intuition, and part math. It's also the art of slicing and dicing up your historical data, and making educated guesses about the future. Outside of those rare bursts and spikes of load on your system, the long-term view is hopefully one of steadily increasing usage. By putting all of this historical data into perspective, you can generate estimates for what you'll need to sustain the growth of your website. As we'll see later, the key to making accurate predictions is having an adjustable forecasting process.
A good capacity plan depends on knowing your needs for your most important resources, and how those needs change over time. Once you have gathered historical data on capacity, you can begin analyzing it with an eye toward recognizing any trends and recurring patterns.
For example, in the last chapter I recounted how at Flickr, we discovered Sunday has been historically the highest photo upload day of the week. This is interesting for many reasons. It may also lead us to other questions: has that Sunday peak changed over time, and if so, how has it changed with respect to the other days of the week? Has the highest upload day always been Sunday? Does that change as we add new members residing on the other side of the International Date Line? Is Sunday still the highest upload day on holiday weekends? These questions can all be answered once you have the data, and the answers in turn could provide a wealth of insight with respect to planning new feature launches, operational outages, or maintenance windows.
Recognizing trends is valuable for many reasons, not just for capacity planning. When we looked at disk space consumption in Chapter 3, we stumbled upon some weekly upload patterns. Being aware of any recurring patterns can be invaluable when making decisions later on. Trends can also inform community management, customer care and support, product management, and finance. Some examples of how metrics measurement can be useful include:
Your operations group can avoid scheduling maintenance that could affect image processing machines on a Sunday, opting for a Friday instead, to minimize any adverse effects on users.
If you deploy any new code that touches the upload processing infrastructure, you might want to pay particular attention the following Sunday to see whether everything is holding up well when the system experiences its highest load.
Making customer support aware of these peak patterns allows them to gauge the effect of any user feedback regarding uploads.
Product management might want to launch new features based on the low or high traffic periods of the day. A good practice is to make sure everyone on your team knows where these metrics are located and what they mean.
Let's take a look back at the daily storage consumption data we collected in the last chapter and apply it to make a forecast of future storage needs. We already know the defining metric: total available disk space. Graphing the cumulative total of this data provides the right perspective from which to predict future needs. Taking a look at Figure 4-1, we can see where we're headed with consumption, how it's changing over time, and when we're likely to run out of space.
Now, let's add our constraint: the total currently available disk space. Let's assume for this example we have a total of 20 TB (or 20,480 GB) installed capacity. From the graph, we see we've consumed about 16 TB. Adding a solid line extending into the future to represent the total space we have installed, we obtain a graph that looks like Figure 4-2. This illustration demonstrates a fundamental principal of capacity planning: predictions require two essential bits of information, your ceilings and your historical data.
Determining when we're going to reach our space limitation is our next step. As I just suggested, we could simply draw a straight line that extends from our measured data to the point at which it intersects our current limit line. But is our growth actually linear? It may not be.
Excel calls this next step "adding a trend line," but some readers might know this process as curve fitting. This is the process by which you attempt to find a mathematical equation that mimics the data you're looking at. You can then use that equation to make educated guesses about missing values within the data. In this case, since our data is on a time line, the missing values in which we're interested are in the future. Finding a good equation to fit the data can be just as much art as science. Fortunately, Excel is one of many programs that feature curve fitting.
XY (Scatter) changes the date values to just single data points. We can then use the trending feature of Excel to show us how this trend looks at some point in the future. Right-click the data on the graph to display a drop-down menu. From that menu, select Add Trendline. A dialog box will open, as shown in Figure 4-3.
Next, select a trend line type. For the time being, let's choose Polynomial, and set Order to 2. There may be good reasons to choose another trend type, depending on how variable your data is, how much data you have, and how far into the future you want to extrapolate. For more information, see the upcoming sidebar, "Fitting Curves."
In this example, the data appears about as linear as can be, but since I already know this data isn't linear over a longer period of time (it's accelerating), I'll pick a trend type that can capture some of the acceleration we know will occur.
After selecting a trend type, click the Options tab to bring up the Add Trendline options dialog box, as shown in Figure 4-4.
To show the equation that will be used to mimic our disk space data, click the checkbox for "Display equation on chart." We can also look at the R2 value for this equation by clicking the "Display R-squared value on chart" checkbox.
The R2 value is known in the world of statistics as the coefficient of determination. Without going into the details of how this is calculated, it's basically an indicator of how well an equation matches a certain set of data. An R2 value of 1 indicates a mathematically perfect fit. With the data we're using for this example, any value above 0.85 should be sufficient. The important thing to know is, as your R2 value decreases, so too should your confidence in the forecasts. Changing the trend type in the previous step affects the R2 values—sometimes for better, sometimes for worse—so some experimentation is needed here when looking at different sets of data.
We'll want to extend our trend line into the future, of course. We want to extend it far enough into the future such that it intersects the line corresponding to our total available space. This is the point at which we can predict we'll run out of space. Under the Forecast portion of the dialog box, enter 25 units for a value. Our units in this case are days. After you hit OK, you'll see our forecast looks similar to Figure 4-5.
The graph indicates that somewhere around day 37, we run out of disk space. Luckily, we don't need to squint at the graph to see the actual values; we have the equation used to plot that trend line. As detailed in Table 4-1, plugging the equation into Excel, and using the day units for the values of X, we find the last day we're below our disk space limit is 8/30/05.
Table 4-1. Determining the precise day you will run out of disk space
Disk available (GB)
y=0.7675 x 2 + 146.96x + 14147
Now we know when we'll need more disk space, and we can get on with ordering and deploying it.
This example of increasing disk space is about as simple as they come. But as the metric is consumption-driven, every day has a new value that contributes to the definition of our curve. We also need to factor in the peak-driven metrics that drive our capacity needs in other parts of our site. Peak-driven metrics involve resources that are continually regenerated, such as CPU time and network bandwidth. They fluctuate more dramatically and thus are more difficult to predict, so curve fitting requires more care.
In Chapter 3, we went through the exercise of establishing our database ceiling values. We discovered (through observing our system metrics) that 40 percent disk I/O wait was a critical value to avoid, because it's the threshold at which database replication begins experiencing disruptive lags.
How do we know when we'll reach this threshold? We need some indication when we are approaching our ceiling. It appears the graphs don't show a clear and smooth line just bumping over the 40 percent threshold. Instead, our disk I/O wait graph shows our database doing fine until a 40 percent spike occurs. We might deem occasional (and recoverable) spikes to be acceptable, but we need to track how our average values change over time so the spikes aren't so close to our ceiling. We also need to somehow tie I/O wait times to our database usage, and ultimately, what that means in terms of actual application usage.
To establish some control over this unruly data, let's take a step back from the system statistics and look at the purpose this database is actually serving. In this example, we're looking at a user database. This is a server in our main database cluster, wherein a segment of Flickr users store the metadata associated with their user account: their photos, their tags, the groups they belong to, and more. The two main drivers of load on the databases are, of course, the number of photos and the number of users.
This particular database has roughly 256,000 users and 23 million photos. Over time, we realized that neither the number of users nor the number of photos is singularly responsible for how much work the database does. Taking only one of those variables into account meant ignoring the effect of the other. Indeed, there may be many users who have few, or no photos; queries for their data is quite fast and not at all taxing. On the flip side, there are a handful of users who maintain enormous collections of photos.
We can look at our metrics for clues on our critical values. We have all our system metrics, our application metrics, and the historical growth of each.
We then set out to find the single most important metric that can define the ceiling for each database server. After looking at the disk I/O wait metric for each one, we were unable to distinguish a good correlation between I/O wait and the number of users on the database. We had some servers with over 450,000 users that were seeing healthy, but not dangerous, levels of I/O wait. Meanwhile, other servers with only 300,000 users were experiencing much higher levels of I/O wait. Looking at the number of photos wasn't helpful either—disk I/O wait didn't appear to be tied to photo population.
As it turns out, the metric that directly indicates disk I/O wait is the ratio of photos-to-users on each of the databases.
As part of our application-level dashboard, we measure on a daily basis (collected each night) how many users are stored on each database along with the number of photos associated with each user. The photos-to-user ratio is simply the total number of photos divided by the number of users. While this could be thought of as an average photos per user, the range can be quite large, with some "power" Flickr users having many thousands of photos while a majority have only tens or hundreds. By looking at how the peak disk I/O wait changes with respect to this photos per user ratio, we can get an idea of what sort of application-level metrics can be used to predict and control the use of our capacity (see Figure 4-6).
This graph was compiled from a number of our databases, and displays the peak disk I/O wait values against their current photos-to-user ratios. With this graph, we can ascertain where disk I/O wait begins to jump up. There's an elbow in our data around the 85–90 ratio when the amount of disk I/O wait jumps above the 30 percent range. Since our ceiling value is 40 percent, we'll want to ensure we keep our photos-to-user ratio in the 80–100 range. We can control this ratio within our application by distributing photos for high-volume users across many databases.
I want to stop here for a moment to talk a bit about Flickr's database architecture. After reaching the limits of the more traditional Master/Slaves MySQL replication architecture (in which all writes go to the master and all reads go to the slaves), we redesigned our database layout to be federated, or sharded. This evolution in architecture is becoming increasingly common as site growth reaches higher levels of changing data. I won't go into how that architectural migration came about, but it's a good example of how architecture decisions can have a positive effect on capacity planning and deployment. By federating our data across many servers, we limit our growth only by the amount of hardware we can deploy, not by the limits imposed by any single machine.
Because we're federated, we can control how users (and their photos) are spread across many databases. This essentially means each server (or pair of servers, for redundancy) contains a unique set of data. This is in contrast to the more traditional monolithic database that contains every record on a single server. More information about federated database architectures can be found in Cal Henderson's book, Building Scalable Web Sites (O'Reilly).
OK, enough diversions—let's get back to our database capacity example and summarize where we are to this point. Database replication lag is bad and we want to avoid it. We hit replication lag when we see 40 percent disk I/O wait, and we reach that threshold when we've installed enough users and photos to produce a photos-to-user ratio of 110. We know how our photo uploads and user registrations grow, because we capture that on a daily basis (Figure 4-7). We are now armed with all the information we need to make informed decisions regarding how much database hardware to buy, and when.
We can extrapolate a trend based on this data to predict how many users and photos we'll have on Flickr for the foreseeable future, then use that to gauge how our photos/user ratio will look on our databases, and whether we need to adjust the maximum amounts of users and photos to ensure an even balance across those databases.
We've found where the elbow in our performance (Figure 4-6) exists for these databases—and therefore our capacity—but what is so special about this photos/users ratio for our databases? Why does this particular value trigger performance degradation? It could be for many reasons, such as specific hardware configurations, or the types of queries that result from having that much data during peak traffic. Investigating the answers to these questions could be a worthwhile exercise, but here again I'll emphasize that we should simply expect this effect will continue and not count on any potential future optimizations.
When we forecast the capacity of a peak-driven resource, we need to track how the peaks change over time. From there, we can extrapolate from that data to predict future needs. Our web server example is a good opportunity to illustrate this process.
In Chapter 3, we identified our web server ceilings as 85 percent CPU usage for this particular hardware platform. We also confirmed CPU usage is directly correlated to the amount of work Apache is doing to serve web pages. Also as a result of our work in Chapter 3, we should be familiar with what a typical week looks like across Flickr's entire web server cluster. Figure 4-8 illustrates the peaks and valleys over the course of one week.
This data is extracted from a time in Flickr's history when we had 15 web servers. Let's suppose this data is taken today, and we have no idea how our activity will look in the future. We can assume the observations we made in the previous chapter are accurate with respect to how CPU usage and the number of busy apache processes relate—which turns out to be a simple multiplier: 1.1. If for some reason this assumption does change, we'll know quickly, as we're tracking these metrics on a per-minute basis. According to the graph in Figure 4-8, we're seeing about 900 busy concurrent Apache processes during peak periods, load balanced across 15 web servers. That works out to about 60 processes per web server. Thus, each web server is using approximately 66 percent total CPU (we can look at our CPU graphs to confirm this assumption).
The peaks for this sample data are what we're interested in the most. Figure 4-9 presents this data over a longer time frame, in which we see these patterns repeat.
It's these weekly peaks that we want to track and use to predict our future needs. As it turns out, for Flickr, those weekly peaks almost always fall on a Monday. If we isolate those peak values and pull a trend line into the future as we did with our disk storage example above, we'll see something similar to Figure 4-10.
If our traffic continues to increase at the current pace, this graph predicts in another eight weeks, we can expect to experience roughly 1,300 busy Apache processes running at peak. With our 1.1 processes-to-CPU ratio, this translates to around 1,430 percent total CPU usage across our cluster. If we have defined 85 percent on each server as our upper limit, we would need 16.8 servers to handle the load. Of course, manufacturers are reluctant to sell servers in increments of tenths, so we'll round that up to 17 servers. We currently have 15 servers, so we'll need to add 2 more.
Fortunately, we already have enough data to calculate when we'll run out of web server capacity. We have 15 servers, each currently operating at 66 percent CPU usage at peak. Our upper limit on web servers is set at 85 percent, which would mean 1,275% CPU usage across the cluster. Applying our 1.1 multiplier factor, this in turn would mean 1,160 busy Apache processes at peak. If we trust the trend line shown in Figure 4-11, we can expect to run out of capacity sometime between the 9th and 10th week.
Therefore, the summary of our forecast can be presented succinctly:
We'll run out of web server capacity three to four weeks from now.
We'll need two more web servers to handle the load we expect to see in eight weeks.
Now we can begin our procurement process with detailed justifications based on hardware usage trends, not simply a wild guess. We'll want to ensure the new servers are in place before we need them, so we'll need to find out how long it will take to purchase, deliver, and install them.
This is a simplified example. Adding two web servers in three to four weeks shouldn't be too difficult or stressful. Ideally, you should have more than six data points upon which to base your forecast, and you likely won't be so close to your cluster's ceiling as in our example. But no matter how much capacity you'll need to add, or how long the timeframe actually is, the process should be the same.
When you're forecasting with peak values as we've done, it's important to remember the more data you have to fit a curve, the more accurate your forecast will be. In our example, we based our hardware justifications on six weeks worth of data. Is that enough data to constitute a trend? Possibly, but the time period on which you're basing your forecasts is of great importance as well. Maybe there is a seasonal lull or peak in traffic, and you're on the cusp of one. Maybe you're about to launch a new feature that will add extra load to the web servers within the timeframe of this forecast. These are only a few considerations for which you may need to compensate when you're making justifications to buy new hardware. A lot of variables can come into play when predicting the future, and as a result, we have to remember to treat our forecasts as what they really are: educated guesses that need constant refinement.
Our use of Excel in the previous examples was pretty straightforward. But you can automate that process by using Excel macros. And since you'll most likely be doing the same process repeatedly as your metric collection system churns out new usage data, you can benefit greatly by introducing some automation into this curve-fitting business. Other benefits can include the ability to integrate these forecasts into a dashboard, plug them into other spreadsheets, or put them into a database.
An open source program called fityk (http://fityk.sourceforge.net) does a great job of curve-fitting equations to arbitrary data, and can handle the same range of equation types as Excel. For our purposes, the full curve-fitting abilities of fityk are a distinct overkill. It was created for analyzing scientific data that can represent wildly dynamic datasets, not just growing and decaying data. While fityk is primarily a GUI-based application (see Figure 4-12), a command-line version is also available, called cfityk. This version accepts commands that mimic what would have been done with the GUI, so it can be used to automate the curve fitting and forecasting.
The command file used by cfityk is nothing more than a script of actions you can write using the GUI version. Once you have the procedure choreographed in the GUI, you'll be able to replay the sequence with different data via the command-line tool.
If you have a carriage return–delimited file of x-y data, you can feed it into a command script that can be processed by cfityk. The syntax of the command file is relatively straightforward, particularly for our simple case. Let's go back to our storage consumption data for an example.
In the code example that follows, we have disk consumption data for a 15-day period, presented in increments of one data point per day. This data is in a file called storage-consumption.xy, and appears as displayed here:
1 14321.83119 2 14452.60193 3 14586.54003 4 14700.89417 5 14845.72223 6 15063.99681 7 15250.21164 8 15403.82607 9 15558.81815 10 15702.35007 11 15835.76298 12 15986.55395 13 16189.27423 14 16367.88211 15 16519.57105
# Fityk script. Fityk version: 0.8.2 @0 < '/home/jallspaw/storage-consumption.xy' guess Quadratic fit info formula in @0 quit
This script imports our x-y data file, sets the equation type to a second-order polynomial (quadratic equation), fits the data, and then returns back information about the fit, such as the formula used. Running the script gives us these results:
jallspaw:~]$cfityk ./fit-storage.fit1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
We now have our formula to fit the data:
|0.786854x2 + 146.657x + 14147.4|
Note how the result looks almost exactly as Excel's for the same type of curve. Treating the values for x as days and those for y as our increasing disk space, we can plug in our 25-day forecast, which yields the same results as the Excel exercise. Table 4-2 lists the results generated by cfityk.
Table 4-2. Same forecast as Table 4-1, curve-fit by cfityk
Disk Available (GB)
y=0.786854x2 + 146.657x + 14147.4
Being able to perform curve-fitting with a cfityk script allows you to carry out forecasting on a daily or weekly basis within a cron job, and can be an essential building block for a capacity planning dashboard.
Web capacity planning can borrow a few useful strategies from the older and better-researched work of mechanical, manufacturing, and structural engineering. These disciplines also need to base design and management considerations around resources and immutable limits. The design and construction of buildings, bridges, and automobiles obviously requires some intimate knowledge of the strength and durability of materials, the loads each component is expected to bear, and what their ultimate failure points are. Does this sound familiar? It should, because capacity planning for web operations shares many of those same considerations and concepts.
Under load, materials such as steel and concrete undergo physical stresses. Some have elastic properties that allow them to recover under light amounts of load, but fail under higher strains. The same concerns exist in your servers, network, or storage. When their resources reach certain critical levels—100 percent CPU or disk usage, for example—they fail. To pre-empt this failure, engineers apply what is known as a factor of safety to their design. Defined briefly, a factor of safety indicates some margin of resource allocated beyond the theoretical capacity of that resource, to allow for uncertainty in the usage.
While safety factors in the case of mechanical or structural engineering are usually part of the design phase, in web operations they should be considered as an amount of available resources that you leave aside, with respect to the ceilings you've established for each class of resource. This will enable those resources to absorb some amount of unexpected increased usage. Resources with which you should calculate safety factors include all the those discussed in Chapter 3: CPU, disk, memory, network bandwidth, even entire hosts (if you run a very large site).
For example, in Chapter 3 we stipulated 85 percent CPU usage as our upper limit for web servers, in order to reserve "enough headroom to handle occasional spikes." In this case, we're allowing a 15 percent margin of "safety." When making forecasts, we need to take these safety factors into account and adjust the ceiling values appropriately.
Why a 15 percent margin? Why not 10 or 20 percent? Your safety factor is going to be somewhat of a slippery number or educated guess. Some resources, such as caching systems, can also tolerate spikes better than others, so you may want to be less conservative with a margin of safety. You should base your safety margins on "spikes" of usage that you've seen in the past. See Figure 4-13.
Figure 4-13 displays the effect of a typically-sized traffic spike Flickr experiences on a regular basis. It's by no means the largest. Spikes such as this one almost always occur when the front page of http://www.yahoo.com posts a prominent link to a group, a photo, or a tag search page on Flickr. This particular spike was fleeting; it lasted only about two hours while the link was up. It caused an eight percent bump in traffic to our photo servers. Seeing a 5–15 percent increase in traffic like this is quite common, and confirms that our 15 percent margin of safety is adequate.