Building Scalable Web Sites

Preface

The first web application I built was called Terrania. A visitor could come to the web site, create a virtual creature with some customizations, and then track that creature’s progress through a virtual world. Creatures would wander about, eat plants (or other creatures), fight battles, and mate with other players’ creatures. This activity would then be reported back to players by twice-daily emails summarizing the day’s events.

Calling it a web application is a bit of a stretch; at the time I certainly wouldn’t have categorized it as such. The core of the game was a program written in C++ that ran on a single machine, loading game data from a single flat file, processing everything for the game “tick,” and storing it all again in a single flat file. When I started building the game, the runtime was destined to become the server component of a client-server game architecture. Programming network data-exchange at the time was a difficult process that tended to involve writing a lot of rote code just to exchange strings between a server and client (we had no .NET in those days).

The Web gave application developers a ready-to-use platform for content delivery across a network, cutting out the trickier parts of client-server applications. We were free to build the server that did the interesting parts while building a client in simple HTML that was trivial in comparison. What would have traditionally been the client component of Terrania resided on the server, simply accessing the same flat file that the game server used. For most pages in the “client” application, I simply loaded the file into memory, parsed out the creatures that the player cared about, and displayed back some static information in HTML. To create a new creature, I appended a block of data to the end of a second file, which the server would then pick up and process each time it ran, integrating the new creatures into the game. All game processing, including the sending of progress emails, was done by the server component. The web server “client” interface was a simple C++ CGI application that could parse the game datafile in a couple of hundred lines of source.

This system was pretty satisfactory; perhaps I didn’t see the limitations at the time because I didn’t come up against any of them. The lack of interactivity through the web interface wasn’t a big deal as that was part of the game design. The only write operation performed by a player was the initial creation of the creature, leaving the rest of the game as a read-only process. Another issue that didn’t come up was concurrency. Since Terrania was largely read-only, any number of players could generate pages simultaneously. All of the writes were simple file appends that were fast enough to avoid spinning for locks. Besides, there weren’t enough players for there to be a reasonable chance of two people reading or writing at once.

A few years would pass before I got around to working with something more closely resembling a web application. While working for a new media agency, I was asked to modify some of the HTML output by a message board powered by UBB (Ultimate Bulletin Board, from Groupee, Inc.). UBB was written in Perl and ran as a CGI. Application data items, such as user accounts and the messages that comprised the discussion, were stored in flat files using a custom format. Some pages of the application were dynamic, being created on the fly from data read from the flat files. Other pages, such as the discussions themselves, were flat HTML files that were written to disk by the application as needed. This render-to-disk technique is still used in low-write, high-read setups such as weblogs, where the cost of generating the viewed pages on the fly outweighs the cost of writing files to disk (which can be a comparatively very slow operation).

The great thing about the UBB was that it was written in a “scripting” language, Perl. Because the source code didn’t need to be compiled, the development cycle was massively reduced, making it much easier to tinker with things without wasting days at a time. The source code was organized into three main files: the endpoint scripts that users actually requested and two library files containing utility functions (called ubb_library.pl and ubb_library2.pl—seriously).

After a little experience working with UBB for a few commercial clients, I got fairly involved with the message board “hacking” community—a strange group of people who spent their time trying to add functionality to existing message board software. I started a site called UBB Hackers with a guy who later went on to be a programmer for Infopop, writing the next version of UBB.

Early on, UBB had very poor concurrency because it relied on nonportable file-locking code that didn’t work on Windows (one of the target platforms). If two users were replying to the same thread at the same time, the thread’s datafile could become corrupted and some of the data lost. As the number of users on any single system increased, the chance for data corruption and race conditions increased. For really active systems, rendering HTML files to disk quickly bottlenecks on file I/O. The next step now seems like it should have been obvious, but at the time it wasn’t.

MySQL 3 changed a lot of things in the world of web applications. Before MySQL, it wasn’t as easy to use a database for storing web application data. Existing database technologies were either prohibitively expensive (Oracle), slow and difficult to work with (FileMaker), or insanely complicated to set up and maintain (PostgreSQL). With the availability of MySQL 3, things started to change. PHP 4 was just starting to get widespread acceptance and the phpMyAdmin project had been started. phpMyAdmin meant that web application developers could start working with databases without the visual design oddities of FileMaker or the arcane SQL syntax knowledge needed to drive things on the command line. I can still never remember the correct syntax for creating a table or granting access to a new user, but now I don’t need to.

MySQL brought application developers concurrency —we could read and write at the same time and our data would never get inadvertently corrupted. As MySQL progessed, we got even higher concurrency and massive performance, miles beyond what we could have achieved with flat files and render-to-disk techniques. With indexes, we could select data in arbitrary sets and orders without having to load it all into memory and walk the data structure. The possibilities were endless.

And they still are.

The current breed of web applications are still pushing the boundaries of what can be done in terms of scale, functionality, and interoperability. With the explosion of public APIs, the ability to combine multiple applications to create new services has made for a service-oriented culture. The API service model has shown us clear ways to architect our applications for flexibility and scale at a low cost.

The largest and most popular web applications of the moment, such as Flickr, Friendster, MySpace, and Wikipedia, handle billions of database queries per day, have huge datasets, and run on massive hardware platforms comprised of commodity hardware. While Google might be the poster child of huge applications, these other smaller (though still huge) applications are becoming role models for the next generation of applications, now labeled Web 2.0. With increased read/write interactivity, network effects, and open APIs, the next generation of web application development is going to be very interesting.

What This Book Is About

This book is primarily about web application design: the design of software and hardware systems for web applications. We’ll be looking at application architecture, development practices, technologies, Unicode, and general infrastructural work. Perhaps as importantly, this book is about the development of web applications: the practice of building the hardware and implementing the software systems that we design. While the theory of application design is all well and good (and an essential part of the whole process), we need to recognize that the implementation plays a very important part in the construction of large applications and needs to be borne in mind during the design process. If we’re designing things that we can’t build, then we can’t know if we’re designing the right thing.

This book is not about programming. At least, not really. Rather than talking about snippets of code, function names, and so forth, we’ll be looking at generalized techniques and approaches for building web applications. While the book does contain some snippets of example code, they are just that: examples. Most of the code examples in this book can be used only in the context of a larger application or infrastructure.

A lot of what we’ll be looking at relates to designing application architectures and building application infrastructures. In the field of web applications, infrastructures tend to mean a combination of hardware platform, software platform, and maintenance and development practices. We’ll consider how all of these fit together to build a seamless infrastructure for large-scale applications.

The largest chapter in this book (Chapter 9) deals solely with scaling applications: architectural approaches to design for scalability as well as technologies and techniques that can be used to help scale existing systems. While we can hardly cover the whole field in a single chapter (we could barely cover the basics in an entire book), we’ve picked a couple of the most useful approaches for applications with common requirements. It should be noted, however, that this is hardly an exhaustive guide to scaling, and there’s plenty more to learn. For an introduction to the wider world of scalable infrastructures, you might want to pick up a copy of Performance by Design: Computer Capacity Planning by Example (Prentice Hall).

Toward the end of the book (Chapters 10 and 11), we look at techniques for keeping web applications running with event monitoring and long-term statistical tracking for capacity planning. Monitoring and alerting are core skills for anyone looking to create an application and then manage it for any length of time. For applications with custom components, or even just many components, the task of designing and building the probes and monitors often falls to the application designers, since they should best know what needs to be tracked and what constitutes an alertable state. For every component of our system, we need to design some way to check that it’s both working and working correctly.

In the last chapter, we’ll look at techniques for sharing data and allowing other applications to integrate with our own via data feeds and read/write APIs. While we’ll be looking at the design of component APIs throughout the book as we deal with different components in our application, the final chapter deals with ways to present those interfaces to the outside world in a safe and accessible manner. We’ll also look at the various standards that have evolved for data export and interaction and look at approaches for presenting them from our application.

What You Need to Know

This book is not meant for people building their first dynamic web site. There are plenty of good books for first timers, so we won’t be attempting to cover that ground here. As such, you’ll need to have a little experience with building dynamic web sites or applications. At a minimum you should have a little experience of exposing data for editing via web pages and managing user data.

While this book isn’t aimed solely at implementers, there are a number of practical examples. To fully appreciate these examples, a basic knowledge of programming is required. While you don’t need to know about continuations or argument currying, you’ll need to have a working knowledge of simple control structures and the basic von Neumann input-process-storage-output model.

Along with the code examples, we’ll be looking at quite a few examples on the Unix command line. Having access to a Linux box (or other Unix flavor) will make your life a lot easier. Having a server on which you can follow along with the commands and code will make everything easier to understand and have immediate practical usage. A working knowledge of the command line is assumed, so I won’t be telling you how to launch a shell, execute a command, or kill a process. If you’re new to the command line, you should pick up an introductory book before going much further—command-line experience is essential for Unix-based applications and is becoming more important even for Windows-based applications.

While the techniques in this book can be equally applied to any number of modern technologies, the examples and discussions will deal with a set of four core technologies upon which many of the largest applications are built. PHP is the main glue language used in most code examples—don’t worry if you haven’t used PHP before, as long as you’ve used another C-like language. If you’ve worked with C, C++, Java?, JavaScript, or Perl, then you’ll pick up PHP in no time at all and the syntax should be immediately understandable.

For secondary code and utility work, there are some examples in Perl. While Perl is also usable as a main application language, it’s most capable in a command-line scripting and data-munging role, so it is often the sensible choice for building administration tools. Again, if you’ve worked with a C-like language, then Perl syntax is a cinch to pick up, so there’s no need to run off and buy the camel book just yet.

For the database component of our application, we’ll focus primarily on MySQL, although we’ll also touch on the other big three (Oracle, SQL Server, and PostgreSQL). MySQL isn’t always the best tool for the job, but it has many advantages over the others: it’s easy to set up, usually good enough, and probably most importantly, free. For prototyping or building small-scale applications, MySQL’s low-effort setup and administration, combined with tools like phpMyAdmin (http://www.phpmyadmin.net), make it a very attractive choice. That’s not to say that there’s no space for other database technologies for building web applications, as all four have extensive usage, but it’s also important to note that MySQL can be used for large scale applications—many of the largest applications on the Internet use it. A basic knowledge of SQL and database theory will be useful when reading this book, as will an instance of MySQL on which you can play about and connect to example PHP scripts.

To keep in line with a Unix environment, all of the examples assume that you’re using Apache as an HTTP server. To an extent, Apache is the least important component in the tool chain, since we don’t talk much about configuring or extending it (that’s a large field in itself). While experience with Apache is beneficial when reading this book, it’s not essential. Experience with any web server software will be fine.

Practical experience with using the software is not the only requirement, however. To get the most out of this book, you’ll need to have a working knowledge of the theory behind these technologies. For each of the core protocols and standards we look at, I will cite the RFC or specification (which tends to be a little dry and impenetrable) and in most cases refer to important books in the field. While I’ll talk in some depth about HTTP, TCP/IP, MIME, and Unicode, other protocols are referred to only in passing (you’ll see over 200 acronyms). For a full understanding of the issues involved, you’re encouraged to find out about these protocols and standards yourself.

Conventions Used in This Book

Items appearing in the book are sometimes given a special appearance to set them apart from the regular text. Here’s how they look:

Italic: Used for citations of books and articles, commands, email addresses, URLs, filenames, emphasized text, and first references to terms
Constant width: Used for literals, constant values, code listings, and XML markup
Constant width italic: Used for replaceable parameter and variable names
Constant width bold: Used to highlight the portion of a code listing being discussed

Tip

Indicates a tip, suggestion, or general note. For example, we’ll tell you if a certain setting is version-specific.

Warning

Indicates a warning or caution. For example, we’ll tell you if a certain setting has some kind of negative impact on the system.

Using Code Examples

The examples from this book are freely downloadable from the book’s web site at http://www.oreilly.com/catalog/web2apps.

This book is here to help you get the job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Building Scalable Web Sites by Cal Henderson. Copyright 2006 O’Reilly Media, Inc., 0-596-10235-6.”

If you feel that your use of code examples falls outside fair use or the permission given here, feel free to contact us at permissions@oreilly.com.

Safari® Enabled

When you see a Safari® Enabled icon on the cover of your favorite technology book, that means the book is available online through the O’Reilly Network Safari Bookshelf.

Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.

How to Contact Us

We have tested and verified the information in this book to the best of our ability, but you may find that features have changed (or even that we have made mistakes!). Please let us know about any errors you find, as well as your suggestions for future editions, by writing to:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, or any additional information. You can access this page at:

http://www.oreilly.com/catalog/web2apps

To comment or ask technical questions about this book, send email to:

bookquestions@oreilly.com

You can sign up for one or more of our mailing lists at:

http://elists.oreilly.com

For more information about our books, conferences, software, Resource Centers, and the O’Reilly Network, see our web site at:

http://www.oreilly.com

Acknowledgments

I’d like to thank the original Flickr/Ludicorp team—Stewart Butterfield, George Oates, and Eric Costello—for letting me help build such an awesome product and have a chance to make something people really care about. Much of the larger scale systems design work has come from discussions with other fellow Ludicorpers John Allspaw, Serguei Mourachov, Dathan Pattishall, and Aaron Straup Cope.

I’d also like to thank my long-suffering partner Elina for not complaining too much when I ignored her for months while writing this book.

Get Building Scalable Web Sites now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Building Scalable Web Sites by Cal Henderson