My goal in this book is to stick to general principles you can apply in any cloud environment. In reality, however, most of you are likely implementing in the AWS environment. Ignoring that fact is just plain foolish; therefore, I will be using that AWS environment for the examples used throughout this book.
AWS is Amazon’s umbrella description of all of their web-based technology services. It encompasses a wide variety of services, all of which fall into the concept of cloud computing (well, to be honest, I have no clue how you categorize Amazon Mechanical Turk). For the purposes of this book, we will leverage the technologies that fit into their Infrastructure Services:
Two of these technologies—Amazon EC2 and Amazon S3—are particularly interesting in the context of transactional systems.
As I mentioned earlier, message queues are critical in grid computing and are also useful in many kinds of transactional systems. They are not, however, typical across web applications, so Amazon SQS will not be a focus in this book.
Given that the heart of a transactional system is a database, you might think Amazon SimpleDB would be a critical piece for a transactional application in the Amazon cloud. In reality, however, Amazon SimpleDB is—as its name implies—simple. Therefore, it’s not well suited to large-scale web applications. Furthermore, it is a proprietary database system, so an application too tightly coupled to Amazon SimpleDB is stuck in the Amazon cloud.
Amazon EC2 is the heart of the Amazon cloud. It provides a web services API for provisioning, managing, and deprovisioning virtual servers inside the Amazon cloud. In other words, any application anywhere on the Internet can launch a virtual server in the Amazon cloud with a single web services call.
At the time of this writing, Amazon’s EC2 U.S. footprint spans three data centers on the East Coast of the U.S. and two in Western Europe. You can sign up separately for an Amazon European data center account, but you cannot mix and match U.S. and European environments. The servers in these environments run a highly customized version of the Open Source Xen hypervisor using paravirtualization. This Xen environment enables the dynamic provisioning and deprovisioning of servers, as well as the capabilities necessary to provide isolated computing environment for guest servers.
When you want to start up a virtual server in the Amazon environment, you launch a new node based on a predefined Amazon machine image (AMI). The AMI includes your operating system and any other prebuilt software. Most people start with a standard AMI based on their favorite operating system, customize it, create a new image, and then launch their servers based on their custom images.
By itself, EC2 has two kinds of storage:
In addition, servers in EC2—like any other server on the Internet—can access Amazon S3 for cloud-based persistent storage. EC2 servers in particular see both cost savings and greater efficiencies in accessing S3.
To secure your network within the cloud, you can control virtual firewall rules that define how traffic can be filtered to your virtual nodes. You define routing rules by creating security groups and associating the rules with those groups. For example, you might create a DMZ group that allows port 80 and port 443 traffic from the public Internet into its servers, but allows no other incoming traffic.
Amazon S3 is cloud-based data storage accessible in real time via a web services API from anywhere on the Internet. Using this API, you can store any number of objects—ranging in size from 1 byte to 5 GB—in a fairly flat namespace.
It is very important not to think of Amazon S3 as a filesystem. I have seen too many people get in trouble when they expect it to act that way. First of all, it has a two-level namespace. At the first level, you have buckets. You can think of these buckets as directories, if you like, as they store the data you put in S3. Unlike traditional directories, however, you cannot organize them hierarchically—you cannot put buckets in buckets. Perhaps more significant is the fact that the bucket namespace is shared across all Amazon customers. You need to take special care in designing bucket names that will not clash with other buckets. In other words, you won’t be creating a bucket called “Documents”.
Another important thing to keep in mind is that Amazon S3 is relatively slow. Actually, it is very fast for an Internet-deployed service, but if you are expecting it to respond like a local disk or a SAN, you will be very disappointed. Therefore, it is not feasible to use Amazon S3 as an operational storage medium.
Finally, access to S3 is via web services, not a filesystem or WebDAV. As a result, applications must be written specifically to store data in Amazon S3. Perhaps more to the point, you can’t simply rsync a directory with S3 without specially crafted tools that use the Amazon API and skirt the S3 limitations.
I have spent enough text describing what Amazon S3 is not—so what is it?
Amazon S3 enables you to place persistent data into the cloud and retrieve it at a later date with a near certainty that it will be there in one consistent piece when you get it back. Its key benefit is that you can simply continue to shove data into Amazon S3 and never worry about running out of storage space. In short, for most users, S3 serves as a short-term or long-term backup facility.
A messaging system typically enables multiple computers to exchange information in complete ignorance of each other. The sender simply submits a short message (up to 8KB in Amazon SQS) into the queue and continues about its business. The recipient retrieves the message from the queue and acts upon the contents of the message.
A message, for example, can be, “Process data set 123.csv in S3 bucket s3://fancy-bucket and submit the results to message queue Y.” One advantage of a message queue system is that the sender does not need to identify a recipient or perform any error handling to deal with communication failures. The recipient does not even need to be active at the time the message is sent.
The Amazon SQS system fits well into a cloud computing environment due to its simplicity. Most systems requiring a message queue need only a simple API to submit a message, retrieve it, and trust the integrity of the message within the queue. It can be a tedious task to develop and maintain something this simple, but it is also way too complex and expensive to use many of the commercial message queue packages.
Amazon CloudFront, a cloud-based content distribution network (CDN), is a new offering from Amazon Web Services at the time of this writing. It enables you to place your online content at the edges of the network, meaning that content is delivered from a location close to the user requesting it. In other words, a site visitor from Los Angeles can grab the same content from an Amazon server in Los Angeles that a visitor from New York is getting from a server in New York. You place the content in S3 and it gets moved to the edge points of the Amazon network for rapid delivery to content consumers.
Amazon SimpleDB is an odd combination of structured data storage with higher reliability than your typical MySQL or Oracle instance, and very baseline relational storage needs. It is very powerful for people concerned more with the availability of relational data and less so with the complexity of their relational model or transaction management. In my experience, this audience is a very small subset of transactional applications—though it could be particularly useful in heavy read environments, such as web content management systems.
The advantages of Amazon SimpleDB include:
No need for a database administrator (DBA)
A very simple web services API for querying the data
Availability of a clustered database management system (DBMS)
Very scalable in terms of data storage capabilities
If you need the power of a relational database, Amazon SimpleDB is not an appropriate tool. On the other hand, if your idea of an application database is bdb, Amazon SimpleDB will be the perfect tool for you.