Cloud Application Architectures

An Overview of Amazon Web Services

My goal in this book is to stick to general principles you can apply in any cloud environment. In reality, however, most of you are likely implementing in the AWS environment. Ignoring that fact is just plain foolish; therefore, I will be using that AWS environment for the examples used throughout this book.

AWS is Amazon’s umbrella description of all of their web-based technology services. It encompasses a wide variety of services, all of which fall into the concept of cloud computing (well, to be honest, I have no clue how you categorize Amazon Mechanical Turk). For the purposes of this book, we will leverage the technologies that fit into their Infrastructure Services:

Amazon Elastic Cloud Compute (Amazon EC2)
Amazon Simple Storage Service (Amazon S3)
Amazon Simple Queue Service (Amazon SQS)
Amazon CloudFront
Amazon SimpleDB

Two of these technologies—Amazon EC2 and Amazon S3—are particularly interesting in the context of transactional systems.

As I mentioned earlier, message queues are critical in grid computing and are also useful in many kinds of transactional systems. They are not, however, typical across web applications, so Amazon SQS will not be a focus in this book.

Given that the heart of a transactional system is a database, you might think Amazon SimpleDB would be a critical piece for a transactional application in the Amazon cloud. In reality, however, Amazon SimpleDB is—as its name implies—simple. Therefore, it’s not well suited to large-scale web applications. Furthermore, it is a proprietary database system, so an application too tightly coupled to Amazon SimpleDB is stuck in the Amazon cloud.

Amazon Elastic Cloud Compute (EC2)

Amazon EC2 is the heart of the Amazon cloud. It provides a web services API for provisioning, managing, and deprovisioning virtual servers inside the Amazon cloud. In other words, any application anywhere on the Internet can launch a virtual server in the Amazon cloud with a single web services call.

At the time of this writing, Amazon’s EC2 U.S. footprint spans three data centers on the East Coast of the U.S. and two in Western Europe. You can sign up separately for an Amazon European data center account, but you cannot mix and match U.S. and European environments. The servers in these environments run a highly customized version of the Open Source Xen hypervisor using paravirtualization. This Xen environment enables the dynamic provisioning and deprovisioning of servers, as well as the capabilities necessary to provide isolated computing environment for guest servers.

When you want to start up a virtual server in the Amazon environment, you launch a new node based on a predefined Amazon machine image (AMI). The AMI includes your operating system and any other prebuilt software. Most people start with a standard AMI based on their favorite operating system, customize it, create a new image, and then launch their servers based on their custom images.

By itself, EC2 has two kinds of storage:

Ephemeral storage tied to the node that expires with the node
Block storage that acts like a SAN and persists across time

Many competitors to Amazon also provide persistent internal storage for nodes to make them operate more like a traditional data center.

In addition, servers in EC2—like any other server on the Internet—can access Amazon S3 for cloud-based persistent storage. EC2 servers in particular see both cost savings and greater efficiencies in accessing S3.

To secure your network within the cloud, you can control virtual firewall rules that define how traffic can be filtered to your virtual nodes. You define routing rules by creating security groups and associating the rules with those groups. For example, you might create a DMZ group that allows port 80 and port 443 traffic from the public Internet into its servers, but allows no other incoming traffic.

Amazon Simple Storage Service (S3)

Amazon S3 is cloud-based data storage accessible in real time via a web services API from anywhere on the Internet. Using this API, you can store any number of objects—ranging in size from 1 byte to 5 GB—in a fairly flat namespace.

It is very important not to think of Amazon S3 as a filesystem. I have seen too many people get in trouble when they expect it to act that way. First of all, it has a two-level namespace. At the first level, you have buckets. You can think of these buckets as directories, if you like, as they store the data you put in S3. Unlike traditional directories, however, you cannot organize them hierarchically—you cannot put buckets in buckets. Perhaps more significant is the fact that the bucket namespace is shared across all Amazon customers. You need to take special care in designing bucket names that will not clash with other buckets. In other words, you won’t be creating a bucket called “Documents”.

Another important thing to keep in mind is that Amazon S3 is relatively slow. Actually, it is very fast for an Internet-deployed service, but if you are expecting it to respond like a local disk or a SAN, you will be very disappointed. Therefore, it is not feasible to use Amazon S3 as an operational storage medium.

Finally, access to S3 is via web services, not a filesystem or WebDAV. As a result, applications must be written specifically to store data in Amazon S3. Perhaps more to the point, you can’t simply rsync a directory with S3 without specially crafted tools that use the Amazon API and skirt the S3 limitations.

I have spent enough text describing what Amazon S3 is not—so what is it?

Amazon S3 enables you to place persistent data into the cloud and retrieve it at a later date with a near certainty that it will be there in one consistent piece when you get it back. Its key benefit is that you can simply continue to shove data into Amazon S3 and never worry about running out of storage space. In short, for most users, S3 serves as a short-term or long-term backup facility.

Cloud storage systems have unique challenges that legacy storage technologies cannot address. Storage technologies based on RAID and replication are not well suited for cloud infrastructures because they don’t scale easily to the exabyte level. Legacy storage technologies rely on redundant copies to increase reliability, resulting in systems that are not easily manageable, chew up bandwidth, and are not cost effective.

Cleversafe’s unique cloud storage platform—based on company technology trademarked under the name Dispersed Storage—divides data into slices and stores them in different geographic locations on hardware appliances. The algorithms used to divide the data are comparable to the concept of parity—but with much more sophistication—because they allow the total data to be reconstituted from a subset. For instance, you may store the data in 12 locations, any 8 of which are enough to restore it completely. This technology, known as information dispersal, achieves geographic redundancy and high availability without expensive replication of the data.

In April 2008, Cleversafe embodied its dispersal technology in hardware appliances that provide a front-end to the user using standard protocols such as REST APIs or iSCSI. The appliances take on the task of splitting and routing the data to storage sites, and merely increase the original file size by 1.3 to 1.6 times, versus 3 times in a replicated system.

Companies are using Cleversafe’s Dispersed Storage appliances to build public and private cloud storage as a backend infrastructure to Software as a Service. Dispersed Storage easily fulfills the characteristics of a cloud infrastructure since it provides storage on demand and accessibility anywhere.

Dispersal also achieves higher levels of security within the cloud without necessarily needing encryption, because each slice contains too little information to be useful. This unique architecture helps people satisfy their concern over their data being outside of their immediate control, which often becomes a barrier to storage decisions. While a lost backup tape contains a full copy of data, access to a single appliance using Dispersed Storage results in no data breach.

Additionally, Dispersed Storage is massively scalable and designed to handle petabytes of data. By adding servers into the storage cloud with automated storage discovery, the total storage of the system can easily grow, and performance can be scaled by simply adding additional appliances. Virtualization tools enable easy deployment and on-demand provisioning. All of these capabilities streamline efforts for storage administrators.

Dispersed Storage is also designed to store and distribute large objects, the cornerstone of our media-intensive society that has become dependent on videos and images in every aspect of life. Dispersal is inherently designed for content distribution by naturally incorporating load balancing through the multitude of access choices for selecting the slices used to reconstruct the original file. This means companies do not have to deal with or pay for implementing a separate content delivery network for their stored data.

Dispersed Storage offers a novel and needed approach to cloud storage, and will be significant as cloud storage matures and displaces traditional storage methods.

Amazon Simple Queue Service (SQS)

Amazon SQS is a cornerstone to any Amazon-based grid computing effort. As with any message queue service, it accepts messages and passes them on to servers subscribing to the message queue.

A messaging system typically enables multiple computers to exchange information in complete ignorance of each other. The sender simply submits a short message (up to 8KB in Amazon SQS) into the queue and continues about its business. The recipient retrieves the message from the queue and acts upon the contents of the message.

A message, for example, can be, “Process data set 123.csv in S3 bucket s3://fancy-bucket and submit the results to message queue Y.” One advantage of a message queue system is that the sender does not need to identify a recipient or perform any error handling to deal with communication failures. The recipient does not even need to be active at the time the message is sent.

The Amazon SQS system fits well into a cloud computing environment due to its simplicity. Most systems requiring a message queue need only a simple API to submit a message, retrieve it, and trust the integrity of the message within the queue. It can be a tedious task to develop and maintain something this simple, but it is also way too complex and expensive to use many of the commercial message queue packages.

Amazon CloudFront

Amazon CloudFront, a cloud-based content distribution network (CDN), is a new offering from Amazon Web Services at the time of this writing. It enables you to place your online content at the edges of the network, meaning that content is delivered from a location close to the user requesting it. In other words, a site visitor from Los Angeles can grab the same content from an Amazon server in Los Angeles that a visitor from New York is getting from a server in New York. You place the content in S3 and it gets moved to the edge points of the Amazon network for rapid delivery to content consumers.

Amazon SimpleDB

Amazon SimpleDB is an odd combination of structured data storage with higher reliability than your typical MySQL or Oracle instance, and very baseline relational storage needs. It is very powerful for people concerned more with the availability of relational data and less so with the complexity of their relational model or transaction management. In my experience, this audience is a very small subset of transactional applications—though it could be particularly useful in heavy read environments, such as web content management systems.

The advantages of Amazon SimpleDB include:

No need for a database administrator (DBA)
A very simple web services API for querying the data
Availability of a clustered database management system (DBMS)
Very scalable in terms of data storage capabilities

If you need the power of a relational database, Amazon SimpleDB is not an appropriate tool. On the other hand, if your idea of an application database is bdb, Amazon SimpleDB will be the perfect tool for you.

Get Cloud Application Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Cloud Application Architectures by George Reese

An Overview of Amazon Web Services

Amazon Elastic Cloud Compute (EC2)

Amazon Simple Storage Service (S3)

Amazon Simple Queue Service (SQS)

Amazon CloudFront

Amazon SimpleDB

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly