It’s Friday afternoon. Your new ebook product is launching on Monday and you have 1,000 files to automatically convert from a legacy format to EPUB. You figured it would take a minute or two per file, but that was before you tried the converter on some actual commercial ebooks, the kind with lots of really big images. It turns out that if you converted all the books one at a time on your laptop like you’d been planning, you’ll be ready to launch in 2015.
Don’t panic. There’s always EC2.
Amazon’s cloud-for-hire service is widely used by internet startups (and post-startups), but it’s a potential lifesaver for any kind of large-scale computing task. Like it or not, the publishing industry today is about producing computer-readable data. Even if you just make a couple of EPUBs a year, you’re developing software, and if you make several hundred EPUBs a year, you should probably be thinking about scalable data-crunching techniques.
In this post, I’ll describe the data conversion problem and the architecture I used to crank through almost a thousand files in a couple hours. In part 2, I’ll walk through the Python code that powers all the pieces and assess the cost/benefits of this kind of approach. Programming skills are assumed.
Entering the EC2 world for the first time is intimidating, even for an experienced developer. There’s a lot of terminology, and books and articles get out of date fast. This post isn’t a comprehensive tutorial, but I’ll try to point out the places where it’s easy to get stuck or confused.
For my ebook conversion project, I had to solve several problems:
- I had to process a lot of files in a short amount of time; that was impossible even with the fastest single computer I could find. So I’d need parallelization.
- I had to assume that my conversion process might have bugs to be fixed or improvements to be made in the future. So I needed repeatability.
- I didn’t intend to spend my weekend turning on and off hundreds of servers, and the penalty for leaving servers running unnecessarily could be a very expensive bill. So I needed server provisioning to be scriptable.
- Once I spun up my servers in parallel, I needed a way to distribute the work across them in an orderly way. I didn’t want to waste time and money converting books unnecessarily.
- It’s metered. You can rent very, very powerful hosts for only as much time as you need.
- It’s scriptable. Once you get the basics set up the way you like, it doesn’t much matter whether you want to run the same process 1 time or 1,000,000 times. I turned to the widely-use boto Python library to script my EC2 provisioning process.
- It’s repeatable. This is more than just about being scriptable; you can create a snapshot of the host exactly as you want, with all of the dependencies pre-installed. If you need to run the same process again in a year, nothing will have changed.
The one thing that EC2 couldn’t magically solve for me was the distribution problem: each EC2 virtual computer would be a newborn baby, with code all ready to go but no idea which files to process. After flailing around for awhile I settled on using a queuing service. Amazon includes Simple Queue Service with their offering so I was all set.
Customizing the EC2 instance
An Amazon Machine Image (AMI) is your virtual computer. AMIs are a complicated subject, but for general purpose Linux needs I pick one of the official Ubuntu AMIs. Customize the AMI storage/memory allocation to match your application’s needs. In my case, I needed a fair amount of both storage and RAM as the conversion process generates a lot of temporary files, but it turned out I didn’t need the most robust processor. (If you need a lot of storage or want lots of files pre-installed, you’re going to also need to configure your own elastic block instance and mount it; avoid this step if you can because it’s a pain.)
You can (and should) install all the dependencies you need on your AMI before taking a snapshot. However, if anything about your application changes, you’ll need to rebuild the AMI, an annoying and non-automatic process. User data, a feature available on some instance types, is a simple user-authored script that will run on boot, and can be provided on instantiation. Since my converter’s source code was available on a private Github repository, I set up my user data script to automatically pull the latest version of the code, which was pre-installed with the correct SSH keys in the AMI:
#!/bin/bash cd /home/ubuntu/ebook_converter # User data scripts run as root, so ensure that we "su" to the local user account sudo -u ubuntu git pull origin master
My conversion script runs as a Python program inside a Python virtual environment. It’s possible that the latest version of the code has new dependencies, so after the instance pulls the latest code, the user-data script also runs
su ubuntu -c "source ve/bin/activate && python setup.py develop"
The job queue
To solve the workload distribution problem, I needed a centralized place to store all of the books to be converted. Originally I thought I’d spin up each EC2 instance with a predefined set of books to convert and pass that list in to the user-data script, but I realized a far better approach was to use a queue. A queue is just a data structure that maintains an ordered list and allows items to be added or removed programmatically. If I’d provisioned the work list for each instance myself, I’d have to deal with potential problems like an instance crashing during processing and leaving half its workload untouched. Better to leave that to the queue.
(The AMIs and the queue need to be in the same Amazon region or they won’t be able to find each other.)
The final pipeline
I ended up with a workflow like this:
- Get a list of books to convert.
- Push the list of filenames to the queue.
- Start up the maximum number of simultaneous instances (20).
- For each instance:
- Update itself with the latest code.
- Get a filename off the queue.
- Go get the file.
- Convert it.
- Put it somewhere.
Each instance keeps requesting books off the queue until the queue is empty; at that point it can just shut itself down. If an instance crashes, there are still others to pick up the slack.
The entire workflow can be written straight into the user-data startup script. I ended up with this script, which self-updates the code, runs “forever” (until the queue is exhausted), and then turns itself off:
#!/bin/bash cd /home/ubuntu/ebook_converter # User data scripts run as root, so ensure that we "su" to the local user account sudo -u ubuntu git pull origin master # Invoke the Python virtual environment and update any Python dependencies su ubuntu -c "source ve/bin/activate && python setup.py develop" # Invoke the virtualenv and run the process su ubuntu -c "source ve/bin/activate && python ebook_converter/process_from_queue.py" # Once this finishes, we're out of jobs on the queue, so shut down the host shutdown now
EC2 lets you remotely monitor any server, so if your job writes logging information, you can watch that in real time. Here’s the output from my converter as it starts up and picks a file off the queue to download and convert:
Or better yet, take advantage of a program like multitail and watch all the log output simultaneously! Shortly after starting up 20 instances, my terminal window looked like this:
The primary purpose of running multitail is to feel like a total rockstar and be able to email this screenshot to all your coworkers. Or better yet, send a photo of you drinking a beer in a hammock while 20 computers in the cloud slave away on your behalf at the click of a button.
In the next post I’ll walk through the Python code that powered the provisioning, distribution, and queuing of the jobs, as well as the economics and practicality of this approach.