Online Backup with AWS::S3

The possibility of maintaining online backups of your important files at little cost is one of the most obvious and compelling uses of S3. There are already a number of third-party tools available for backing up your files in S3 with support for file versioning and scheduled uploads. If you are looking for such a tool, check the Amazon Web Services (AWS) Solution Center to see what is available. However, because you sometimes need to create your own solution, we will work through a simple example that demonstrates how to create a very simple backup tool in Ruby using the AWS::S3 library.

Our objectives for this backup solution are very modest indeed. We will not store different file version snapshots, nor will we implement complex schemes to allow for efficient file renaming or rearrangement of large files into smaller, more manageable chunks. Our backup process will comprise only the following steps:

  • Find all the files in a local directory to be backed up.

  • List the objects that are already present in S3.

  • Upload the local files that are not already present in S3, or whose contents have changed since the object was last uploaded to S3.

  • Delete objects stored in S3 when the corresponding local file has been deleted or renamed.

AWS::S3 Ruby Library

In this example we will use the excellent Ruby S3 library, AWS::S3, which may be found at http://amazon.rubyforge.org/. Our example is based on version 0.4.0 of this library.

AWS::S3 provides an object-oriented view of resources and operations in S3 that make it much easier to work with than the procedural application programming interface (API) implementation we presented in Chapter 3. We will define a simple Ruby script in the file s3backup.rb that will use this library to interact with S3.

First you must install the AWS::S3 library. This library is available as a Ruby gem package or as a download from the project’s web site that you can install manually. We prefer to use the convenient gem package that you can install from the command line.

$ gem install aws-s3

S3Backup Class

Example 4-2 defines the beginning of a Ruby script that will back up your files. This script stub loads the libraries we will need, including the AWS::S3 library and the MD5 (Message-Digest algorithm 5) digest library. To keep everything nicely organized, we will define a Ruby class called S3Backup to contain our implementation methods. All the method definitions that follow in this section should be defined inside this class.

Example 4-2. S3Backup class stub: s3backup.rb

#!/usr/bin/env ruby

# Load the AWS::S3 library and include it to give us easy access to objects
require 'rubygems'
require 'aws/s3'
include AWS::S3

# Use the ruby MD5 digest tool for file/object comparisons
require 'digest/md5'

class S3Backup

  # Implementation methods will go here...
end

To establish a connection with S3, you must let the AWS::S3 library know what your AWS credentials are. Example 4-3 defines an initialize method for the S3Backup class that will include your credentials.

Example 4-3. Initialize an S3 connection: s3backup.rb

def initialize
  Base.establish_connection!(
    :access_key_id     => 'YOUR_AWS_ACCESS_KEY',
    :secret_access_key => 'YOUR_AWS_SECRET_KEY'
  )
end

List Backed-Up Objects

Before our program uploads files to S3, it needs to find out which files are already stored there so that only new or updated files will be uploaded. Example 4-4 defines a method that lists the contents of a bucket. As a convenience, this method will create a bucket if one does not already exist.

Example 4-4. List bucket contents: s3backup.rb

# Find a bucket and return the bucket's object listing.
# Create the bucket if it does not already exist.
def bucket_find(bucket_name)
  puts "Listing objects in bucket..."
  objects = Bucket.find(bucket_name)        
  
rescue NoSuchBucket
  puts "Creating bucket '#{bucket_name}'"
  if not Bucket.create(bucket_name)
    raise 'Unable to create bucket'
  end
  objects = Bucket.find(bucket_name)        
end

Find Files to Back Up

Example 4-5 defines a method that recursively lists the files and subdirectories contained in a directory path and returns the object names the files will be given in S3. The backup script will be given a directory path by the user to indicate the root directory location of the files to back up. Any file inside this root path will be uploaded to S3, including files inside subdirectories. When we store the files in S3, each object will be given a key name corresponding to the file’s location relative to the root path.

Example 4-5. List local files: s3backup.rb

# Find all the files inside the root path, including subdirectories.
# Return an array of object names corresponding to the relative
# path of the files inside the root path.
#
# The sub_path parameter should only be used internally for recursive
# method calls.
def local_objects(root_path, sub_path = '')
  object_names = []
  
  # Include subdirectory paths if scanning a nested hierarchy.
  if sub_path.length > 0
    base_path = "#{root_path}/#{sub_path}"
  else
    base_path = root_path
  end
  
  # List files in the current scan directory
  Dir.entries("#{base_path}").each do |f|
    # Skip current and parent directory shortcuts
    next if f == '.' || f == '..'
    
    file_path = "#{base_path}/#{f}"
    object_name = (sub_path.length > 0 ? "#{sub_path}/#{f}" : f)
    
    if File.directory?(file_path)
      # Recursively find files in subdirectory
      local_objects(root_path, object_name).each do |n|
        object_names << n
      end
    else
      # Add the object key name for this file to our list
      object_names << object_name
    end        
  end
  return object_names
end

Back Up Files

We now have methods to list the objects in the target S3 bucket and to list the local files that will be backed up. The next step is to actually upload the new and changed files to S3. Example 4-6 defines a method to do this.

Example 4-6. Upload files: s3backup.rb

# Upload all objects that are not up-to-date in S3.
def upload_files(path, bucket, files, force=false, options={})
    files.each do |f|
        file = File.new("#{path}/#{f}", 'rb') # Open files in binary mode

        if force || bucket[f].nil?
            # Object is not present in S3, or upload has been forced
            puts "Storing object: #{f} (#{file.stat.size})"
            S3Object.store(f, open(file.path), bucket.name, options)
        else
            obj = bucket[f]

            # Ensure S3 object is latest version by comparing MD5 hash
            # after removing quote characters surrounding S3's ETag.
            remote_etag = obj.about['etag'][1..-2]
            local_etag = Digest::MD5.hexdigest(file.read)

            if remote_etag != local_etag
                puts "Updating object: #{f} (#{file.stat.size})"
                S3Object.store(f, open(file.path), bucket.name, options)
            else
                puts "Object is up-to-date: #{f}"
            end
        end
    end
end

This method loops through the local file listing and decides which files should be uploaded by checking first whether the file is already present in S3. If the file is present in the target bucket, it checks whether the local file has changed since the S3 version was created. If the file is not present, it is uploaded immediately.

If the file is already present in the bucket, we have to find out whether the local version is different from the version in S3. The method generates an MD5 hash of the local file’s contents to find out whether it differs from the object stored in S3. The S3 object’s MD5 hash value is made available as a hex-encoded value in the object’s ETag property. If the hash value of the local file and the object match, then they have identical content, and there is no need to upload the file. If the hashes do not match, then we assume the local file has been modified and that it should replace the version in S3.

It can take some time and processing power to generate the MD5 hash values for files, especially if they are large, so this hash-comparison approach slows things down. A faster alternative would be to compare the dates of the local file and the S3 object to see whether the local file is newer; but such comparisons are risky, because the object creation date reported by S3 may differ from your local system clock. Because we are more concerned with protecting our data than doing things quickly, we prefer to use hashes; it is the safest approach.

The upload_files method includes two optional parameters. The options parameter allows us to pass extra options to the S3Object.store method defined in the AWS::S3 library. Our script will use these options to specify an access control policy to apply to newly created objects. The method’s force parameter is a Boolean value that allows users to force files to be uploaded, even if they are already present in the bucket. This option could be handy if the user wanted to force a change to the Access Control List (ACL) policy settings of all the objects in a backup bucket.

Delete Obsolete Objects

In addition to storing files in S3, our backup script will be able to delete obsolete objects from S3 when the corresponding local file has been removed or renamed. This step will help to prevent our backup bucket from filling up with outdated files. Example 4-7 defines a method that loops through the objects present in the target bucket and checks whether the listing of local files includes a corresponding file. If there is no local file corresponding to the object, it is deleted. In a more advanced backup scenario, these outdated objects would be kept for some time, in case the local files had been deleted by mistake; but such a feature is beyond the scope of this book.

Example 4-7. Delete objects: s3backup.rb

# Delete all objects that are not present in the local file path
def delete_obsolete_objects(bucket, local_files)
  bucket.each do |obj|
    if local_files.index(obj.key).nil?
      # Obsolete object, delete it
      puts "Deleting orphan object: #{obj.key}"
      obj.delete
    end
  end
end

Putting It All Together

The final step to complete the S3Backup class is to add a method to tie together all the steps required to perform a backup. Example 4-8 defines a back_up method that performs this task. The methods we defined above should only be used from within the class itself, so we will make these methods private by using Ruby’s private macro.

Example 4-8. Perform backup: s3backup.rb

# Perform a backup to S3
def backup(bucket_name, path, force=false, options={})
    # Ensure the provided path exists and is a directory
    if not File.directory?(path)
        raise "Not a directory: '#{path}'"
    end

    puts "Uploading directory path '#{path}' to bucket '#{bucket_name}'"

    # List contents of the target bucket
    bucket = bucket_find(bucket_name)

    # List local files
    files = local_objects(path)

    # Upload files and delete obsolete objects
    upload_files(path, bucket, files, force, options)
    delete_obsolete_objects(bucket, files)
end

private :bucket_find, :local_objects
private :upload_files, :delete_obsolete_objects

The S3Backup class is now functionally complete, but the class by itself cannot be run as a script. Example 4-9 defines a block of code that will automatically invoke the S3Backup class when the Ruby script file is run from the command line. Add this code to the end of the script file, outside the body of the S3Backup class.

Example 4-9. Run block: s3backup.rb

if __FILE__ == $0
    if ARGV.length < 2
        puts "Usage: #{$0} bucket path [force_flag acl_policy]"
        exit
    end

    bucket_name = ARGV[0]
    path = ARGV[1]
    force_flag = ARGV[2]
    acl_policy = (ARGV[3].nil? ? 'private' : ARGV[3])

    s3backup = S3Backup.new
    s3backup.back_up(bucket_name, path, force_flag, {:access=>acl_policy})
end

The script is now ready to run. You can try it out with some of the following commands. However, be careful not to back up your files to an S3 bucket that already contains objects you wish to keep.

# Print a help message by not specifying the required parameters
$ ruby s3backup.rb                                       
Usage: s3backup.rb bucket path [force_flag acl_policy]

# Back up the directory Documents/ImportantDirectory to the bucket my-bucket
$ ruby s3backup.rb my-bucket Documents/ImportantDirectory
Uploading contents of directory 'Documents/ImportantDirectory' to bucket 'my-bucket'
Listing objects in bucket...
Creating bucket 'my-bucket'
Storing object: Document1.txt (17091)
Storing object: Document2.txt (8517)
. . .

# Follow-up backups of the directory Documents/ImportantDirectory will run
# faster as only new or changed files will be uploaded
$ ruby s3backup.rb my-bucket Documents/ImportantDirectory
. . .
Object is up-to-date: Document1.txt
Object is up-to-date: Document2.txt
. . .

# Force the script to upload all the local files again, this time with the
# 'public' access control permission.
$ ruby s3backup.rb my-bucket Documents/ImportantDirectory true public
. . .
Storing object: Document1.txt (17091)
Storing object: Document2.txt (8517)
. . .

If you are serious about backing up your files to S3, you will likely need many backup features that are missing from this example; plus, we have not included a script to restore your files from S3 if a disaster strikes. We will leave these additional features as an exercise for the reader.

Content-Length Workaround

You may experience problems using version 0.4.0 of the AWS::S3 library with some web proxies, because the method that creates a bucket does not explicitly set the Content-Length header prior to performing the PUT request. Some web proxies refuse to pass on PUT messages that do not include this header, but the S3 service accommodates these.

If you receive inexplicable Unable to create bucket error messages when you use the s3backup.rb script, try adding the workaround code in Example 4-10 to your script outside the S3Backup class.

Example 4-10. Content-Length fix: s3backup.rb

# Modification to AWS::S3 library to ensure bucket creation PUT requests 
# include a Content-Length header
class Bucket
    class << self
        def create(name, options = {})
            options['Content-Length'] = 0   # Explicitly set header
            validate_name!(name)
            put("/#{name}", options).success?
        end
    end
end

Get Programming Amazon Web Services now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.