Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Salman Ul Haq, a techpreneur, co-founder and CEO of TunaCode, Inc.

server-clusterFrequent and regular data backups are important in most environments, and though data backups are a standard procedure, there are multiple ways they can be achieved. When it comes to Big Data, the stakes get higher so the cost of failure is great. HBase is an open source distributed Big Data store with billions of rows and millions of columns that can be quickly deployed over commodity servers. Because of this, HBase makes an ideal candidate for your Big Data storage needs. In this post, we will examine the different ways you can back up data on a live HBase cluster.

So why backup on a live HBase cluster? Is this practice safe? Will we be able to maintain data consistency? Let’s examine all of these questions. First off, a live cluster data backup generates absolutely zero downtime. It is of course possible to first trigger a complete cluster shutdown, and then backup HBase tables and data. This process is regarded as a safer choice, since there’s no chance of backed up data showing any inconsistencies. The problem here is that you can only opt for a full-shutdown backup when you can tolerate system downtime (read Full shutdown backup using distcp in HBase Administration Cookbook to learn about a full-shutdown backup). This post will focus solely on a live cluster data backup.

There are several ways you can back up data from a live HBase cluster:

  • Using the CopyTable utility to duplicate data from source to backup table
  • Exporting HBase tables as HDFS files and then importing them back into HBase
  • Replicating the whole HBase cluster

Replicating the complete HBase cluster is the most straightforward approach, but may not be the most resource-optimized way to back up data. It is ideal, however, for disaster recovery scenarios. With CopyTable, you can replicate complete tables within the same cluster or to another cluster, for example, your “backup cluster.” Exporting HBase tables as HDFS files and then importing them back can be done with Import/Export utilities. You can, for instance, dump your HBase table into the HDFS on the same cluster and then restore the HDFS file into an HBase table using the Import utility.

Before we delve into code examples on how to backup and restore HBase tables, let’s first see how HBase addresses the concern of ending up with inconsistent data in a live cluster backup. In scenarios where data is being constantly written, you may miss out on some (or even a large number) of rows, like when the CopyTable utility is busy copying the table from within the same or a different cluster.

Cloudera, which offers tools and solutions around Apache Hadoop and related projects like HBase, has released CDH 4.2, which is their open source Apache Hadoop distribution. This release includes an interesting feature called “Snapshots.” As the name suggests, it takes a snapshot of a table without the need to shut down the cluster, hence there is no downtime and almost no chance of any inconsistencies. Snapshot is not a replacement of CopyTable, though, because unlike CopyTable, Snapshot does not clone the data of the table. Snapshot only makes a copy of the meta-data associated with a table, which enables it to restore to the point where the Snapshot was made. This may not be ideal for every scenario, but it sure has its value when it comes to restoring an HBase table to a previous known working state or to take daily snapshots for data validation/verification purposes or even for compliance.

Let’s see how we can backup data on a live HBase cluster. As mentioned, CopyTable lets you make a copy of an HBase table on the same cluster or a separate cluster, which for example, may be a dedicated backup cluster. CopyTable allows you to make incremental backup, which comes in handy for making regular backups. You can configure CopyTable and instruct it to copy data that was added within your given timeframe (start and end timestamps):

    1. Fire up HBase cluster
      • To get started backing up data for a running HBase source cluster into a backup cluster, first ensure that the Hbase configuration file (hbase-site.xml) is configured on the client cluster and the HBase dependency JARs are added in the Hadoop CP.
    2. Create an HBase backup table
      • Let’s suppose we’re copying a table named “employee_records.” We will create a backup table “employee_records_backup”:

    1. Execute the backup
      • Now it’s time to create the backup. We will only create backup for column family “n”:

      • This statement creates a MapReduce job that scans the source table’s specified column families and then simply writes them onto the backup table using the client API. This will copy the complete table. Let’s finally see how you can backup data within a specified timeframe:

    • This will copy data that was written into the source table within the timeframe specified by the start and end times.

This post has provided details on backing up data on a live HBase cluster. For more information on using HBase, see the list of books below.

Safari Books Online has the content you need

Check out these Hbase books available from Safari Books Online:

HBase: The Definitive Guide shows you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. This book provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.
HBase in Action has all of the knowledge you need to design, build, and run applications using HBase. First, it introduces you to the fundamentals of distributed systems and large scale data handling. Then, you’ll explore real-world applications and code samples with just enough theory to understand the practical techniques. You’ll see how to build applications with HBase and take advantage of the MapReduce processing framework. And along the way you’ll learn patterns and best practices.

About the author

Salman Ul Haq is a techpreneur, co-founder and CEO of TunaCode, Inc., a startup that delivers GPU-accelerated computing solutions to time-critical application domains. He holds a degree is Computer Systems Engineering. His current focus is on delivering the right solution for cloud security. He can be reached at

Tags: Apache Hadoop, Apache HBase, backup, Big Data, columns, data, live HBase cluster, table,

Comments are closed.