Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

HBase lets you store big data on commodity machines. It can scale up to millions of rows and columns each, putting Big Data processing, in real time, within reach of even individual developers, which is extremely difficult, if not impossible, with Relational Database Management Systems (RDBMS). For an introduction to HBase, be sure to read HBase – An OpenSource BigTable Database. In this post, I will talk about data migrations, showing you how to migrate your data from MySQL into an HBase data store.

There are three primary ways to migrate data into HBase:

  • Put API
  • Bulk load tool
  • MapReduce Job

The Put API is probably the most straightforward way to import data quickly into an HBase table. It’s not recommended, however, to be used when importing huge amounts of data. You can think of it as the “Hello World” of HBase data migration.

The bulk load tool, on the other hand, runs a MapReduce job behind the scenes to populate data into an HBase table. Once populated, the tool also supports generating the HBase internal file format (HFile), enabling you to feed the freshly imported data directly into a running HBase cluster.

To quickly and reliably load a huge data file, you can use the importTSV tool. Since data migration is a write-intensive task, if you go the custom MapReduce job route, you will have to handle issues like optimal configuration to make sure data writes are not clogged. Let’s see how you can load data from a MySQL table into an HBase table with importTSV.

First, you will have to export your table data from an RDBMS, like MySQL, into a TSV format (remember, we’re using the Import TSV tool). You will have to make sure that there’s a field representing the row key of the HBase table row. There are scripts that can do this for you.

Now, since importTSV uses a MapReduce job in the backend to import data, you will have to spin a MapReduce instance on your cluster. Starting a MapReduce daemon can be done by executing this command on the master node:

Follow along with these four steps to load data from a TSV file to HBase table:

    1. Since importTSV only reads data from HDFS, you have to copy the TSV file from local file system to HDFS directory:
    2. Create a target table in HBase and add a column family to it. If the table already exists, you can alter it and old data will not be changed. I’ve added a ‘t’ column family where data will be populated:
    3. Add hbase-site.xml and HBase dependency JARs in Hadoop classpath:
    4. Finally, run the importTSV tool by running this script:

What you’re doing here is running importTSV with the following parameters:

      • Porttsv.columns, which specifies the row key column. I’ve used HBASE_ROW_KEY that specifies the row key.
      • Mapping information between the HBase table column family and the data columns in your TSV file. In this case, the data was arranged in the TSV file as row_key, value01, value02 and so on.
      • TSV file path.

You can check the status of your MapReduce data migration job on the Admin page:

This is one of the several ways you can migrate MySQL database tables into HBase table. You can find a lot more hands-on examples in the books referenced below.

Safari Books Online has the content you need

Check out these Hbase books available from Safari Books Online:

HBase: The Definitive Guide shows you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. This book provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.
HBase in Action has all of the knowledge you need to design, build, and run applications using HBase. First, it introduces you to the fundamentals of distributed systems and large scale data handling. Then, you’ll explore real-world applications and code samples with just enough theory to understand the practical techniques. You’ll see how to build applications with HBase and take advantage of the MapReduce processing framework. And along the way you’ll learn patterns and best practices.

About the author

Salman Ul Haq is a techpreneur, co-founder and CEO of TunaCode, Inc., a startup that delivers GPU-accelerated computing solutions to time-critical application domains. He holds a degree is Computer Systems Engineering. His current focus is on delivering the right solution for cloud security. He can be reached at salman@tunacode.com.

Tags: HBase, MapReduce, Migrating Data, mySQL, RDBMS, TSV Format,

Comments are closed.