Although Apache Hadoop was created to process huge amounts of unstructured data, especially on the Internet, you will find Hadoop sitting alongside a relational database. This stems from the prevalence of relational databases. You must also consider the fact that companies, individuals and teams considering a migration to Hadoop need to port their data to Hadoop so that MapReduce jobs can use it. Although you can configure and manually execute data migration, there are tools available to do this for you. One such tool is Sqoop (http://sqoop.apache.org), which was released by Cloudera, but now it’s an Apache project. As the name signifies, it scoops data from a relational source to HDFS, and vice versa.
Before we see how Sqoop imports data into Hadoop, let’s first look at how we can do it without a third party tool. The obvious approach is to create a MapReduce job that pulls data from a structured data source, like MySQL DB with direct JDBC access, and write it into HDFS. The primary issue with this approach is configuring and handling DB connection pooling, since way too many mappers will try to maintain a connection while pulling data.
Now let’s see how we do this using Sqoop. The first step is to download, install and configure sqoop. Make sure you have Hadoop 1.4+ on your machine. Configuring sqoop is pretty straightforward, but read Time for action – downloading and configuring Sqoop in Hadoop Beginner’s Guide for more details. After installing sqoop, make sure you have the JDBC driver of your relational DB (MySQL for example) and copy it into Sqoop’s lib directory. Everything you need is set up. Now we will attempt to simply dump data from a MySQL table into structured files on HDFS.
Create an “Employees” table in MySQL and populate it with some data. We will import this table into HDFS. Now run Sqoop to export this data:
$ sqoop import --connect jdbc:mysql://localhost/hadooptest
--username hadoopuser --password password --table employees
Let’s examine what just happened. With a single sqoop command, we pulled data from MySQL database’s “Employees” table into Hadoop. The first option specifies the type of sqoop (import in our case). Next, we listed the JDBC URI for our MySQL database along with the table name. Sqoop does the rest. Sqoop will place the data pulled from the table into multiple files (based upon the number of mappers it spun) into the home directory. To verify, run the home directory listing:
$ hadoop fs -ls employees
Found 6 items
-rw-r--r-- 3 hadoop supergroup
0 2013-04-24 04:10 /user/hadoop/employees/_SUCCESS
drwxr-xr-x - hadoop supergroup
0 2012-04-24 04:10 /user/hadoop/employees/_logs
-rw-r--r-- 3 … /user/hadoop/employees/part-m-00000
-rw-r--r-- 3 … /user/hadoop/employees/part-m-00001
-rw-r--r-- 3 … /user/hadoop/employees/part-m-00002
-rw-r--r-- 3 … /user/hadoop/employees/part-m-00003
We only had five records in the employee table, but Sqoop still split it into four different files. This is because Sqoop creates a minimum of four mappers to pull the data from the MySQL source. If the data source has more data, Sqoop will scale up mappers accordingly. This is something you will have to handle yourself if you’re manually pulling the data.
In addition to using the import command of Sqoop to import data, you can also use a custom SQL query within Sqoop to pick and choose which data to import. Sqoop can also be used to import data into other Hadoop data warehouses like Apache Hive.
You can read about more real world examples of Hadoop in the following eBooks.
Safari Books Online has the content you need
|Hadoop MapReduce Cookbook deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.|
|Hadoop Beginner’s Guide removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.|
|In Hadoop Real-World Solutions Cookbook covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.|
|Hadoop in Practices collects 85 Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you’ll face, like querying big data using Pig or writing a log file loader. You’ll explore each problem step by step, learning both how to build and deploy that specific solution along with the thinking that went into its design. As you work through the tasks, you’ll find yourself growing more comfortable with Hadoop and at home in the world of big data.|
About the author
|Salman Ul Haq is a techpreneur, co-founder and CEO of TunaCode, Inc., a startup that delivers GPU-accelerated computing solutions to time-critical application domains. He holds a degree is Computer Systems Engineering. His current focus is on delivering the right solution for cloud security. He can be reached at firstname.lastname@example.org.|