Chapter 6. Hadoop Ecosystem Integration

The previous chapters described the various use cases where Sqoop enables highly efficient data transfers between Hadoop and relational databases. This chapter will focus on integrating Sqoop with the rest of the Hadoop ecosystem: we will show you how to run Sqoop from within a specialized Hadoop scheduler named Oozie and how to load your data into Hadoop’s data warehouse system, Apache Hive, and Hadoop’s database, Apache HBase.

Scheduling Sqoop Jobs with Oozie

Problem

You are using Oozie in your environment to schedule Hadoop jobs and would like to call Sqoop from within your existing workflows.

Solution

Oozie includes special Sqoop actions that you can use to call Sqoop in your workflow. For example:

<workflow-app name="sqoop-workflow" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="sqoop-action">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <command>import --table cities --connect ...</command>
        </sqoop>
        <ok to="next"/>
        <error to="error"/>
    </action>
    ...
</workflow-app>

Discussion

Starting from version 3.2.0, Oozie has built-in support for Sqoop. You can use the special action type in the same way you would execute a MapReduce action. You have two options for specifying Sqoop parameters. The first option is to use one tag, <command>, to list all the parameters, for example:

<command>import --table cities --username sqoop --password sqoop ...</command>

In this case, Oozie will ...

Get Apache Sqoop Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.