You are previewing Apache Sqoop Cookbook.

Apache Sqoop Cookbook

Cover of Apache Sqoop Cookbook by Jarek Jarcec Cecho... Published by O'Reilly Media, Inc.
  1. Apache Sqoop Cookbook
  2. Foreword
  3. Preface
    1. Sqoop 2
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
      1. Jarcec Thanks
      2. Kathleen Thanks
  4. 1. Getting Started
    1. Downloading and Installing Sqoop
      1. Problem
      2. Solution
      3. Discussion
    2. Installing JDBC Drivers
      1. Problem
      2. Solution
      3. Discussion
    3. Installing Specialized Connectors
      1. Problem
      2. Solution
      3. Discussion
    4. Starting Sqoop
      1. Problem
      2. Solution
      3. Discussion
    5. Getting Help with Sqoop
      1. Problem
      2. Solution
      3. Discussion
  5. 2. Importing Data
    1. Transferring an Entire Table
      1. Problem
      2. Solution
      3. Discussion
    2. Specifying a Target Directory
      1. Problem
      2. Solution
      3. Discussion
    3. Importing Only a Subset of Data
      1. Problem
      2. Solution
      3. Discussion
    4. Protecting Your Password
      1. Problem
      2. Solution
      3. Discussion
    5. Using a File Format Other Than CSV
      1. Problem
      2. Solution
      3. Discussion
    6. Compressing Imported Data
      1. Problem
      2. Solution
      3. Discussion
    7. Speeding Up Transfers
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    8. Overriding Type Mapping
      1. Problem
      2. Solution
      3. Discussion
    9. Controlling Parallelism
      1. Problem
      2. Solution
      3. Discussion
    10. Encoding NULL Values
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    11. Importing All Your Tables
      1. Problem
      2. Solution
      3. Discussion
  6. 3. Incremental Import
    1. Importing Only New Data
      1. Problem
      2. Solution
      3. Discussion
    2. Incrementally Importing Mutable Data
      1. Problem
      2. Solution
      3. Discussion
    3. Preserving the Last Imported Value
      1. Problem
      2. Solution
      3. Discussion
    4. Storing Passwords in the Metastore
      1. Problem
      2. Solution
      3. Discussion
    5. Overriding the Arguments to a Saved Job
      1. Problem
      2. Solution
      3. Discussion
    6. Sharing the Metastore Between Sqoop Clients
      1. Problem
      2. Solution
      3. Discussion
  7. 4. Free-Form Query Import
    1. Importing Data from Two Tables
      1. Problem
      2. Solution
      3. Discussion
    2. Using Custom Boundary Queries
      1. Problem
      2. Solution
      3. Discussion
    3. Renaming Sqoop Job Instances
      1. Problem
      2. Solution
      3. Discussion
    4. Importing Queries with Duplicated Columns
      1. Problem
      2. Solution
      3. Discussion
  8. 5. Export
    1. Transferring Data from Hadoop
      1. Problem
      2. Solution
      3. Discussion
    2. Inserting Data in Batches
      1. Problem
      2. Solution
      3. Discussion
    3. Exporting with All-or-Nothing Semantics
      1. Problem
      2. Solution
      3. Discussion
    4. Updating an Existing Data Set
      1. Problem
      2. Solution
      3. Discussion
    5. Updating or Inserting at the Same Time
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    6. Using Stored Procedures
      1. Problem
      2. Solution
      3. Discussion
    7. Exporting into a Subset of Columns
      1. Problem
      2. Solution
      3. Discussion
    8. Encoding the NULL Value Differently
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    9. Exporting Corrupted Data
      1. Problem
      2. Solution
      3. Discussion
  9. 6. Hadoop Ecosystem Integration
    1. Scheduling Sqoop Jobs with Oozie
      1. Problem
      2. Solution
      3. Discussion
    2. Specifying Commands in Oozie
      1. Problem
      2. Solution
      3. Discussion
    3. Using Property Parameters in Oozie
      1. Problem
      2. Solution
      3. Discussion
    4. Installing JDBC Drivers in Oozie
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    5. Importing Data Directly into Hive
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    6. Using Partitioned Hive Tables
      1. Problem
      2. Solution
      3. Discussion
    7. Replacing Special Delimiters During Hive Import
      1. Problem
      2. Solution
      3. Discussion
    8. Using the Correct NULL String in Hive
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    9. Importing Data into HBase
      1. Problem
      2. Solution
      3. Discussion
    10. Importing All Rows into HBase
      1. Problem
      2. Solution
      3. Discussion
    11. Improving Performance When Importing into HBase
      1. Problem
      2. Solution
      3. Discussion
  10. 7. Specialized Connectors
    1. Overriding Imported boolean Values in PostgreSQL Direct Import
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    2. Importing a Table Stored in Custom Schema in PostgreSQL
      1. Problem
      2. Solution
      3. Discussion
    3. Exporting into PostgreSQL Using pg_bulkload
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    4. Connecting to MySQL
      1. Problem
      2. Solution
      3. Discussion
    5. Using Direct MySQL Import into Hive
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    6. Using the upsert Feature When Exporting into MySQL
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    7. Importing from Oracle
      1. Problem
      2. Solution
      3. Discussion
    8. Using Synonyms in Oracle
      1. Problem
      2. Solution
      3. Discussion
    9. Faster Transfers with Oracle
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    10. Importing into Avro with OraOop
      1. Problem
      2. Solution
      3. Discussion
    11. Choosing the Proper Connector for Oracle
      1. Problem
      2. Solution
      3. Discussion
    12. Exporting into Teradata
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    13. Using the Cloudera Teradata Connector
      1. Problem
      2. Solution
      3. Discussion
      4. See Also
    14. Using Long Column Names in Teradata
      1. Problem
      2. Solution
      3. Discussion
  11. About the Authors
  12. Colophon
  13. Copyright
O'Reilly logo

Chapter 1. Getting Started

This chapter will guide you through the basic prerequisites of using Sqoop. You will learn how to download and install Sqoop on your computer or on any node of your Hadoop cluster. Sqoop comes with a very detailed User Guide describing all the available parameters and basic usage. Rather than repeating the guide, this book focuses on applying the parameters to real use cases and helping you to deploy and use Sqoop effectively in your environment.

Downloading and Installing Sqoop

Problem

You want to install Sqoop on your computer or on any node in your Hadoop cluster.

Solution

Sqoop supports the Linux operating system, and there are several installation options. One option is the source tarball that is provided with every release. This tarball contains only the source code of the project. You can’t use it directly and will need to first compile the sources into binary executables. For your convenience, the Sqoop community provides a binary tarball for each major supported version of Hadoop along with the source tarball.

In addition to the tarballs, there are open source projects and commercial companies that provide operating system-specific packages. One such project, Apache Bigtop, provides rpm packages for Red Hat, CentOS, SUSE, and deb packages for Ubuntu and Debian. The biggest benefit of using packages over tarballs is their seamless integration with the operating system: for example, Configuration files are stored in /etc/ and logs in /var/log.

Discussion

This book focuses on using Sqoop rather than developing for it. If you prefer to compile the source code from source tarball into binary directly, the Developer’s Guide is a good resource.

You can download the binary tarballs from the Apache Sqoop website. All binary tarballs contain a .bin__hadoop string embedded in their name, followed by the Apache Hadoop major version that was used to generate them. For Hadoop 1.x, the archive name will include the string .bin__hadoop-1.0.0. While the naming convention suggests this tarball only works with version 1.0.0, in fact, it’s fully compatible not only with the entire 1.0.x release branch but also with version 1.1.0. It’s very important to download the binary tarball created for your Hadoop version. Hadoop has changed internal interfaces between some of the major versions; therefore, using a Sqoop tarball that was compiled against Hadoop version 1.x with, say, Hadoop version 2.x, will not work.

To install Sqoop, download the binary tarball to any machine from which you want to run Sqoop and unzip the archive. You can directly use Sqoop from within the extracted directory without any additional steps. As Sqoop is not a cluster service, you do not need to install it on all the nodes in your cluster. Having the installation available on one single machine is sufficient. As a Hadoop application, Sqoop requires that the Hadoop libraries and configurations be available on the machine. Hadoop installation instructions can be found in the Hadoop project documentation. If you want to import your data into HBase and Hive, Sqoop will need those libraries. For common functionality, these dependencies are not mandatory.

Installing packages is simpler than using tarballs. They are already integrated with the operating system and will automatically download and install most of the required dependencies during the Sqoop installation. Due to licensing, the JDBC drivers won’t be installed automatically. For those instructions, check out the section Installing JDBC Drivers.

Bigtop provides repositories that can be easily added into your system in order to find and install the dependencies. Bigtop installation instructions can be found in the Bigtop project documentation. Once Bigtop is successfully deployed, installing Sqoop is very simple and can be done with the following commands:

  • To install Sqoop on a Red Hat, CentOS, or other yum system:
$ sudo yum install sqoop
  • To install Sqoop on an Ubuntu, Debian, or other deb-based system:
$ sudo apt-get install sqoop
  • To install Sqoop on a SLES system:
$ sudo zypper install sqoop

Sqoop’s main configuration file sqoop-site.xml is available in the configuration directory (conf/ when using the tarball or /etc/sqoop/conf when using Bigtop packages). While you can further customize Sqoop, the defaults will suffice in a majority of cases. All available properties are documented in the sqoop-site.xml file. We will explain the more commonly used properties in greater detail later in the book.

Installing JDBC Drivers

Problem

Sqoop requires the JDBC drivers for your specific database server (MySQL, Oracle, etc.) in order to transfer data. They are not bundled in the tarball or packages.

Solution

You need to download the JDBC drivers and then install them into Sqoop. JDBC drivers are usually available free of charge from the database vendors’ websites. Some enterprise data stores might bundle the driver with the installation itself. After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s lib/ directory. If you’re using the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping the tarball. If you’re using packages, you will need to copy the driver files into the /usr/lib/sqoop/lib directory.

Discussion

JDBC is a Java specific database-vendor independent interface for accessing relational databases and enterprise data warehouses. Upon this generic interface, each database vendor must implement a compliant driver containing required functionality. Due to licensing, the Sqoop project can’t bundle the drivers in the distribution. You will need to download and install each driver individually.

Each database vendor has a slightly different method for retrieving the JDBC driver. Most of them make it available as a free download from their websites. Please contact your database administrator if you are not sure how to retrieve the driver.

Installing Specialized Connectors

Problem

Some database systems provide special connectors, which are not part of the Sqoop distribution, and these take advantage of advanced database features. If you want to take advantage of these optimizations, you will need to individually download and install those specialized connectors.

Solution

On the node running Sqoop, you can install the specialized connectors anywhere on the local filesystem. If you plan to run Sqoop from multiple nodes, you have to install the connector on all of those nodes. To be clear, you do not have to install the connector on all nodes in your cluster, as Sqoop will automatically propagate the appropriate JARs as needed throughout your cluster.

In addition to installing the connector JARs on the local filesystem, you also need to register them with Sqoop. First, create a directory manager.d in the Sqoop configuration directory (if it does not exist already). The configuration directory might be in a different location, based on how you’ve installed Sqoop. With packages, it’s usually in the /etc/sqoop directory, and with tarballs, it’s usually in the conf/ directory. Then, inside this directory, you need to create a file (naming it after the connector is a recommended best practice) that contains the following line:

connector.fully.qualified.class.name=/full/path/to/the/jar

You can find the name of the fully qualified class in each connector’s documentation.

Discussion

A significant strength of Sqoop is its ability to work with all major and minor database systems and enterprise data warehouses. To abstract the different behavior of each system, Sqoop introduced the concept of connectors: all database-specific operations are delegated from core Sqoop to the specialized connectors. Sqoop itself bundles many such connectors; you do not need to download anything extra in order to run Sqoop. The most general connector bundled with Sqoop is the Generic JDBC Connector that utilizes only the JDBC interface. This will work with every JDBC-compliant database system. In addition to this generic connector, Sqoop also ships with specialized connectors for MySQL, Oracle, PostgreSQL, Microsoft SQL Server, and DB2, which utilize special properties of each particular database system. You do not need to explicitly select the desired connector, as Sqoop will automatically do so based on your JDBC URL.

In addition to the built-in connectors, there are many specialized connectors available for download. Some of them are further described in this book. For example, OraOop is described in Faster Transfers with Oracle, and Cloudera Connector for Teradata is described in Using the Cloudera Teradata Connector. More advanced users can develop their own connectors by following the guidelines listed in the Sqoop Developer’s Guide.

Most, if not all, of the connectors depend on the underlying JDBC drivers in order to make the connection to the remote database server. It’s imperative to install both the specialized connector and the appropriate JDBC driver. It’s also important to distinguish the connector from the JDBC driver. The connector is a Sqoop-specific pluggable piece that is used to delegate some of the functionality that might be done faster when using database-specific tweaks. The JDBC driver is also a pluggable piece. However, it is independent of Sqoop and exposes database interfaces in a portable manner for all Java applications.

Caution

Sqoop always requires both the connector and the JDBC driver.

Starting Sqoop

Problem

You’ve successfully installed and configured Sqoop, and now you want to know how to run it.

Solution

Sqoop is a command-line tool that can be called from any shell implementation such as bash or zsh. An example Sqoop command might look like the following (all parameters will be described later in the book):

sqoop import \
  -Dsqoop.export.records.per.statement=1 \
  --connect jdbc:postgresql://postgresql.example.com/database \
  --username sqoop \
  --password sqoop \
  --table cities \
  -- \
  --schema us

Discussion

The command-line interface has the following structure:

sqoop TOOL PROPERTY_ARGS SQOOP_ARGS [-- EXTRA_ARGS]

TOOL indicates the operation that you want to perform. The most important operations are import for transferring data from a database to Hadoop and export for transferring data from Hadoop to a database. PROPERTY_ARGS are a special set of parameters that are entered as Java properties in the format -Dname=value (examples appear later in the book). Property parameters are followed by SQOOP_ARGS that contain all the various Sqoop parameters.

Caution

Mixing property and Sqoop parameters together is not allowed. Furthermore, all property parameters must precede all Sqoop parameters.

You can specify EXTRA_ARGS for specialized connectors, which can be used to enter additional parameters specific to each connector. The EXTRA_ARGS parameters must be separated from the SQOOP_ARGS with a --.

Note

Sqoop has a bewildering number of command-line parameters (more than 60). Type sqoop help to retrieve the entire list. Type sqoop help TOO (e.g., sqoop help import) to get detailed information for a specific tool.

Getting Help with Sqoop

Problem

You have a question that is not answered by this book.

Solution

You can ask for help from the Sqoop community via the mailing lists. The Sqoop Mailing Lists page contains general information and instructions for using the Sqoop User and Development mailing lists. Here is an outline of the general process:

  1. First, you need to subscribe to the User list at the Sqoop Mailing Lists page.
  2. To get the most out of the Sqoop mailing lists, you may want to read Eric Raymond’s How To Ask Questions The Smart Way.
  3. Then provide the full context of your problem with details on observed or desired behavior. If appropriate, include a minimal self-reproducing example so that others can reproduce the problem you’re facing.
  4. Finally, email your question to .

Discussion

Before sending email to the mailing list, it is useful to read the Sqoop documentation and search the Sqoop mailing list archives. Most likely your question has already been asked, in which case you’ll be able to get an immediate answer by searching the archives. If it seems that your question hasn’t been asked yet, send it to .

Caution

If you aren’t already a list subscriber, your email submission will be rejected.

Your question might have to do with your Sqoop command causing an error or giving unexpected results. In the latter case, it is necessary to include enough data to reproduce the error. If the list readers can’t reproduce it, they can’t diagnose it. Including relevant information greatly increases the probability of getting a useful answer.

To that end, you’ll need to include the following information:

  • Versions: Sqoop, Hadoop, OS, JDBC
  • Console log after running with the --verbose flag

    • Capture the entire output via sqoop import … &> sqoop.log
  • Entire Sqoop command including the options-file if applicable
  • Expected output and actual output
  • Table definition
  • Small input data set that triggers the problem

    • Especially with export, malformed data is often the culprit
  • Hadoop task logs

    • Often the task logs contain further information describing the problem
  • Permissions on input files

While the project has several communication channels, the mailing lists are not only the most active but also the official channels for making decisions about the project itself. If you’re interested in learning more about or participating in the Apache Sqoop project, the mailing lists are the best way to do that.

The best content for your career. Discover unlimited learning on demand for around $1/day.