You are previewing Hadoop: The Definitive Guide, 3rd Edition.

Hadoop: The Definitive Guide, 3rd Edition

Cover of Hadoop: The Definitive Guide, 3rd Edition by Tom White Published by O'Reilly Media, Inc.
  1. Hadoop: The Definitive Guide
  2. Dedication
  3. Foreword
  4. Preface
    1. Administrative Notes
    2. What’s in This Book?
    3. What’s New in the Second Edition?
    4. What’s New in the Third Edition?
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  5. 1. Meet Hadoop
    1. Data!
    2. Data Storage and Analysis
    3. Comparison with Other Systems
      1. Rational Database Management System
      2. Grid Computing
      3. Volunteer Computing
    4. A Brief History of Hadoop
    5. Apache Hadoop and the Hadoop Ecosystem
    6. Hadoop Releases
      1. What’s Covered in This Book
      2. Compatibility
  6. 2. MapReduce
    1. A Weather Dataset
      1. Data Format
    2. Analyzing the Data with Unix Tools
    3. Analyzing the Data with Hadoop
      1. Map and Reduce
      2. Java MapReduce
    4. Scaling Out
      1. Data Flow
      2. Combiner Functions
      3. Running a Distributed MapReduce Job
    5. Hadoop Streaming
      1. Ruby
      2. Python
    6. Hadoop Pipes
      1. Compiling and Running
  7. 3. The Hadoop Distributed Filesystem
    1. The Design of HDFS
    2. HDFS Concepts
      1. Blocks
      2. Namenodes and Datanodes
      3. HDFS Federation
      4. HDFS High-Availability
    3. The Command-Line Interface
      1. Basic Filesystem Operations
    4. Hadoop Filesystems
      1. Interfaces
    5. The Java Interface
      1. Reading Data from a Hadoop URL
      2. Reading Data Using the FileSystem API
      3. Writing Data
      4. Directories
      5. Querying the Filesystem
      6. Deleting Data
    6. Data Flow
      1. Anatomy of a File Read
      2. Anatomy of a File Write
      3. Coherency Model
    7. Data Ingest with Flume and Sqoop
    8. Parallel Copying with distcp
      1. Keeping an HDFS Cluster Balanced
    9. Hadoop Archives
      1. Using Hadoop Archives
      2. Limitations
  8. 4. Hadoop I/O
    1. Data Integrity
      1. Data Integrity in HDFS
      2. LocalFileSystem
      3. ChecksumFileSystem
    2. Compression
      1. Codecs
      2. Compression and Input Splits
      3. Using Compression in MapReduce
    3. Serialization
      1. The Writable Interface
      2. Writable Classes
      3. Implementing a Custom Writable
      4. Serialization Frameworks
    4. Avro
      1. Avro Data Types and Schemas
      2. In-Memory Serialization and Deserialization
      3. Avro Datafiles
      4. Interoperability
      5. Schema Resolution
      6. Sort Order
      7. Avro MapReduce
      8. Sorting Using Avro MapReduce
      9. Avro MapReduce in Other Languages
    5. File-Based Data Structures
      1. SequenceFile
      2. MapFile
  9. 5. Developing a MapReduce Application
    1. The Configuration API
      1. Combining Resources
      2. Variable Expansion
    2. Setting Up the Development Environment
      1. Managing Configuration
      2. GenericOptionsParser, Tool, and ToolRunner
    3. Writing a Unit Test with MRUnit
      1. Mapper
      2. Reducer
    4. Running Locally on Test Data
      1. Running a Job in a Local Job Runner
      2. Testing the Driver
    5. Running on a Cluster
      1. Packaging a Job
      2. Launching a Job
      3. The MapReduce Web UI
      4. Retrieving the Results
      5. Debugging a Job
      6. Hadoop Logs
      7. Remote Debugging
    6. Tuning a Job
      1. Profiling Tasks
    7. MapReduce Workflows
      1. Decomposing a Problem into MapReduce Jobs
      2. JobControl
      3. Apache Oozie
  10. 6. How MapReduce Works
    1. Anatomy of a MapReduce Job Run
      1. Classic MapReduce (MapReduce 1)
      2. YARN (MapReduce 2)
    2. Failures
      1. Failures in Classic MapReduce
      2. Failures in YARN
    3. Job Scheduling
      1. The Fair Scheduler
      2. The Capacity Scheduler
    4. Shuffle and Sort
      1. The Map Side
      2. The Reduce Side
      3. Configuration Tuning
    5. Task Execution
      1. The Task Execution Environment
      2. Speculative Execution
      3. Output Committers
      4. Task JVM Reuse
      5. Skipping Bad Records
  11. 7. MapReduce Types and Formats
    1. MapReduce Types
      1. The Default MapReduce Job
    2. Input Formats
      1. Input Splits and Records
      2. Text Input
      3. Binary Input
      4. Multiple Inputs
      5. Database Input (and Output)
    3. Output Formats
      1. Text Output
      2. Binary Output
      3. Multiple Outputs
      4. Lazy Output
      5. Database Output
  12. 8. MapReduce Features
    1. Counters
      1. Built-in Counters
      2. User-Defined Java Counters
      3. User-Defined Streaming Counters
    2. Sorting
      1. Preparation
      2. Partial Sort
      3. Total Sort
      4. Secondary Sort
    3. Joins
      1. Map-Side Joins
      2. Reduce-Side Joins
    4. Side Data Distribution
      1. Using the Job Configuration
      2. Distributed Cache
    5. MapReduce Library Classes
  13. 9. Setting Up a Hadoop Cluster
    1. Cluster Specification
      1. Network Topology
    2. Cluster Setup and Installation
      1. Installing Java
      2. Creating a Hadoop User
      3. Installing Hadoop
      4. Testing the Installation
    3. SSH Configuration
    4. Hadoop Configuration
      1. Configuration Management
      2. Environment Settings
      3. Important Hadoop Daemon Properties
      4. Hadoop Daemon Addresses and Ports
      5. Other Hadoop Properties
      6. User Account Creation
    5. YARN Configuration
      1. Important YARN Daemon Properties
      2. YARN Daemon Addresses and Ports
    6. Security
      1. Kerberos and Hadoop
      2. Delegation Tokens
      3. Other Security Enhancements
    7. Benchmarking a Hadoop Cluster
      1. Hadoop Benchmarks
      2. User Jobs
    8. Hadoop in the Cloud
      1. Apache Whirr
  14. 10. Administering Hadoop
    1. HDFS
      1. Persistent Data Structures
      2. Safe Mode
      3. Audit Logging
      4. Tools
    2. Monitoring
      1. Logging
      2. Metrics
      3. Java Management Extensions
    3. Maintenance
      1. Routine Administration Procedures
      2. Commissioning and Decommissioning Nodes
      3. Upgrades
  15. 11. Pig
    1. Installing and Running Pig
      1. Execution Types
      2. Running Pig Programs
      3. Grunt
      4. Pig Latin Editors
    2. An Example
      1. Generating Examples
    3. Comparison with Databases
    4. Pig Latin
      1. Structure
      2. Statements
      3. Expressions
      4. Types
      5. Schemas
      6. Functions
      7. Macros
    5. User-Defined Functions
      1. A Filter UDF
      2. An Eval UDF
      3. A Load UDF
    6. Data Processing Operators
      1. Loading and Storing Data
      2. Filtering Data
      3. Grouping and Joining Data
      4. Sorting Data
      5. Combining and Splitting Data
    7. Pig in Practice
      1. Parallelism
      2. Parameter Substitution
  16. 12. Hive
    1. Installing Hive
      1. The Hive Shell
    2. An Example
    3. Running Hive
      1. Configuring Hive
      2. Hive Services
      3. The Metastore
    4. Comparison with Traditional Databases
      1. Schema on Read Versus Schema on Write
      2. Updates, Transactions, and Indexes
    5. HiveQL
      1. Data Types
      2. Operators and Functions
    6. Tables
      1. Managed Tables and External Tables
      2. Partitions and Buckets
      3. Storage Formats
      4. Importing Data
      5. Altering Tables
      6. Dropping Tables
    7. Querying Data
      1. Sorting and Aggregating
      2. MapReduce Scripts
      3. Joins
      4. Subqueries
      5. Views
    8. User-Defined Functions
      1. Writing a UDF
      2. Writing a UDAF
  17. 13. HBase
    1. HBasics
      1. Backdrop
    2. Concepts
      1. Whirlwind Tour of the Data Model
      2. Implementation
    3. Installation
      1. Test Drive
    4. Clients
      1. Java
      2. Avro, REST, and Thrift
    5. Example
      1. Schemas
      2. Loading Data
      3. Web Queries
    6. HBase Versus RDBMS
      1. Successful Service
      2. HBase
      3. Use Case: HBase at Streamy.com
    7. Praxis
      1. Versions
      2. HDFS
      3. UI
      4. Metrics
      5. Schema Design
      6. Counters
      7. Bulk Load
  18. 14. ZooKeeper
    1. Installing and Running ZooKeeper
    2. An Example
      1. Group Membership in ZooKeeper
      2. Creating the Group
      3. Joining a Group
      4. Listing Members in a Group
      5. Deleting a Group
    3. The ZooKeeper Service
      1. Data Model
      2. Operations
      3. Implementation
      4. Consistency
      5. Sessions
      6. States
    4. Building Applications with ZooKeeper
      1. A Configuration Service
      2. The Resilient ZooKeeper Application
      3. A Lock Service
      4. More Distributed Data Structures and Protocols
    5. ZooKeeper in Production
      1. Resilience and Performance
      2. Configuration
  19. 15. Sqoop
    1. Getting Sqoop
    2. Sqoop Connectors
    3. A Sample Import
      1. Text and Binary File Formats
    4. Generated Code
      1. Additional Serialization Systems
    5. Imports: A Deeper Look
      1. Controlling the Import
      2. Imports and Consistency
      3. Direct-mode Imports
    6. Working with Imported Data
      1. Imported Data and Hive
    7. Importing Large Objects
    8. Performing an Export
    9. Exports: A Deeper Look
      1. Exports and Transactionality
      2. Exports and SequenceFiles
  20. 16. Case Studies
    1. Hadoop Usage at Last.fm
      1. Last.fm: The Social Music Revolution
      2. Hadoop at Last.fm
      3. Generating Charts with Hadoop
      4. The Track Statistics Program
      5. Summary
    2. Hadoop and Hive at Facebook
      1. Hadoop at Facebook
      2. Hypothetical Use Case Studies
      3. Hive
      4. Problems and Future Work
    3. Nutch Search Engine
      1. Data Structures
      2. Selected Examples of Hadoop Data Processing in Nutch
      3. Summary
    4. Log Processing at Rackspace
      1. Requirements/The Problem
      2. Brief History
      3. Choosing Hadoop
      4. Collection and Storage
      5. MapReduce for Logs
    5. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
    6. TeraByte Sort on Apache Hadoop
    7. Using Pig and Wukong to Explore Billion-edge Network Graphs
      1. Measuring Community
      2. Everybody’s Talkin’ at Me: The Twitter Reply Graph
      3. Symmetric Links
      4. Community Extraction
  21. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudodistributed Mode
      3. Fully Distributed Mode
  22. B. Cloudera’s Distribution Including Apache Hadoop
  23. C. Preparing the NCDC Weather Data
  24. Index
  25. About the Author
  26. Colophon
  27. Copyright
O'Reilly logo

Chapter 15. Sqoop

Aaron Kimball

A great strength of the Hadoop platform is its ability to work with data in several different forms. HDFS can reliably store logs and other data from a plethora of sources, and MapReduce programs can parse diverse ad hoc data formats, extracting relevant information and combining multiple datasets into powerful results.

But to interact with data in storage repositories outside of HDFS, MapReduce programs need to use external APIs to get to this data. Often, valuable data in an organization is stored in structured data stores such as relational database systems (RDBMS). Apache Sqoop is an open source tool that allows users to extract data from a structured data store into Hadoop for further processing. This processing can be done with MapReduce programs or other higher-level tools such as Hive. (It’s even possible to use Sqoop to move data from a database into HBase.) When the final results of an analytic pipeline are available, Sqoop can export these results back to the data store for consumption by other clients.

In this chapter, we’ll take a look at how Sqoop works and how you can use it in your data processing pipeline.

Getting Sqoop

Sqoop is available in a few places. The primary home of the project is http://sqoop.apache.org/. This repository contains all the Sqoop source code and documentation. Official releases are available at this site, as well as the source code for the version currently under development. The repository itself contains instructions ...

The best content for your career. Discover unlimited learning on demand for around $1/day.