You are previewing Hadoop: The Definitive Guide, 3rd Edition.

Hadoop: The Definitive Guide, 3rd Edition

Cover of Hadoop: The Definitive Guide, 3rd Edition by Tom White Published by O'Reilly Media, Inc.
  1. Hadoop: The Definitive Guide
  2. Dedication
  3. Foreword
  4. Preface
    1. Administrative Notes
    2. What’s in This Book?
    3. What’s New in the Second Edition?
    4. What’s New in the Third Edition?
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  5. 1. Meet Hadoop
    1. Data!
    2. Data Storage and Analysis
    3. Comparison with Other Systems
      1. Rational Database Management System
      2. Grid Computing
      3. Volunteer Computing
    4. A Brief History of Hadoop
    5. Apache Hadoop and the Hadoop Ecosystem
    6. Hadoop Releases
      1. What’s Covered in This Book
      2. Compatibility
  6. 2. MapReduce
    1. A Weather Dataset
      1. Data Format
    2. Analyzing the Data with Unix Tools
    3. Analyzing the Data with Hadoop
      1. Map and Reduce
      2. Java MapReduce
    4. Scaling Out
      1. Data Flow
      2. Combiner Functions
      3. Running a Distributed MapReduce Job
    5. Hadoop Streaming
      1. Ruby
      2. Python
    6. Hadoop Pipes
      1. Compiling and Running
  7. 3. The Hadoop Distributed Filesystem
    1. The Design of HDFS
    2. HDFS Concepts
      1. Blocks
      2. Namenodes and Datanodes
      3. HDFS Federation
      4. HDFS High-Availability
    3. The Command-Line Interface
      1. Basic Filesystem Operations
    4. Hadoop Filesystems
      1. Interfaces
    5. The Java Interface
      1. Reading Data from a Hadoop URL
      2. Reading Data Using the FileSystem API
      3. Writing Data
      4. Directories
      5. Querying the Filesystem
      6. Deleting Data
    6. Data Flow
      1. Anatomy of a File Read
      2. Anatomy of a File Write
      3. Coherency Model
    7. Data Ingest with Flume and Sqoop
    8. Parallel Copying with distcp
      1. Keeping an HDFS Cluster Balanced
    9. Hadoop Archives
      1. Using Hadoop Archives
      2. Limitations
  8. 4. Hadoop I/O
    1. Data Integrity
      1. Data Integrity in HDFS
      2. LocalFileSystem
      3. ChecksumFileSystem
    2. Compression
      1. Codecs
      2. Compression and Input Splits
      3. Using Compression in MapReduce
    3. Serialization
      1. The Writable Interface
      2. Writable Classes
      3. Implementing a Custom Writable
      4. Serialization Frameworks
    4. Avro
      1. Avro Data Types and Schemas
      2. In-Memory Serialization and Deserialization
      3. Avro Datafiles
      4. Interoperability
      5. Schema Resolution
      6. Sort Order
      7. Avro MapReduce
      8. Sorting Using Avro MapReduce
      9. Avro MapReduce in Other Languages
    5. File-Based Data Structures
      1. SequenceFile
      2. MapFile
  9. 5. Developing a MapReduce Application
    1. The Configuration API
      1. Combining Resources
      2. Variable Expansion
    2. Setting Up the Development Environment
      1. Managing Configuration
      2. GenericOptionsParser, Tool, and ToolRunner
    3. Writing a Unit Test with MRUnit
      1. Mapper
      2. Reducer
    4. Running Locally on Test Data
      1. Running a Job in a Local Job Runner
      2. Testing the Driver
    5. Running on a Cluster
      1. Packaging a Job
      2. Launching a Job
      3. The MapReduce Web UI
      4. Retrieving the Results
      5. Debugging a Job
      6. Hadoop Logs
      7. Remote Debugging
    6. Tuning a Job
      1. Profiling Tasks
    7. MapReduce Workflows
      1. Decomposing a Problem into MapReduce Jobs
      2. JobControl
      3. Apache Oozie
  10. 6. How MapReduce Works
    1. Anatomy of a MapReduce Job Run
      1. Classic MapReduce (MapReduce 1)
      2. YARN (MapReduce 2)
    2. Failures
      1. Failures in Classic MapReduce
      2. Failures in YARN
    3. Job Scheduling
      1. The Fair Scheduler
      2. The Capacity Scheduler
    4. Shuffle and Sort
      1. The Map Side
      2. The Reduce Side
      3. Configuration Tuning
    5. Task Execution
      1. The Task Execution Environment
      2. Speculative Execution
      3. Output Committers
      4. Task JVM Reuse
      5. Skipping Bad Records
  11. 7. MapReduce Types and Formats
    1. MapReduce Types
      1. The Default MapReduce Job
    2. Input Formats
      1. Input Splits and Records
      2. Text Input
      3. Binary Input
      4. Multiple Inputs
      5. Database Input (and Output)
    3. Output Formats
      1. Text Output
      2. Binary Output
      3. Multiple Outputs
      4. Lazy Output
      5. Database Output
  12. 8. MapReduce Features
    1. Counters
      1. Built-in Counters
      2. User-Defined Java Counters
      3. User-Defined Streaming Counters
    2. Sorting
      1. Preparation
      2. Partial Sort
      3. Total Sort
      4. Secondary Sort
    3. Joins
      1. Map-Side Joins
      2. Reduce-Side Joins
    4. Side Data Distribution
      1. Using the Job Configuration
      2. Distributed Cache
    5. MapReduce Library Classes
  13. 9. Setting Up a Hadoop Cluster
    1. Cluster Specification
      1. Network Topology
    2. Cluster Setup and Installation
      1. Installing Java
      2. Creating a Hadoop User
      3. Installing Hadoop
      4. Testing the Installation
    3. SSH Configuration
    4. Hadoop Configuration
      1. Configuration Management
      2. Environment Settings
      3. Important Hadoop Daemon Properties
      4. Hadoop Daemon Addresses and Ports
      5. Other Hadoop Properties
      6. User Account Creation
    5. YARN Configuration
      1. Important YARN Daemon Properties
      2. YARN Daemon Addresses and Ports
    6. Security
      1. Kerberos and Hadoop
      2. Delegation Tokens
      3. Other Security Enhancements
    7. Benchmarking a Hadoop Cluster
      1. Hadoop Benchmarks
      2. User Jobs
    8. Hadoop in the Cloud
      1. Apache Whirr
  14. 10. Administering Hadoop
    1. HDFS
      1. Persistent Data Structures
      2. Safe Mode
      3. Audit Logging
      4. Tools
    2. Monitoring
      1. Logging
      2. Metrics
      3. Java Management Extensions
    3. Maintenance
      1. Routine Administration Procedures
      2. Commissioning and Decommissioning Nodes
      3. Upgrades
  15. 11. Pig
    1. Installing and Running Pig
      1. Execution Types
      2. Running Pig Programs
      3. Grunt
      4. Pig Latin Editors
    2. An Example
      1. Generating Examples
    3. Comparison with Databases
    4. Pig Latin
      1. Structure
      2. Statements
      3. Expressions
      4. Types
      5. Schemas
      6. Functions
      7. Macros
    5. User-Defined Functions
      1. A Filter UDF
      2. An Eval UDF
      3. A Load UDF
    6. Data Processing Operators
      1. Loading and Storing Data
      2. Filtering Data
      3. Grouping and Joining Data
      4. Sorting Data
      5. Combining and Splitting Data
    7. Pig in Practice
      1. Parallelism
      2. Parameter Substitution
  16. 12. Hive
    1. Installing Hive
      1. The Hive Shell
    2. An Example
    3. Running Hive
      1. Configuring Hive
      2. Hive Services
      3. The Metastore
    4. Comparison with Traditional Databases
      1. Schema on Read Versus Schema on Write
      2. Updates, Transactions, and Indexes
    5. HiveQL
      1. Data Types
      2. Operators and Functions
    6. Tables
      1. Managed Tables and External Tables
      2. Partitions and Buckets
      3. Storage Formats
      4. Importing Data
      5. Altering Tables
      6. Dropping Tables
    7. Querying Data
      1. Sorting and Aggregating
      2. MapReduce Scripts
      3. Joins
      4. Subqueries
      5. Views
    8. User-Defined Functions
      1. Writing a UDF
      2. Writing a UDAF
  17. 13. HBase
    1. HBasics
      1. Backdrop
    2. Concepts
      1. Whirlwind Tour of the Data Model
      2. Implementation
    3. Installation
      1. Test Drive
    4. Clients
      1. Java
      2. Avro, REST, and Thrift
    5. Example
      1. Schemas
      2. Loading Data
      3. Web Queries
    6. HBase Versus RDBMS
      1. Successful Service
      2. HBase
      3. Use Case: HBase at Streamy.com
    7. Praxis
      1. Versions
      2. HDFS
      3. UI
      4. Metrics
      5. Schema Design
      6. Counters
      7. Bulk Load
  18. 14. ZooKeeper
    1. Installing and Running ZooKeeper
    2. An Example
      1. Group Membership in ZooKeeper
      2. Creating the Group
      3. Joining a Group
      4. Listing Members in a Group
      5. Deleting a Group
    3. The ZooKeeper Service
      1. Data Model
      2. Operations
      3. Implementation
      4. Consistency
      5. Sessions
      6. States
    4. Building Applications with ZooKeeper
      1. A Configuration Service
      2. The Resilient ZooKeeper Application
      3. A Lock Service
      4. More Distributed Data Structures and Protocols
    5. ZooKeeper in Production
      1. Resilience and Performance
      2. Configuration
  19. 15. Sqoop
    1. Getting Sqoop
    2. Sqoop Connectors
    3. A Sample Import
      1. Text and Binary File Formats
    4. Generated Code
      1. Additional Serialization Systems
    5. Imports: A Deeper Look
      1. Controlling the Import
      2. Imports and Consistency
      3. Direct-mode Imports
    6. Working with Imported Data
      1. Imported Data and Hive
    7. Importing Large Objects
    8. Performing an Export
    9. Exports: A Deeper Look
      1. Exports and Transactionality
      2. Exports and SequenceFiles
  20. 16. Case Studies
    1. Hadoop Usage at Last.fm
      1. Last.fm: The Social Music Revolution
      2. Hadoop at Last.fm
      3. Generating Charts with Hadoop
      4. The Track Statistics Program
      5. Summary
    2. Hadoop and Hive at Facebook
      1. Hadoop at Facebook
      2. Hypothetical Use Case Studies
      3. Hive
      4. Problems and Future Work
    3. Nutch Search Engine
      1. Data Structures
      2. Selected Examples of Hadoop Data Processing in Nutch
      3. Summary
    4. Log Processing at Rackspace
      1. Requirements/The Problem
      2. Brief History
      3. Choosing Hadoop
      4. Collection and Storage
      5. MapReduce for Logs
    5. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
    6. TeraByte Sort on Apache Hadoop
    7. Using Pig and Wukong to Explore Billion-edge Network Graphs
      1. Measuring Community
      2. Everybody’s Talkin’ at Me: The Twitter Reply Graph
      3. Symmetric Links
      4. Community Extraction
  21. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudodistributed Mode
      3. Fully Distributed Mode
  22. B. Cloudera’s Distribution Including Apache Hadoop
  23. C. Preparing the NCDC Weather Data
  24. Index
  25. About the Author
  26. Colophon
  27. Copyright
O'Reilly logo

Chapter 7. MapReduce Types and Formats

MapReduce has a simple model of data processing: inputs and outputs for the map and reduce functions are key-value pairs. This chapter looks at the MapReduce model in detail and, in particular, how data in various formats, from simple text to structured binary objects, can be used with this model.

MapReduce Types

The map and reduce functions in Hadoop MapReduce have the following general form:

map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)

In general, the map input key and value types (K1 and V1) are different from the map output types (K2 and V2). However, the reduce input must have the same types as the map output, although the reduce output types may be different again (K3 and V3). The Java API mirrors this general form:

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    // ...
  }

  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    // ...
  }
}

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  public class Context extends ReducerContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    // ...
  }protected void reduce(KEYIN key, Iterable<VALUEIN> values,
                  Context context) throws IOException, 
  InterruptedException {
    // ...
  }
}

The context objects are used for emitting key-value pairs, and so they are parameterized by the output types so that the signature of the write() method is:

public void write(KEYOUT key, ...

The best content for your career. Discover unlimited learning on demand for around $1/day.