Hadoop: The Definitive Guide, 2nd Edition

Book description

Discover how Apache Hadoop can unleash the power of your data. This comprehensive resource shows you how to build and maintain reliable, scalable, distributed systems with the Hadoop framework -- an open source implementation of MapReduce, the algorithm on which Google built its empire. Programmers will find details for analyzing datasets of any size, and administrators will learn how to set up and run Hadoop clusters.

This revised edition covers recent changes to Hadoop, including new features such as Hive, Sqoop, and Avro. It also provides illuminating case studies that illustrate how Hadoop is used to solve specific problems. Looking to get the most out of your data? This is your book.

  • Use the Hadoop Distributed File System (HDFS) for storing large datasets, then run distributed computations over those datasets with MapReduce
  • Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud
  • Use Pig, a high-level query language for large-scale data processing
  • Analyze datasets with Hive, Hadoop’s data warehousing system
  • Take advantage of HBase, Hadoop’s database for structured and semi-structured data
  • Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems

"Now you have the opportunity to learn about Hadoop from a master -- not only of the technology, but also of common sense and plain talk."

--Doug Cutting, Cloudera

Publisher resources

View/Submit Errata

Table of contents

  1. Hadoop: The Definitive Guide
  2. Dedication
  3. A Note Regarding Supplemental Files
  4. Foreword
  5. Preface
    1. Administrative Notes
    2. What’s in This Book?
    3. What’s New in the Second Edition?
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
  6. 1. Meet Hadoop
    1. Data!
    2. Data Storage and Analysis
    3. Comparison with Other Systems
      1. RDBMS
      2. Grid Computing
      3. Volunteer Computing
    4. A Brief History of Hadoop
    5. Apache Hadoop and the Hadoop Ecosystem
  7. 2. MapReduce
    1. A Weather Dataset
      1. Data Format
    2. Analyzing the Data with Unix Tools
    3. Analyzing the Data with Hadoop
      1. Map and Reduce
      2. Java MapReduce
        1. A test run
        2. The new Java MapReduce API
    4. Scaling Out
      1. Data Flow
      2. Combiner Functions
        1. Specifying a combiner function
      3. Running a Distributed MapReduce Job
    5. Hadoop Streaming
      1. Ruby
      2. Python
    6. Hadoop Pipes
      1. Compiling and Running
  8. 3. The Hadoop Distributed Filesystem
    1. The Design of HDFS
    2. HDFS Concepts
      1. Blocks
      2. Namenodes and Datanodes
    3. The Command-Line Interface
      1. Basic Filesystem Operations
    4. Hadoop Filesystems
      1. Interfaces
        1. Thrift
        2. C
        3. FUSE
        4. WebDAV
        5. Other HDFS Interfaces
    5. The Java Interface
      1. Reading Data from a Hadoop URL
      2. Reading Data Using the FileSystem API
        1. FSDataInputStream
      3. Writing Data
        1. FSDataOutputStream
      4. Directories
      5. Querying the Filesystem
        1. File metadata: FileStatus
        2. Listing files
        3. File patterns
        4. PathFilter
      6. Deleting Data
    6. Data Flow
      1. Anatomy of a File Read
      2. Anatomy of a File Write
      3. Coherency Model
        1. Consequences for application design
    7. Parallel Copying with distcp
      1. Keeping an HDFS Cluster Balanced
    8. Hadoop Archives
      1. Using Hadoop Archives
      2. Limitations
  9. 4. Hadoop I/O
    1. Data Integrity
      1. Data Integrity in HDFS
      2. LocalFileSystem
      3. ChecksumFileSystem
    2. Compression
      1. Codecs
        1. Compressing and decompressing streams with CompressionCodec
        2. Inferring CompressionCodecs using CompressionCodecFactory
        3. Native libraries
          1. CodecPool
      2. Compression and Input Splits
      3. Using Compression in MapReduce
        1. Compressing map output
    3. Serialization
      1. The Writable Interface
        1. WritableComparable and comparators
      2. Writable Classes
        1. Writable wrappers for Java primitives
        2. Text
          1. Indexing
          2. Unicode
          3. Iteration
          4. Mutability
          5. Resorting to String
        3. BytesWritable
        4. NullWritable
        5. ObjectWritable and GenericWritable
        6. Writable collections
      3. Implementing a Custom Writable
        1. Implementing a RawComparator for speed
        2. Custom comparators
      4. Serialization Frameworks
        1. Serialization IDL
      5. Avro
        1. Avro data types and schemas
        2. In-memory serialization and deserialization
        3. Avro data files
        4. Interoperability
          1. Python API
          2. C API
        5. Schema resolution
        6. Sort order
        7. Avro MapReduce
    4. File-Based Data Structures
      1. SequenceFile
        1. Writing a SequenceFile
        2. Reading a SequenceFile
        3. Displaying a SequenceFile with the command-line interface
        4. Sorting and merging SequenceFiles
        5. The SequenceFile format
      2. MapFile
        1. Writing a MapFile
        2. Reading a MapFile
        3. Converting a SequenceFile to a MapFile
  10. 5. Developing a MapReduce Application
    1. The Configuration API
      1. Combining Resources
      2. Variable Expansion
    2. Configuring the Development Environment
      1. Managing Configuration
      2. GenericOptionsParser, Tool, and ToolRunner
    3. Writing a Unit Test
      1. Mapper
      2. Reducer
    4. Running Locally on Test Data
      1. Running a Job in a Local Job Runner
        1. Fixing the mapper
      2. Testing the Driver
    5. Running on a Cluster
      1. Packaging
      2. Launching a Job
      3. The MapReduce Web UI
        1. The jobtracker page
        2. The job page
      4. Retrieving the Results
      5. Debugging a Job
        1. The tasks page
        2. The task details page
        3. Handling malformed data
      6. Using a Remote Debugger
    6. Tuning a Job
      1. Profiling Tasks
        1. The HPROF profiler
        2. Other profilers
    7. MapReduce Workflows
      1. Decomposing a Problem into MapReduce Jobs
      2. Running Dependent Jobs
        1. Oozie
  11. 6. How MapReduce Works
    1. Anatomy of a MapReduce Job Run
      1. Job Submission
      2. Job Initialization
      3. Task Assignment
      4. Task Execution
        1. Streaming and Pipes
      5. Progress and Status Updates
      6. Job Completion
    2. Failures
      1. Task Failure
      2. Tasktracker Failure
      3. Jobtracker Failure
    3. Job Scheduling
      1. The Fair Scheduler
      2. The Capacity Scheduler
    4. Shuffle and Sort
      1. The Map Side
      2. The Reduce Side
      3. Configuration Tuning
    5. Task Execution
      1. Speculative Execution
      2. Task JVM Reuse
      3. Skipping Bad Records
      4. The Task Execution Environment
        1. Streaming environment variables
        2. Task side-effect files
  12. 7. MapReduce Types and Formats
    1. MapReduce Types
      1. The Default MapReduce Job
        1. The default Streaming job
        2. Keys and values in Streaming
    2. Input Formats
      1. Input Splits and Records
        1. FileInputFormat
        2. FileInputFormat input paths
        3. FileInputFormat input splits
        4. Small files and CombineFileInputFormat
        5. Preventing splitting
        6. File information in the mapper
        7. Processing a whole file as a record
      2. Text Input
        1. TextInputFormat
        2. KeyValueTextInputFormat
        3. NLineInputFormat
        4. XML
      3. Binary Input
        1. SequenceFileInputFormat
        2. SequenceFileAsTextInputFormat
        3. SequenceFileAsBinaryInputFormat
      4. Multiple Inputs
      5. Database Input (and Output)
    3. Output Formats
      1. Text Output
      2. Binary Output
        1. SequenceFileOutputFormat
        2. SequenceFileAsBinaryOutputFormat
        3. MapFileOutputFormat
      3. Multiple Outputs
        1. An example: Partitioning data
        2. MultipleOutputFormat
        3. MultipleOutputs
      4. Lazy Output
      5. Database Output
  13. 8. MapReduce Features
    1. Counters
      1. Built-in Counters
      2. User-Defined Java Counters
        1. Dynamic counters
        2. Readable counter names
        3. Retrieving counters
      3. User-Defined Streaming Counters
    2. Sorting
      1. Preparation
      2. Partial Sort
        1. An application: Partitioned MapFile lookups
      3. Total Sort
      4. Secondary Sort
        1. Java code
        2. Streaming
    3. Joins
      1. Map-Side Joins
      2. Reduce-Side Joins
    4. Side Data Distribution
      1. Using the Job Configuration
      2. Distributed Cache
        1. Usage
        2. How it works
        3. The DistributedCache API
    5. MapReduce Library Classes
  14. 9. Setting Up a Hadoop Cluster
    1. Cluster Specification
      1. Network Topology
        1. Rack awareness
    2. Cluster Setup and Installation
      1. Installing Java
      2. Creating a Hadoop User
      3. Installing Hadoop
      4. Testing the Installation
    3. SSH Configuration
    4. Hadoop Configuration
      1. Configuration Management
        1. Control scripts
        2. Master node scenarios
      2. Environment Settings
        1. Memory
        2. Java
        3. System logfiles
        4. SSH settings
      3. Important Hadoop Daemon Properties
        1. HDFS
        2. MapReduce
      4. Hadoop Daemon Addresses and Ports
      5. Other Hadoop Properties
        1. Cluster membership
        2. Buffer size
        3. HDFS block size
        4. Reserved storage space
        5. Trash
        6. Task memory limits
        7. Job scheduler
      6. User Account Creation
    5. Security
      1. Kerberos and Hadoop
        1. An example
      2. Delegation Tokens
      3. Other Security Enhancements
    6. Benchmarking a Hadoop Cluster
      1. Hadoop Benchmarks
        1. Benchmarking HDFS with TestDFSIO
        2. Benchmarking MapReduce with Sort
        3. Other benchmarks
      2. User Jobs
    7. Hadoop in the Cloud
      1. Hadoop on Amazon EC2
        1. Setup
        2. Launching a cluster
        3. Running a MapReduce job
        4. Terminating a cluster
  15. 10. Administering Hadoop
    1. HDFS
      1. Persistent Data Structures
        1. Namenode directory structure
        2. The filesystem image and edit log
        3. Secondary namenode directory structure
        4. Datanode directory structure
      2. Safe Mode
        1. Entering and leaving safe mode
      3. Audit Logging
      4. Tools
        1. dfsadmin
        2. Filesystem check (fsck)
          1. Finding the blocks for a file
        3. Datanode block scanner
        4. balancer
    2. Monitoring
      1. Logging
        1. Setting log levels
        2. Getting stack traces
      2. Metrics
        1. FileContext
        2. GangliaContext
        3. NullContextWithUpdateThread
        4. CompositeContext
      3. Java Management Extensions
    3. Maintenance
      1. Routine Administration Procedures
        1. Metadata backups
        2. Data backups
        3. Filesystem check (fsck)
        4. Filesystem balancer
      2. Commissioning and Decommissioning Nodes
        1. Commissioning new nodes
        2. Decommissioning old nodes
      3. Upgrades
        1. HDFS data and metadata upgrades
          1. Start the upgrade
          2. Wait until the upgrade is complete
          3. Check the upgrade
          4. Roll back the upgrade (optional)
          5. Finalize the upgrade (optional)
  16. 11. Pig
    1. Installing and Running Pig
      1. Execution Types
        1. Local mode
        2. MapReduce mode
      2. Running Pig Programs
      3. Grunt
      4. Pig Latin Editors
    2. An Example
      1. Generating Examples
    3. Comparison with Databases
    4. Pig Latin
      1. Structure
      2. Statements
      3. Expressions
      4. Types
      5. Schemas
        1. Validation and nulls
        2. Schema merging
      6. Functions
    5. User-Defined Functions
      1. A Filter UDF
        1. Leveraging types
      2. An Eval UDF
      3. A Load UDF
        1. Using a schema
    6. Data Processing Operators
      1. Loading and Storing Data
      2. Filtering Data
        1. FOREACH...GENERATE
        2. STREAM
      3. Grouping and Joining Data
        1. JOIN
        2. COGROUP
        3. CROSS
        4. GROUP
      4. Sorting Data
      5. Combining and Splitting Data
    7. Pig in Practice
      1. Parallelism
      2. Parameter Substitution
        1. Dynamic parameters
        2. Parameter substitution processing
  17. 12. Hive
    1. Installing Hive
      1. The Hive Shell
    2. An Example
    3. Running Hive
      1. Configuring Hive
        1. Logging
      2. Hive Services
        1. Hive clients
      3. The Metastore
    4. Comparison with Traditional Databases
      1. Schema on Read Versus Schema on Write
      2. Updates, Transactions, and Indexes
    5. HiveQL
      1. Data Types
        1. Primitive types
        2. Conversions
        3. Complex types
      2. Operators and Functions
    6. Tables
      1. Managed Tables and External Tables
      2. Partitions and Buckets
        1. Partitions
        2. Buckets
      3. Storage Formats
        1. The default storage format: Delimited text
        2. Binary storage formats: Sequence files and RCFiles
        3. An example: RegexSerDe
      4. Importing Data
        1. INSERT OVERWRITE TABLE
        2. Multitable insert
        3. CREATE TABLE...AS SELECT
      5. Altering Tables
      6. Dropping Tables
    7. Querying Data
      1. Sorting and Aggregating
      2. MapReduce Scripts
      3. Joins
        1. Inner joins
        2. Outer joins
        3. Semi joins
        4. Map joins
      4. Subqueries
      5. Views
    8. User-Defined Functions
      1. Writing a UDF
      2. Writing a UDAF
        1. A more complex UDAF
  18. 13. HBase
    1. HBasics
      1. Backdrop
    2. Concepts
      1. Whirlwind Tour of the Data Model
        1. Regions
        2. Locking
      2. Implementation
        1. HBase in operation
    3. Installation
      1. Test Drive
    4. Clients
      1. Java
        1. MapReduce
      2. Avro, REST, and Thrift
        1. REST
        2. Thrift
        3. Avro
    5. Example
      1. Schemas
      2. Loading Data
        1. Optimization notes
      3. Web Queries
    6. HBase Versus RDBMS
      1. Successful Service
      2. HBase
      3. Use Case: HBase at Streamy.com
        1. Very large items tables
        2. Very large sort merges
        3. Life with HBase
    7. Praxis
      1. Versions
      2. HDFS
      3. UI
      4. Metrics
      5. Schema Design
        1. Joins
        2. Row keys
      6. Counters
      7. Bulk Load
  19. 14. ZooKeeper
    1. Installing and Running ZooKeeper
    2. An Example
      1. Group Membership in ZooKeeper
      2. Creating the Group
      3. Joining a Group
      4. Listing Members in a Group
        1. ZooKeeper command-line tools
      5. Deleting a Group
    3. The ZooKeeper Service
      1. Data Model
        1. Ephemeral znodes
        2. Sequence numbers
        3. Watches
      2. Operations
        1. APIs
        2. Watch triggers
        3. ACLs
      3. Implementation
      4. Consistency
      5. Sessions
        1. Time
      6. States
    4. Building Applications with ZooKeeper
      1. A Configuration Service
      2. The Resilient ZooKeeper Application
        1. InterruptedException
        2. KeeperException
          1. State exceptions
          2. Recoverable exceptions
          3. Unrecoverable exceptions
        3. A reliable configuration service
      3. A Lock Service
        1. The herd effect
        2. Recoverable exceptions
        3. Unrecoverable exceptions
        4. Implementation
      4. More Distributed Data Structures and Protocols
        1. BookKeeper
    5. ZooKeeper in Production
      1. Resilience and Performance
      2. Configuration
  20. 15. Sqoop
    1. Getting Sqoop
    2. A Sample Import
    3. Generated Code
      1. Additional Serialization Systems
    4. Database Imports: A Deeper Look
      1. Controlling the Import
      2. Imports and Consistency
      3. Direct-mode Imports
    5. Working with Imported Data
      1. Imported Data and Hive
    6. Importing Large Objects
    7. Performing an Export
    8. Exports: A Deeper Look
      1. Exports and Transactionality
      2. Exports and SequenceFiles
  21. 16. Case Studies
    1. Hadoop Usage at Last.fm
      1. Last.fm: The Social Music Revolution
      2. Hadoop at Last.fm
      3. Generating Charts with Hadoop
      4. The Track Statistics Program
        1. Calculating the number of unique listeners
          1. UniqueListenerMapper
          2. UniqueListenersReducer
        2. Summing the track totals
          1. SumMapper
          2. SumReducer
        3. Merging the results
          1. MergeListenersMapper
          2. IdentityMapper
          3. SumReducer
      5. Summary
    2. Hadoop and Hive at Facebook
      1. Introduction
      2. Hadoop at Facebook
        1. History
        2. Use cases
        3. Data architecture
        4. Hadoop configuration
      3. Hypothetical Use Case Studies
        1. Advertiser insights and performance
        2. Ad hoc analysis and product feedback
        3. Data analysis
      4. Hive
        1. Overview
        2. Data organization
        3. Query language
        4. Data pipelines using Hive
      5. Problems and Future Work
        1. Fair sharing
        2. Space management
        3. Scribe-HDFS integration
        4. Improvements to Hive
    3. Nutch Search Engine
      1. Background
      2. Data Structures
        1. CrawlDb
        2. LinkDb
        3. Segments
      3. Selected Examples of Hadoop Data Processing in Nutch
        1. Link inversion
        2. Generation of fetchlists
          1. Step 1: Select, sort by score, limit by URL count per host
          2. Step 2: Invert, partition by host, sort randomly
        3. Fetcher: A multithreaded MapRunner in action
        4. Indexer: Using custom OutputFormat
      4. Summary
    4. Log Processing at Rackspace
      1. Requirements/The Problem
        1. Logs
      2. Brief History
      3. Choosing Hadoop
      4. Collection and Storage
        1. Log collection
        2. Log storage
      5. MapReduce for Logs
        1. Processing
          1. Phase 1: Map
          2. Phase 1: Reduce
          3. Phase 2: Map
          4. Phase 2: Reduce
        2. Merging for near-term search
          1. Sharding
          2. Search results
        3. Archiving for analysis
    5. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
    6. TeraByte Sort on Apache Hadoop
    7. Using Pig and Wukong to Explore Billion-edge Network Graphs
      1. Measuring Community
      2. Everybody’s Talkin’ at Me: The Twitter Reply Graph
        1. Edge pairs versus adjacency list
        2. Degree
      3. Symmetric Links
      4. Community Extraction
        1. Get neighbors
        2. Community metrics and the 1 million × 1 million problem
        3. Local properties at global scale
  22. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudo-Distributed Mode
        1. Configuring SSH
        2. Formatting the HDFS filesystem
        3. Starting and stopping the daemons
      3. Fully Distributed Mode
  23. B. Cloudera’s Distribution for Hadoop
  24. C. Preparing the NCDC Weather Data
  25. Index
  26. About the Author
  27. Colophon
  28. Copyright

Product information

  • Title: Hadoop: The Definitive Guide, 2nd Edition
  • Author(s): Tom White
  • Release date: October 2010
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449389734