Cover image for Hadoop: The Definitive Guide

Book description

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:

  • Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce

  • Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence

  • Discover common pitfalls and advanced features for writing real-world MapReduce programs

  • Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud

  • Use Pig, a high-level query language for large-scale data processing

  • Take advantage of HBase, Hadoop's database for structured and semi-structured data

  • Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems

If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject.

"Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

Table of Contents

  1. Hadoop: The Definitive Guide
  2. Dedication
  3. A Note Regarding Supplemental Files
  4. Foreword
  5. Preface
    1. Administrative Notes
    2. What’s in This Book?
    3. Conventions Used in This Book
    4. Using Code Examples
    5. Safari® Books Online
    6. How to Contact Us
    7. Acknowledgments
  6. 1. Meet Hadoop
    1. Data!
    2. Data Storage and Analysis
    3. Comparison with Other Systems
      1. RDBMS
      2. Grid Computing
      3. Volunteer Computing
    4. A Brief History of Hadoop
    5. The Apache Hadoop Project
  7. 2. MapReduce
    1. A Weather Dataset
      1. Data Format
    2. Analyzing the Data with Unix Tools
    3. Analyzing the Data with Hadoop
      1. Map and Reduce
      2. Java MapReduce
        1. A test run
        2. The new Java MapReduce API
    4. Scaling Out
      1. Data Flow
      2. Combiner Functions
        1. Specifying a combiner function
      3. Running a Distributed MapReduce Job
    5. Hadoop Streaming
      1. Ruby
      2. Python
    6. Hadoop Pipes
      1. Compiling and Running
  8. 3. The Hadoop Distributed Filesystem
    1. The Design of HDFS
    2. HDFS Concepts
      1. Blocks
      2. Namenodes and Datanodes
    3. The Command-Line Interface
      1. Basic Filesystem Operations
    4. Hadoop Filesystems
      1. Interfaces
        1. Thrift
        2. C
        3. FUSE
        4. WebDAV
        5. Other HDFS Interfaces
    5. The Java Interface
      1. Reading Data from a Hadoop URL
      2. Reading Data Using the FileSystem API
        1. FSDataInputStream
      3. Writing Data
        1. FSDataOutputStream
      4. Directories
      5. Querying the Filesystem
        1. File metadata: FileStatus
        2. Listing files
        3. File patterns
        4. PathFilter
      6. Deleting Data
    6. Data Flow
      1. Anatomy of a File Read
      2. Anatomy of a File Write
      3. Coherency Model
        1. Consequences for application design
    7. Parallel Copying with distcp
      1. Keeping an HDFS Cluster Balanced
    8. Hadoop Archives
      1. Using Hadoop Archives
      2. Limitations
  9. 4. Hadoop I/O
    1. Data Integrity
      1. Data Integrity in HDFS
      2. LocalFileSystem
      3. ChecksumFileSystem
    2. Compression
      1. Codecs
        1. Compressing and decompressing streams with CompressionCodec
        2. Inferring CompressionCodecs using CompressionCodecFactory
        3. Native libraries
          1. CodecPool
      2. Compression and Input Splits
      3. Using Compression in MapReduce
        1. Compressing map output
    3. Serialization
      1. The Writable Interface
        1. WritableComparable and comparators
      2. Writable Classes
        1. Writable wrappers for Java primitives
        2. Text
          1. Indexing
          2. Unicode
          3. Iteration
          4. Mutability
          5. Resorting to String
        3. BytesWritable
        4. NullWritable
        5. ObjectWritable and GenericWritable
        6. Writable collections
      3. Implementing a Custom Writable
        1. Implementing a RawComparator for speed
        2. Custom comparators
      4. Serialization Frameworks
        1. Serialization IDL
    4. File-Based Data Structures
      1. SequenceFile
        1. Writing a SequenceFile
        2. Reading a SequenceFile
        3. Displaying a SequenceFile with the command-line interface
        4. Sorting and merging SequenceFiles
        5. The SequenceFile Format
      2. MapFile
        1. Writing a MapFile
        2. Reading a MapFile
        3. Converting a SequenceFile to a MapFile
  10. 5. Developing a MapReduce Application
    1. The Configuration API
      1. Combining Resources
      2. Variable Expansion
    2. Configuring the Development Environment
      1. Managing Configuration
      2. GenericOptionsParser, Tool, and ToolRunner
    3. Writing a Unit Test
      1. Mapper
      2. Reducer
    4. Running Locally on Test Data
      1. Running a Job in a Local Job Runner
        1. Fixing the mapper
      2. Testing the Driver
    5. Running on a Cluster
      1. Packaging
      2. Launching a Job
      3. The MapReduce Web UI
        1. The jobtracker page
        2. The job page
      4. Retrieving the Results
      5. Debugging a Job
        1. The tasks page
        2. The task details page
        3. Handling malformed data
      6. Using a Remote Debugger
    6. Tuning a Job
      1. Profiling Tasks
        1. The HPROF profiler
        2. Other profilers
    7. MapReduce Workflows
      1. Decomposing a Problem into MapReduce Jobs
      2. Running Dependent Jobs
  11. 6. How MapReduce Works
    1. Anatomy of a MapReduce Job Run
      1. Job Submission
      2. Job Initialization
      3. Task Assignment
      4. Task Execution
        1. Streaming and Pipes
      5. Progress and Status Updates
      6. Job Completion
    2. Failures
      1. Task Failure
      2. Tasktracker Failure
      3. Jobtracker Failure
    3. Job Scheduling
      1. The Fair Scheduler
    4. Shuffle and Sort
      1. The Map Side
      2. The Reduce Side
      3. Configuration Tuning
    5. Task Execution
      1. Speculative Execution
      2. Task JVM Reuse
      3. Skipping Bad Records
      4. The Task Execution Environment
        1. Streaming environment variables
        2. Task side-effect files
  12. 7. MapReduce Types and Formats
    1. MapReduce Types
      1. The Default MapReduce Job
        1. The default Streaming job
        2. Keys and values in Streaming
    2. Input Formats
      1. Input Splits and Records
        1. FileInputFormat
        2. FileInputFormat input paths
        3. FileInputFormat input splits
        4. Small files and CombineFileInputFormat
        5. Preventing splitting
        6. File information in the mapper
        7. Processing a whole file as a record
      2. Text Input
        1. TextInputFormat
        2. KeyValueTextInputFormat
        3. NLineInputFormat
        4. XML
      3. Binary Input
        1. SequenceFileInputFormat
        2. SequenceFileAsTextInputFormat
        3. SequenceFileAsBinaryInputFormat
      4. Multiple Inputs
      5. Database Input (and Output)
    3. Output Formats
      1. Text Output
      2. Binary Output
        1. SequenceFileOutputFormat
        2. SequenceFileAsBinaryOutputFormat
        3. MapFileOutputFormat
      3. Multiple Outputs
        1. An example: Partitioning data
        2. MultipleOutputFormat
        3. MultipleOutputs
      4. Lazy Output
      5. Database Output
  13. 8. MapReduce Features
    1. Counters
      1. Built-in Counters
      2. User-Defined Java Counters
        1. Dynamic counters
        2. Readable counter names
        3. Retrieving counters
      3. User-Defined Streaming Counters
    2. Sorting
      1. Preparation
      2. Partial Sort
        1. An application: Partitioned MapFile lookups
      3. Total Sort
      4. Secondary Sort
        1. Java code
        2. Streaming
    3. Joins
      1. Map-Side Joins
      2. Reduce-Side Joins
    4. Side Data Distribution
      1. Using the Job Configuration
      2. Distributed Cache
        1. Usage
        2. How it works
        3. The DistributedCache API
    5. MapReduce Library Classes
  14. 9. Setting Up a Hadoop Cluster
    1. Cluster Specification
      1. Network Topology
        1. Rack awareness
    2. Cluster Setup and Installation
      1. Installing Java
      2. Creating a Hadoop User
      3. Installing Hadoop
      4. Testing the Installation
    3. SSH Configuration
    4. Hadoop Configuration
      1. Configuration Management
        1. Control scripts
        2. Master node scenarios
      2. Environment Settings
        1. Memory
        2. Java
        3. System logfiles
        4. SSH settings
      3. Important Hadoop Daemon Properties
        1. HDFS
        2. MapReduce
      4. Hadoop Daemon Addresses and Ports
      5. Other Hadoop Properties
        1. Cluster membership
        2. Service-level authorization
        3. Buffer size
        4. HDFS block size
        5. Reserved storage space
        6. Trash
        7. Task memory limits
        8. Job scheduler
    5. Post Install
    6. Benchmarking a Hadoop Cluster
      1. Hadoop Benchmarks
        1. Benchmarking HDFS with TestDFSIO
        2. Benchmarking MapReduce with Sort
        3. Other benchmarks
      2. User Jobs
    7. Hadoop in the Cloud
      1. Hadoop on Amazon EC2
        1. Setup
        2. Launching a cluster
        3. Running a MapReduce job
        4. Terminating a cluster
  15. 10. Administering Hadoop
    1. HDFS
      1. Persistent Data Structures
        1. Namenode directory structure
        2. The filesystem image and edit log
        3. Secondary namenode directory structure
        4. Datanode directory structure
      2. Safe Mode
        1. Entering and leaving safe mode
      3. Audit Logging
      4. Tools
        1. dfsadmin
        2. Filesystem check (fsck)
          1. Finding the blocks for a file
        3. Datanode block scanner
        4. balancer
    2. Monitoring
      1. Logging
        1. Setting log levels
        2. Getting stack traces
      2. Metrics
        1. FileContext
        2. GangliaContext
        3. NullContextWithUpdateThread
        4. CompositeContext
      3. Java Management Extensions
    3. Maintenance
      1. Routine Administration Procedures
        1. Metadata backups
        2. Data backups
        3. Filesystem check (fsck)
        4. Filesystem balancer
      2. Commissioning and Decommissioning Nodes
        1. Commissioning new nodes
        2. Decommissioning old nodes
      3. Upgrades
        1. HDFS data and metadata upgrades
          1. Start the upgrade
          2. Wait until the upgrade is complete
          3. Check the upgrade
          4. Roll back the upgrade (optional)
          5. Finalize the upgrade (optional)
  16. 11. Pig
    1. Installing and Running Pig
      1. Execution Types
        1. Local mode
        2. Hadoop mode
      2. Running Pig Programs
      3. Grunt
      4. Pig Latin Editors
    2. An Example
      1. Generating Examples
    3. Comparison with Databases
    4. Pig Latin
      1. Structure
      2. Statements
      3. Expressions
      4. Types
      5. Schemas
        1. Validation and nulls
        2. Schema merging
      6. Functions
    5. User-Defined Functions
      1. A Filter UDF
        1. Leveraging types
      2. An Eval UDF
      3. A Load UDF
        1. Using a schema
        2. Advanced loading with Slicer
    6. Data Processing Operators
      1. Loading and Storing Data
      2. Filtering Data
        1. FOREACH .. GENERATE
        2. STREAM
      3. Grouping and Joining Data
        1. JOIN
        2. COGROUP
        3. CROSS
        4. GROUP
      4. Sorting Data
      5. Combining and Splitting Data
    7. Pig in Practice
      1. Parallelism
      2. Parameter Substitution
        1. Dynamic parameters
        2. Parameter substitution processing
  17. 12. HBase
    1. HBasics
      1. Backdrop
    2. Concepts
      1. Whirlwind Tour of the Data Model
        1. Regions
        2. Locking
      2. Implementation
        1. HBase in operation
    3. Installation
      1. Test Drive
    4. Clients
      1. Java
        1. MapReduce
      2. REST and Thrift
        1. REST
        2. Thrift
    5. Example
      1. Schemas
      2. Loading Data
        1. Optimization notes
      3. Web Queries
    6. HBase Versus RDBMS
      1. Successful Service
      2. HBase
      3. Use Case: HBase at streamy.com
        1. Very large items tables
        2. Very large sort merges
        3. Life with HBase
    7. Praxis
      1. Versions
      2. Love and Hate: HBase and HDFS
      3. UI
      4. Metrics
      5. Schema Design
        1. Joins
        2. Row keys
  18. 13. ZooKeeper
    1. Installing and Running ZooKeeper
    2. An Example
      1. Group Membership in ZooKeeper
      2. Creating the Group
      3. Joining a Group
      4. Listing Members in a Group
        1. ZooKeeper command-line tools
      5. Deleting a Group
    3. The ZooKeeper Service
      1. Data Model
        1. Ephemeral znodes
        2. Sequence numbers
        3. Watches
      2. Operations
        1. APIs
        2. Watch triggers
        3. ACLs
      3. Implementation
      4. Consistency
      5. Sessions
        1. Time
      6. States
    4. Building Applications with ZooKeeper
      1. A Configuration Service
      2. The Resilient ZooKeeper Application
        1. InterruptedException
        2. KeeperException
          1. State exceptions
          2. Recoverable exceptions
          3. Unrecoverable exceptions
        3. A reliable configuration service
      3. A Lock Service
        1. The herd effect
        2. Recoverable exceptions
        3. Unrecoverable exceptions
        4. Implementation
      4. More Distributed Data Structures and Protocols
        1. BookKeeper
    5. ZooKeeper in Production
      1. Resilience and Performance
      2. Configuration
  19. 14. Case Studies
    1. Hadoop Usage at Last.fm
      1. Last.fm: The Social Music Revolution
      2. Hadoop at Last.fm
      3. Generating Charts with Hadoop
      4. The Track Statistics Program
        1. Calculating the number of unique listeners
          1. UniqueListenerMapper
          2. UniqueListenersReducer
        2. Summing the track totals
          1. SumMapper
          2. SumReducer
        3. Merging the results
          1. MergeListenersMapper
          2. IdentityMapper
          3. SumReducer
      5. Summary
    2. Hadoop and Hive at Facebook
      1. Introduction
      2. Hadoop at Facebook
        1. History
        2. Use cases
        3. Data architecture
        4. Hadoop configuration
      3. Hypothetical Use Case Studies
        1. Advertiser insights and performance
        2. Ad hoc analysis and product feedback
        3. Data analysis
      4. Hive
        1. Overview
        2. Data organization
        3. Query language
        4. Data pipelines using Hive
      5. Problems and Future Work
        1. Fair sharing
        2. Space management
        3. Scribe-HDFS integration
        4. Improvements to Hive
    3. Nutch Search Engine
      1. Background
      2. Data Structures
        1. CrawlDb
        2. LinkDb
        3. Segments
      3. Selected Examples of Hadoop Data Processing in Nutch
        1. Link inversion
        2. Generation of fetchlists
          1. Step 1: Select, sort by score, limit by URL count per host
          2. Step 2: Invert, partition by host, sort randomly
        3. Fetcher: A multi-threaded MapRunner in action
        4. Indexer: Using custom OutputFormat
      4. Summary
    4. Log Processing at Rackspace
      1. Requirements/The Problem
        1. Logs
      2. Brief History
      3. Choosing Hadoop
      4. Collection and Storage
        1. Log collection
        2. Log storage
      5. MapReduce for Logs
        1. Processing
          1. Phase 1: Map
          2. Phase 1: Reduce
          3. Phase 2: Map
          4. Phase 2: Reduce
        2. Merging for near-term search
          1. Sharding
          2. Search results
        3. Archiving for analysis
    5. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
    6. TeraByte Sort on Apache Hadoop
  20. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudo-Distributed Mode
        1. Configuring SSH
        2. Formatting the HDFS filesystem
        3. Starting and stopping the daemons
      3. Fully Distributed Mode
  21. B. Cloudera’s Distribution for Hadoop
    1. Prerequisites
    2. Standalone Mode
    3. Pseudo-Distributed Mode
    4. Fully Distributed Mode
    5. Hadoop-Related Packages
  22. C. Preparing the NCDC Weather Data
  23. Index
  24. About the Author
  25. Colophon
  26. Copyright