Hadoop: The Definitive Guide, 4th Edition

Book description

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing.

  • Learn fundamental components such as MapReduce, HDFS, and YARN
  • Explore MapReduce in depth, including steps for developing applications with it
  • Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
  • Learn two data formats: Avro for data serialization and Parquet for nested data
  • Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
  • Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
  • Learn the HBase distributed database and the ZooKeeper distributed configuration service

Publisher resources

View/Submit Errata

Table of contents

  1. Dedication
  2. Foreword
  3. Preface
    1. Administrative Notes
    2. What’s New in the Fourth Edition?
    3. What’s New in the Third Edition?
    4. What’s New in the Second Edition?
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  4. I. Hadoop Fundamentals
    1. 1. Meet Hadoop
      1. Data!
      2. Data Storage and Analysis
      3. Querying All Your Data
      4. Beyond Batch
      5. Comparison with Other Systems
        1. Relational Database Management Systems
        2. Grid Computing
        3. Volunteer Computing
      6. A Brief History of Apache Hadoop
      7. What’s in This Book?
    2. 2. MapReduce
      1. A Weather Dataset
        1. Data Format
      2. Analyzing the Data with Unix Tools
      3. Analyzing the Data with Hadoop
        1. Map and Reduce
        2. Java MapReduce
          1. A test run
      4. Scaling Out
        1. Data Flow
        2. Combiner Functions
          1. Specifying a combiner function
        3. Running a Distributed MapReduce Job
      5. Hadoop Streaming
        1. Ruby
        2. Python
    3. 3. The Hadoop Distributed Filesystem
      1. The Design of HDFS
      2. HDFS Concepts
        1. Blocks
        2. Namenodes and Datanodes
        3. Block Caching
        4. HDFS Federation
        5. HDFS High Availability
          1. Failover and fencing
      3. The Command-Line Interface
        1. Basic Filesystem Operations
      4. Hadoop Filesystems
        1. Interfaces
          1. HTTP
          2. C
          3. NFS
          4. FUSE
      5. The Java Interface
        1. Reading Data from a Hadoop URL
        2. Reading Data Using the FileSystem API
          1. FSDataInputStream
        3. Writing Data
          1. FSDataOutputStream
        4. Directories
        5. Querying the Filesystem
          1. File metadata: FileStatus
          2. Listing files
          3. File patterns
          4. PathFilter
        6. Deleting Data
      6. Data Flow
        1. Anatomy of a File Read
        2. Anatomy of a File Write
        3. Coherency Model
          1. Consequences for application design
      7. Parallel Copying with distcp
        1. Keeping an HDFS Cluster Balanced
    4. 4. YARN
      1. Anatomy of a YARN Application Run
        1. Resource Requests
        2. Application Lifespan
        3. Building YARN Applications
      2. YARN Compared to MapReduce 1
      3. Scheduling in YARN
        1. Scheduler Options
        2. Capacity Scheduler Configuration
          1. Queue placement
        3. Fair Scheduler Configuration
          1. Enabling the Fair Scheduler
          2. Queue configuration
          3. Queue placement
          4. Preemption
        4. Delay Scheduling
        5. Dominant Resource Fairness
      4. Further Reading
    5. 5. Hadoop I/O
      1. Data Integrity
        1. Data Integrity in HDFS
        2. LocalFileSystem
        3. ChecksumFileSystem
      2. Compression
        1. Codecs
          1. Compressing and decompressing streams with CompressionCodec
          2. Inferring CompressionCodecs using CompressionCodecFactory
          3. Native libraries
            1. CodecPool
        2. Compression and Input Splits
        3. Using Compression in MapReduce
          1. Compressing map output
      3. Serialization
        1. The Writable Interface
          1. WritableComparable and comparators
        2. Writable Classes
          1. Writable wrappers for Java primitives
          2. Text
            1. Indexing
            2. Unicode
            3. Iteration
            4. Mutability
            5. Resorting to String
          3. BytesWritable
          4. NullWritable
          5. ObjectWritable and GenericWritable
          6. Writable collections
        3. Implementing a Custom Writable
          1. Implementing a RawComparator for speed
          2. Custom comparators
        4. Serialization Frameworks
          1. Serialization IDL
      4. File-Based Data Structures
        1. SequenceFile
          1. Writing a SequenceFile
          2. Reading a SequenceFile
          3. Displaying a SequenceFile with the command-line interface
          4. Sorting and merging SequenceFiles
          5. The SequenceFile format
        2. MapFile
          1. MapFile variants
        3. Other File Formats and Column-Oriented Formats
  5. II. MapReduce
    1. 6. Developing a MapReduce Application
      1. The Configuration API
        1. Combining Resources
        2. Variable Expansion
      2. Setting Up the Development Environment
        1. Managing Configuration
        2. GenericOptionsParser, Tool, and ToolRunner
      3. Writing a Unit Test with MRUnit
        1. Mapper
        2. Reducer
      4. Running Locally on Test Data
        1. Running a Job in a Local Job Runner
        2. Testing the Driver
      5. Running on a Cluster
        1. Packaging a Job
          1. The client classpath
          2. The task classpath
          3. Packaging dependencies
          4. Task classpath precedence
        2. Launching a Job
        3. The MapReduce Web UI
          1. The resource manager page
          2. The MapReduce job page
        4. Retrieving the Results
        5. Debugging a Job
          1. The tasks and task attempts pages
          2. Handling malformed data
        6. Hadoop Logs
        7. Remote Debugging
      6. Tuning a Job
        1. Profiling Tasks
          1. The HPROF profiler
      7. MapReduce Workflows
        1. Decomposing a Problem into MapReduce Jobs
        2. JobControl
        3. Apache Oozie
          1. Defining an Oozie workflow
          2. Packaging and deploying an Oozie workflow application
          3. Running an Oozie workflow job
    2. 7. How MapReduce Works
      1. Anatomy of a MapReduce Job Run
        1. Job Submission
        2. Job Initialization
        3. Task Assignment
        4. Task Execution
          1. Streaming
        5. Progress and Status Updates
        6. Job Completion
      2. Failures
        1. Task Failure
        2. Application Master Failure
        3. Node Manager Failure
        4. Resource Manager Failure
      3. Shuffle and Sort
        1. The Map Side
        2. The Reduce Side
        3. Configuration Tuning
      4. Task Execution
        1. The Task Execution Environment
          1. Streaming environment variables
        2. Speculative Execution
        3. Output Committers
          1. Task side-effect files
    3. 8. MapReduce Types and Formats
      1. MapReduce Types
        1. The Default MapReduce Job
          1. The default Streaming job
          2. Keys and values in Streaming
      2. Input Formats
        1. Input Splits and Records
          1. FileInputFormat
          2. FileInputFormat input paths
          3. FileInputFormat input splits
          4. Small files and CombineFileInputFormat
          5. Preventing splitting
          6. File information in the mapper
          7. Processing a whole file as a record
        2. Text Input
          1. TextInputFormat
            1. Controlling the maximum line length
          2. KeyValueTextInputFormat
          3. NLineInputFormat
          4. XML
        3. Binary Input
          1. SequenceFileInputFormat
          2. SequenceFileAsTextInputFormat
          3. SequenceFileAsBinaryInputFormat
          4. FixedLengthInputFormat
        4. Multiple Inputs
        5. Database Input (and Output)
      3. Output Formats
        1. Text Output
        2. Binary Output
          1. SequenceFileOutputFormat
          2. SequenceFileAsBinaryOutputFormat
          3. MapFileOutputFormat
        3. Multiple Outputs
          1. An example: Partitioning data
          2. MultipleOutputs
        4. Lazy Output
        5. Database Output
    4. 9. MapReduce Features
      1. Counters
        1. Built-in Counters
          1. Task counters
          2. Job counters
        2. User-Defined Java Counters
          1. Dynamic counters
          2. Retrieving counters
        3. User-Defined Streaming Counters
      2. Sorting
        1. Preparation
        2. Partial Sort
        3. Total Sort
        4. Secondary Sort
          1. Java code
          2. Streaming
      3. Joins
        1. Map-Side Joins
        2. Reduce-Side Joins
      4. Side Data Distribution
        1. Using the Job Configuration
        2. Distributed Cache
          1. Usage
          2. How it works
          3. The distributed cache API
      5. MapReduce Library Classes
  6. III. Hadoop Operations
    1. 10. Setting Up a Hadoop Cluster
      1. Cluster Specification
        1. Cluster Sizing
          1. Master node scenarios
        2. Network Topology
          1. Rack awareness
      2. Cluster Setup and Installation
        1. Installing Java
        2. Creating Unix User Accounts
        3. Installing Hadoop
        4. Configuring SSH
        5. Configuring Hadoop
        6. Formatting the HDFS Filesystem
        7. Starting and Stopping the Daemons
        8. Creating User Directories
      3. Hadoop Configuration
        1. Configuration Management
        2. Environment Settings
          1. Java
          2. Memory heap size
          3. System logfiles
          4. SSH settings
        3. Important Hadoop Daemon Properties
          1. HDFS
          2. YARN
          3. Memory settings in YARN and MapReduce
          4. CPU settings in YARN and MapReduce
        4. Hadoop Daemon Addresses and Ports
        5. Other Hadoop Properties
          1. Cluster membership
          2. Buffer size
          3. HDFS block size
          4. Reserved storage space
          5. Trash
          6. Job scheduler
          7. Reduce slow start
          8. Short-circuit local reads
      4. Security
        1. Kerberos and Hadoop
          1. An example
        2. Delegation Tokens
        3. Other Security Enhancements
      5. Benchmarking a Hadoop Cluster
        1. Hadoop Benchmarks
          1. Benchmarking MapReduce with TeraSort
          2. Other benchmarks
        2. User Jobs
    2. 11. Administering Hadoop
      1. HDFS
        1. Persistent Data Structures
          1. Namenode directory structure
          2. The filesystem image and edit log
          3. Secondary namenode directory structure
          4. Datanode directory structure
        2. Safe Mode
          1. Entering and leaving safe mode
        3. Audit Logging
        4. Tools
          1. dfsadmin
          2. Filesystem check (fsck)
            1. Finding the blocks for a file
          3. Datanode block scanner
          4. Balancer
      2. Monitoring
        1. Logging
          1. Setting log levels
          2. Getting stack traces
        2. Metrics and JMX
      3. Maintenance
        1. Routine Administration Procedures
          1. Metadata backups
          2. Data backups
          3. Filesystem check (fsck)
          4. Filesystem balancer
        2. Commissioning and Decommissioning Nodes
          1. Commissioning new nodes
          2. Decommissioning old nodes
        3. Upgrades
          1. HDFS data and metadata upgrades
            1. Start the upgrade
            2. Wait until the upgrade is complete
            3. Check the upgrade
            4. Roll back the upgrade (optional)
            5. Finalize the upgrade (optional)
  7. IV. Related Projects
    1. 12. Avro
      1. Avro Data Types and Schemas
      2. In-Memory Serialization and Deserialization
        1. The Specific API
      3. Avro Datafiles
      4. Interoperability
        1. Python API
        2. Avro Tools
      5. Schema Resolution
      6. Sort Order
      7. Avro MapReduce
      8. Sorting Using Avro MapReduce
      9. Avro in Other Languages
    2. 13. Parquet
      1. Data Model
        1. Nested Encoding
      2. Parquet File Format
      3. Parquet Configuration
      4. Writing and Reading Parquet Files
        1. Avro, Protocol Buffers, and Thrift
          1. Projection and read schemas
      5. Parquet MapReduce
    3. 14. Flume
      1. Installing Flume
      2. An Example
      3. Transactions and Reliability
        1. Batching
      4. The HDFS Sink
        1. Partitioning and Interceptors
        2. File Formats
      5. Fan Out
        1. Delivery Guarantees
        2. Replicating and Multiplexing Selectors
      6. Distribution: Agent Tiers
        1. Delivery Guarantees
      7. Sink Groups
      8. Integrating Flume with Applications
      9. Component Catalog
      10. Further Reading
    4. 15. Sqoop
      1. Getting Sqoop
      2. Sqoop Connectors
      3. A Sample Import
        1. Text and Binary File Formats
      4. Generated Code
        1. Additional Serialization Systems
      5. Imports: A Deeper Look
        1. Controlling the Import
        2. Imports and Consistency
        3. Incremental Imports
        4. Direct-Mode Imports
      6. Working with Imported Data
        1. Imported Data and Hive
      7. Importing Large Objects
      8. Performing an Export
      9. Exports: A Deeper Look
        1. Exports and Transactionality
        2. Exports and SequenceFiles
      10. Further Reading
    5. 16. Pig
      1. Installing and Running Pig
        1. Execution Types
          1. Local mode
          2. MapReduce mode
        2. Running Pig Programs
        3. Grunt
        4. Pig Latin Editors
      2. An Example
        1. Generating Examples
      3. Comparison with Databases
      4. Pig Latin
        1. Structure
        2. Statements
        3. Expressions
        4. Types
        5. Schemas
          1. Using Hive tables with HCatalog
          2. Validation and nulls
          3. Schema merging
        6. Functions
          1. Other libraries
        7. Macros
      5. User-Defined Functions
        1. A Filter UDF
          1. Leveraging types
        2. An Eval UDF
          1. Dynamic invokers
        3. A Load UDF
          1. Using a schema
      6. Data Processing Operators
        1. Loading and Storing Data
        2. Filtering Data
          1. FOREACH...GENERATE
          2. STREAM
        3. Grouping and Joining Data
          1. JOIN
          2. COGROUP
          3. CROSS
          4. GROUP
        4. Sorting Data
        5. Combining and Splitting Data
      7. Pig in Practice
        1. Parallelism
        2. Anonymous Relations
        3. Parameter Substitution
          1. Dynamic parameters
          2. Parameter substitution processing
      8. Further Reading
    6. 17. Hive
      1. Installing Hive
        1. The Hive Shell
      2. An Example
      3. Running Hive
        1. Configuring Hive
          1. Execution engines
          2. Logging
        2. Hive Services
          1. Hive clients
        3. The Metastore
      4. Comparison with Traditional Databases
        1. Schema on Read Versus Schema on Write
        2. Updates, Transactions, and Indexes
        3. SQL-on-Hadoop Alternatives
      5. HiveQL
        1. Data Types
          1. Primitive types
          2. Complex types
        2. Operators and Functions
          1. Conversions
      6. Tables
        1. Managed Tables and External Tables
        2. Partitions and Buckets
          1. Partitions
          2. Buckets
        3. Storage Formats
          1. The default storage format: Delimited text
          2. Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles
          3. Using a custom SerDe: RegexSerDe
          4. Storage handlers
        4. Importing Data
          1. Inserts
          2. Multitable insert
          3. CREATE TABLE...AS SELECT
        5. Altering Tables
        6. Dropping Tables
      7. Querying Data
        1. Sorting and Aggregating
        2. MapReduce Scripts
        3. Joins
          1. Inner joins
          2. Outer joins
          3. Semi joins
          4. Map joins
        4. Subqueries
        5. Views
      8. User-Defined Functions
        1. Writing a UDF
        2. Writing a UDAF
          1. A more complex UDAF
      9. Further Reading
    7. 18. Crunch
      1. An Example
      2. The Core Crunch API
        1. Primitive Operations
          1. union()
          2. parallelDo()
          3. groupByKey()
          4. combineValues()
        2. Types
          1. Records and tuples
        3. Sources and Targets
          1. Reading from a source
          2. Writing to a target
          3. Existing outputs
          4. Combined sources and targets
        4. Functions
          1. Serialization of functions
          2. Object reuse
        5. Materialization
          1. PObject
      3. Pipeline Execution
        1. Running a Pipeline
          1. Asynchronous execution
          2. Debugging
        2. Stopping a Pipeline
        3. Inspecting a Crunch Plan
        4. Iterative Algorithms
        5. Checkpointing a Pipeline
      4. Crunch Libraries
      5. Further Reading
    8. 19. Spark
      1. Installing Spark
      2. An Example
        1. Spark Applications, Jobs, Stages, and Tasks
        2. A Scala Standalone Application
        3. A Java Example
        4. A Python Example
      3. Resilient Distributed Datasets
        1. Creation
        2. Transformations and Actions
          1. Aggregation transformations
        3. Persistence
          1. Persistence levels
        4. Serialization
          1. Data
          2. Functions
      4. Shared Variables
        1. Broadcast Variables
        2. Accumulators
      5. Anatomy of a Spark Job Run
        1. Job Submission
        2. DAG Construction
        3. Task Scheduling
        4. Task Execution
      6. Executors and Cluster Managers
        1. Spark on YARN
          1. YARN client mode
          2. YARN cluster mode
      7. Further Reading
    9. 20. HBase
      1. HBasics
        1. Backdrop
      2. Concepts
        1. Whirlwind Tour of the Data Model
          1. Regions
          2. Locking
        2. Implementation
          1. HBase in operation
      3. Installation
        1. Test Drive
      4. Clients
        1. Java
        2. MapReduce
        3. REST and Thrift
      5. Building an Online Query Application
        1. Schema Design
        2. Loading Data
          1. Load distribution
          2. Bulk load
        3. Online Queries
          1. Station queries
          2. Observation queries
      6. HBase Versus RDBMS
        1. Successful Service
        2. HBase
      7. Praxis
        1. HDFS
        2. UI
        3. Metrics
        4. Counters
      8. Further Reading
    10. 21. ZooKeeper
      1. Installing and Running ZooKeeper
      2. An Example
        1. Group Membership in ZooKeeper
        2. Creating the Group
        3. Joining a Group
        4. Listing Members in a Group
          1. ZooKeeper command-line tools
        5. Deleting a Group
      3. The ZooKeeper Service
        1. Data Model
          1. Ephemeral znodes
          2. Sequence numbers
          3. Watches
        2. Operations
          1. Multiupdate
          2. APIs
          3. Watch triggers
          4. ACLs
        3. Implementation
        4. Consistency
        5. Sessions
          1. Time
        6. States
      4. Building Applications with ZooKeeper
        1. A Configuration Service
        2. The Resilient ZooKeeper Application
          1. InterruptedException
          2. KeeperException
            1. State exceptions
            2. Recoverable exceptions
            3. Unrecoverable exceptions
          3. A reliable configuration service
        3. A Lock Service
          1. The herd effect
          2. Recoverable exceptions
          3. Unrecoverable exceptions
          4. Implementation
        4. More Distributed Data Structures and Protocols
          1. BookKeeper and Hedwig
      5. ZooKeeper in Production
        1. Resilience and Performance
        2. Configuration
      6. Further Reading
  8. V. Case Studies
    1. 22. Composable Data at Cerner
      1. From CPUs to Semantic Integration
      2. Enter Apache Crunch
      3. Building a Complete Picture
      4. Integrating Healthcare Data
      5. Composability over Frameworks
      6. Moving Forward
    2. 23. Biological Data Science: Saving Lives with Software
      1. The Structure of DNA
      2. The Genetic Code: Turning DNA Letters into Proteins
      3. Thinking of DNA as Source Code
      4. The Human Genome Project and Reference Genomes
      5. Sequencing and Aligning DNA
      6. ADAM, A Scalable Genome Analysis Platform
        1. Literate programming with the Avro interface description language (IDL)
        2. Column-oriented access with Parquet
        3. A simple example: k-mer counting using Spark and ADAM
      7. From Personalized Ads to Personalized Medicine
      8. Join In
    3. 24. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
  9. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudodistributed Mode
        1. Configuring SSH
        2. Formatting the HDFS filesystem
        3. Starting and stopping the daemons
        4. Creating a user directory
      3. Fully Distributed Mode
  10. B. Cloudera’s Distribution Including Apache Hadoop
  11. C. Preparing the NCDC Weather Data
  12. D. The Old and New Java MapReduce APIs
  13. Index
  14. Colophon
  15. Copyright

Product information

  • Title: Hadoop: The Definitive Guide, 4th Edition
  • Author(s): Tom White
  • Release date: April 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491901632