You are previewing Hadoop Beginner's Guide.
O'Reilly logo
Hadoop Beginner's Guide

Book Description

Learn how to crunch big data to extract meaning from the data avalanche

  • Learn tools and techniques that let you approach big data with relish and not fear

  • Shows how to build a complete infrastructure to handle your needs as your data grows

  • Hands-on examples in each chapter give the big picture while also giving direct experience

In Detail

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Table of Contents

  1. Hadoop Beginner's Guide
    1. Table of Contents
    2. Hadoop Beginner's Guide
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Time for action – heading
        1. What just happened?
        2. Pop quiz – heading
        3. Have a go hero – heading
      6. Reader feedback
      7. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. What It's All About
      1. Big data processing
        1. The value of data
        2. Historically for the few and not the many
          1. Classic data processing systems
            1. Scale-up
            2. Early approaches to scale-out
          2. Limiting factors
        3. A different approach
          1. All roads lead to scale-out
          2. Share nothing
          3. Expect failure
          4. Smart software, dumb hardware
          5. Move processing, not data
          6. Build applications, not infrastructure
        4. Hadoop
          1. Thanks, Google
          2. Thanks, Doug
          3. Thanks, Yahoo
          4. Parts of Hadoop
          5. Common building blocks
          6. HDFS
          7. MapReduce
          8. Better together
          9. Common architecture
          10. What it is and isn't good for
      2. Cloud computing with Amazon Web Services
        1. Too many clouds
        2. A third way
        3. Different types of costs
        4. AWS – infrastructure on demand from Amazon
          1. Elastic Compute Cloud (EC2)
          2. Simple Storage Service (S3)
          3. Elastic MapReduce (EMR)
        5. What this book covers
          1. A dual approach
      3. Summary
    9. 2. Getting Hadoop Up and Running
      1. Hadoop on a local Ubuntu host
        1. Other operating systems
      2. Time for action – checking the prerequisites
        1. What just happened?
        2. Setting up Hadoop
          1. A note on versions
      3. Time for action – downloading Hadoop
        1. What just happened?
      4. Time for action – setting up SSH
        1. What just happened?
        2. Configuring and running Hadoop
      5. Time for action – using Hadoop to calculate Pi
        1. What just happened?
        2. Three modes
      6. Time for action – configuring the pseudo-distributed mode
        1. What just happened?
        2. Configuring the base directory and formatting the filesystem
      7. Time for action – changing the base HDFS directory
        1. What just happened?
      8. Time for action – formatting the NameNode
        1. What just happened?
        2. Starting and using Hadoop
      9. Time for action – starting Hadoop
        1. What just happened?
      10. Time for action – using HDFS
        1. What just happened?
      11. Time for action – WordCount, the Hello World of MapReduce
        1. What just happened?
        2. Have a go hero – WordCount on a larger body of text
        3. Monitoring Hadoop from the browser
          1. The HDFS web UI
            1. The MapReduce web UI
      12. Using Elastic MapReduce
        1. Setting up an account in Amazon Web Services
          1. Creating an AWS account
          2. Signing up for the necessary services
      13. Time for action – WordCount on EMR using the management console
        1. What just happened?
        2. Have a go hero – other EMR sample applications
        3. Other ways of using EMR
          1. AWS credentials
          2. The EMR command-line tools
        4. The AWS ecosystem
      14. Comparison of local versus EMR Hadoop
      15. Summary
    10. 3. Understanding MapReduce
      1. Key/value pairs
        1. What it mean
        2. Why key/value data?
          1. Some real-world examples
        3. MapReduce as a series of key/value transformations
        4. Pop quiz – key/value pairs
      2. The Hadoop Java API for MapReduce
        1. The 0.20 MapReduce Java API
          1. The Mapper class
          2. The Reducer class
          3. The Driver class
      3. Writing MapReduce programs
      4. Time for action – setting up the classpath
        1. What just happened?
      5. Time for action – implementing WordCount
        1. What just happened?
      6. Time for action – building a JAR file
        1. What just happened?
      7. Time for action – running WordCount on a local Hadoop cluster
        1. What just happened?
      8. Time for action – running WordCount on EMR
        1. What just happened?
        2. The pre-0.20 Java MapReduce API
        3. Hadoop-provided mapper and reducer implementations
      9. Time for action – WordCount the easy way
        1. What just happened?
      10. Walking through a run of WordCount
        1. Startup
        2. Splitting the input
        3. Task assignment
        4. Task startup
        5. Ongoing JobTracker monitoring
        6. Mapper input
        7. Mapper execution
        8. Mapper output and reduce input
        9. Partitioning
        10. The optional partition function
        11. Reducer input
        12. Reducer execution
        13. Reducer output
        14. Shutdown
        15. That's all there is to it!
        16. Apart from the combiner…maybe
          1. Why have a combiner?
      11. Time for action – WordCount with a combiner
        1. What just happened?
          1. When you can use the reducer as the combiner
      12. Time for action – fixing WordCount to work with a combiner
        1. What just happened?
        2. Reuse is your friend
        3. Pop quiz – MapReduce mechanics
      13. Hadoop-specific data types
        1. The Writable and WritableComparable interfaces
        2. Introducing the wrapper classes
          1. Primitive wrapper classes
          2. Array wrapper classes
          3. Map wrapper classes
      14. Time for action – using the Writable wrapper classes
        1. What just happened?
          1. Other wrapper classes
        2. Have a go hero – playing with Writables
          1. Making your own
      15. Input/output
        1. Files, splits, and records
        2. InputFormat and RecordReader
        3. Hadoop-provided InputFormat
        4. Hadoop-provided RecordReader
        5. OutputFormat and RecordWriter
        6. Hadoop-provided OutputFormat
        7. Don't forget Sequence files
      16. Summary
    11. 4. Developing MapReduce Programs
      1. Using languages other than Java with Hadoop
        1. How Hadoop Streaming works
        2. Why to use Hadoop Streaming
      2. Time for action – implementing WordCount using Streaming
        1. What just happened?
        2. Differences in jobs when using Streaming
      3. Analyzing a large dataset
        1. Getting the UFO sighting dataset
        2. Getting a feel for the dataset
      4. Time for action – summarizing the UFO data
        1. What just happened?
          1. Examining UFO shapes
      5. Time for action – summarizing the shape data
        1. What just happened?
      6. Time for action – correlating of sighting duration to UFO shape
        1. What just happened?
          1. Using Streaming scripts outside Hadoop
      7. Time for action – performing the shape/time analysis from the command line
        1. What just happened?
        2. Java shape and location analysis
      8. Time for action – using ChainMapper for field validation/analysis
        1. What just happened?
        2. Have a go hero
          1. Too many abbreviations
          2. Using the Distributed Cache
      9. Time for action – using the Distributed Cache to improve location output
        1. What just happened?
      10. Counters, status, and other output
      11. Time for action – creating counters, task states, and writing log output
        1. What just happened?
        2. Too much information!
      12. Summary
    12. 5. Advanced MapReduce Techniques
      1. Simple, advanced, and in-between
      2. Joins
        1. When this is a bad idea
        2. Map-side versus reduce-side joins
        3. Matching account and sales information
      3. Time for action – reduce-side join using MultipleInputs
        1. What just happened?
          1. DataJoinMapper and TaggedMapperOutput
        2. Implementing map-side joins
          1. Using the Distributed Cache
        3. Have a go hero - Implementing map-side joins
          1. Pruning data to fit in the cache
          2. Using a data representation instead of raw data
          3. Using multiple mappers
        4. To join or not to join...
      4. Graph algorithms
        1. Graph 101
        2. Graphs and MapReduce – a match made somewhere
        3. Representing a graph
      5. Time for action – representing the graph
        1. What just happened?
        2. Overview of the algorithm
          1. The mapper
          2. The reducer
          3. Iterative application
      6. Time for action – creating the source code
        1. What just happened?
      7. Time for action – the first run
        1. What just happened?
      8. Time for action – the second run
        1. What just happened?
      9. Time for action – the third run
        1. What just happened?
      10. Time for action – the fourth and last run
        1. What just happened?
        2. Running multiple jobs
        3. Final thoughts on graphs
      11. Using language-independent data structures
        1. Candidate technologies
        2. Introducing Avro
      12. Time for action – getting and installing Avro
        1. What just happened?
        2. Avro and schemas
      13. Time for action – defining the schema
        1. What just happened?
      14. Time for action – creating the source Avro data with Ruby
        1. What just happened?
      15. Time for action – consuming the Avro data with Java
        1. What just happened?
        2. Using Avro within MapReduce
      16. Time for action – generating shape summaries in MapReduce
        1. What just happened?
      17. Time for action – examining the output data with Ruby
        1. What just happened?
      18. Time for action – examining the output data with Java
        1. What just happened?
        2. Have a go hero – graphs in Avro
        3. Going forward with Avro
      19. Summary
    13. 6. When Things Break
      1. Failure
        1. Embrace failure
        2. Or at least don't fear it
        3. Don't try this at home
        4. Types of failure
        5. Hadoop node failure
          1. The dfsadmin command
          2. Cluster setup, test files, and block sizes
          3. Fault tolerance and Elastic MapReduce
      2. Time for action – killing a DataNode process
        1. What just happened?
          1. NameNode and DataNode communication
        2. Have a go hero – NameNode log delving
      3. Time for action – the replication factor in action
        1. What just happened?
      4. Time for action – intentionally causing missing blocks
        1. What just happened?
          1. When data may be lost
          2. Block corruption
      5. Time for action – killing a TaskTracker process
        1. What just happened?
          1. Comparing the DataNode and TaskTracker failures
          2. Permanent failure
        2. Killing the cluster masters
      6. Time for action – killing the JobTracker
        1. What just happened?
          1. Starting a replacement JobTracker
        2. Have a go hero – moving the JobTracker to a new host
      7. Time for action – killing the NameNode process
        1. What just happened?
          1. Starting a replacement NameNode
          2. The role of the NameNode in more detail
          3. File systems, files, blocks, and nodes
          4. The single most important piece of data in the cluster – fsimage
          5. DataNode startup
          6. Safe mode
          7. SecondaryNameNode
          8. So what to do when the NameNode process has a critical failure?
          9. BackupNode/CheckpointNode and NameNode HA
          10. Hardware failure
          11. Host failure
          12. Host corruption
          13. The risk of correlated failures
        2. Task failure due to software
          1. Failure of slow running tasks
      8. Time for action – causing task failure
        1. What just happened?
        2. Have a go hero – HDFS programmatic access
          1. Hadoop's handling of slow-running tasks
          2. Speculative execution
          3. Hadoop's handling of failing tasks
        3. Have a go hero – causing tasks to fail
        4. Task failure due to data
          1. Handling dirty data through code
          2. Using Hadoop's skip mode
      9. Time for action – handling dirty data by using skip mode
        1. What just happened?
          1. To skip or not to skip...
      10. Summary
    14. 7. Keeping Things Running
      1. A note on EMR
      2. Hadoop configuration properties
        1. Default values
      3. Time for action – browsing default properties
        1. What just happened?
        2. Additional property elements
        3. Default storage location
        4. Where to set properties
      4. Setting up a cluster
        1. How many hosts?
          1. Calculating usable space on a node
          2. Location of the master nodes
          3. Sizing hardware
          4. Processor / memory / storage ratio
          5. EMR as a prototyping platform
        2. Special node requirements
        3. Storage types
          1. Commodity versus enterprise class storage
          2. Single disk versus RAID
          3. Finding the balance
          4. Network storage
        4. Hadoop networking configuration
          1. How blocks are placed
          2. Rack awareness
            1. The rack-awareness script
      5. Time for action – examining the default rack configuration
        1. What just happened?
      6. Time for action – adding a rack awareness script
        1. What just happened?
        2. What is commodity hardware anyway?
        3. Pop quiz – setting up a cluster
      7. Cluster access control
        1. The Hadoop security model
      8. Time for action – demonstrating the default security
        1. What just happened?
          1. User identity
            1. The super user
          2. More granular access control
        2. Working around the security model via physical access control
      9. Managing the NameNode
        1. Configuring multiple locations for the fsimage class
      10. Time for action – adding an additional fsimage location
        1. What just happened?
          1. Where to write the fsimage copies
        2. Swapping to another NameNode host
          1. Having things ready before disaster strikes
      11. Time for action – swapping to a new NameNode host
        1. What just happened?
          1. Don't celebrate quite yet!
          2. What about MapReduce?
        2. Have a go hero – swapping to a new NameNode host
      12. Managing HDFS
        1. Where to write data
        2. Using balancer
          1. When to rebalance
      13. MapReduce management
        1. Command line job management
        2. Have a go hero – command line job management
        3. Job priorities and scheduling
      14. Time for action – changing job priorities and killing a job
        1. What just happened?
        2. Alternative schedulers
          1. Capacity Scheduler
          2. Fair Scheduler
          3. Enabling alternative schedulers
          4. When to use alternative schedulers
      15. Scaling
        1. Adding capacity to a local Hadoop cluster
        2. Have a go hero – adding a node and running balancer
        3. Adding capacity to an EMR job flow
          1. Expanding a running job flow
      16. Summary
    15. 8. A Relational View on Data with Hive
      1. Overview of Hive
        1. Why use Hive?
        2. Thanks, Facebook!
      2. Setting up Hive
        1. Prerequisites
        2. Getting Hive
      3. Time for action – installing Hive
        1. What just happened?
      4. Using Hive
      5. Time for action – creating a table for the UFO data
        1. What just happened?
      6. Time for action – inserting the UFO data
        1. What just happened?
        2. Validating the data
      7. Time for action – validating the table
        1. What just happened?
      8. Time for action – redefining the table with the correct column separator
        1. What just happened?
        2. Hive tables – real or not?
      9. Time for action – creating a table from an existing file
        1. What just happened?
      10. Time for action – performing a join
        1. What just happened?
        2. Have a go hero – improve the join to use regular expressions
        3. Hive and SQL views
      11. Time for action – using views
        1. What just happened?
        2. Handling dirty data in Hive
        3. Have a go hero – do it!
      12. Time for action – exporting query output
        1. What just happened?
        2. Partitioning the table
      13. Time for action – making a partitioned UFO sighting table
        1. What just happened?
        2. Bucketing, clustering, and sorting... oh my!
        3. User-Defined Function
      14. Time for action – adding a new User Defined Function (UDF)
        1. What just happened?
        2. To preprocess or not to preprocess...
        3. Hive versus Pig
        4. What we didn't cover
      15. Hive on Amazon Web Services
      16. Time for action – running UFO analysis on EMR
        1. What just happened?
        2. Using interactive job flows for development
        3. Have a go hero – using an interactive EMR cluster
        4. Integration with other AWS products
      17. Summary
    16. 9. Working with Relational Databases
      1. Common data paths
        1. Hadoop as an archive store
        2. Hadoop as a preprocessing step
        3. Hadoop as a data input tool
        4. The serpent eats its own tail
      2. Setting up MySQL
      3. Time for action – installing and setting up MySQL
        1. What just happened?
        2. Did it have to be so hard?
      4. Time for action – configuring MySQL to allow remote connections
        1. What just happened?
        2. Don't do this in production!
      5. Time for action – setting up the employee database
        1. What just happened?
        2. Be careful with data file access rights
      6. Getting data into Hadoop
        1. Using MySQL tools and manual import
        2. Have a go hero – exporting the employee table into HDFS
        3. Accessing the database from the mapper
        4. A better way – introducing Sqoop
      7. Time for action – downloading and configuring Sqoop
        1. What just happened?
          1. Sqoop and Hadoop versions
          2. Sqoop and HDFS
      8. Time for action – exporting data from MySQL to HDFS
        1. What just happened?
          1. Mappers and primary key columns
          2. Other options
          3. Sqoop's architecture
        2. Importing data into Hive using Sqoop
      9. Time for action – exporting data from MySQL into Hive
        1. What just happened?
      10. Time for action – a more selective import
        1. What just happened?
          1. Datatype issues
      11. Time for action – using a type mapping
        1. What just happened?
      12. Time for action – importing data from a raw query
        1. What just happened?
        2. Have a go hero
          1. Sqoop and Hive partitions
          2. Field and line terminators
      13. Getting data out of Hadoop
        1. Writing data from within the reducer
        2. Writing SQL import files from the reducer
        3. A better way – Sqoop again
      14. Time for action – importing data from Hadoop into MySQL
        1. What just happened?
          1. Differences between Sqoop imports and exports
          2. Inserts versus updates
        2. Have a go hero
          1. Sqoop and Hive exports
      15. Time for action – importing Hive data into MySQL
        1. What just happened?
      16. Time for action – fixing the mapping and re-running the export
        1. What just happened?
          1. Other Sqoop features
            1. Incremental merge
            2. Avoiding partial exports
            3. Sqoop as a code generator
      17. AWS considerations
        1. Considering RDS
      18. Summary
    17. 10. Data Collection with Flume
      1. A note about AWS
      2. Data data everywhere...
        1. Types of data
        2. Getting network traffic into Hadoop
      3. Time for action – getting web server data into Hadoop
        1. What just happened?
        2. Have a go hero
        3. Getting files into Hadoop
        4. Hidden issues
          1. Keeping network data on the network
          2. Hadoop dependencies
          3. Reliability
          4. Re-creating the wheel
          5. A common framework approach
      4. Introducing Apache Flume
        1. A note on versioning
      5. Time for action – installing and configuring Flume
        1. What just happened?
        2. Using Flume to capture network data
      6. Time for action – capturing network traffic in a log file
        1. What just happened?
      7. Time for action – logging to the console
        1. What just happened?
        2. Writing network data to log files
      8. Time for action – capturing the output of a command to a flat file
        1. What just happened?
          1. Logs versus files
      9. Time for action – capturing a remote file in a local flat file
        1. What just happened?
        2. Sources, sinks, and channels
          1. Sources
          2. Sinks
          3. Channels
          4. Or roll your own
        3. Understanding the Flume configuration files
        4. Have a go hero
        5. It's all about events
      10. Time for action – writing network traffic onto HDFS
        1. What just happened?
      11. Time for action – adding timestamps
        1. What just happened?
        2. To Sqoop or to Flume...
      12. Time for action – multi level Flume networks
        1. What just happened?
      13. Time for action – writing to multiple sinks
        1. What just happened?
        2. Selectors replicating and multiplexing
        3. Handling sink failure
        4. Have a go hero - Handling sink failure
        5. Next, the world
        6. Have a go hero - Next, the world
      14. The bigger picture
        1. Data lifecycle
        2. Staging data
        3. Scheduling
      15. Summary
    18. 11. Where to Go Next
      1. What we did and didn't cover in this book
      2. Upcoming Hadoop changes
      3. Alternative distributions
        1. Why alternative distributions?
          1. Bundling
          2. Free and commercial extensions
            1. Cloudera Distribution for Hadoop
            2. Hortonworks Data Platform
            3. MapR
            4. IBM InfoSphere Big Insights
          3. Choosing a distribution
      4. Other Apache projects
        1. HBase
        2. Oozie
        3. Whir
        4. Mahout
        5. MRUnit
      5. Other programming abstractions
        1. Pig
        2. Cascading
      6. AWS resources
        1. HBase on EMR
        2. SimpleDB
        3. DynamoDB
      7. Sources of information
        1. Source code
        2. Mailing lists and forums
        3. LinkedIn groups
        4. HUGs
        5. Conferences
      8. Summary
    19. A. Pop Quiz Answers
      1. Chapter 3, Understanding MapReduce
        1. Pop quiz – key/value pairs
        2. Pop quiz – walking through a run of WordCount
      2. Chapter 7, Keeping Things Running
        1. Pop quiz – setting up a cluster
    20. Index