You are previewing Learning Hadoop 2.
O'Reilly logo
Learning Hadoop 2

Book Description

Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2

In Detail

This book introduces you to the world of building data-processing applications with the wide variety of tools supported by Hadoop 2. Starting with the core components of the framework—HDFS and YARN—this book will guide you through how to build applications using a variety of approaches.

You will learn how YARN completely changes the relationship between MapReduce and Hadoop and allows the latter to support more varied processing approaches and a broader array of applications. These include real-time processing with Apache Samza and iterative computation with Apache Spark. Next up, we discuss Apache Pig and the dataflow data model it provides. You will discover how to use Pig to analyze a Twitter dataset.

With this book, you will be able to make your life easier by using tools such as Apache Hive, Apache Oozie, Hadoop Streaming, Apache Crunch, and Kite SDK. The last part of this book discusses the likely future direction of major Hadoop components and how to get involved with the Hadoop community.

What You Will Learn

  • Write distributed applications using the MapReduce framework

  • Go beyond MapReduce and process data in real time with Samza and iteratively with Spark

  • Familiarize yourself with data mining approaches that work with very large datasets

  • Prototype applications on a VM and deploy them to a local cluster or to a cloud infrastructure (Amazon Web Services)

  • Conduct batch and real time data analysis using SQL-like tools

  • Build data processing flows using Apache Pig and see how it enables the easy incorporation of custom functionality

  • Define and orchestrate complex workflows and pipelines with Apache Oozie

  • Manage your data lifecycle and changes over time

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Learning Hadoop 2
      1. Table of Contents
      2. Learning Hadoop 2
      3. Credits
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Introduction
        1. A note on versioning
        2. The background of Hadoop
        3. Components of Hadoop
          1. Common building blocks
          2. Storage
          3. Computation
          4. Better together
        4. Hadoop 2 – what's the big deal?
          1. Storage in Hadoop 2
          2. Computation in Hadoop 2
        5. Distributions of Apache Hadoop
        6. A dual approach
        7. AWS – infrastructure on demand from Amazon
          1. Simple Storage Service (S3)
          2. Elastic MapReduce (EMR)
        8. Getting started
          1. Cloudera QuickStart VM
          2. Amazon EMR
            1. Creating an AWS account
            2. Signing up for the necessary services
          3. Using Elastic MapReduce
          4. Getting Hadoop up and running
            1. How to use EMR
            2. AWS credentials
          5. The AWS command-line interface
        9. Running the examples
        10. Data processing with Hadoop
          1. Why Twitter?
          2. Building our first dataset
            1. One service, multiple APIs
            2. Anatomy of a Tweet
            3. Twitter credentials
          3. Programmatic access with Python
        11. Summary
      9. 2. Storage
        1. The inner workings of HDFS
          1. Cluster startup
            1. NameNode startup
            2. DataNode startup
          2. Block replication
        2. Command-line access to the HDFS filesystem
          1. Exploring the HDFS filesystem
        3. Protecting the filesystem metadata
          1. Secondary NameNode not to the rescue
          2. Hadoop 2 NameNode HA
            1. Keeping the HA NameNodes in sync
          3. Client configuration
          4. How a failover works
        4. Apache ZooKeeper – a different type of filesystem
          1. Implementing a distributed lock with sequential ZNodes
          2. Implementing group membership and leader election using ephemeral ZNodes
          3. Java API
          4. Building blocks
          5. Further reading
        5. Automatic NameNode failover
        6. HDFS snapshots
        7. Hadoop filesystems
          1. Hadoop interfaces
            1. Java FileSystem API
            2. Libhdfs
            3. Thrift
        8. Managing and serializing data
          1. The Writable interface
          2. Introducing the wrapper classes
          3. Array wrapper classes
          4. The Comparable and WritableComparable interfaces
        9. Storing data
          1. Serialization and Containers
          2. Compression
          3. General-purpose file formats
          4. Column-oriented data formats
            1. RCFile
            2. ORC
            3. Parquet
            4. Avro
            5. Using the Java API
        10. Summary
      10. 3. Processing – MapReduce and Beyond
        1. MapReduce
        2. Java API to MapReduce
          1. The Mapper class
          2. The Reducer class
          3. The Driver class
          4. Combiner
          5. Partitioning
            1. The optional partition function
          6. Hadoop-provided mapper and reducer implementations
          7. Sharing reference data
        3. Writing MapReduce programs
          1. Getting started
          2. Running the examples
            1. Local cluster
            2. Elastic MapReduce
          3. WordCount, the Hello World of MapReduce
          4. Word co-occurrences
          5. Trending topics
            1. The Top N pattern
          6. Sentiment of hashtags
          7. Text cleanup using chain mapper
        4. Walking through a run of a MapReduce job
          1. Startup
          2. Splitting the input
          3. Task assignment
          4. Task startup
          5. Ongoing JobTracker monitoring
          6. Mapper input
          7. Mapper execution
          8. Mapper output and reducer input
          9. Reducer input
          10. Reducer execution
          11. Reducer output
          12. Shutdown
          13. Input/Output
          14. InputFormat and RecordReader
          15. Hadoop-provided InputFormat
          16. Hadoop-provided RecordReader
          17. OutputFormat and RecordWriter
          18. Hadoop-provided OutputFormat
          19. Sequence files
        5. YARN
          1. YARN architecture
            1. The components of YARN
            2. Anatomy of a YARN application
          2. Life cycle of a YARN application
            1. Fault tolerance and monitoring
          3. Thinking in layers
          4. Execution models
        6. YARN in the real world – Computation beyond MapReduce
          1. The problem with MapReduce
          2. Tez
            1. Hive-on-tez
          3. Apache Spark
          4. Apache Samza
            1. YARN-independent frameworks
          5. YARN today and beyond
        7. Summary
      11. 4. Real-time Computation with Samza
        1. Stream processing with Samza
          1. How Samza works
          2. Samza high-level architecture
          3. Samza's best friend – Apache Kafka
          4. YARN integration
          5. An independent model
          6. Hello Samza!
          7. Building a tweet parsing job
          8. The configuration file
          9. Getting Twitter data into Kafka
          10. Running a Samza job
          11. Samza and HDFS
          12. Windowing functions
          13. Multijob workflows
          14. Tweet sentiment analysis
            1. Bootstrap streams
          15. Stateful tasks
        2. Summary
      12. 5. Iterative Computation with Spark
        1. Apache Spark
          1. Cluster computing with working sets
            1. Resilient Distributed Datasets (RDDs)
            2. Actions
          2. Deployment
            1. Spark on YARN
            2. Spark on EC2
          3. Getting started with Spark
          4. Writing and running standalone applications
            1. Scala API
            2. Java API
            3. WordCount in Java
            4. Python API
        2. The Spark ecosystem
          1. Spark Streaming
          2. GraphX
          3. MLlib
          4. Spark SQL
        3. Processing data with Apache Spark
          1. Building and running the examples
            1. Running the examples on YARN
            2. Finding popular topics
            3. Assigning a sentiment to topics
          2. Data processing on streams
            1. State management
          3. Data analysis with Spark SQL
            1. SQL on data streams
        4. Comparing Samza and Spark Streaming
        5. Summary
      13. 6. Data Analysis with Apache Pig
        1. An overview of Pig
        2. Getting started
        3. Running Pig
          1. Grunt – the Pig interactive shell
            1. Elastic MapReduce
        4. Fundamentals of Apache Pig
        5. Programming Pig
          1. Pig data types
          2. Pig functions
            1. Load/store
            2. Eval
            3. The tuple, bag, and map functions
            4. The math, string, and datetime functions
            5. Dynamic invokers
            6. Macros
          3. Working with data
            1. Filtering
            2. Aggregation
            3. Foreach
            4. Join
        6. Extending Pig (UDFs)
          1. Contributed UDFs
            1. Piggybank
            2. Elephant Bird
            3. Apache DataFu
        7. Analyzing the Twitter stream
          1. Prerequisites
          2. Dataset exploration
          3. Tweet metadata
          4. Data preparation
          5. Top n statistics
          6. Datetime manipulation
            1. Sessions
          7. Capturing user interactions
          8. Link analysis
          9. Influential users
        8. Summary
      14. 7. Hadoop and SQL
        1. Why SQL on Hadoop
          1. Other SQL-on-Hadoop solutions
        2. Prerequisites
          1. Overview of Hive
          2. The nature of Hive tables
        3. Hive architecture
          1. Data types
          2. DDL statements
          3. File formats and storage
            1. JSON
            2. Avro
            3. Columnar stores
          4. Queries
          5. Structuring Hive tables for given workloads
          6. Partitioning a table
            1. Overwriting and updating data
            2. Bucketing and sorting
            3. Sampling data
          7. Writing scripts
        4. Hive and Amazon Web Services
          1. Hive and S3
          2. Hive on Elastic MapReduce
        5. Extending HiveQL
        6. Programmatic interfaces
          1. JDBC
          2. Thrift
        7. Stinger initiative
        8. Impala
          1. The architecture of Impala
          2. Co-existing with Hive
          3. A different philosophy
          4. Drill, Tajo, and beyond
        9. Summary
      15. 8. Data Lifecycle Management
        1. What data lifecycle management is
          1. Importance of data lifecycle management
          2. Tools to help
        2. Building a tweet analysis capability
          1. Getting the tweet data
          2. Introducing Oozie
            1. A note on HDFS file permissions
            2. Making development a little easier
            3. Extracting data and ingesting into Hive
            4. A note on workflow directory structure
            5. Introducing HCatalog
              1. Using HCatalog
            6. The Oozie sharelib
            7. HCatalog and partitioned tables
          3. Producing derived data
            1. Performing multiple actions in parallel
            2. Calling a subworkflow
            3. Adding global settings
        3. Challenges of external data
          1. Data validation
            1. Validation actions
          2. Handling format changes
          3. Handling schema evolution with Avro
            1. Final thoughts on using Avro schema evolution
              1. Only make additive changes
              2. Manage schema versions explicitly
              3. Think about schema distribution
        4. Collecting additional data
          1. Scheduling workflows
          2. Other Oozie triggers
        5. Pulling it all together
          1. Other tools to help
        6. Summary
      16. 9. Making Development Easier
        1. Choosing a framework
        2. Hadoop streaming
          1. Streaming word count in Python
          2. Differences in jobs when using streaming
          3. Finding important words in text
            1. Calculate term frequency
            2. Calculate document frequency
            3. Putting it all together – TF-IDF
        3. Kite Data
          1. Data Core
          2. Data HCatalog
          3. Data Hive
          4. Data MapReduce
          5. Data Spark
          6. Data Crunch
        4. Apache Crunch
          1. Getting started
          2. Concepts
          3. Data serialization
          4. Data processing patterns
            1. Aggregation and sorting
            2. Joining data
          5. Pipelines implementation and execution
            1. SparkPipeline
            2. MemPipeline
          6. Crunch examples
            1. Word co-occurrence
            2. TF-IDF
          7. Kite Morphlines
            1. Concepts
            2. Morphline commands
        5. Summary
      17. 10. Running a Hadoop Cluster
        1. I'm a developer – I don't care about operations!
          1. Hadoop and DevOps practices
        2. Cloudera Manager
          1. To pay or not to pay
          2. Cluster management using Cloudera Manager
            1. Cloudera Manager and other management tools
          3. Monitoring with Cloudera Manager
            1. Finding configuration files
          4. Cloudera Manager API
          5. Cloudera Manager lock-in
        3. Ambari – the open source alternative
        4. Operations in the Hadoop 2 world
        5. Sharing resources
        6. Building a physical cluster
          1. Physical layout
            1. Rack awareness
            2. Service layout
            3. Upgrading a service
        7. Building a cluster on EMR
          1. Considerations about filesystems
          2. Getting data into EMR
          3. EC2 instances and tuning
        8. Cluster tuning
          1. JVM considerations
            1. The small files problem
          2. Map and reduce optimizations
        9. Security
          1. Evolution of the Hadoop security model
          2. Beyond basic authorization
          3. The future of Hadoop security
          4. Consequences of using a secured cluster
        10. Monitoring
          1. Hadoop – where failures don't matter
          2. Monitoring integration
          3. Application-level metrics
        11. Troubleshooting
          1. Logging levels
          2. Access to logfiles
          3. ResourceManager, NodeManager, and Application Manager
            1. Applications
            2. Nodes
            3. Scheduler
            4. MapReduce
            5. MapReduce v1
            6. MapReduce v2 (YARN)
            7. JobHistory Server
          4. NameNode and DataNode
        12. Summary
      18. 11. Where to Go Next
        1. Alternative distributions
          1. Cloudera Distribution for Hadoop
          2. Hortonworks Data Platform
          3. MapR
          4. And the rest…
          5. Choosing a distribution
        2. Other computational frameworks
          1. Apache Storm
          2. Apache Giraph
          3. Apache HAMA
        3. Other interesting projects
          1. HBase
          2. Sqoop
          3. Whir
          4. Mahout
          5. Hue
        4. Other programming abstractions
          1. Cascading
        5. AWS resources
          1. SimpleDB and DynamoDB
          2. Kinesis
          3. Data Pipeline
        6. Sources of information
          1. Source code
          2. Mailing lists and forums
          3. LinkedIn groups
          4. HUGs
          5. Conferences
        7. Summary
      19. Index