O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hadoop Essentials

Book Description

Delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem

In Detail

This book jumps into the world of Hadoop ecosystem components and its tools in a simplified manner, and provides you with the skills to utilize them effectively for faster and effective development of Hadoop projects.

Starting with the concepts of Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also get acquainted with many Hadoop ecosystem components tools such as Hive, HBase, Pig, Sqoop, Flume, Storm, and Spark.

By the end of the book, you will be confident to begin working with Hadoop straightaway and implement the knowledge gained in all your real-world scenarios.

What You Will Learn

  • Get introduced to Hadoop, big data, and the pillars of Hadoop such as HDFS, MapReduce, and YARN

  • Understand different use cases of Hadoop along with big data analytics and real-time analysis in Hadoop

  • Explore the Hadoop ecosystem tools and effectively use them for faster development and maintenance of a Hadoop project

  • Demonstrate YARN's capacity for database processing

  • Work with Hive, HBase, and Pig with Hadoop to easily figure out your big data problems

  • Gain insights into widely used tools such as Sqoop, Flume, Storm, and Spark using practical examples

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Hadoop Essentials
      1. Table of Contents
      2. Hadoop Essentials
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Introduction to Big Data and Hadoop
        1. V's of big data
          1. Volume
          2. Velocity
          3. Variety
        2. Understanding big data
          1. NoSQL
            1. Types of NoSQL databases
          2. Analytical database
        3. Who is creating big data?
          1. Big data use cases
        4. Big data use case patterns
          1. Big data as a storage pattern
          2. Big data as a data transformation pattern
          3. Big data for a data analysis pattern
          4. Big data for data in a real-time pattern
          5. Big data for a low latency caching pattern
        5. Hadoop
          1. Hadoop history
          2. Description
          3. Advantages of Hadoop
          4. Uses of Hadoop
          5. Hadoop ecosystem
          6. Apache Hadoop
          7. Hadoop distributions
        6. Pillars of Hadoop
        7. Data access components
        8. Data storage component
        9. Data ingestion in Hadoop
        10. Streaming and real-time analysis
        11. Summary
      10. 2. Hadoop Ecosystem
        1. Traditional systems
          1. Database trend
        2. The Hadoop use cases
        3. Hadoop's basic data flow
        4. Hadoop integration
        5. The Hadoop ecosystem
        6. Distributed filesystem
          1. HDFS
        7. Distributed programming
        8. NoSQL databases
          1. Apache HBase
        9. Data ingestion
        10. Service programming
          1. Apache YARN
          2. Apache Zookeeper
        11. Scheduling
        12. Data analytics and machine learning
        13. System management
          1. Apache Ambari
        14. Summary
      11. 3. Pillars of Hadoop – HDFS, MapReduce, and YARN
        1. HDFS
          1. Features of HDFS
          2. HDFS architecture
            1. NameNode
            2. DataNode
            3. Checkpoint NameNode or Secondary NameNode
            4. BackupNode
          3. Data storage in HDFS
            1. Read pipeline
            2. Write pipeline
          4. Rack awareness
            1. Advantages of rack awareness in HDFS
          5. HDFS federation
            1. Limitations of HDFS 1.0
            2. The benefit of HDFS federation
          6. HDFS ports
          7. HDFS commands
        2. MapReduce
          1. The MapReduce architecture
            1. JobTracker
            2. TaskTracker
          2. Serialization data types
            1. The Writable interface
            2. WritableComparable interface
          3. The MapReduce example
          4. The MapReduce process
            1. Mapper
            2. Shuffle and sorting
            3. Reducer
          5. Speculative execution
          6. FileFormats
            1. InputFormats
            2. RecordReader
            3. OutputFormats
            4. RecordWriter
          7. Writing a MapReduce program
            1. Mapper code
            2. Reducer code
            3. Driver code
          8. Auxiliary steps
            1. Combiner
            2. Partitioner
              1. Custom partitioner
        3. YARN
          1. YARN architecture
            1. ResourceManager
            2. NodeManager
            3. ApplicationMaster
          2. Applications powered by YARN
        4. Summary
      12. 4. Data Access Components – Hive and Pig
        1. Need of a data processing tool on Hadoop
        2. Pig
          1. Pig data types
          2. The Pig architecture
            1. The logical plan
            2. The physical plan
            3. The MapReduce plan
          3. Pig modes
          4. Grunt shell
            1. Input data
            2. Loading data
            3. Dump
            4. Store
              1. FOREACH generate
            5. Filter
            6. Group By
            7. Limit
            8. Aggregation
            9. Cogroup
            10. DESCRIBE
            11. EXPLAIN
            12. ILLUSTRATE
        3. Hive
          1. The Hive architecture
            1. Metastore
            2. The Query compiler
            3. The Execution engine
          2. Data types and schemas
          3. Installing Hive
          4. Starting Hive shell
          5. HiveQL
            1. DDL (Data Definition Language) operations
            2. DML (Data Manipulation Language) operations
            3. The SQL operation
              1. Joins
              2. Aggregations
            4. Built-in functions
            5. Custom UDF (User Defined Functions)
          6. Managing tables – external versus managed
          7. SerDe
          8. Partitioning
          9. Bucketing
        4. Summary
      13. 5. Storage Component – HBase
        1. An Overview of HBase
        2. Advantages of HBase
        3. The Architecture of HBase
          1. MasterServer
          2. RegionServer
            1. WAL
            2. BlockCache
              1. LRUBlockCache
              2. SlabCache
              3. BucketCache
            3. Regions
            4. MemStore
            5. Zookeeper
        4. The HBase data model
          1. Logical components of a data model
          2. ACID properties
          3. The CAP theorem
        5. The Schema design
        6. The Write pipeline
        7. The Read pipeline
        8. Compaction
          1. The Compaction policy
          2. Minor compaction
          3. Major compaction
        9. Splitting
          1. Pre-Splitting
          2. Auto Splitting
          3. Forced Splitting
        10. Commands
          1. help
          2. Create
          3. List
          4. Put
          5. Scan
          6. Get
          7. Disable
          8. Drop
        11. HBase Hive integration
        12. Performance tuning
          1. Compression
          2. Filters
          3. Counters
          4. HBase coprocessors
        13. Summary
      14. 6. Data Ingestion in Hadoop – Sqoop and Flume
        1. Data sources
        2. Challenges in data ingestion
        3. Sqoop
        4. Connectors and drivers
        5. Sqoop 1 architecture
          1. Limitation of Sqoop 1
        6. Sqoop 2 architecture
        7. Imports
        8. Exports
        9. Apache Flume
          1. Reliability
        10. Flume architecture
          1. Multitier topology
            1. Flume master
            2. Flume nodes
            3. Components in Agent
              1. Source
              2. Sink
            4. Channels
              1. Memory channel
              2. File Channel
              3. JDBC Channel
        11. Examples of configuring Flume
          1. The Single agent example
          2. Multiple flows in an agent
            1. Configuring a multiagent setup
        12. Summary
      15. 7. Streaming and Real-time Analysis – Storm and Spark
        1. An introduction to Storm
          1. Features of Storm
          2. Physical architecture of Storm
          3. Data architecture of Storm
        2. Storm topology
        3. Storm on YARN
        4. Topology configuration example
          1. Spouts
          2. Bolts
          3. Topology
        5. An introduction to Spark
          1. Features of Spark
        6. Spark framework
          1. Spark SQL
          2. GraphX
          3. MLib
          4. Spark streaming
        7. Spark architecture
          1. Directed Acyclic Graph engine
          2. Resilient Distributed Dataset
          3. Physical architecture
        8. Operations in Spark
          1. Transformations
          2. Actions
        9. Spark example
        10. Summary
      16. Index