You are previewing Cloudera Administration Handbook.
O'Reilly logo
Cloudera Administration Handbook

Book Description

A complete, hands-on guide to building and maintaining large Apache Hadoop clusters using Cloudera Manager and CDH5

In Detail

Apache Hadoop is an open source distributed computing technology that assists users in processing large volumes of data with relative ease, helping them to generate tremendous insights into their data. Cloudera, with their open source distribution of Hadoop, has made data analytics on big data possible and accessible to anyone interested.

This book fully prepares you to be a Hadoop administrator, with special emphasis on Cloudera's CDH. It provides step-by-step instructions on setting up and managing a robust Hadoop cluster running CDH5. This book will also equip you with an understanding of tools such as Cloudera Manager, which is currently being used by many companies to manage Hadoop clusters with hundreds of nodes. You will learn how to set up security using Kerberos. You will also use Cloudera Manager to set up alerts and events that will help you monitor and troubleshoot cluster issues.

What You Will Learn

  • Understand the Apache Hadoop architecture and the future of distributed processing frameworks
  • Use HDFS and MapReduce for all file-related operations
  • Install and configure CDH to bring up an Apache Hadoop cluster
  • Configure HDFS High Availability and HDFS Federation to prevent single points of failure
  • Install and configure Cloudera Manager to perform administrator operations
  • Implement security by installing and configuring Kerberos for all services in the cluster
  • Add, remove, and rebalance nodes in a cluster using cluster management tools
  • Understand and configure the different backup options to back up your HDFS
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Cloudera Administration Handbook
      1. Table of Contents
      2. Cloudera Administration Handbook
      3. Credits
      4. Notice
      5. About the Author
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Getting Started with Apache Hadoop
        1. History of Apache Hadoop and its trends
        2. Components of Apache Hadoop
        3. Understanding the Apache Hadoop daemons
          1. Namenode
          2. Secondary namenode
          3. Jobtracker
          4. Tasktracker
          5. ResourceManager
          6. NodeManager
          7. Job submission in YARN
        4. Introducing Cloudera
        5. Introducing CDH
        6. Responsibilities of a Hadoop administrator
        7. Summary
      10. 2. HDFS and MapReduce
        1. Essentials of HDFS
          1. Configuring HDFS
        2. The read/write operational flow in HDFS
          1. Writing files in HDFS
          2. Reading files in HDFS
        3. Understanding the namenode UI
        4. Understanding the secondary namenode UI
        5. Exploring HDFS commands
          1. Commonly used HDFS commands
          2. Commands to administer HDFS
        6. Getting acquainted with MapReduce
          1. Understanding the map phase
          2. Understanding the reduce phase
          3. Learning all about the MapReduce job flow
            1. Configuring MapReduce
          4. Understanding the jobtracker UI
          5. Getting MapReduce job information
        7. Summary
      11. 3. Cloudera's Distribution Including Apache Hadoop
        1. Getting started with CDH
        2. Understanding the CDH components
          1. Apache Hadoop
          2. Apache Flume NG
          3. Apache Sqoop
          4. Apache Pig
          5. Apache Hive
          6. Apache ZooKeeper
          7. Apache HBase
          8. Apache Whirr
          9. Snappy – previously known as Zippy
          10. Apache Mahout
          11. Apache Avro
          12. Apache Oozie
          13. Cloudera Search
          14. Cloudera Impala
          15. Cloudera Hue
            1. Beeswax – Hive UI
            2. Cloudera Impala UI
            3. Pig UI
            4. File Browser
            5. Metastore Manager
            6. Sqoop Jobs
            7. Job Browser
            8. Job Designs
            9. Dashboard
            10. Collection Manager
            11. Hue Shell
            12. HBase Browser
        3. Installing CDH
          1. Stopping Hadoop services
          2. Understanding a YARN cluster
        4. Installing the CDH components
          1. Installing Apache Flume
          2. Installing Apache Sqoop
          3. Installing Apache Sqoop 2
          4. Installing Apache Pig
          5. Installing Apache Hive
          6. Installing Apache Oozie
          7. Installing Apache ZooKeeper
        5. Summary
      12. 4. Exploring HDFS Federation and Its High Availability
        1. Implementing HDFS Federation
          1. Configuring HDFS Federation
            1. Configuring ViewFS for a federated HDFS
        2. Implementing HDFS High Availability
          1. The Quorum-based storage
            1. Configuring HDFS high availability by theQuorum-based storage
          2. Shared storage using NFS
            1. Configuring HDFS high availability by shared storage using NFS
              1. NameNode Journal Status for Quorum-based storage approach
              2. NameNode Journal Status for the Shared Storage-based approach
          3. Configuring automatic failover for HDFS high availability
        3. Jobtracker high availability
          1. Configuring jobtracker high availability
          2. Configuring automatic failover for jobtracker high availability
        4. Summary
      13. 5. Using Cloudera Manager
        1. Introducing Cloudera Manager
        2. Understanding the Cloudera Manager architecture
        3. Installing Cloudera Manager
        4. Navigating the Cloudera Manager Web console
          1. Navigating the Home screen
          2. Navigating the Clusters menu
          3. Exploring the Hosts menu
          4. Understanding the Diagnostics menu
          5. Understanding the Audits screen
          6. Understanding the Charts menu
          7. Understanding the Backup menu
          8. Understanding the Administration menu
        5. Configuring High Availability using Cloudera Manager
        6. Summary
      14. 6. Implementing Security Using Kerberos
        1. Understanding authentication and authorization
        2. Introducing Kerberos
        3. Understanding the Kerberos Architecture
          1. Authenticating a user
          2. Accessing a secure file server
          3. Understanding important Kerberos terms
        4. Installing Kerberos
          1. Configuring the KDC Server
          2. Testing the KDC installation
          3. Configuring the Kerberos clients
        5. Configuring Kerberos for Apache Hadoop
          1. Configuring Kerberos principal for Cloudera Manager Server
          2. Configuring the Cloudera Manager Server for Kerberos
        6. Authorization in Apache Hadoop
          1. Configuring access control lists in Hadoop
        7. Summary
      15. 7. Managing an Apache Hadoop Cluster
        1. Configuring Hadoop services using Cloudera Manager
          1. Adding a service to the cluster
          2. Removing a service from the cluster
        2. Role management in Cloudera Manager
          1. Adding a role instance to a host
            1. Adding a DataNode role to a host
            2. Adding a TaskTracker role to a host
        3. Managing hosts using Cloudera Manager
          1. Adding a new host
          2. Removing an existing host
        4. Managing multiple clusters with Cloudera Manager
        5. Rebalancing a Hadoop cluster from Cloudera Manager
          1. Adding the Balancer service to the cluster
          2. Rebalancing the cluster
        6. Summary
      16. 8. Cluster Monitoring Using Events and Alerts
        1. Monitoring Hadoop services from Cloudera Manager
        2. Understanding events and alerts
          1. Configuring events and alerts
          2. Configuring the alert delivery by an e-mail
        3. Summary
      17. 9. Configuring Backups
        1. Understanding backups
          1. Types of backups
          2. Types of storage media for backups
          3. Using cloud services for backups
        2. Understanding HDFS backups
        3. Using the distributed copy (DistCp)
        4. Configuring backups using Cloudera Manager
          1. Configuring HDFS replication
          2. Configuring Hive replication
          3. Configuring snapshots
            1. Enabling snapshot paths in HDFS
            2. Configuring a snapshot policy
        5. Summary
      18. Index