You are previewing Hadoop Backup and Recovery Solutions.
O'Reilly logo
Hadoop Backup and Recovery Solutions

Book Description

Learn the best strategies for data recovery from Hadoop backup clusters and troubleshoot problems

In Detail

Hadoop offers distributed processing of large datasets across clusters and is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. It enables computing solutions that are scalable, cost-effective, flexible, and fault tolerant to back up very large data sets from hardware failures.

Starting off with the basics of Hadoop administration, this book becomes increasingly exciting with the best strategies of backing up distributed storage databases.

You will gradually learn about the backup and recovery principles, discover the common failure points in Hadoop, and facts about backing up Hive metadata. A deep dive into the interesting world of Apache HBase will show you different ways of backing up data and will compare them. Going forward, you'll learn the methods of defining recovery strategies for various causes of failures, failover recoveries, corruption, working drives, and metadata. Also covered are the concepts of Hadoop matrix and MapReduce. Finally, you'll explore troubleshooting strategies and techniques to resolve failures.

What You Will Learn

  • Familiarize yourself with HDFS and daemons

  • Determine backup areas, disaster recover principles, and backup needs

  • Understand the necessity for Hive metadata backup

  • Discover HBase to explore different backup styles, such as snapshot, replication, copy table, the HTable API, and manual backup

  • Learn the key considerations of a recovery strategy and restore data in the event of accidental deletion

  • Tune the performance of a Hadoop cluster and recover from scenarios such as failover, corruption, working drives, and NameNodes

  • Monitor node health, and explore various techniques for checks, including HDFS checks and MapReduce checks

  • Identify common hardware failure points and discover mitigation techniques

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Hadoop Backup and Recovery Solutions
      1. Table of Contents
      2. Hadoop Backup and Recovery Solutions
      3. Credits
      4. About the Authors
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
        7. Downloading the example code
        8. Errata
        9. Piracy
        10. Questions
      8. 1. Knowing Hadoop and Clustering Basics
        1. Understanding the need for Hadoop
          1. Apache Hive
          2. Apache Pig
          3. Apache HBase
          4. Apache HCatalog
        2. Understanding HDFS design
          1. Getting familiar with HDFS daemons
            1. Scenario 1 – writing data to the HDFS cluster
            2. Scenario 2 – reading data from the HDFS cluster
        3. Understanding the basics of Hadoop cluster
        4. Summary
      9. 2. Understanding Hadoop Backup and Recovery Needs
        1. Understanding the backup and recovery philosophies
          1. Replication of data using DistCp
            1. Updating and overwriting using DistCp
          2. The backup philosophy
            1. Changes since the last backup
            2. The rate of new data arrival
            3. The size of the cluster
            4. Priority of the datasets
            5. Selecting the datasets or parts of datasets
            6. The timelines of data backups
            7. Reducing the window of possible data loss
            8. Backup consistency
            9. Avoiding invalid backups
          3. The recovery philosophy
        2. Knowing the necessity of backing up Hadoop
        3. Determining backup areas – what should I back up?
          1. Datasets
            1. Block size – a large file divided into blocks
            2. Replication factor
            3. A list of all the blocks of a file
            4. A list of DataNodes for each block – sorted by distance
            5. The ACK package
            6. The checksums
            7. The number of under-replicated blocks
            8. The secondary NameNode
              1. Fixing the disk that has been corrupted or repairing it
              2. Recovering the edit log
              3. Recovering the state from the secondary NameNode
            9. Active and passive nodes in second generation Hadoop
            10. Hardware failure
              1. Data corruption on disk
              2. Disk/node failure
              3. Rack failure
            11. Software failure
          2. Applications
          3. Configurations
        4. Is taking backup enough?
          1. Understanding the disaster recovery principle
            1. Knowing a disaster
          2. The need for recovery
          3. Understanding recovery areas
        5. Summary
      10. 3. Determining Backup Strategies
        1. Knowing the areas to be protected
        2. Understanding the common failure types
          1. Hardware failure
            1. Host failure
            2. Using commodity hardware
            3. Hardware failures may lead to loss of data
          2. User application failure
            1. Software causing task failure
            2. Failure of slow-running tasks
              1. How Hadoop handles slow-running tasks
                1. Speculative execution
            3. Hadoop's handling of failing tasks
            4. Task failure due to data
              1. Data loss or corruption
              2. No live node contains block errors
            5. Bad data handling – through code
            6. Hadoop's skip mode
              1. Handling skip mode in Hadoop
        3. Learning a way to define the backup strategy
          1. Why do I need a strategy?
          2. What should be considered in a strategy?
            1. Filesystem check (fsck)
            2. Filesystem balancer
            3. Upgrading your Hadoop cluster
            4. Designing network layout and rack awareness
            5. Most important areas to consider while defining a backup strategy
        4. Understanding the need for backing up Hive metadata
          1. What is Hive?
          2. Hive replication
        5. Summary
      11. 4. Backing Up Hadoop
        1. Data backup in Hadoop
          1. Distributed copy
          2. Architectural approach to backup
        2. HBase
          1. HBase history
          2. HBase introduction
          3. Understanding the HBase data model
            1. Accessing HBase data
        3. Approaches to backing up HBase
          1. Snapshots
            1. Operations involved in snapshots
              1. Snapshot operation commands
          2. HBase replication
            1. Modes of replication
          3. Export
          4. The copy table
          5. HTable API
          6. Offline backup
          7. Comparing backup options
        4. Summary
      12. 5. Determining Recovery Strategy
        1. Knowing the key considerations of recovery strategy
        2. Disaster failure at data centers
          1. How HDFS handles failures at data centers
            1. Automatic failover configuration
            2. How automatic failover configuration works
            3. How to configure automatic failover
              1. The transitionToActive and transitionToStandBy commands
              2. Failover
              3. The getServiceState command
              4. The checkHealth command
          2. How HBase handles failures at data centers
        3. Restoring a point-in time copy for auditing
        4. Restoring a data copy due to user error or accidental deletion
        5. Defining recovery strategy
          1. Centralized configuration
          2. Monitoring
          3. Alerting
            1. Teeing versus copying
        6. Summary
      13. 6. Recovering Hadoop Data
        1. Failover to backup cluster
          1. Installation and configuration
            1. The user and group settings
            2. Java installation
            3. Password-less SSH configuration
            4. ZooKeeper installation
            5. Hadoop installation
          2. The test installation of Hadoop
          3. Hadoop configuration for an automatic failover
            1. Preparing for the HA state in ZooKeeper
            2. Formatting and starting NameNodes
            3. Starting the ZKFC services
            4. Starting DataNodes
            5. Verifying an automatic failover
        2. Importing a table or restoring a snapshot
        3. Pointing the HBase root folder to the backup location
        4. Locating and repairing corruptions
        5. Recovering a drive from the working state
        6. Lost files
        7. The recovery of NameNode
          1. What did we do just now?
        8. Summary
      14. 7. Monitoring
        1. Monitoring overview
        2. Metrics of Hadoop
          1. FileContext
          2. GangliaContext
          3. NullContextWithUpdateThread
          4. CompositeContext
          5. Java Management Extension
        3. Monitoring node health
          1. Hadoop host monitoring
          2. Hadoop process monitoring
          3. The HDFS checks
          4. The MapReduce checks
        4. Cluster monitoring
          1. Managing the HDFS cluster
        5. Logging
          1. Log output written via log4j
          2. Setting the log levels
          3. Getting stack traces
        6. Summary
      15. 8. Troubleshooting
        1. Understanding troubleshooting approaches
        2. Understanding common failure points
          1. Human errors
          2. Configuration issues
          3. Hardware failures
          4. Resource allocation issues
        3. Identifying the root cause
        4. Knowing issue resolution techniques
        5. Summary
      16. Index