You are previewing Hadoop Real-World Solutions Cookbook - Second Edition.
O'Reilly logo
Hadoop Real-World Solutions Cookbook - Second Edition

Book Description

Over 90 hands-on recipes to help you learn and master the intricacies of Apache Hadoop 2.X, YARN, Hive, Pig, Oozie, Flume, Sqoop, Apache Spark, and Mahout

About This Book

  • Implement outstanding Machine Learning use cases on your own analytics models and processes.

  • Solutions to common problems when working with the Hadoop ecosystem.

  • Step-by-step implementation of end-to-end big data use cases.

  • Who This Book Is For

    Readers who have a basic knowledge of big data systems and want to advance their knowledge with hands-on recipes.

    What You Will Learn

  • Installing and maintaining Hadoop 2.X cluster and its ecosystem.

  • Write advanced Map Reduce programs and understand design patterns.

  • Advanced Data Analysis using the Hive, Pig, and Map Reduce programs.

  • Import and export data from various sources using Sqoop and Flume.

  • Data storage in various file formats such as Text, Sequential, Parquet, ORC, and RC Files.

  • Machine learning principles with libraries such as Mahout

  • Batch and Stream data processing using Apache Spark

  • In Detail

    Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping Machine Learning techniques will help you greatly in building predictive models and using this data to make the right decisions for your organization.

    Hadoop Real World Solutions Cookbook gives readers insights into learning and mastering big data via recipes. The book not only clarifies most big data tools in the market but also provides best practices for using them. The book provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This book provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this book.

    This guide is an invaluable tutorial if you are planning to implement a big data warehouse for your business.

    Style and approach

    An easy-to-follow guide that walks you through world of big data. Each tool in the Hadoop ecosystem is explained in detail and the recipes are placed in such a manner that readers can implement them sequentially. Plenty of reference links are provided for advanced reading.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Hadoop Real-World Solutions Cookbook Second Edition
      1. Table of Contents
      2. Hadoop Real-World Solutions Cookbook Second Edition
      3. Credits
      4. About the Author
      5. Acknowledgements
      6. About the Reviewer
      7. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why Subscribe?
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      9. 1. Getting Started with Hadoop 2.X
        1. Introduction
        2. Installing a single-node Hadoop Cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
            1. Hadoop Distributed File System (HDFS)
            2. Yet Another Resource Negotiator (YARN)
          4. There's more
        3. Installing a multi-node Hadoop cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Adding new nodes to existing Hadoop clusters
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Executing the balancer command for uniform data distribution
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Entering and exiting from the safe mode in a Hadoop cluster
          1. How to do it...
          2. How it works...
        7. Decommissioning DataNodes
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing benchmarking on a Hadoop cluster
          1. Getting ready
          2. How to do it...
            1. TestDFSIO
            2. NNBench
            3. MRBench
          3. How it works...
      10. 2. Exploring HDFS
        1. Introduction
        2. Loading data from a local machine to HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Exporting HDFS data to a local machine
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Changing the replication factor of an existing file in HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Setting the HDFS block size for all the files in a cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Setting the HDFS block size for a specific file in a cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Enabling transparent encryption for HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Importing data from another Hadoop cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Recycling deleted data from trash to HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Saving compressed data in HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
      11. 3. Mastering Map Reduce Programs
        1. Introduction
        2. Writing the Map Reduce program in Java to analyze web log data
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Executing the Map Reduce program in a Hadoop cluster
          1. Getting ready
          2. How to do it
          3. How it works...
        4. Adding support for a new writable data type in Hadoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Implementing a user-defined counter in a Map Reduce program
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Map Reduce program to find the top X
          1. Getting ready
          2. How to do it...
          3. How it works
        7. Map Reduce program to find distinct values
          1. Getting ready
          2. How to do it
          3. How it works...
        8. Map Reduce program to partition data using a custom partitioner
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Writing Map Reduce results to multiple output files
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Performing Reduce side Joins using Map Reduce
          1. Getting ready
          2. How to do it
          3. How it works...
        11. Unit testing the Map Reduce code using MRUnit
          1. Getting ready
          2. How to do it...
          3. How it works...
      12. 4. Data Analysis Using Hive, Pig, and Hbase
        1. Introduction
        2. Storing and processing Hive data in a sequential file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Storing and processing Hive data in the RC file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Storing and processing Hive data in the ORC file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Storing and processing Hive data in the Parquet file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Performing FILTER By queries in Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Performing Group By queries in Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing Order By queries in Pig
          1. Getting ready
          2. How to do it..
          3. How it works...
        9. Performing JOINS in Pig
          1. Getting ready
          2. How to do it...
          3. How it works
            1. Replicated Joins
            2. Skewed Joins
            3. Merge Joins
        10. Writing a user-defined function in Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Analyzing web log data using Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Performing the Hbase operation in CLI
          1. Getting ready
          2. How to do it
          3. How it works...
        13. Performing Hbase operations in Java
          1. Getting ready
          2. How to do it
          3. How it works...
        14. Executing the MapReduce programming with an Hbase Table
          1. Getting ready
          2. How to do it
          3. How it works
      13. 5. Advanced Data Analysis Using Hive
        1. Introduction
        2. Processing JSON data in Hive using JSON SerDe
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Processing XML data in Hive using XML SerDe
          1. Getting ready
          2. How to do it...
          3. How it works
        4. Processing Hive data in the Avro format
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Writing a user-defined function in Hive
          1. Getting ready
          2. How to do it
          3. How it works...
        6. Performing table joins in Hive
          1. Getting ready
          2. How to do it...
            1. Left outer join
            2. Right outer join
            3. Full outer join
            4. Left semi join
          3. How it works...
        7. Executing map side joins in Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing context Ngram in Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Call Data Record Analytics using Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Twitter sentiment analysis using Hive
          1. Getting ready
          2. How to do it...
          3. How it works
        11. Implementing Change Data Capture using Hive
          1. Getting ready
          2. How to do it
          3. How it works
        12. Multiple table inserting using Hive
          1. Getting ready
          2. How to do it
          3. How it works
      14. 6. Data Import/Export Using Sqoop and Flume
        1. Introduction
        2. Importing data from RDMBS to HDFS using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Exporting data from HDFS to RDBMS
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Using query operator in Sqoop import
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Importing data using Sqoop in compressed format
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Performing Atomic export using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Importing data into Hive tables using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Importing data into HDFS from Mainframes
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Incremental import using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Creating and executing Sqoop job
          1. Getting ready
          2. How to do it...
          3. How it works...
        11. Importing data from RDBMS to Hbase using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Importing Twitter data into HDFS using Flume
          1. Getting ready
          2. How to do it...
          3. How it works
        13. Importing data from Kafka into HDFS using Flume
          1. Getting ready
          2. How to do it...
          3. How it works
        14. Importing web logs data into HDFS using Flume
          1. Getting ready
          2. How to do it...
          3. How it works...
      15. 7. Automation of Hadoop Tasks Using Oozie
        1. Introduction
        2. Implementing a Sqoop action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works
        3. Implementing a Map Reduce action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Implementing a Java action job using Oozie
          1. Getting ready
          2. How to do it
          3. How it works
        5. Implementing a Hive action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Implementing a Pig action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works
        7. Implementing an e-mail action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Executing parallel jobs using Oozie (fork)
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Scheduling a job in Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
      16. 8. Machine Learning and Predictive Analytics Using Mahout and R
        1. Introduction
        2. Setting up the Mahout development environment
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Creating an item-based recommendation engine using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Creating a user-based recommendation engine using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Predictive analytics on Bank Data using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Text data clustering using K-Means using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Population Data Analytics using R
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Twitter Sentiment Analytics using R
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Performing Predictive Analytics using R
          1. Getting ready
          2. How to do it...
          3. How it works...
      17. 9. Integration with Apache Spark
        1. Introduction
        2. Running Spark standalone
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Running Spark on YARN
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Performing Olympics Athletes analytics using the Spark Shell
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Creating Twitter trending topics using Spark Streaming
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Twitter trending topics using Spark streaming
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Analyzing Parquet files using Spark
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Analyzing JSON data using Spark
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Processing graphs using Graph X
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Conducting predictive analytics using Spark MLib
          1. Getting ready
          2. How to do it...
          3. How it works...
      18. 10. Hadoop Use Cases
        1. Introduction
        2. Call Data Record analytics
          1. Getting ready
          2. How to do it...
            1. Problem Statement
            2. Solution
          3. How it works...
        3. Web log analytics
          1. Getting ready
          2. How to do it...
            1. Problem statement
            2. Solution
          3. How it works...
        4. Sensitive data masking and encryption using Hadoop
          1. Getting ready
          2. How to do it...
            1. Problem statement
            2. Solution
          3. How it works...
      19. Index