You are previewing Hadoop MapReduce v2 Cookbook - Second Edition.
O'Reilly logo
Hadoop MapReduce v2 Cookbook - Second Edition

Book Description

Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets

In Detail

Starting with installing Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to solve analytics, classifications, online marketing, recommendations, and data indexing and searching. You will learn how to take advantage of Hadoop ecosystem projects including Hive, HBase, Pig, Mahout, Nutch, and Giraph and be introduced to deploying in cloud environments.

Finally, you will be able to apply the knowledge you have gained to your own real-world scenarios to achieve the best-possible results.

What You Will Learn

  • Configure and administer Hadoop YARN, MapReduce v2, and HDFS clusters

  • Use Hive, HBase, Pig, Mahout, and Nutch with Hadoop v2 to solve your big data problems easily and effectively

  • Solve large-scale analytics problems using MapReduce-based applications

  • Tackle complex problems such as classifications, finding relationships, online marketing, recommendations, and searching using Hadoop MapReduce and other related projects

  • Perform massive text data processing using Hadoop MapReduce and other related projects

  • Deploy your clusters to cloud environments

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Hadoop MapReduce v2 Cookbook Second Edition
      1. Table of Contents
      2. Hadoop MapReduce v2 Cookbook Second Edition
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Author
      7. About the Reviewers
      8. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      9. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      10. 1. Getting Started with Hadoop v2
        1. Introduction
          1. Hadoop Distributed File System – HDFS
          2. Hadoop YARN
          3. Hadoop MapReduce
          4. Hadoop installation modes
        2. Setting up Hadoop v2 on your local machine
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        4. Adding a combiner step to the WordCount MapReduce program
          1. How to do it...
          2. How it works...
          3. There's more...
        5. Setting up HDFS
          1. Getting ready
          2. How to do it...
          3. See also
        6. Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        7. Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
          1. Getting ready
          2. How to do it...
          3. There's more...
        8. HDFS command-line file operations
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        9. Running the WordCount program in a distributed cluster environment
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        10. Benchmarking HDFS using DFSIO
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Benchmarking Hadoop MapReduce using TeraSort
          1. Getting ready
          2. How to do it...
          3. How it works...
      11. 2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
        1. Introduction
        2. Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
          1. Getting ready
          2. How to do it...
          3. See also
        3. Saving money using Amazon EC2 Spot Instances to execute EMR job flows
          1. How to do it...
          2. There's more...
          3. See also
        4. Executing a Pig script using EMR
          1. How to do it...
          2. There's more...
            1. Starting a Pig interactive session
        5. Executing a Hive script using EMR
          1. How to do it...
          2. There's more...
            1. Starting a Hive interactive session
          3. See also
        6. Creating an Amazon EMR job flow using the AWS Command Line Interface
          1. Getting ready
          2. How to do it...
          3. There's more...
          4. See also
        7. Deploying an Apache HBase cluster on Amazon EC2 using EMR
          1. Getting ready
          2. How to do it...
          3. See also
        8. Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
          1. How to do it...
          2. There's more...
        9. Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
          1. How to do it...
          2. How it works...
          3. See also
      12. 3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs
        1. Introduction
        2. Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Shared user Hadoop clusters – using Fair and Capacity schedulers
          1. How to do it...
          2. How it works...
          3. There's more...
        4. Setting classpath precedence to user-provided JARs
          1. How to do it...
          2. How it works...
        5. Speculative execution of straggling tasks
          1. How to do it...
          2. There's more...
        6. Unit testing Hadoop MapReduce applications using MRUnit
          1. Getting ready
          2. How to do it...
          3. See also
        7. Integration testing Hadoop MapReduce applications using MiniYarnCluster
          1. Getting ready
          2. How to do it...
          3. See also
        8. Adding a new DataNode
          1. Getting ready
          2. How to do it...
          3. There's more...
            1. Rebalancing HDFS
          4. See also
        9. Decommissioning DataNodes
          1. How to do it...
          2. How it works...
          3. See also
        10. Using multiple disks/volumes and limiting HDFS disk usage
          1. How to do it...
        11. Setting the HDFS block size
          1. How to do it...
          2. There's more...
          3. See also
        12. Setting the file replication factor
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        13. Using the HDFS Java API
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Configuring the FileSystem object
            2. Retrieving the list of data blocks of a file
      13. 4. Developing Complex Hadoop MapReduce Applications
        1. Introduction
        2. Choosing appropriate Hadoop data types
          1. How to do it...
          2. There's more...
          3. See also
        3. Implementing a custom Hadoop Writable data type
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        4. Implementing a custom Hadoop key type
          1. How to do it...
          2. How it works...
          3. See also
        5. Emitting data of different value types from a Mapper
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        6. Choosing a suitable Hadoop InputFormat for your input data format
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        7. Adding support for new input data formats – implementing a custom InputFormat
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        8. Formatting the results of MapReduce computations – using Hadoop OutputFormats
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Writing multiple outputs from a MapReduce computation
          1. How to do it...
          2. How it works...
            1. Using multiple input data types and multiple Mapper implementations in a single MapReduce application
          3. See also
        10. Hadoop intermediate data partitioning
          1. How to do it...
          2. How it works...
          3. There's more...
            1. TotalOrderPartitioner
            2. KeyFieldBasedPartitioner
        11. Secondary sorting – sorting Reduce input values
          1. How to do it...
          2. How it works...
          3. See also
        12. Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Distributing archives using the DistributedCache
            2. Adding resources to the DistributedCache from the command line
            3. Adding resources to the classpath using the DistributedCache
        13. Using Hadoop with legacy applications – Hadoop streaming
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        14. Adding dependencies between MapReduce jobs
          1. How to do it...
          2. How it works...
          3. There's more...
        15. Hadoop counters to report custom metrics
          1. How to do it...
          2. How it works...
      14. 5. Analytics
        1. Introduction
        2. Simple analytics using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Performing GROUP BY using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Calculating frequency distributions and sorting using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        5. Plotting the Hadoop MapReduce results using gnuplot
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Calculating histograms using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Calculating Scatter plots using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Parsing a complex dataset with Hadoop
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        9. Joining two datasets using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
      15. 6. Hadoop Ecosystem – Apache Hive
        1. Introduction
        2. Getting started with Apache Hive
          1. How to do it...
          2. See also
        3. Creating databases and tables using Hive CLI
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Hive data types
            2. Hive external tables
            3. Using the describe formatted command to inspect the metadata of Hive tables
        4. Simple SQL-style data querying using Apache Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Using Apache Tez as the execution engine for Hive
          5. See also
        5. Creating and populating Hive tables and views using Hive query results
          1. Getting ready
          2. How to do it...
        6. Utilizing different storage formats in Hive - storing table data using ORC files
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Using Hive built-in functions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        8. Hive batch mode - using a query file
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        9. Performing a join with Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        10. Creating partitioned Hive tables
          1. Getting ready
          2. How to do it...
        11. Writing Hive User-defined Functions (UDF)
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. HCatalog – performing Java MapReduce computations on data mapped to Hive tables
          1. Getting ready
          2. How to do it...
          3. How it works...
        13. HCatalog – writing data to Hive tables from Java MapReduce computations
          1. Getting ready
          2. How to do it...
          3. How it works...
      16. 7. Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
        1. Introduction
        2. Getting started with Apache Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Joining two datasets using Pig
          1. How to do it...
          2. How it works...
          3. There's more...
        4. Accessing a Hive table data in Pig using HCatalog
          1. Getting ready
          2. How to do it...
          3. There's more...
          4. See also
        5. Getting started with Apache HBase
          1. Getting ready
          2. How to do it...
          3. There's more...
          4. See also
        6. Data random access using Java client APIs
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Running MapReduce jobs on HBase
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Using Hive to insert data into HBase tables
          1. Getting ready
          2. How to do it...
          3. See also
        9. Getting started with Apache Mahout
          1. How to do it...
          2. How it works...
          3. There's more...
        10. Running K-means with Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        11. Importing data to HDFS from a relational database using Apache Sqoop
          1. Getting ready
          2. How to do it...
        12. Exporting data from HDFS to a relational database using Apache Sqoop
          1. Getting ready
          2. How to do it...
      17. 8. Searching and Indexing
        1. Introduction
        2. Generating an inverted index using Hadoop MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Outputting a random accessible indexed InvertedIndex
          5. See also
        3. Intradomain web crawling using Apache Nutch
          1. Getting ready
          2. How to do it...
          3. See also
        4. Indexing and searching web documents using Apache Solr
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        5. Configuring Apache HBase as the backend data store for Apache Nutch
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        6. Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        7. Elasticsearch for indexing and searching
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        8. Generating the in-links graph for crawled web pages
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      18. 9. Classifications, Recommendations, and Finding Relationships
        1. Introduction
        2. Performing content-based recommendations
          1. How to do it...
          2. How it works...
          3. There's more...
        3. Classification using the naïve Bayes classifier
          1. How to do it...
          2. How it works...
        4. Assigning advertisements to keywords using the Adwords balance algorithm
          1. How to do it...
          2. How it works...
          3. There's more...
      19. 10. Mass Text Data Processing
        1. Introduction
        2. Data preprocessing using Hadoop streaming and Python
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. De-duplicating data using Hadoop streaming
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        4. Loading large datasets to an Apache HBase data store – importtsv and bulkload
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. There's more...
            1. Data de-duplication using HBase
          5. See also
        5. Creating TF and TF-IDF vectors for the text data
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        6. Clustering text data using Apache Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        7. Topic discovery using Latent Dirichlet Allocation (LDA)
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        8. Document classification using Mahout Naive Bayes Classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      20. Index