You are previewing Hadoop MapReduce Cookbook.
O'Reilly logo
Hadoop MapReduce Cookbook

Book Description

Learn how to use Hadoop MapReduce to analyze large and complex datasets with this comprehensive cookbook. Over fifty recipes with step-by-step instructions quickly take your Hadoop skills to the next level.

  • Learn to process large and complex data sets, starting simply, then diving in deep

  • Solve complex big data problems such as classifications, finding relationships, online marketing and recommendations

  • More than 50 Hadoop MapReduce recipes, presented in a simple and straightforward manner, with step-by-step instructions and real world examples

  • In Detail

    We are facing an avalanche of data. The unstructured data we gather can contain many insights that might hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop MapReduce is one of the most highly sought after skills in today's job market.

    Hadoop MapReduce Cookbook is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases.

    Hadoop MapReduce Cookbook presents more than 50 ready-to-use Hadoop MapReduce recipes in a simple and straightforward manner, with step-by-step instructions and real world examples.

    Start with how to install, then configure, extend, and administer Hadoop. Then write simple examples, learn MapReduce patterns, harness the Hadoop landscape, and finally jump to the cloud.

    The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.

    Hadoop MapReduce Cookbook teaches you how process large and complex data sets using real examples providing a comprehensive guide to get things done using Hadoop MapReduce.

    Table of Contents

    1. Hadoop MapReduce Cookbook
      1. Table of Contents
      2. Hadoop MapReduce Cookbook
      3. Credits
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Getting Hadoop Up and Running in a Cluster
        1. Introduction
        2. Setting up Hadoop on your machine
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Adding the combiner step to the WordCount MapReduce program
          1. How to do it...
          2. How it works...
          3. There's more...
        5. Setting up HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Using HDFS monitoring UI
          1. Getting ready
          2. How to do it...
        7. HDFS basic command-line file operations
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Setting Hadoop in a distributed cluster environment
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        9. Running the WordCount program in a distributed cluster environment
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        10. Using MapReduce monitoring UI
          1. How to do it...
          2. How it works...
      9. 2. Advanced HDFS
        1. Introduction
        2. Benchmarking HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Adding a new DataNode
          1. Getting ready
          2. How to do it...
          3. There's more...
            1. Rebalancing HDFS
          4. See also
        4. Decommissioning DataNodes
          1. How to do it...
          2. How it works...
          3. See also
        5. Using multiple disks/volumes and limiting HDFS disk usage
          1. How to do it...
        6. Setting HDFS block size
          1. How to do it...
          2. There's more...
          3. See also
        7. Setting the file replication factor
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        8. Using HDFS Java API
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Configuring the FileSystem object
            2. Retrieving the list of data blocks of a file
          5. See also
        9. Using HDFS C API (libhdfs)
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Configuring using HDFS configuration files
          5. See also
        10. Mounting HDFS (Fuse-DFS)
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Building libhdfs
          5. See also
        11. Merging files in HDFS
          1. How to do it...
          2. How it works...
      10. 3. Advanced Hadoop MapReduce Administration
        1. Introduction
        2. Tuning Hadoop configurations for cluster deployments
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Running benchmarks to verify the Hadoop installation
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Reusing Java VMs to improve the performance
          1. How to do it...
          2. How it works...
        5. Fault tolerance and speculative execution
          1. How to do it...
          2. How it works...
        6. Debug scripts – analyzing task failures
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Setting failure percentages and skipping bad records
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Shared-user Hadoop clusters – using fair and other schedulers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        9. Hadoop security – integrating with Kerberos
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Using the Hadoop Tool interface
          1. How to do it...
          2. How it works...
      11. 4. Developing Complex Hadoop MapReduce Applications
        1. Introduction
        2. Choosing appropriate Hadoop data types
          1. How to do it...
          2. There's more...
          3. See also
        3. Implementing a custom Hadoop Writable data type
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        4. Implementing a custom Hadoop key type
          1. How to do it...
          2. How it works...
          3. See also
        5. Emitting data of different value types from a mapper
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        6. Choosing a suitable Hadoop InputFormat for your input data format
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Using multiple input data types and multiple mapper implementations in a single MapReduce application
          4. See also
        7. Adding support for new input data formats – implementing a custom InputFormat
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        8. Formatting the results of MapReduce computations – using Hadoop OutputFormats
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Hadoop intermediate (map to reduce) data partitioning
          1. How to do it...
          2. How it works...
          3. There's more...
            1. TotalOrderPartitioner
            2. KeyFieldBasedPartitioner
        10. Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Distributing archives using the DistributedCache
            2. Adding resources to the DistributedCache from the command line
            3. Adding resources to the classpath using DistributedCache
          4. See also
        11. Using Hadoop with legacy applications – Hadoop Streaming
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        12. Adding dependencies between MapReduce jobs
          1. How to do it...
          2. How it works...
          3. There's more...
        13. Hadoop counters for reporting custom metrics
          1. How to do it...
          2. How it works...
      12. 5. Hadoop Ecosystem
        1. Introduction
        2. Installing HBase
          1. How to do it...
          2. How it works...
          3. There's more...
        3. Data random access using Java client APIs
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Running MapReduce jobs on HBase (table input/output)
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Installing Pig
          1. How to do it...
          2. How it works...
          3. There's more...
        6. Running your first Pig command
          1. How to do it...
          2. How it works...
        7. Set operations (join, union) and sorting with Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Installing Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Running a SQL-style query with Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Performing a join with Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Installing Mahout
          1. How to do it...
          2. How it works...
        12. Running K-means with Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        13. Visualizing K-means results
          1. Getting ready
          2. How to do it...
          3. How it works...
      13. 6. Analytics
        1. Introduction
        2. Simple analytics using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Performing Group-By using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Calculating frequency distributions and sorting using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Plotting the Hadoop results using GNU Plot
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Calculating histograms using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Calculating scatter plots using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Parsing a complex dataset with Hadoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Joining two datasets using MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
      14. 7. Searching and Indexing
        1. Introduction
        2. Generating an inverted index using Hadoop MapReduce
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Intra-domain web crawling using Apache Nutch
          1. Getting ready
          2. How to do it...
          3. See also
        4. Indexing and searching web documents using Apache Solr
          1. Getting Ready
          2. How to do it
          3. How it works
          4. See also
        5. Configuring Apache HBase as the backend data store for Apache Nutch
          1. Getting ready
          2. How to do it
          3. How it works...
          4. See also
        6. Deploying Apache HBase on a Hadoop cluster
          1. Getting ready
          2. How to do it
          3. How it works...
          4. See also
        7. Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
          1. Getting ready
          2. How to do it
          3. How it works
          4. See also
        8. ElasticSearch for indexing and searching
          1. Getting ready
          2. How to do it
          3. How it works
          4. See also
        9. Generating the in-links graph for crawled web pages
          1. Getting ready
          2. How to do it
          3. How it works
          4. See also
      15. 8. Classifications, Recommendations, and Finding Relationships
        1. Introduction
        2. Content-based recommendations
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Hierarchical clustering
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Clustering an Amazon sales dataset
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        5. Collaborative filtering-based recommendations
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Classification using Naive Bayes Classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Assigning advertisements to keywords using the Adwords balance algorithm
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
      16. 9. Mass Text Data Processing
        1. Introduction
        2. Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Data de-duplication using Hadoop Streaming
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        4. Loading large datasets to an Apache HBase data store using importtsv and bulkload tools
          1. Getting ready
          2. How to do it…
          3. How it works...
          4. There's more...
            1. Data de-duplication using HBase
          5. See also
        5. Creating TF and TF-IDF vectors for the text data
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        6. Clustering the text data
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        7. Topic discovery using Latent Dirichlet Allocation (LDA)
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        8. Document classification using Mahout Naive Bayes classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      17. 10. Cloud Deployments: Using Hadoop on Clouds
        1. Introduction
        2. Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
          1. Getting ready
          2. How to do it...
          3. See also
        3. Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
          1. How to do it...
          2. There's more...
          3. See also
        4. Executing a Pig script using EMR
          1. How to do it...
          2. There's more...
            1. Starting a Pig interactive session
          3. See also
        5. Executing a Hive script using EMR
          1. How to do it...
          2. There's more...
            1. Starting a Hive interactive session
          3. See also
        6. Creating an Amazon EMR job flow using the Command Line Interface
          1. How to do it...
          2. There's more...
          3. See also
        7. Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR
          1. Getting ready
          2. How to do it...
          3. See also
        8. Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs
          1. How to do it...
          2. There's more...
        9. Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
          1. How to do it...
          2. How it works...
          3. See also
        10. Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      18. Index