You are previewing Hadoop Real-World Solutions Cookbook.
O'Reilly logo
Hadoop Real-World Solutions Cookbook

Book Description

Realistic, simple code examples to solve problems at scale with Hadoop and related technologies

  • Solutions to common problems when working in the Hadoop environment

  • Recipes for (un)loading data, analytics, and troubleshooting

  • In depth code examples demonstrating various analytic models, analytic solutions, and common best practices

In Detail

Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation.

Hadoop Real World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.

Hadoop Real World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.

Hadoop Real World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.

Table of Contents

  1. Hadoop Real-World Solutions Cookbook
    1. Table of Contents
    2. Hadoop Real-World Solutions Cookbook
    3. Credits
    4. About the Authors
    5. About the Reviewers
    6. www.packtpub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Hadoop Distributed File System – Importing and Exporting Data
      1. Introduction
      2. Importing and exporting data into HDFS using Hadoop shell commands
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      3. Moving data efficiently between clusters using Distributed Copy
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      4. Importing data from MySQL into HDFS using Sqoop
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Exporting data from HDFS into MySQL using Sqoop
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Configuring Sqoop for Microsoft SQL Server
        1. Getting ready
        2. How to do it...
        3. How it works...
      7. Exporting data from HDFS into MongoDB
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Importing data from MongoDB into HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Exporting data from HDFS into MongoDB using Pig
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. Using HDFS in a Greenplum external table
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      11. Using Flume to load data into HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    9. 2. HDFS
      1. Introduction
      2. Reading and writing data to HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      3. Compressing data using LZO
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Reading and writing data to SequenceFiles
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Using Apache Avro to serialize data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Using Apache Thrift to serialize data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Using Protocol Buffers to serialize data
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Setting the replication factor for HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      9. Setting the block size for HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
    10. 3. Extracting and Transforming Data
      1. Introduction
      2. Transforming Apache logs into TSV format using MapReduce
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      3. Using Apache Pig to filter bot traffic from web server logs
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Using Apache Pig to sort web server log data by timestamp
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Using Apache Pig to sessionize web server log data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Using Python to extend Apache Pig functionality
        1. Getting ready
        2. How to do it...
        3. How it works...
      7. Using MapReduce and secondary sort to calculate page views
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Using Hive and Python to clean and transform geographical event data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Making every column type String
          2. Type casing values using the AS keyword
          3. Testing the script locally
      9. Using Python and Hadoop Streaming to perform a time series analytic
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Using Hadoop Streaming with any language that can read from stdin and write to stdout
          2. Using the –file parameter to pass additional required files for MapReduce jobs
      10. Using MultipleOutputs in MapReduce to name output files
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. Creating custom Hadoop Writable and InputFormat to read geographical event data
        1. Getting ready
        2. How to do it...
        3. How it works...
    11. 4. Performing Common Tasks Using Hive, Pig, and MapReduce
      1. Introduction
      2. Using Hive to map an external table over weblog data in HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. LOCATION must point to a directory, not a file
          2. Dropping an external table does not delete the data stored in the table
          3. You can add data to the path specified by LOCATION
      3. Using Hive to dynamically create tables from the results of a weblog query
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. CREATE TABLE AS cannot be used to create external tables
          2. DROP temporary tables
      4. Using the Hive string UDFs to concatenate fields in weblog data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. The UDF concat_ws() function will not automatically cast parameters to String
          2. Alias your concatenated field
          3. The concat_ws() function supports variable length parameter arguments
        5. See also
      5. Using Hive to intersect weblog IPs and determine the country
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Hive supports multitable joins
          2. The ON operator for inner joins does not support inequality conditions
        5. See also
      6. Generating n-grams over news archives using MapReduce
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Use caution when invoking FileSystem.delete()
          2. Use NullWritable to avoid unnecessary serialization overhead
      7. Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Use the distributed cache to pass JAR dependencies to map/reduce task JVMs
          2. Distributed cache does not work in local jobrunner mode
      8. Using Pig to load a table and perform a SELECT operation with GROUP BY
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
    12. 5. Advanced Joins
      1. Introduction
      2. Joining data in the Mapper using MapReduce
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      3. Joining data using Apache Pig replicated join
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Joining sorted data using Apache Pig merge join
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Joining skewed data using Apache Pig skewed join
        1. Getting ready
        2. How to do it...
        3. How it works...
      6. Using a map-side join in Apache Hive to analyze geographical events
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Auto-convert to map-side join whenever possible
          2. Map-join behavior
        5. See also
      7. Using optimized full outer joins in Apache Hive to analyze geographical events
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Common join versus map-side join
          2. STREAMTABLE hint
          3. Table ordering in the query matters
      8. Joining data using an external key-value store (Redis)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    13. 6. Big Data Analysis
      1. Introduction
      2. Counting distinct IPs in weblog data using MapReduce and Combiners
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. The Combiner does not always have to be the same class as your Reducer
          2. Combiners are not guaranteed to run
      3. Using Hive date UDFs to transform and sort event dates from geographic event data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Date format strings follow Java SimpleDateFormat guidelines
          2. Default date and time formats
        5. See also
      4. Using Hive to build a per-month report of fatalities over geographic event data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. The coalesce() method can take variable length arguments.
          2. Date reformatting code template
        5. See also
      5. Implementing a custom UDF in Hive to help validate source reliability over geographic event data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Check out the existing UDFs
          2. User-defined table and aggregate functions
          3. Export HIVE_AUX_JARS_PATH in your environment
        5. See also
      6. Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
          2. MAP and REDUCE keywords are shorthand for SELECT TRANSFORM
        5. See also
      7. Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Trim Outliers from the Audioscrobbler dataset using Pig and datafu
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    14. 7. Advanced Big Data Analysis
      1. Introduction
      2. PageRank with Apache Giraph
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Keep up with the Apache Giraph community
          2. Read and understand the Google Pregel paper
        5. See also
      3. Single-source shortest-path with Apache Giraph
        1. Getting ready
        2. How to do it...
        3. How it works...
          1. First superstep (S0)
          2. Second superstep (S1)
        4. See also
      4. Using Apache Giraph to perform a distributed breadth-first search
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Apache Giraph jobs often require scalability tuning
      5. Collaborative filtering with Apache Mahout
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Clustering with Apache Mahout
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Sentiment classification with Apache Mahout
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    15. 8. Debugging
      1. Introduction
      2. Using Counters in a MapReduce job to track bad records
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Developing and testing MapReduce jobs with MRUnit
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Developing and testing MapReduce jobs running in local mode
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Enabling MapReduce jobs to skip bad records
        1. How to do it...
        2. How it works...
        3. There's more...
      6. Using Counters in a streaming job
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Updating task status messages to display debugging information
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Using illustrate to debug Pig jobs
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
    16. 9. System Administration
      1. Introduction
      2. Starting Hadoop in pseudo-distributed mode
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      3. Starting Hadoop in distributed mode
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Adding new nodes to an existing cluster
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Safely decommissioning nodes
        1. Getting ready
        2. How to do it...
        3. How it works...
      6. Recovering from a NameNode failure
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      7. Monitoring cluster health using Ganglia
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Tuning MapReduce job parameters
        1. Getting ready
        2. How to do it...
        3. How it works...
    17. 10. Persistence Using Apache Accumulo
      1. Introduction
      2. Designing a row key to store geographic events in Accumulo
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Lexicographic sorting of keys
          2. Z-order curve
        5. See also
      3. Using MapReduce to bulk import geographic event data into Accumulo
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. AccumuloTableAssistant.java
          2. Split points
          3. AccumuloOutputFormat versus AccumuloFileOutputFormat
        5. See also
      4. Setting a custom field constraint forinputting geographic event data in Accumulo
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Bundled Constraint classes
          2. Installing a constraint on each TabletServer
        5. See also
      5. Limiting query results using the regex filtering iterator
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Counting fatalities for different versions of the same key using SumCombiner
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Combiners are on a per-key basis, not across all keys
          2. Combiners can be applied at scan time or applied to the table configuration for incoming mutations
        5. See also
      7. Enforcing cell-level security on scans using Accumulo
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Writing mutations for unauthorized scanning
          2. ColumnVisibility is part of the key
          3. Supporting more complex Boolean expressions
        5. See also
      8. Aggregating sources in Accumulo using MapReduce
        1. Getting ready
        2. How to do it...
        3. How it works...
    18. Index