Hadoop: Data Processing and Modelling

Book description

Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets

About This Book

  • Conquer the mountain of data using Hadoop 2.X tools
  • The authors succeed in creating a context for Hadoop and its ecosystem
  • Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms
  • Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X

Who This Book Is For

This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X.

What You Will Learn

  • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
  • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
  • Installing and maintaining Hadoop 2.X cluster and its ecosystem
  • Advanced Data Analysis using the Hive, Pig, and Map Reduce programs
  • Machine learning principles with libraries such as Mahout and Batch and Stream data processing using Apache Spark
  • Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0
  • Dive into YARN and Storm and use YARN to integrate Storm with Hadoop
  • Deploy Hadoop on Amazon Elastic MapReduce and Discover HDFS replacements and learn about HDFS Federation

In Detail

As Marc Andreessen has said "Data is eating the world," which can be witnessed today being the age of Big Data, businesses are producing data in huge volumes every day and this rise in tide of data need to be organized and analyzed in a more secured way. With proper and effective use of Hadoop, you can build new-improved models, and based on that you will be able to make the right decisions.

The first module, Hadoop beginners Guide will walk you through on understanding Hadoop with very detailed instructions and how to go about using it. Commands are explained using sections called "What just happened" for more clarity and understanding.

The second module, Hadoop Real World Solutions Cookbook, 2nd edition, is an essential tutorial to effectively implement a big data warehouse in your business, where you get detailed practices on the latest technologies such as YARN and Spark.

Big data has become a key basis of competition and the new waves of productivity growth. Hence, once you get familiar with the basics and implement the end-to-end big data use cases, you will start exploring the third module, Mastering Hadoop.

So, now the question is if you need to broaden your Hadoop skill set to the next level after you nail the basics and the advance concepts, then this course is indispensable. When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes.

Style and approach

This course has covered everything right from the basic concepts of Hadoop till you master the advance mechanisms to become a big data expert. The goal here is to help you learn the basic essentials using the step-by-step tutorials and from there moving toward the recipes with various real-world solutions for you. It covers all the important aspects of Hadoop from system designing and configuring Hadoop, machine learning principles with various libraries with chapters illustrated with code fragments and schematic diagrams. This is a compendious course to explore Hadoop from the basics to the most advanced techniques available in Hadoop 2.X.

Table of contents

  1. Hadoop: Data Processing and Modelling
    1. Table of Contents
    2. Hadoop: Data Processing and Modelling
    3. Hadoop: Data Processing and Modelling
    4. Credits
    5. Preface
      1. What this learning path covers
        1. Hadoop beginners Guide
        2. Hadoop Real World Solutions Cookbook, 2nd edition
        3. Mastering Hadoop
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    6. 1. Module 1
      1. 1. What It's All About
        1. Big data processing
          1. The value of data
          2. Historically for the few and not the many
            1. Classic data processing systems
              1. Scale-up
              2. Early approaches to scale-out
            2. Limiting factors
          3. A different approach
            1. All roads lead to scale-out
            2. Share nothing
            3. Expect failure
            4. Smart software, dumb hardware
            5. Move processing, not data
            6. Build applications, not infrastructure
          4. Hadoop
            1. Thanks, Google
            2. Thanks, Doug
            3. Thanks, Yahoo
            4. Parts of Hadoop
            5. Common building blocks
            6. HDFS
            7. MapReduce
            8. Better together
            9. Common architecture
            10. What it is and isn't good for
        2. Cloud computing with Amazon Web Services
          1. Too many clouds
          2. A third way
          3. Different types of costs
          4. AWS – infrastructure on demand from Amazon
            1. Elastic Compute Cloud (EC2)
            2. Simple Storage Service (S3)
            3. Elastic MapReduce (EMR)
          5. What this book covers
            1. A dual approach
        3. Summary
      2. 2. Getting Hadoop Up and Running
        1. Hadoop on a local Ubuntu host
          1. Other operating systems
        2. Time for action – checking the prerequisites
          1. What just happened?
          2. Setting up Hadoop
            1. A note on versions
        3. Time for action – downloading Hadoop
          1. What just happened?
        4. Time for action – setting up SSH
          1. What just happened?
          2. Configuring and running Hadoop
        5. Time for action – using Hadoop to calculate Pi
          1. What just happened?
          2. Three modes
        6. Time for action – configuring the pseudo-distributed mode
          1. What just happened?
          2. Configuring the base directory and formatting the filesystem
        7. Time for action – changing the base HDFS directory
          1. What just happened?
        8. Time for action – formatting the NameNode
          1. What just happened?
          2. Starting and using Hadoop
        9. Time for action – starting Hadoop
          1. What just happened?
        10. Time for action – using HDFS
          1. What just happened?
        11. Time for action – WordCount, the Hello World of MapReduce
          1. What just happened?
          2. Have a go hero – WordCount on a larger body of text
          3. Monitoring Hadoop from the browser
            1. The HDFS web UI
              1. The MapReduce web UI
        12. Using Elastic MapReduce
          1. Setting up an account in Amazon Web Services
            1. Creating an AWS account
            2. Signing up for the necessary services
        13. Time for action – WordCount on EMR using the management console
          1. What just happened?
          2. Have a go hero – other EMR sample applications
          3. Other ways of using EMR
            1. AWS credentials
            2. The EMR command-line tools
          4. The AWS ecosystem
        14. Comparison of local versus EMR Hadoop
        15. Summary
      3. 3. Understanding MapReduce
        1. Key/value pairs
          1. What it mean
          2. Why key/value data?
            1. Some real-world examples
          3. MapReduce as a series of key/value transformations
          4. Pop quiz – key/value pairs
        2. The Hadoop Java API for MapReduce
          1. The 0.20 MapReduce Java API
            1. The Mapper class
            2. The Reducer class
            3. The Driver class
        3. Writing MapReduce programs
        4. Time for action – setting up the classpath
          1. What just happened?
        5. Time for action – implementing WordCount
          1. What just happened?
        6. Time for action – building a JAR file
          1. What just happened?
        7. Time for action – running WordCount on a local Hadoop cluster
          1. What just happened?
        8. Time for action – running WordCount on EMR
          1. What just happened?
          2. The pre-0.20 Java MapReduce API
          3. Hadoop-provided mapper and reducer implementations
        9. Time for action – WordCount the easy way
          1. What just happened?
        10. Walking through a run of WordCount
          1. Startup
          2. Splitting the input
          3. Task assignment
          4. Task startup
          5. Ongoing JobTracker monitoring
          6. Mapper input
          7. Mapper execution
          8. Mapper output and reduce input
          9. Partitioning
          10. The optional partition function
          11. Reducer input
          12. Reducer execution
          13. Reducer output
          14. Shutdown
          15. That's all there is to it!
          16. Apart from the combiner…maybe
            1. Why have a combiner?
        11. Time for action – WordCount with a combiner
          1. What just happened?
            1. When you can use the reducer as the combiner
        12. Time for action – fixing WordCount to work with a combiner
          1. What just happened?
          2. Reuse is your friend
          3. Pop quiz – MapReduce mechanics
        13. Hadoop-specific data types
          1. The Writable and WritableComparable interfaces
          2. Introducing the wrapper classes
            1. Primitive wrapper classes
            2. Array wrapper classes
            3. Map wrapper classes
        14. Time for action – using the Writable wrapper classes
          1. What just happened?
            1. Other wrapper classes
          2. Have a go hero – playing with Writables
            1. Making your own
        15. Input/output
          1. Files, splits, and records
          2. InputFormat and RecordReader
          3. Hadoop-provided InputFormat
          4. Hadoop-provided RecordReader
          5. OutputFormat and RecordWriter
          6. Hadoop-provided OutputFormat
          7. Don't forget Sequence files
        16. Summary
      4. 4. Developing MapReduce Programs
        1. Using languages other than Java with Hadoop
          1. How Hadoop Streaming works
          2. Why to use Hadoop Streaming
        2. Time for action – implementing WordCount using Streaming
          1. What just happened?
          2. Differences in jobs when using Streaming
        3. Analyzing a large dataset
          1. Getting the UFO sighting dataset
          2. Getting a feel for the dataset
        4. Time for action – summarizing the UFO data
          1. What just happened?
            1. Examining UFO shapes
        5. Time for action – summarizing the shape data
          1. What just happened?
        6. Time for action – correlating of sighting duration to UFO shape
          1. What just happened?
            1. Using Streaming scripts outside Hadoop
        7. Time for action – performing the shape/time analysis from the command line
          1. What just happened?
          2. Java shape and location analysis
        8. Time for action – using ChainMapper for field validation/analysis
          1. What just happened?
          2. Have a go hero
            1. Too many abbreviations
            2. Using the Distributed Cache
        9. Time for action – using the Distributed Cache to improve location output
          1. What just happened?
        10. Counters, status, and other output
        11. Time for action – creating counters, task states, and writing log output
          1. What just happened?
          2. Too much information!
        12. Summary
      5. 5. Advanced MapReduce Techniques
        1. Simple, advanced, and in-between
        2. Joins
          1. When this is a bad idea
          2. Map-side versus reduce-side joins
          3. Matching account and sales information
        3. Time for action – reduce-side join using MultipleInputs
          1. What just happened?
            1. DataJoinMapper and TaggedMapperOutput
          2. Implementing map-side joins
            1. Using the Distributed Cache
          3. Have a go hero - Implementing map-side joins
            1. Pruning data to fit in the cache
            2. Using a data representation instead of raw data
            3. Using multiple mappers
          4. To join or not to join...
        4. Graph algorithms
          1. Graph 101
          2. Graphs and MapReduce – a match made somewhere
          3. Representing a graph
        5. Time for action – representing the graph
          1. What just happened?
          2. Overview of the algorithm
            1. The mapper
            2. The reducer
            3. Iterative application
        6. Time for action – creating the source code
          1. What just happened?
        7. Time for action – the first run
          1. What just happened?
        8. Time for action – the second run
          1. What just happened?
        9. Time for action – the third run
          1. What just happened?
        10. Time for action – the fourth and last run
          1. What just happened?
          2. Running multiple jobs
          3. Final thoughts on graphs
        11. Using language-independent data structures
          1. Candidate technologies
          2. Introducing Avro
        12. Time for action – getting and installing Avro
          1. What just happened?
          2. Avro and schemas
        13. Time for action – defining the schema
          1. What just happened?
        14. Time for action – creating the source Avro data with Ruby
          1. What just happened?
        15. Time for action – consuming the Avro data with Java
          1. What just happened?
          2. Using Avro within MapReduce
        16. Time for action – generating shape summaries in MapReduce
          1. What just happened?
        17. Time for action – examining the output data with Ruby
          1. What just happened?
        18. Time for action – examining the output data with Java
          1. What just happened?
          2. Have a go hero – graphs in Avro
          3. Going forward with Avro
        19. Summary
      6. 6. When Things Break
        1. Failure
          1. Embrace failure
          2. Or at least don't fear it
          3. Don't try this at home
          4. Types of failure
          5. Hadoop node failure
            1. The dfsadmin command
            2. Cluster setup, test files, and block sizes
            3. Fault tolerance and Elastic MapReduce
        2. Time for action – killing a DataNode process
          1. What just happened?
            1. NameNode and DataNode communication
          2. Have a go hero – NameNode log delving
        3. Time for action – the replication factor in action
          1. What just happened?
        4. Time for action – intentionally causing missing blocks
          1. What just happened?
            1. When data may be lost
            2. Block corruption
        5. Time for action – killing a TaskTracker process
          1. What just happened?
            1. Comparing the DataNode and TaskTracker failures
            2. Permanent failure
          2. Killing the cluster masters
        6. Time for action – killing the JobTracker
          1. What just happened?
            1. Starting a replacement JobTracker
          2. Have a go hero – moving the JobTracker to a new host
        7. Time for action – killing the NameNode process
          1. What just happened?
            1. Starting a replacement NameNode
            2. The role of the NameNode in more detail
            3. File systems, files, blocks, and nodes
            4. The single most important piece of data in the cluster – fsimage
            5. DataNode startup
            6. Safe mode
            7. SecondaryNameNode
            8. So what to do when the NameNode process has a critical failure?
            9. BackupNode/CheckpointNode and NameNode HA
            10. Hardware failure
            11. Host failure
            12. Host corruption
            13. The risk of correlated failures
          2. Task failure due to software
            1. Failure of slow running tasks
        8. Time for action – causing task failure
          1. What just happened?
          2. Have a go hero – HDFS programmatic access
            1. Hadoop's handling of slow-running tasks
            2. Speculative execution
            3. Hadoop's handling of failing tasks
          3. Have a go hero – causing tasks to fail
          4. Task failure due to data
            1. Handling dirty data through code
            2. Using Hadoop's skip mode
        9. Time for action – handling dirty data by using skip mode
          1. What just happened?
            1. To skip or not to skip...
        10. Summary
      7. 7. Keeping Things Running
        1. A note on EMR
        2. Hadoop configuration properties
          1. Default values
        3. Time for action – browsing default properties
          1. What just happened?
          2. Additional property elements
          3. Default storage location
          4. Where to set properties
        4. Setting up a cluster
          1. How many hosts?
            1. Calculating usable space on a node
            2. Location of the master nodes
            3. Sizing hardware
            4. Processor / memory / storage ratio
            5. EMR as a prototyping platform
          2. Special node requirements
          3. Storage types
            1. Commodity versus enterprise class storage
            2. Single disk versus RAID
            3. Finding the balance
            4. Network storage
          4. Hadoop networking configuration
            1. How blocks are placed
            2. Rack awareness
              1. The rack-awareness script
        5. Time for action – examining the default rack configuration
          1. What just happened?
        6. Time for action – adding a rack awareness script
          1. What just happened?
          2. What is commodity hardware anyway?
          3. Pop quiz – setting up a cluster
        7. Cluster access control
          1. The Hadoop security model
        8. Time for action – demonstrating the default security
          1. What just happened?
            1. User identity
              1. The super user
            2. More granular access control
          2. Working around the security model via physical access control
        9. Managing the NameNode
          1. Configuring multiple locations for the fsimage class
        10. Time for action – adding an additional fsimage location
          1. What just happened?
            1. Where to write the fsimage copies
          2. Swapping to another NameNode host
            1. Having things ready before disaster strikes
        11. Time for action – swapping to a new NameNode host
          1. What just happened?
            1. Don't celebrate quite yet!
            2. What about MapReduce?
          2. Have a go hero – swapping to a new NameNode host
        12. Managing HDFS
          1. Where to write data
          2. Using balancer
            1. When to rebalance
        13. MapReduce management
          1. Command line job management
          2. Have a go hero – command line job management
          3. Job priorities and scheduling
        14. Time for action – changing job priorities and killing a job
          1. What just happened?
          2. Alternative schedulers
            1. Capacity Scheduler
            2. Fair Scheduler
            3. Enabling alternative schedulers
            4. When to use alternative schedulers
        15. Scaling
          1. Adding capacity to a local Hadoop cluster
          2. Have a go hero – adding a node and running balancer
          3. Adding capacity to an EMR job flow
            1. Expanding a running job flow
        16. Summary
      8. 8. A Relational View on Data with Hive
        1. Overview of Hive
          1. Why use Hive?
          2. Thanks, Facebook!
        2. Setting up Hive
          1. Prerequisites
          2. Getting Hive
        3. Time for action – installing Hive
          1. What just happened?
        4. Using Hive
        5. Time for action – creating a table for the UFO data
          1. What just happened?
        6. Time for action – inserting the UFO data
          1. What just happened?
          2. Validating the data
        7. Time for action – validating the table
          1. What just happened?
        8. Time for action – redefining the table with the correct column separator
          1. What just happened?
          2. Hive tables – real or not?
        9. Time for action – creating a table from an existing file
          1. What just happened?
        10. Time for action – performing a join
          1. What just happened?
          2. Have a go hero – improve the join to use regular expressions
          3. Hive and SQL views
        11. Time for action – using views
          1. What just happened?
          2. Handling dirty data in Hive
          3. Have a go hero – do it!
        12. Time for action – exporting query output
          1. What just happened?
          2. Partitioning the table
        13. Time for action – making a partitioned UFO sighting table
          1. What just happened?
          2. Bucketing, clustering, and sorting... oh my!
          3. User-Defined Function
        14. Time for action – adding a new User Defined Function (UDF)
          1. What just happened?
          2. To preprocess or not to preprocess...
          3. Hive versus Pig
          4. What we didn't cover
        15. Hive on Amazon Web Services
        16. Time for action – running UFO analysis on EMR
          1. What just happened?
          2. Using interactive job flows for development
          3. Have a go hero – using an interactive EMR cluster
          4. Integration with other AWS products
        17. Summary
      9. 9. Working with Relational Databases
        1. Common data paths
          1. Hadoop as an archive store
          2. Hadoop as a preprocessing step
          3. Hadoop as a data input tool
          4. The serpent eats its own tail
        2. Setting up MySQL
        3. Time for action – installing and setting up MySQL
          1. What just happened?
          2. Did it have to be so hard?
        4. Time for action – configuring MySQL to allow remote connections
          1. What just happened?
          2. Don't do this in production!
        5. Time for action – setting up the employee database
          1. What just happened?
          2. Be careful with data file access rights
        6. Getting data into Hadoop
          1. Using MySQL tools and manual import
          2. Have a go hero – exporting the employee table into HDFS
          3. Accessing the database from the mapper
          4. A better way – introducing Sqoop
        7. Time for action – downloading and configuring Sqoop
          1. What just happened?
            1. Sqoop and Hadoop versions
            2. Sqoop and HDFS
        8. Time for action – exporting data from MySQL to HDFS
          1. What just happened?
            1. Mappers and primary key columns
            2. Other options
            3. Sqoop's architecture
          2. Importing data into Hive using Sqoop
        9. Time for action – exporting data from MySQL into Hive
          1. What just happened?
        10. Time for action – a more selective import
          1. What just happened?
            1. Datatype issues
        11. Time for action – using a type mapping
          1. What just happened?
        12. Time for action – importing data from a raw query
          1. What just happened?
          2. Have a go hero
            1. Sqoop and Hive partitions
            2. Field and line terminators
        13. Getting data out of Hadoop
          1. Writing data from within the reducer
          2. Writing SQL import files from the reducer
          3. A better way – Sqoop again
        14. Time for action – importing data from Hadoop into MySQL
          1. What just happened?
            1. Differences between Sqoop imports and exports
            2. Inserts versus updates
          2. Have a go hero
            1. Sqoop and Hive exports
        15. Time for action – importing Hive data into MySQL
          1. What just happened?
        16. Time for action – fixing the mapping and re-running the export
          1. What just happened?
            1. Other Sqoop features
              1. Incremental merge
              2. Avoiding partial exports
              3. Sqoop as a code generator
        17. AWS considerations
          1. Considering RDS
        18. Summary
      10. 10. Data Collection with Flume
        1. A note about AWS
        2. Data data everywhere...
          1. Types of data
          2. Getting network traffic into Hadoop
        3. Time for action – getting web server data into Hadoop
          1. What just happened?
          2. Have a go hero
          3. Getting files into Hadoop
          4. Hidden issues
            1. Keeping network data on the network
            2. Hadoop dependencies
            3. Reliability
            4. Re-creating the wheel
            5. A common framework approach
        4. Introducing Apache Flume
          1. A note on versioning
        5. Time for action – installing and configuring Flume
          1. What just happened?
          2. Using Flume to capture network data
        6. Time for action – capturing network traffic in a log file
          1. What just happened?
        7. Time for action – logging to the console
          1. What just happened?
          2. Writing network data to log files
        8. Time for action – capturing the output of a command to a flat file
          1. What just happened?
            1. Logs versus files
        9. Time for action – capturing a remote file in a local flat file
          1. What just happened?
          2. Sources, sinks, and channels
            1. Sources
            2. Sinks
            3. Channels
            4. Or roll your own
          3. Understanding the Flume configuration files
          4. Have a go hero
          5. It's all about events
        10. Time for action – writing network traffic onto HDFS
          1. What just happened?
        11. Time for action – adding timestamps
          1. What just happened?
          2. To Sqoop or to Flume...
        12. Time for action – multi level Flume networks
          1. What just happened?
        13. Time for action – writing to multiple sinks
          1. What just happened?
          2. Selectors replicating and multiplexing
          3. Handling sink failure
          4. Have a go hero - Handling sink failure
          5. Next, the world
          6. Have a go hero - Next, the world
        14. The bigger picture
          1. Data lifecycle
          2. Staging data
          3. Scheduling
        15. Summary
      11. 11. Where to Go Next
        1. What we did and didn't cover in this book
        2. Upcoming Hadoop changes
        3. Alternative distributions
          1. Why alternative distributions?
            1. Bundling
            2. Free and commercial extensions
              1. Cloudera Distribution for Hadoop
              2. Hortonworks Data Platform
              3. MapR
              4. IBM InfoSphere Big Insights
            3. Choosing a distribution
        4. Other Apache projects
          1. HBase
          2. Oozie
          3. Whir
          4. Mahout
          5. MRUnit
        5. Other programming abstractions
          1. Pig
          2. Cascading
        6. AWS resources
          1. HBase on EMR
          2. SimpleDB
          3. DynamoDB
        7. Sources of information
          1. Source code
          2. Mailing lists and forums
          3. LinkedIn groups
          4. HUGs
          5. Conferences
        8. Summary
      12. A. Pop Quiz Answers
        1. Chapter 3, Understanding MapReduce
          1. Pop quiz – key/value pairs
          2. Pop quiz – walking through a run of WordCount
        2. Chapter 7, Keeping Things Running
          1. Pop quiz – setting up a cluster
    7. 2. Module 2
      1. 1. Getting Started with Hadoop 2.X
        1. Introduction
        2. Installing a single-node Hadoop Cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
            1. Hadoop Distributed File System (HDFS)
            2. Yet Another Resource Negotiator (YARN)
          4. There's more
        3. Installing a multi-node Hadoop cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Adding new nodes to existing Hadoop clusters
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Executing the balancer command for uniform data distribution
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Entering and exiting from the safe mode in a Hadoop cluster
          1. How to do it...
          2. How it works...
        7. Decommissioning DataNodes
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing benchmarking on a Hadoop cluster
          1. Getting ready
          2. How to do it...
            1. TestDFSIO
            2. NNBench
            3. MRBench
          3. How it works...
      2. 2. Exploring HDFS
        1. Introduction
        2. Loading data from a local machine to HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Exporting HDFS data to a local machine
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Changing the replication factor of an existing file in HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Setting the HDFS block size for all the files in a cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Setting the HDFS block size for a specific file in a cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Enabling transparent encryption for HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Importing data from another Hadoop cluster
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Recycling deleted data from trash to HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Saving compressed data in HDFS
          1. Getting ready
          2. How to do it...
          3. How it works...
      3. 3. Mastering Map Reduce Programs
        1. Introduction
        2. Writing the Map Reduce program in Java to analyze web log data
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Executing the Map Reduce program in a Hadoop cluster
          1. Getting ready
          2. How to do it
          3. How it works...
        4. Adding support for a new writable data type in Hadoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Implementing a user-defined counter in a Map Reduce program
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Map Reduce program to find the top X
          1. Getting ready
          2. How to do it...
          3. How it works
        7. Map Reduce program to find distinct values
          1. Getting ready
          2. How to do it
          3. How it works...
        8. Map Reduce program to partition data using a custom partitioner
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Writing Map Reduce results to multiple output files
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Performing Reduce side Joins using Map Reduce
          1. Getting ready
          2. How to do it
          3. How it works...
        11. Unit testing the Map Reduce code using MRUnit
          1. Getting ready
          2. How to do it...
          3. How it works...
      4. 4. Data Analysis Using Hive, Pig, and Hbase
        1. Introduction
        2. Storing and processing Hive data in a sequential file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Storing and processing Hive data in the ORC file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Storing and processing Hive data in the ORC file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Storing and processing Hive data in the Parquet file format
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Performing FILTER By queries in Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Performing Group By queries in Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing Order By queries in Pig
          1. Getting ready
          2. How to do it..
          3. How it works...
        9. Performing JOINS in Pig
          1. Getting ready
          2. How to do it...
          3. How it works
            1. Replicated Joins
            2. Skewed Joins
            3. Merge Joins
        10. Writing a user-defined function in Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Analyzing web log data using Pig
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Performing the Hbase operation in CLI
          1. Getting ready
          2. How to do it
          3. How it works...
        13. Performing Hbase operations in Java
          1. Getting ready
          2. How to do it
          3. How it works...
        14. Executing the MapReduce programming with an Hbase Table
          1. Getting ready
          2. How to do it
          3. How it works
      5. 5. Advanced Data Analysis Using Hive
        1. Introduction
        2. Processing JSON data in Hive using JSON SerDe
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Processing XML data in Hive using XML SerDe
          1. Getting ready
          2. How to do it...
          3. How it works
        4. Processing Hive data in the Avro format
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Writing a user-defined function in Hive
          1. Getting ready
          2. How to do it
          3. How it works...
        6. Performing table joins in Hive
          1. Getting ready
          2. How to do it...
            1. Left outer join
            2. Right outer join
            3. Full outer join
            4. Left semi join
          3. How it works...
        7. Executing map side joins in Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing context Ngram in Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Call Data Record Analytics using Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Twitter sentiment analysis using Hive
          1. Getting ready
          2. How to do it...
          3. How it works
        11. Implementing Change Data Capture using Hive
          1. Getting ready
          2. How to do it
          3. How it works
        12. Multiple table inserting using Hive
          1. Getting ready
          2. How to do it
          3. How it works
      6. 6. Data Import/Export Using Sqoop and Flume
        1. Introduction
        2. Importing data from RDMBS to HDFS using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Exporting data from HDFS to RDBMS
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Using query operator in Sqoop import
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Importing data using Sqoop in compressed format
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Performing Atomic export using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Importing data into Hive tables using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Importing data into HDFS from Mainframes
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Incremental import using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Creating and executing Sqoop job
          1. Getting ready
          2. How to do it...
          3. How it works...
        11. Importing data from RDBMS to Hbase using Sqoop
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Importing Twitter data into HDFS using Flume
          1. Getting ready
          2. How to do it...
          3. How it works
        13. Importing data from Kafka into HDFS using Flume
          1. Getting ready
          2. How to do it...
          3. How it works
        14. Importing web logs data into HDFS using Flume
          1. Getting ready
          2. How to do it...
          3. How it works...
      7. 7. Automation of Hadoop Tasks Using Oozie
        1. Introduction
        2. Implementing a Sqoop action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works
        3. Implementing a Map Reduce action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Implementing a Java action job using Oozie
          1. Getting ready
          2. How to do it
          3. How it works
        5. Implementing a Hive action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Implementing a Pig action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works
        7. Implementing an e-mail action job using Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Executing parallel jobs using Oozie (fork)
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Scheduling a job in Oozie
          1. Getting ready
          2. How to do it...
          3. How it works...
      8. 8. Machine Learning and Predictive Analytics Using Mahout and R
        1. Introduction
        2. Setting up the Mahout development environment
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Creating an item-based recommendation engine using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Creating a user-based recommendation engine using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Using Predictive analytics on Bank Data using Mahout
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Clustering text data using K-Means
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Performing Population Data Analytics using R
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Performing Twitter Sentiment Analytics using R
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Performing Predictive Analytics using R
          1. Getting ready
          2. How to do it...
          3. How it works...
      9. 9. Integration with Apache Spark
        1. Introduction
        2. Running Spark standalone
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Running Spark on YARN
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Olympics Athletes analytics using the Spark Shell
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Creating Twitter trending topics using Spark Streaming
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Twitter trending topics using Spark streaming
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Analyzing Parquet files using Spark
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Analyzing JSON data using Spark
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Processing graphs using Graph X
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Conducting predictive analytics using Spark MLib
          1. Getting ready
          2. How to do it...
          3. How it works...
      10. 10. Hadoop Use Cases
        1. Introduction
        2. Call Data Record analytics
          1. Getting ready
          2. How to do it...
            1. Problem Statement
            2. Solution
          3. How it works...
        3. Web log analytics
          1. Getting ready
          2. How to do it...
            1. Problem statement
            2. Solution
          3. How it works...
        4. Sensitive data masking and encryption using Hadoop
          1. Getting ready
          2. How to do it...
            1. Problem statement
            2. Solution
          3. How it works...
    8. 3. Module 3
      1. 1. Hadoop 2.X
        1. The inception of Hadoop
        2. The evolution of Hadoop
          1. Hadoop's genealogy
            1. Hadoop-0.20-append
            2. Hadoop-0.20-security
            3. Hadoop's timeline
        3. Hadoop 2.X
          1. Yet Another Resource Negotiator (YARN)
            1. Architecture overview
          2. Storage layer enhancements
            1. High availability
            2. HDFS Federation
            3. HDFS snapshots
            4. Other enhancements
          3. Support enhancements
        4. Hadoop distributions
          1. Which Hadoop distribution?
            1. Performance
            2. Scalability
            3. Reliability
            4. Manageability
          2. Available distributions
            1. Cloudera Distribution of Hadoop (CDH)
            2. Hortonworks Data Platform (HDP)
            3. MapR
            4. Pivotal HD
        5. Summary
      2. 2. Advanced MapReduce
        1. MapReduce input
          1. The InputFormat class
          2. The InputSplit class
        2. The RecordReader class
        3. Hadoop's "small files" problem
        4. Filtering inputs
        5. The Map task
          1. The dfs.blocksize attribute
          2. Sort and spill of intermediate outputs
          3. Node-local Reducers or Combiners
          4. Fetching intermediate outputs – Map-side
        6. The Reduce task
          1. Fetching intermediate outputs – Reduce-side
          2. Merge and spill of intermediate outputs
        7. MapReduce output
          1. Speculative execution of tasks
        8. MapReduce job counters
        9. Handling data joins
          1. Reduce-side joins
          2. Map-side joins
        10. Summary
      3. 3. Advanced Pig
        1. Pig versus SQL
        2. Different modes of execution
        3. Complex data types in Pig
        4. Compiling Pig scripts
          1. The logical plan
          2. The physical plan
          3. The MapReduce plan
        5. Development and debugging aids
          1. The DESCRIBE command
          2. The EXPLAIN command
          3. The ILLUSTRATE command
        6. The advanced Pig operators
          1. The advanced FOREACH operator
            1. The FLATTEN operator
            2. The nested FOREACH operator
            3. The COGROUP operator
            4. The UNION operator
            5. The CROSS operator
          2. Specialized joins in Pig
            1. The Replicated join
            2. Skewed joins
            3. The Merge join
        7. User-defined functions
          1. The evaluation functions
            1. The aggregate functions
              1. The Algebraic interface
              2. The Accumulator interface
            2. The filter functions
          2. The load functions
          3. The store functions
        8. Pig performance optimizations
          1. The optimization rules
          2. Measurement of Pig script performance
          3. Combiners in Pig
          4. Memory for the Bag data type
          5. Number of reducers in Pig
          6. The multiquery mode in Pig
        9. Best practices
          1. The explicit usage of types
          2. Early and frequent projection
          3. Early and frequent filtering
          4. The usage of the LIMIT operator
          5. The usage of the DISTINCT operator
          6. The reduction of operations
          7. The usage of Algebraic UDFs
          8. The usage of Accumulator UDFs
          9. Eliminating nulls in the data
          10. The usage of specialized joins
          11. Compressing intermediate results
          12. Combining smaller files
        10. Summary
      4. 4. Advanced Hive
        1. The Hive architecture
          1. The Hive metastore
          2. The Hive compiler
          3. The Hive execution engine
          4. The supporting components of Hive
        2. Data types
        3. File formats
          1. Compressed files
          2. ORC files
          3. The Parquet files
        4. The data model
          1. Dynamic partitions
            1. Semantics for dynamic partitioning
          2. Indexes on Hive tables
        5. Hive query optimizers
        6. Advanced DML
          1. The GROUP BY operation
          2. ORDER BY versus SORT BY clauses
          3. The JOIN operator and its types
            1. Map-side joins
          4. Advanced aggregation support
          5. Other advanced clauses
        7. UDF, UDAF, and UDTF
        8. Summary
      5. 5. Serialization and Hadoop I/O
        1. Data serialization in Hadoop
          1. Writable and WritableComparable
          2. Hadoop versus Java serialization
        2. Avro serialization
          1. Avro and MapReduce
          2. Avro and Pig
          3. Avro and Hive
          4. Comparison – Avro versus Protocol Buffers / Thrift
        3. File formats
          1. The Sequence file format
            1. Reading and writing Sequence files
          2. The MapFile format
          3. Other data structures
        4. Compression
          1. Splits and compressions
          2. Scope for compression
        5. Summary
      6. 6. YARN – Bringing Other Paradigms to Hadoop
        1. The YARN architecture
          1. Resource Manager (RM)
          2. Application Master (AM)
          3. Node Manager (NM)
          4. YARN clients
        2. Developing YARN applications
          1. Writing YARN clients
          2. Writing the Application Master entity
        3. Monitoring YARN
        4. Job scheduling in YARN
          1. CapacityScheduler
          2. FairScheduler
        5. YARN commands
          1. User commands
          2. Administration commands
        6. Summary
      7. 7. Storm on YARN – Low Latency Processing in Hadoop
        1. Batch processing versus streaming
        2. Apache Storm
          1. Architecture of an Apache Storm cluster
          2. Computation and data modeling in Apache Storm
          3. Use cases for Apache Storm
          4. Developing with Apache Storm
          5. Apache Storm 0.9.1
        3. Storm on YARN
          1. Installing Apache Storm-on-YARN
            1. Prerequisites
          2. Installation procedure
        4. Summary
      8. 8. Hadoop on the Cloud
        1. Cloud computing characteristics
        2. Hadoop on the cloud
        3. Amazon Elastic MapReduce (EMR)
          1. Provisioning a Hadoop cluster on EMR
        4. Summary
      9. 9. HDFS Replacements
        1. HDFS – advantages and drawbacks
        2. Amazon AWS S3
          1. Hadoop support for S3
        3. Implementing a filesystem in Hadoop
        4. Implementing an S3 native filesystem in Hadoop
        5. Summary
      10. 10. HDFS Federation
        1. Limitations of the older HDFS architecture
        2. Architecture of HDFS Federation
          1. Benefits of HDFS Federation
          2. Deploying federated NameNodes
        3. HDFS high availability
          1. Secondary NameNode, Checkpoint Node, and Backup Node
          2. High availability – edits sharing
          3. Useful HDFS tools
          4. Three-layer versus four-layer network topology
        4. HDFS block placement
          1. Pluggable block placement policy
        5. Summary
      11. 11. Hadoop Security
        1. The security pillars
        2. Authentication in Hadoop
          1. Kerberos authentication
          2. The Kerberos architecture and workflow
          3. Kerberos authentication and Hadoop
          4. Authentication via HTTP interfaces
        3. Authorization in Hadoop
          1. Authorization in HDFS
            1. Identity of an HDFS user
            2. Group listings for an HDFS user
            3. HDFS APIs and shell commands
            4. Specifying the HDFS superuser
            5. Turning off HDFS authorization
          2. Limiting HDFS usage
            1. Name quotas in HDFS
            2. Space quotas in HDFS
          3. Service-level authorization in Hadoop
        4. Data confidentiality in Hadoop
          1. HTTPS and encrypted shuffle
            1. SSL configuration changes
            2. Configuring the keystore and truststore
        5. Audit logging in Hadoop
        6. Summary
      12. 12. Analytics Using Hadoop
        1. Data analytics workflow
        2. Machine learning
        3. Apache Mahout
        4. Document analysis using Hadoop and Mahout
          1. Term frequency
          2. Document frequency
          3. Term frequency – inverse document frequency
          4. Tf-Idf in Pig
          5. Cosine similarity distance measures
          6. Clustering using k-means
          7. K-means clustering using Apache Mahout
        5. RHadoop
        6. Summary
      13. 13. Hadoop for Microsoft Windows
        1. Deploying Hadoop on Microsoft Windows
          1. Prerequisites
          2. Building Hadoop
          3. Configuring Hadoop
          4. Deploying Hadoop
        2. Summary
    9. A. Bibliography
    10. Index

Product information

  • Title: Hadoop: Data Processing and Modelling
  • Author(s): Garry Turkington, Tanmay Deshpande, Sandeep Karanth
  • Release date: August 2016
  • Publisher(s): Packt Publishing
  • ISBN: 9781787125162