O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learn By Example: Hadoop, MapReduce for Big Data problems

Video Description

A hands-on workout in Hadoop, MapReduce and the art of thinking "parallel"

About This Video

  • Recommend friends in a Social Networking site: Generate Top 10 friend recommendations using a Collaborative filtering algorithm.
  • Build an Inverted Index for Search Engines: Use MapReduce to parallelize the humongous task of building an inverted index for a search engine.
  • Generate Bigrams from text: Generate bigrams and compute their frequency distribution in a corpus of text.
  • Build your Hadoop cluster:
  • Install Hadoop in Standalone, Pseudo-Distributed and Fully Distributed modes
  • Set up a Hadoop cluster using Linux VMs.
  • Set up a cloud Hadoop cluster on AWS with Cloudera Manager.
  • Understand HDFS, MapReduce and YARN and their interaction
  • Customize your MapReduce Jobs:
  • Chain multiple MR jobs together
  • Write your own Customized Partitioner
  • Total Sort : Globally sort a large amount of data by sampling input files Secondary sorting
  • Unit tests with MR Unit
  • Integrate with Python using the Hadoop Streaming API .. and of course all the basics:
  • MapReduce : Mapper, Reducer, Sort/Merge, Partitioning, Shuffle and Sort
  • HDFS & YARN: Namenode, Datanode, Resource manager, Node manager, the anatomy of a MapReduce application, YARN Scheduling, Configuring HDFS and YARN to performance tune your cluster.

In Detail

This course is a zoom-in, zoom-out, hands-on workout involving Hadoop, MapReduce and the art of thinking parallel. This course is both broad and deep. It covers the individual components of Hadoop in great detail and also gives you a higher level picture of how they interact with each other. It's a hands-on workout involving Hadoop, MapReduce. This course will get you hands-on with Hadoop very early on. You'll learn how to set up your own cluster using both VMs and the Cloud. All the major features of MapReduce are covered, including advanced topics like Total Sort and Secondary Sort. MapReduce completely changed the way people thought about processing Big Data. Breaking down any problem into parallelizable units is an art. The examples in this course will train you to think in parallel.

Table of Contents

  1. Chapter 1 : Introduction
    1. You, this course and Us 00:01:53
  2. Chapter 2 : Why is Big Data a Big Deal
    1. The Big Data Paradigm 00:14:21
    2. Serial vs Distributed Computing 00:08:37
    3. What is Hadoop? 00:07:25
    4. HDFS or the Hadoop Distributed File System 00:11:01
    5. MapReduce Introduced 00:11:39
    6. YARN or Yet Another Resource Negotiator 00:04:01
  3. Chapter 3 : Installing Hadoop in a Local Environment
    1. Hadoop Install Modes 00:08:33
    2. Hadoop Standalone mode Install 00:15:47
    3. Hadoop Pseudo-Distributed mode Install 00:11:45
  4. Chapter 4 : The MapReduce "Hello World"
    1. The basic philosophy underlying MapReduce 00:08:50
    2. MapReduce - Visualized And Explained 00:09:04
    3. MapReduce - Digging a little deeper at every step 00:10:21
    4. "Hello World" in MapReduce 00:10:30
    5. The Mapper 00:09:48
    6. The Reducer 00:07:47
    7. The Job 00:12:28
  5. Chapter 5 : Run a MapReduce Job
    1. Get comfortable with HDFS 00:10:59
    2. Run your first MapReduce Job 00:14:30
  6. Chapter 6 : Juicing your MapReduce - Combiners, Shuffle and Sort and The Streaming API
    1. Parallelize the reduce phase - use the Combiner 00:14:40
    2. Not all Reducers are Combiners 00:14:31
    3. How many mappers and reducers does your MapReduce have? 00:08:24
    4. Parallelizing reduce using Shuffle And Sort 00:14:55
    5. MapReduce is not limited to the Java language - Introducing the Streaming API 00:05:06
    6. Python for MapReduce 00:12:19
  7. Chapter 7 : HDFS and Yarn
    1. HDFS - Protecting against data loss using replication 00:15:39
    2. HDFS - Name nodes and why they're critical 00:06:54
    3. HDFS - Checkpointing to backup name node information 00:11:16
    4. Yarn - Basic component 00:08:40
    5. Yarn - Submitting a job to Yarn 00:13:16
    6. Yarn - Plug in scheduling policies 00:14:27
    7. Yarn - Configure the scheduler 00:12:33
  8. Chapter 8 : MapReduce Customizations For Finer Grained Control
    1. Setting up your MapReduce to accept command line arguments 00:13:48
    2. The Tool, ToolRunner and GenericOptionsParser 00:12:36
    3. Configuring properties of the Job object 00:10:41
    4. Customizing the Partitioner, Sort Comparator, and Group Comparator 00:15:17
  9. Chapter 9 : The Inverted Index, Custom Data Types for Keys, Bigram Counts and Unit Tests!
    1. The heart of search engines - The Inverted Index 00:14:47
    2. Generating the inverted index using MapReduce 00:10:32
    3. Custom data types for keys - The Writable Interface 00:10:30
    4. Represent a Bigram using a WritableComparable 00:13:20
    5. MapReduce to count the Bigrams in input text 00:08:33
    6. Test your MapReduce job using MRUnit 00:13:48
  10. Chapter 10 : Input and Output Formats and Customized Partitioning
    1. Introducing the File Input Format 00:12:49
    2. Text And Sequence File Formats 00:10:22
    3. Data partitioning using a custom partitioner 00:07:11
    4. Make the custom partitioner real in code 00:10:25
    5. Total Order Partitioning 00:10:11
    6. Input Sampling, Distribution, Partitioning and configuring these 00:09:05
    7. Secondary Sort 00:14:34
  11. Chapter 11 : Recommendation Systems using Collaborative Filtering
    1. Introduction to Collaborative Filtering 00:07:25
    2. Friend recommendations using chained MR jobs 00:17:16
    3. Get common friends for every pair of users - the first MapReduce 00:14:50
    4. Top 10 friend recommendation for every user - the second MapReduce 00:13:46
  12. Chapter 12 : Hadoop as a Database
    1. Structured data in Hadoop 00:14:09
    2. Running an SQL Select with MapReduce 00:15:31
    3. Running an SQL Group By with MapReduce 00:14:02
    4. A MapReduce Join - The Map Side 00:14:20
    5. A MapReduce Join - The Reduce Side 00:13:08
    6. A MapReduce Join - Sorting and Partitioning 00:08:50
    7. A MapReduce Join - Putting it all together 00:13:46
  13. Chapter 13 : K-Means Clustering
    1. What is K-Means Clustering? 00:14:04
    2. A MapReduce job for K-Means Clustering 00:16:34
    3. K-Means Clustering - Measuring the distance between points 00:13:52
    4. K-Means Clustering - Custom Writables for Input/Output 00:08:27
    5. K-Means Clustering - Configuring the Job 00:10:50
    6. K-Means Clustering - The Mapper and Reducer 00:11:23
    7. K-Means Clustering: The Iterative MapReduce Job 00:03:40
  14. Chapter 14 : Setting up a Hadoop Cluster
    1. Manually configuring a Hadoop cluster (Linux VMs) 00:13:51
    2. Getting started with Amazon Web Servicies 00:06:26
    3. Start a Hadoop Cluster with Cloudera Manager on AWS 00:13:05
  15. Chapter 15 : Appendix
    1. Setup a Virtual Linux Instance (For Windows users) 00:15:59
    2. [For Linux/Mac OS Shell Newbies] Path and other Environment Variables 00:08:26