O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Ultimate Hands-on Hadoop

Video Description

Tame Your Big Data

About This Video

  • Design distributed systems that manage "big data" using Hadoop and related technologies.

  • Use HDFS and MapReduce for storing and analyzing data at scale.

  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.

  • Analyze relational data using Hive and MySQL

  • Analyze non-relational data using HBase, Cassandra, and MongoDB

  • Query data interactively with Drill, Phoenix, and Presto

  • Choose an appropriate data storage technology for your application

  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.

  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume

  • Consume streaming data using Spark Streaming, Flink, and Storm

  • In Detail

    The world of Hadoop and "Big Data" can be intimidating - hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this course, you'll not only understand what those systems are and how they fit together - but you'll go hands-on and learn how to use them to solve real business problems!This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It's filled with hands-on activities and exercises, so you get some real experience in using Hadoop - it's not just theory.You'll find a range of activities in this course for people at every level. If you're a project manager who just wants to learn the buzzwords, there are web UI's for many of the activities in the course that require no programming knowledge. If you're comfortable with command lines, we'll show you how to work with them too. And if you're a programmer, I'll challenge you with writing real scripts on a Hadoop system using Scala, Pig Latin, and Python.

    Table of Contents

    1. Chapter 1 : Learn all the buzzwords! And install Hadoop
      1. [Activity] Introduction, and install Hadoop on your desktop! 00:17:00
      2. Hadoop Overview and History 00:07:45
      3. Overview of Hadoop Ecosystem 00:16:47
      4. Tips for Using This Course 00:01:26
    2. Chapter 2 : Using Hadoop's Core: HDFs and MapReduce
      1. HDFS: What it is, and how it works 00:13:54
      2. [Activity] Install the MovieLens dataset into HDFS using the Ambari UI 00:06:20
      3. [Activity] Install the MovieLens dataset into HDFS using the command line 00:07:51
      4. MapReduce: What it is, and how it works 00:10:40
      5. How MapReduce distributes processing 00:12:57
      6. MapReduce example: Break down movie ratings by rating score 00:11:36
      7. [Activity] Installing Python, MRJob, and nano 00:07:34
      8. [Activity] Code up the ratings histogram MapReduce job and run it 00:07:36
      9. [Exercise] Rank Movies by their popularity 00:07:07
      10. [Activity] Check your results against mine! 00:08:24
    3. Chapter 3 : Programming Hadoop with Pig
      1. Introducing Ambari 00:09:50
      2. Introducing Pig 00:06:26
      3. Example: Find the oldest movie with 5-star rating using Pig 00:15:08
      4. [Activity] Find old 5-star movies with Pig 00:09:40
      5. More Pig Latin 00:07:34
      6. [Exercise] Find the most-rated one-star movie 00:01:56
      7. Pig Challenge: Compare Your Results to Mine! 00:05:37
    4. Chapter 4 : Programming Hadoop with Spark
      1. Why Spark? 00:10:07
      2. The Resilient Distributed Datasets(RDD) 00:10:14
      3. [Activity] Find the movie with the lowest average rating - with RDD's 00:15:34
      4. Datasets and Spark 2.0 00:06:28
      5. [Activity] Find the movie with the lowest average rating - with DataFrames 00:10:01
      6. [Activity] Movie recommendations with MLLib 00:12:16
      7. [Exercise] Filter the lowest-rated movies by number of ratings 00:02:51
      8. [Activity] Check your results against mine! 00:06:40
    5. Chapter 5 : Using relational data stores with Hadoop
      1. What is Hive? 00:06:32
      2. [Activity] Use Hive to find the most popular movie 00:10:46
      3. How Hive Works? 00:09:11
      4. [Exercise] Use Hive to find the movie with the highest average rating 00:01:56
      5. Compare your solution to mine 00:04:11
      6. Integrating MySQL with Hadoop 00:08:00
      7. [Activity] Install MySQL and import our movie data 00:07:36
      8. [Activity] Use Sqoop to import data from MySQL to HFDS/Hive 00:07:31
      9. [Activity] Use Sqoop to export data from Hadoop to MySQL 00:07:17
    6. Chapter 6 : Using non-relational data stores with Hadoop
      1. Why NoSQL? 00:13:55
      2. What is HBase 00:12:55
      3. [Activity] Import movie ratings into HBase 00:13:29
      4. [Activity] Use HBase with Pig to import data at scale 00:11:20
      5. Cassandra Overview 00:14:51
      6. [Activity] Installing Cassandra 00:11:44
      7. [Activity] Write Spark output into Cassandra 00:11:01
      8. MongoDB overview 00:16:54
      9. [Activity] Install MongoDB, and integrate Spark with MongoDB 00:12:45
      10. [Activity] Using the MongoDB shell 00:07:48
      11. Choosing a database technology 00:15:59
      12. [Exercise] Choose a database for a given problem 00:05:00
    7. Chapter 7 : Querying Your Data Interactively
      1. Overview of Drill 00:07:56
      2. [Activity] Setting up Drill 00:11:20
      3. [Activity] Querying across multiple databases with Drill 00:07:07
      4. Overview of Phoenix 00:08:56
      5. [Activity] Install Phoenix and query HBase with it 00:07:08
      6. [Activity] Integrate Phoenix with Pig 00:11:46
      7. Overview of Presto 00:06:40
      8. [Activity] Install Presto, and query Hive with it 00:12:27
      9. [Activity] Query both Cassandra and Hive using Presto 00:09:01
    8. Chapter 8 : Managing your Cluster
      1. YARN Explained 00:10:02
      2. Tez explained 00:04:56
      3. [Activity] Use Hive on Tez and measure the performance benefit 00:08:36
      4. Mesos explained 00:07:14
      5. ZooKeeper explained 00:13:11
      6. [Activity] Simulating a failing master with ZooKeeper 00:06:48
      7. Oozie explained 00:11:56
      8. [Activity] Set up a simple Oozie workflow 00:16:39
      9. Zeppelin overview 00:05:02
      10. [Activity] Use Zeppelin to analyze movie ratings, part 1 00:12:28
      11. [Activity] Use Zeppelin to analyze movie ratings, part 2 00:09:47
      12. Hue Overview 00:08:08
      13. Other technologies worth mentioning 00:04:35
    9. Chapter 9 : Feeding Data to your Cluster
      1. Kafka explained 00:09:48
      2. [Activity] Setting up Kafka, and publishing some data 00:07:24
      3. [Activity] Publishing web logs with Kafka 00:10:21
      4. Flume explained 00:10:16
      5. [Activity] Set up Flume and publish logs with it 00:07:46
      6. [Activity] Set up Flume to monitor a directory and store its data in HDFS 00:09:12
    10. Chapter 10 : Analysing Streams of Data
      1. Spark Streaming: Introduction 00:14:28
      2. [Activity] Analyze web logs published with Flume using Spark streaming 00:14:21
      3. [Exercise] Monitor Flume-published logs for errors in real time 00:02:02
      4. Exercise solution: Aggregating HTTP access codes with Spark Streaming 00:04:25
      5. Apache Storm: Introduction 00:09:28
      6. [Activity] Count words with Storm 00:14:35
      7. Flink: An Overview 00:06:53
      8. [Activity] Counting words with Flink 00:10:21
    11. Chapter 11 : Designing Real-World Systems
      1. The Best of the Rest 00:09:25
      2. Review: How the pieces fit together 00:06:30
      3. Understanding your requirements 00:08:03
      4. Sample Application: consume web server logs and keep tracks of top-sellers 00:10:07
      5. Sample application: serving movie recommendations to a website 00:11:18
      6. [Exercise] Design a system to report web sessions per day 00:02:53
      7. Exercise solution: Design a system to count daily sessions 00:04:24
    12. Chapter 12 : Learning More
      1. Books and online resources 00:05:33
      2. Bonus lecture: Discounts on my other big data / data science courses! 00:02:26