Hadoop and Spark Fundamentals

Video description

9+ Hours of Video Instruction

The perfect (and fast) way to get started with Hadoop and Spark

Hadoop and Spark Fundamentals LiveLessons provides 9+ hours of video introduction to the Apache Hadoop Big Data ecosystem. The tutorial includes background information and explains the core components of Hadoop, including Hadoop Distributed File Systems (HDFS), MapReduce, the YARN resource manager, and YARN Frameworks. In addition, it demonstrates how to use Hadoop at several levels, including the native Java interface, C++ pipes, and the universal streaming program interface. Examples include how to use benchmarks and high-level tools, including the Apache Pig scripting language, Apache Hive "SQL-like" interface, Apache Flume for streaming input, Apache Sqoop for import and export of relational data, and Apache Oozie for Hadoop workflow management. In addition, there is comprehensive coverage of Spark, PySpark, and the Zeppelin web-GUI. The steps for easily installing a working Hadoop/Spark system on a desktop/laptop and on a local stand-alone cluster using the powerful Ambari GUI are also included. All software used in these LiveLessons is open source and freely available for your use and experimentation. A bonus lesson includes a quick primer on the Linux command line as used with Hadoop and Spark.

Downloads associated with this LiveLesson can be found at https://www.clustermonkey.net/download/LiveLessons/Hadoop_Fundamentals/

About the Instructor

Douglas Eadline, PhD, began his career as a practitioner and a chronicler of the Linux cluster HPC revolution and now documents big data analytics. Starting with the first Beowulf Cluster how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering High Performance Computing (HPC) and Data Analytics. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Data Analytics industry and leader of the Limulus Personal Cluster Project. He is author of Hadoop Fundamentals LiveLessons and Apache Hadoop YARN Fundamentals LiveLessons videos from Pearson, and book coauthor of Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale. He is also the sole author of Hadoop 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem.

Skill Level

  • Beginner
  • Intermediate

Learn How To

  • Understand Hadoop design and key components
  • How the MapReduce process works in Hadoop
  • Understand the relationship of Spark and Hadoop
  • Key aspects of the new YARN design and Frameworks
  • Use, administer, and program HDFS
  • Run and administer Hadoop/Spark programs
  • Write basic MapReduce/Spark programs
  • Install Hadoop/Spark on a laptop/desktop
  • Run Apache Pig, Hive, Flume, Sqoop, Oozie, Spark applications
  • Perform basic data Ingest with Hive and Spark
  • Use the Zeppelin web-GUI for Spark/Hive programing
  • Install and administer Hadoop with the Apache Ambari GUI tool

Who Should Take This Course

  • Users, developers, and administrators interested in learning the fundamental aspects and operations of the open source Hadoop and Spark ecosystems

Course Requirements

  • Basic understanding of programming and development
  • A working knowledge of Linux systems and tools
  • Familiarity with Bash, Python, Java, and C++

Lesson 1: Background Concepts
This lesson introduces Hadoop and Spark along with the many aspects and features that enable the analysis of large unstructured data sets. Many of these discussions about Hadoop ignore the fundamental change Hadoop brings to data management. Doug explains this key point using the data lake metaphor, and then provides background on how the Hadoop data platform, MapReduce, and Spark fit into the data analytics landscape. A bonus lesson is also included for new Linux users that provides the basics of the command line interface used throughout these lessons.

Lesson 2: Running Hadoop on a Desktop or Laptop
A real Hadoop installation, whether it be a local cluster or in the cloud, can be difficult to configure and possibly an expensive proposition. In order to make the examples of this tutorial more accessible, you learn how to install the Hortonworks HDP Sandbox on a desktop or laptop. The "Sandbox" is a freely available Hadoop virtual machine that provides a full Hadoop environment (including Spark). You can use this environment to try most of the examples in this tutorial. If you would rather learn about Hadoop and Spark installation details, we will also do a direct single (Linux) machine install using the latest Hadoop and Spark binary code.

Lesson 3: The Hadoop Distributed File System
The backbone of Hadoop is the Hadoop Distributed File System or HDFS. In this lesson you learn the basics of HDFS and how it is different from many standard file systems used today. In particular, Doug explains why various design trade-offs provide HDFS with a performance edge in big data applications. You also learn how to navigate HDFS using the Hadoop tools and how to use HDFS in user programs. Finally, I present some of the new features available in HDFS including high availability, federation, snapshots, and NFS access.

Lesson 4: Hadoop MapReduce
If the Hadoop Distributed File System is the backbone of Hadoop, then MapReduce is the muscle that operates on big data. In this lesson, Doug shows you how MapReduce compares to a traditional search approach. From there, he shows you how to compile and run a Java MapReduce application. Deeper background on how MapReduce works is presented along with how to use MapReduce with other languages and how to do simple debugging of a MapReduce program.

Lesson 5: Hadoop MapReduce Examples
This lesson continues with MapReduce examples. Doug first shows you a multifile word count program, and then moves on to a more practical log file analysis. From there, he demonstrates how to use a really large text file, like Wikipedia. The lesson concludes with some examples of running MapReduce benchmarks and the using the YARN job browser.

Lesson 6: Higher Level Tools
While Hadoop is very effective at presenting a basic scalable MapReduce model, some higher-level approaches have been developed. In this lesson, Doug teaches you how to use Apache Pig–a Hadoop scripting language that simplifies using MapReduce. In addition, he shows you how to use Apache Hive QL–an SQL-like language that enables higher-level "ad hoc" queries using MapReduce and HDFS. And finally, the Oozie workflow manager is presented.

Lesson 7: Using the Spark Language
Spark has become a popular tool for data analytics. In this lesson, Doug provides some of the basic aspects of the Spark language and demonstrates the Python-Spark interface, PySpark, with a simple command line example. Additional aspects of the Spark language are also used in the next two lessons.

Lesson 8: Getting Data into Hadoop HDFS
The first, and often overlooked step in data analytics, is "data ingest." As was demonstrated in Lesson 3, files can be simply copied into HDFS. However, there are methods that can preserve and import structure that could be lost with simple copying. In this lesson. Doug demonstrates how to import data into Hive tables and use Spark to import data into HDFS. He also demonstrates importing log and other streaming data directly into HDFS using Apache Flume. Finally, a complete example of using Apache Sqoop to import and export a relational database into and out of HDFS is presented.

Lesson 9: Using the Zeppelin Web Interface
Although much of the early Hadoop applications were developed using the command line interface, new web-based GUI tools such as Apache Zeppelin offer a more user-friendly approach to application development. In this lesson, a walk-through of the Zeppelin interface is provided and includes an example of how to create an interactive Zeppelin notebook for a simple Spark application.

Lesson 10: Learning Basic Hadoop Installation and Administration
One of the challenges facing Hadoop users and administrators is setting up a real cluster for production use. In this lesson, Doug teaches you how to use the Ambari web GUI to install, monitor, and administer a full Hadoop installation. He also provides a few important command line tools that will help with basic administration. Finally, some additional HDFS features such as snapshots and NFSv3 mounts are demonstrated.

About Pearson Video Training

Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.

Table of contents

  1. Introduction
    1. Hadoop and Spark Fundamentals: Introduction
  2. Lesson 1: Background Concepts
    1. Learning objectives
    2. 1.1 Understand Big Data and analytics
    3. 1.2 Understand Hadoop as a data platform
    4. 1.3 Understand Hadoop MapReduce basics
    5. 1.4 Understand Spark language basics
    6. 1.5 Learn the Linux command line features
    7. 1.6 Preview Hadoop V3 new features
  3. Lesson 2: Running Hadoop on a Desktop or Laptop
    1. Learning objectives
    2. 2.1 Install Hortonworks Hadoop and Spark HDP Sandbox
    3. 2.2 Install from Hadoop sources--Part 1
    4. 2.2 Install from Hadoop sources--Part 2
    5. 2.3 Install from Spark sources
  4. Lesson 3: The Hadoop Distributed File System
    1. Learning objectives
    2. 3.1 Understand HDFS basics
    3. 3.2 Use HDFS command line tools
    4. 3.3 Use HDFS in programs
    5. 3.4 Utilize additional features of HDFS
  5. Lesson 4: Hadoop MapReduce
    1. Learning objectives
    2. 4.1 Understand the MapReduce paradigm
    3. 4.2 Develop and run a Java MapReduce application
    4. 4.3 Understand how MapReduce works
  6. Lesson 5: Hadoop MapReduce Examples
    1. Learning objectives
    2. 5.1 Use the Streaming Interface
    3. 5.2 Use the Pipes interface
    4. 5.3 Run the Hadoop grep example
    5. 5.4 Debugging MapReduce
    6. 5.5 Understand Hadoop Version 2 MapReduce
    7. 5.6 Use Hadoop Version 2 features--Part 1
    8. 5.6 Use Hadoop Version 2 features--Part 2
  7. Lesson 6: Higher Level Tools
    1. Learning objectives
    2. 6.1 Demonstrate a Pig example
    3. 6.2 Demonstrate a Hive example
    4. 6.3 Demonstrate an Oozie example--Part 1
    5. 6.3 Demonstrate an Oozie example--Part 2
  8. Lesson 7: Using the Spark Language
    1. Learning objectives
    2. 7.1 Learn Spark language basics
    3. 7.2 Demonstrate a PySpark command line example
  9. Lesson 8: Getting Data into Hadoop HDFS
    1. Learning objectives
    2. 8.1 Import data into Hive tables
    3. 8.2 Use Spark to import data into HDFS
    4. 8.3 Demonstrate a Flume Example--Part 1
    5. 8.3 Demonstrate a Flume Example--Part 2
    6. 8.4 Demonstrate a Sqoop Example--Part 1
    7. 8.4 Demonstrate a Sqoop Example--Part 2
  10. Lesson 9: Using the Zeppelin Web Interface
    1. Learning objectives
    2. 9.1 Understand Zeppelin features
    3. 9.2 Deconstruct a Spark application in Zeppelin
  11. Lesson 10: Learning Basic Hadoop Installation and Administration
    1. Learning objectives
    2. 10.1 Install and configure Hadoop using Ambari--Part 1
    3. 10.1 Install and configure Hadoop using Ambari Part--2
    4. 10.2 Perform simple administration and monitoring with Ambari
    5. 10.3 Perform simple command line administration
    6. 10.4 Utilize additional features of HDFS
  12. Summary
    1. Hadoop and Spark Fundamentals: Summary

Product information

  • Title: Hadoop and Spark Fundamentals
  • Author(s): Douglas Eadline
  • Release date: June 2018
  • Publisher(s): Pearson
  • ISBN: 0134770862