O'Reilly logo
live online training icon Live Online training

Hands-on Introduction to Apache Hadoop and Spark Programming

A quick-start introduction to the important facets of big data analytics

Douglas Eadline

The live training course will provide a "first touch" hands-on experience needed to start using essential tools in the Apache Hadoop and Spark ecosystem. Tools that will be presented include Hadoop Distributed File Systems (HDFS) Apache Pig, Hive, Sqoop, Flume, and Spark. The topics are presented in a "soup-to-nuts" fashion with minimal assumptions about prior experience. The programming examples include data ingest and one data analytics example. After completing the course attendees will gain the skills needed to begin their own projects.

What you'll learn-and how you can apply it

  • Be able to navigate and use the Hadoop Distributed File Systems (HDFS)
  • Learn how to run, monitor, inspect, and stop applications in a Hadoop environment
  • Learn how to start and run Apache Pig, Hive, and Spark applications from the command line.
  • Start and use the Zeppelin Web GUI for Hive and Spark application development
  • Use Flume and Sqoop to import/export log files and databases into HDFS

This training course is for you because...

  • Beginning developers who want to quickly learn how to navigate the Hadoop and Spark development environment
  • Administrators who are tasked with providing and supporting a Hadoop/Spark environment to their organization
  • Data Scientists that do not have experience with scalable tools like Hadoop or Spark

Prerequisites

Basic understanding of Linux command line including bash shell, simple text editing, and some experience with Python.
If you want to run the examples, a functioning Hadoop/Spark environment (see below)

Setup Instructions:

To run the examples, you will need a functioning Hadoop environment. We recommend the Hortonworks HDP Sandbox (https://hortonworks.com/products/sandbox/) If you wish to follow along, install and test the sandbox at least one day before the class.

The following two resources offer other methods to install a Hadoop and or a Spark environment directly from the Apache web site using a Linux desktop or laptop. They also provide instructions on how to install the Hortonworks HDP Sandbox using Virtual box. The following resources also provide step-by-step notes files to assist with installation.

Recommended preparation:

Hadoop Fundamentals LiveLessons (video)

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale (book)

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (book)

About your instructor

  • Douglas Eadline, PhD, began his career as Analytical Chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering many aspects of Linux HPC and Hadoop computing. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC Editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include Hadoop Fundamentals LiveLessons video (Addison Wesley), Hadoop 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1

Note: All example commands are available in annotated notes files that can be used to run the same examples after the course is complete. Commands can be cut/paste/run from the notes files allowing students to repeat (or modify) all course examples.

Segment 1: Introduction and Quick Overview of Hadoop and Spark (40 mins)

  • Instructor explains how course will work (sit back and watch, try on your own later)
  • This section is all slides and provide background on Hadoop and Spark
  • There will be about 10 minutes for questions

Segment 2: Using the Hadoop Distributed File System (HDFS) (25 mins)

  • Instructor will provide background on HDFS and demonstrate how to use basic commands on a real cluster.
  • If needed, there will be a 10 minute question and answer period

Break: 10 minutes

Segment 3: Running and Monitoring Hadoop Applications (35 mins)

  • Instructor demonstrate how to run Hadoop example applications and benchmarks
  • A live tour of the YARN web GUI will be presented for a running application
  • If needed, there will be a 10 minute question and answer period

Segment 4: Using Apache Pig (20 mins)

  • Instructor will present a simple Apache Pig example
  • Starting Pig locally, on a cluster, and with Tez acceleration will be demonstrated
  • If needed, there will be a 5 minute question and answer period

Break: 10 minutes

Segment 5: Using Apache Hive (30 mins)

  • Instructor will demonstrate a simple interactive Hive-SQL example using example data
  • Running the same example from a script will also be presented
  • If needed, there will be a 10 minute question and answer period.

Day 2

Segment 6: Running Apache Spark (pySpark) (35 mins)

  • The interactive pySpark word count example will be explained to illustrate RDDs, mapping, reducing, filtering, and lambda functions
  • A stand-alone pi estimator program will be demonstrated
  • If needed, there will be a 10 minute question and answer period

Break: 5 minutes

Segment 7: Running Apache Sqoop (30 mins)

  • A full example of taking data from MySQL to Hadoop/HDFS and back to MySQL will be demonstrated
  • Various Sqoop options will be demonstrated
  • If needed, there will be a 10 minute question and answer period

Segment 8: Using Apache Flume (20 mins)

  • A Flume example will demonstrate how to move web log data into Hadoop/HDFS
  • If needed, there will be a 5 minute question and answer period

Break: 10 minutes

Segment 9: A Walking Tour of the Apache Zeppelin Web Interface (20 mins)

  • The major features of the Zeppelin web notebook will be demonstrated
  • If needed, there will be a 5 minute question and answer period

Segment 10: Example Analytics Application using Apache Zeppelin (30 mins)

A simple banking application notebook will be demonstrated using Apache Zeppelin

  • The example includes CSS input, RDD/Dataframe usage, and interactive plotting
  • If needed, there will be a 10 minute question and answer period

Break: 10 minutes

Segment 11: Wrap-up/ Where to Go Next (20 mins)

  • A brief summary of course take-aways
  • The download URL for all course notes, data, and a DOS to Linux/HDFS cheat-sheet
  • Resources for installing Hadoop/Spark/Zeppelin on your hardware are provided
  • If needed, there will be a 5-10 minute final question and answer period