O'Reilly logo
live online training icon Live Online training

Hands-on Introduction to Apache Hadoop and Spark Programming

A quick-start introduction to the important facets of big data analytics

Douglas Eadline

The live training course will provide a "first touch" hands-on experience needed to start using essential tools in the Apache Hadoop and Spark ecosystem. Tools that will be presented include Hadoop Distributed File Systems (HDFS) Apache Pig, Hive, Sqoop, Flume, and Spark. The topics are presented in a "soup-to-nuts" fashion with minimal assumptions about prior experience. The programming examples include data ingest and one data analytics example. After completing the course attendees will gain the skills needed to begin their own projects.

What you'll learn-and how you can apply it

  • Be able to navigate and use the Hadoop Distributed File Systems (HDFS)
  • Learn how to run, monitor, inspect, and stop applications in a Hadoop environment
  • Learn how to start and run Apache Pig, Hive, and Spark applications from the command line.
  • Start and use the Zeppelin Web GUI for Hive and Spark application development
  • Use Flume and Sqoop to import/export log files and databases into HDFS

This training course is for you because...

  • Beginning developers who want to quickly learn how to navigate the Hadoop and Spark development environment
  • Administrators who are tasked with providing and supporting a Hadoop/Spark environment to their organization
  • Data Scientists that do not have experience with scalable tools like Hadoop or Spark

Prerequisites

Basic understanding of Linux command line including bash shell, simple text editing, and some experience with Python.
If you want to run the examples, a functioning Hadoop/Spark environment (see below)

Setup Instructions:

To run the examples, you will need a functioning Hadoop environment. We recommend the Hortonworks HDP Sandbox (https://hortonworks.com/products/sandbox/) If you wish to follow along, install and test the sandbox at least one day before the class.

The following two resources offer other methods to install a Hadoop and or a Spark environment directly from the Apache web site using a Linux desktop or laptop. They also provide instructions on how to install the Hortonworks HDP Sandbox using Virtual box. The following resources also provide step-by-step notes files to assist with installation.

Recommended preparation:

Hadoop Fundamentals LiveLessons (video)

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale (book)

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (book)

About your instructor

  • Douglas Eadline, PhD, began his career as a practitioner and a chronicler of the Linux cluster HPC revolution and now documents big data analytics. Starting with the first Beowulf Cluster how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering virtually all aspects of High Performance Computing (HPC) computing. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Data Analytics industry and leader of the Limulus Personal Cluster Project (http://limulus.basement-supercomputing.com). He is author of Hadoop Fundamentals LiveLessons and Apache Hadoop YARN Fundamentals LiveLessons videos from Pearson and book coauthor of Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale and sole author of Hadoop 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1: Total time approximately 190 minutes (3.3 hours) with 55 minutes allotted to questions

Segment 1: Quick Overview of Hadoop and Spark (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 2: Using the Hadoop Distributed File System (HDFS) (25 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 3: Running and Monitoring Hadoop Applications (30 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Break: 5 minutes

Segment 4: Using Apache Pig (15 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 5 minutes

Segment 5: Using Apache Hive (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 6: Running Apache Spark (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Day 2: Total time approximately 160 minutes (2.7 hours) with 40 minutes allotted to questions

Segment 7: Running Apache Sqoop (30 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 8: Using Apache Flume (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Break: 5 minutes

Segment 9: A Walking Tour of the Apache Zeppelin Web Interface (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 5 minutes

Segment 10: Creating an Analytics Application with Zeppelin (30 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 11: Wrap-up/ Where to Go Next (15 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 5 minutes