O'Reilly logo
live online training icon Live Online training

Hands-on Introduction to Apache Hadoop and Spark Programming

A quick-start introduction to the important facets of big data analytics

Douglas Eadline

The live training course will provide a "first touch" hands-on experience needed to start using essential tools in the Apache Hadoop and Spark ecosystem. Tools that will be presented include Hadoop Distributed File Systems (HDFS) Apache Pig, Hive, Sqoop, Flume, and Spark. The topics are presented in a "soup-to-nuts" fashion with minimal assumptions about prior experience. The programming examples include data ingest and one data analytics example. After completing the course attendees will gain the skills needed to begin their own projects.

What you'll learn-and how you can apply it

  • Be able to navigate and use the Hadoop Distributed File Systems (HDFS)
  • Learn how to run, monitor, inspect, and stop applications in a Hadoop environment
  • Learn how to start and run Apache Pig, Hive, and Spark applications from the command line.
  • Start and use the Zeppelin Web GUI for Hive and Spark application development
  • Use Flume and Sqoop to import/export log files and databases into HDFS

This training course is for you because...

  • Beginning developers who want to quickly learn how to navigate the Hadoop and Spark development environment
  • Administrators who are tasked with providing and supporting a Hadoop/Spark environment to their organization
  • Data Scientists that do not have experience with scalable tools like Hadoop or Spark

Prerequisites

Basic understanding of Linux command line including bash shell, simple text editing, and some experience with Python. If you want to run the examples, a functioning Hadoop/Spark environment (see below)

Setup Instructions:

To run the examples, you will need a functioning Hadoop environment. We recommend the Hortonworks HDP Sandbox (https://hortonworks.com/products/sandbox/) If you wish to follow along, install and test the sandbox at least one day before the class.

The following two resources offer other methods to install a Hadoop and or a Spark environment directly from the Apache web site using a Linux desktop or laptop. They also provide instructions on how to install the Hortonworks HDP Sandbox using Virtual box. The following resources also provide step-by-step notes files to assist with installation.

Recommended preparation:

Hadoop Fundamentals LiveLessons (video)

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale (book)

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (book)

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1: Total time approximately 190 minutes (3.3 hours) with 55 minutes allotted to questions

Segment 1: Quick Overview of Hadoop and Spark (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 2: Using the Hadoop Distributed File System (HDFS) (25 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 3: Running and Monitoring Hadoop Applications (30 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Break: 5 minutes

Segment 4: Using Apache Pig (15 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 5 minutes

Segment 5: Using Apache Hive (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 6: Running Apache Spark (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Day 2: Total time approximately 160 minutes (2.7 hours) with 40 minutes allotted to questions

Segment 7: Running Apache Sqoop (30 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 8: Using Apache Flume (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Break: 5 minutes

Segment 9: A Walking Tour of the Apache Zeppelin Web Interface (20 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 5 minutes

Segment 10: Creating an Analytics Application with Zeppelin (30 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 10 minutes

Segment 11: Wrap-up/ Where to Go Next (15 mins)

  • Instructor will present material and answer questions
  • Participants will listen and ask questions
  • Questions: 5 minutes