O'Reilly logo
live online training icon Live Online training

Apache Hadoop, Spark and Big Data Foundations

Learn the value proposition behind scalable data analytics tools

Douglas Eadline

The live training course will cover the essential introductory aspects of Hadoop, Spark and Big Data. A concise and essential overview of the Hadoop and Spark ecosystem will be presented. After completing the workshop attendees will gain a workable understanding of the Hadoop/Spark value proposition for their organization and a clear background on Big Data technologies.

What you'll learn-and how you can apply it

  • An understanding of Hadoop as a data platform
  • How the "Data Lake" and Big Data are changing data analytics
  • A basic understanding of the differences and similarities of Hadoop tools
  • Attendees will be able to navigate market congestion and understand how these technologies can work for their organization
  • Developer types can build on a solid foundation and learn how to use various tools mentioned in the presentation

This training course is for you because...

  • CIO and other managers who need to "get up to speed" quickly on scalable big data technologies
  • Developers or Administrators (devops) wanting to learn how all the key pieces of the have Hadoop and Spark ecosystem fit together
  • Data Scientists that do not have experience with scalable tools like Hadoop or Spark

Prerequisites

  • Basic understanding of data center operations (servers, storage, networks, database)

Recommended preparation:

Hadoop Fundamentals LiveLessons (video)

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (book)

About your instructor

  • Douglas Eadline, PhD, began his career as a practitioner and a chronicler of the Linux cluster HPC revolution and now documents big data analytics. Starting with the first Beowulf Cluster how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering virtually all aspects of High Performance Computing (HPC) computing. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Data Analytics industry and leader of the Limulus Personal Cluster Project (http://limulus.basement-supercomputing.com). He is author of Hadoop Fundamentals LiveLessons and Apache Hadoop YARN Fundamentals LiveLessons videos from Pearson and book coauthor of Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale and sole author of Hadoop 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Why is Hadoop Such a Big Deal? (50 mins)

  • A Brief History of Apache Hadoop
  • What is Big Data?
  • Hadoop as a Data Lake
  • Apache Hadoop V2 is a Platform
  • The Apache Hadoop Project Ecosystem
  • Hadoop Interfaces for New Users
  • Questions: 10 minutes

Break: 5 minutes

Segment 2: Hadoop Distributed File System (HDFS) Basics (25 mins)

  • How HDFS works
  • Questions: 10 minutes

Segment 3: Hadoop MapReduce Framework (25 mins)

  • The MapReduce Model
  • MapReduce Data Flow
  • Questions: 10 minutes

Break: 5 minutes

Segment 4: Making life Easier: Spark (20 mins)

  • Spark Basics and Components
  • Spark RRDs and Dataframes
  • Spark vs MapReduce
  • Questions: 10 minutes

Segment 5: Real World Applications/Wrap-up (15 mins)

  • Questions: 10 minutes