O'Reilly logo
live online training icon Live Online training

Big Data & Hadoop for Beginners

Understand the tools and concepts within the big data ecosystem

Jesse Anderson

In this class, we’ll focus on the information that developers and technical teams need to be successful with Big Data. This class will give you the tools to understand the concepts behind the most popular open source frameworks like Apache Hadoop and Apache Hive.

We’ll introduce Big Data’s extensive ecosystem. You’ll learn that each one of these technologies has a specific job and where it fits in the ecosystem.

What you'll learn-and how you can apply it

  • You’ll learn about Apache Hadoop and its two main components, HDFS and MapReduce.
  • You'll learn how HDFS works and how you access it.
  • You’ll learn how MapReduce works and its API.
  • You’ll see how you can leverage your existing SQL skills with Apache Hive.
  • You’ll learn how Hive’s language, HQL, differs and is similar to SQL.
  • What Big Data is and the tools that are available.
  • What a shuffle sort is and how it works.
  • How Hive run its SQL queries on Hadoop.

Touches on:

  • Apache Pig
  • Apache Crunch
  • Apache Beam
  • Apache Oozie
  • Hue
  • Apache Solr
  • Apache HBase
  • Apache Spark
  • Apache Storm/Heron
  • Apache Flink
  • Apache NiFi
  • Kafka Streams
  • Apache Impala and Presto
  • Apache Kafka

And you’ll be able to:

  • Write a simple MapReduce program
  • Understand how MapReduce works
  • Write a simple query with Hive

This training course is for you because...

  • You are a Software Engineer and need to write Hadoop MapReduce code and execute queries.
  • You are a Software Architect and need to understand the Big Data ecosystem.
  • You are a Business Analyst who needs to write a more complex analysis with Hadoop MapReduce and Apache Hive.
  • You are a Business Intelligence Analyst who needs to learn how to run complex analytics at scale.
  • You are a Quality Assurance Engineer who needs to test Hadoop MapReduce code.
  • You are a DBA who needs to understand Big Data and know how to run SQL queries with it.
  • You are a Technical Manager who needs to understand the technical side of Big Data.

Prerequisites

  • All attendees will need a technical background. To program, the attendee will need to be able to program in one of the following languages: Java, Ruby, Python, or Perl.
  • If you are taking this class at your place of work, verify with your network administrator that you can access ports 4822 and 8080. If those ports aren't opened, please ask your network administrator to open them.

*PRIOR TO CLASS, YOU NEED TO SETUP YOUR VIRTUAL MACHINE - SEE THE FOLLOWING INSTRUCTIONS https://www.dropbox.com/s/mrnnuayrw844ejt/Intro_to_Big_Data_For_Developers_VM_Instructions%20%282%29.pdf?dl=0

Recommended Preparation:

Introduction to Apache Hive

Using Spark in the Hadoop Ecosystem

Taming Big Data with MapReduce and Hadoop - Hands On!

What the #@)*$ is Big Data? A Holistic View of Data and Algorithms

How Big Data Changes Everything

Planning for Big Data

Resources for Further Learning:

On complexity in big data

Is my developer team ready for big data?

What will become of Big Data?

Why should or shouldn’t you become a Data Engineer?

About your instructor

  • Jesse Anderson is the Managing Director at Big Data Institute. He trains at companies ranging from startups to Fortune 100 companies on Big Data. This includes training on cutting edge technology like Apache Kafka, Apache Hadoop and Apache Spark. He has taught thousands of students the skills to become Data Engineers.

    He is widely regarded as an expert in the field and his novel teaching practices. Jesse is published on O’Reilly and Pragmatic Programmers. He has been covered in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Schedule/Outline:

Introduction (10 minutes)

Lecture: Thinking in big data—the concepts and reasons for using big data solutions; how HDFS works (80 minutes)

Break: 15 mins

Hands-on exercise: Use HDFS via the shell commands (30 minutes)

Lecture: MapReduce basics (60 minutes)

Break: 1 hour

Demonstration: Coding with MapReduce and the MapReduce API (30 minutes)

Hands-on exercise: Write a MapReduce job (60 minutes)

Break: 15 mins

Lecture: Apache Hive and its SQL-like language, HiveQL (60 minutes)

Hands-on exercise: Use the simple features of the HiveQL language to write a query (30 minutes)