Big Data & Hadoop for Beginners
Understand the tools and concepts within the big data ecosystem
In this class, we’ll focus on the information that developers and technical teams need to be successful with Big Data. This class will give you the tools to understand the concepts behind the most popular open source frameworks like Apache Hadoop and Apache Hive.
We’ll introduce Big Data’s extensive ecosystem. You’ll learn that each one of these technologies has a specific job and where it fits in the ecosystem.
What you'll learn-and how you can apply it
- You’ll learn about Apache Hadoop and its two main components, HDFS and MapReduce.
- You'll learn how HDFS works and how you access it.
- You’ll learn how MapReduce works and its API.
- You’ll see how you can leverage your existing SQL skills with Apache Hive.
- You’ll learn how Hive’s language, HQL, differs and is similar to SQL.
- What Big Data is and the tools that are available.
- What a shuffle sort is and how it works.
- How Hive run its SQL queries on Hadoop.
- Apache Pig
- Apache Crunch
- Apache Beam
- Apache Oozie
- Apache Solr
- Apache HBase
- Apache Spark
- Apache Storm/Heron
- Apache Flink
- Apache NiFi
- Kafka Streams
- Apache Impala and Presto
- Apache Kafka
And you’ll be able to:
- Write a simple MapReduce program
- Understand how MapReduce works
- Write a simple query with Hive
This training course is for you because...
- You are a Software Engineer and need to write Hadoop MapReduce code and execute queries.
- You are a Software Architect and need to understand the Big Data ecosystem.
- You are a Business Analyst who needs to write a more complex analysis with Hadoop MapReduce and Apache Hive.
- You are a Business Intelligence Analyst who needs to learn how to run complex analytics at scale.
- You are a Quality Assurance Engineer who needs to test Hadoop MapReduce code.
- You are a DBA who needs to understand Big Data and know how to run SQL queries with it.
- You are a Technical Manager who needs to understand the technical side of Big Data.
- All attendees will need a technical background. To program, the attendee will need to be able to program in one of the following languages: Java, Ruby, Python, or Perl.
- If you are taking this class at your place of work, verify with your network administrator that you can access ports 4822 and 8080. If those ports aren't opened, please ask your network administrator to open them.
VIRTUAL MACHINE SETUP INSTRUCTIONS NEEDED PRIOR TO CLASS
Resources for Further Learning:
About your instructor
Jesse Anderson is the Managing Director at Big Data Institute. He trains at companies ranging from startups to Fortune 100 companies on Big Data. This includes training on cutting edge technology like Apache Kafka, Apache Hadoop and Apache Spark. He has taught thousands of students the skills to become Data Engineers.
He is widely regarded as an expert in the field and his novel teaching practices. Jesse is published on O’Reilly and Pragmatic Programmers. He has been covered in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired.
The timeframes are only estimates and may vary according to how the class is progressing
Introduction (10 minutes)
Lecture: Thinking in big data—the concepts and reasons for using big data solutions; how HDFS works (80 minutes)
Break: 15 mins
Hands-on exercise: Use HDFS via the shell commands (30 minutes)
Lecture: MapReduce basics (60 minutes)
Break: 1 hour
Demonstration: Coding with MapReduce and the MapReduce API (30 minutes)
Hands-on exercise: Write a MapReduce job (60 minutes)
Break: 15 mins
Lecture: Apache Hive and its SQL-like language, HiveQL (60 minutes)
Hands-on exercise: Use the simple features of the HiveQL language to write a query (30 minutes)