O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Path: A Beginner's Guide to Architecting Big Data Applications

Video Description

Whether you’re a data engineer who needs to plan and implement a big data pipeline or a manager interested in learning how tools in the Hadoop technology stack address business goals, these videos will walk you through how to plan your big data solution. You’ll receive an introduction to the concepts of Apache Hadoop, and training on key components including Apache HBase, YARN, Cassandra, Kafka, and Spark.

Table of Contents

  1. Introduction
    1. Introduction And Course Overview 00:00:56
    2. About The Author 00:00:48
    3. Getting Started With A Hadoop Installation 00:01:32
  2. What Is Hadoop?
    1. What Is Hadoop? 00:04:56
    2. What Is HDFS? - Scalable Storage 00:02:02
    3. Understanding Block Storage 00:00:56
    4. Block Replication And Resilience 00:02:50
    5. HDFS Architecture - The Name Node And The Data Nodes 00:02:56
    6. Parallel Performance 00:01:12
    7. What Is Yarn? - Scalable Compute 00:04:30
    8. Yarn: Plug-In Processing Engines 00:02:01
    9. Overview Of MapReduce 00:06:13
    10. Using Different Languages 00:02:34
  3. Options For Data Input
    1. Importing Data 00:02:59
    2. The Hadoop Client 00:02:26
    3. Overview Of Sqoop 00:02:33
    4. Overview Of Flume 00:02:07
    5. Other Import Tools 00:02:55
  4. Hadoop Tools
    1. What Is Pig? 00:03:40
    2. What Is Hive? 00:04:35
    3. Comparing Hive To SQL 00:02:34
    4. Hive Architecture 00:02:25
    5. What Is HCatalog? 00:01:37
    6. Hive Interfaces 00:02:32
    7. Apache Storm 00:02:00
    8. Apache Spark 00:05:53
    9. Hadoop Security 00:01:46
    10. Overview Of Oozie 00:01:43
    11. Mahout 00:01:58
    12. HBase And Other Data Stores: Hbase, Accumulo, Etc. 00:05:02
    13. Apache Kafka 00:01:21
    14. Cluster Management 00:02:32
  5. Conclusion
    1. Distributions And Where To Go From Here 00:03:42
    2. Conclusion 00:00:29
  6. Introduction
    1. Course Agenda And Instructor 00:02:50
    2. How to Access your Working Files 00:01:15
  7. Core Hadoop Components
    1. Basic Overview Of Hadoop Core Components: HDFS 00:02:29
    2. Hadoop Core Components Overview 00:06:52
    3. What Is Map/Reduce? 00:08:52
  8. YARN: Components And Architecture
    1. Pre-YARN Architecture 00:06:20
    2. YARN Architecture And Daemons 00:08:22
  9. Scheduling, Running And Monitoring Applications In YARN
    1. Running Jobs In YARN 00:10:11
    2. YARN Parameters 00:05:01
    3. YARN Cluster Resource Allocation 00:02:42
    4. Failure Handling 00:04:18
    5. YARN Logs 00:09:29
    6. Hands On With YARN 00:13:22
  10. Conclusion
    1. Summary 00:04:25
  11. Introduction
    1. What Is HBase 00:04:13
    2. What To Expect 00:02:31
    3. About The Author 00:00:35
    4. How To Access Your Working Files 00:01:15
  12. Administration Basics
    1. HBase Deployment Architecture 00:04:36
    2. HBase Fault Tolerance 00:04:37
    3. Hardware Recommendations 00:06:57
    4. Software Recommendations 00:05:28
    5. HBase Deployment At Scale 00:08:13
    6. Installation With Cloudera Manager 00:05:46
    7. Basic Static Configuration 00:06:45
    8. Rolling Restarts And Upgrades 00:04:18
    9. Interacting With HBase 00:05:36
  13. Troubleshooting
    1. Trouble Shooting Methodology 00:08:02
    2. Trouble Shooting Distributed Clusters 00:07:58
    3. Administration From The Command Line 00:06:10
    4. Using The HBase UI 00:06:07
    5. Using The Metrics 00:04:18
    6. Using The Logs 00:06:27
  14. Tuning
    1. Basic HBase Tuning 00:01:22
    2. Generating Load And Load Test Tool 00:07:37
    3. Generating With YCSB 00:07:39
    4. Region Tuning 00:06:47
    5. Table Storage Tuning 00:07:40
    6. Memory Tuning 00:05:33
    7. Tuning With Failures 00:07:04
    8. Tuning For Modern Hardware 00:08:29
  15. Operations Continuity
    1. Operational Continuity 00:07:19
    2. Corruption: hbck 00:07:02
    3. Corruption: Other Tools 00:04:05
    4. Security 00:07:28
    5. Security Demo 00:12:05
    6. Backups: Snapshots 00:04:20
    7. Backups: Import / Export / Copy Table 00:06:32
    8. Cluster Replication 00:09:16
  16. Ecosystem
    1. HBase Proxy Servers, Thrift And Rest 00:03:43
    2. Hue 00:03:27
    3. HBase With Apache Phoenix 00:04:08
  17. Conclusion
    1. Wrapup And Thank You 00:03:20
  18. Introduction To Cassandra
    1. Introducing The Course 00:04:41
    2. Understanding What Cassandra Is 00:04:58
    3. Learning What Cassandra Is Being Used For 00:04:56
    4. Understanding The System Requirements 00:06:54
    5. How To Access Your Working Files 00:01:15
    6. Opening The Main Virtual Machine 00:02:53
    7. Pop Quiz - Intro to Cassandra 00:01:24
  19. Getting Started With The Architecture
    1. Understanding That Cassandra Is A Distributed Database 00:02:23
    2. Learning What Snitch Is For 00:03:53
    3. Learning What Gossip Is For 00:01:52
    4. Learning How Data Gets Distributed 00:05:35
    5. Learning About Replication 00:02:12
    6. Learning About Virtual Nodes 00:03:01
    7. Pop Quiz - Getting Started with Architecture 00:01:25
  20. Installing Cassandra
    1. Downloading Cassandra 00:02:48
    2. Ensuring Oracle Java 7 Is Installed 00:02:02
    3. Installing Cassandra 00:03:44
    4. Viewing The Main Configuration File 00:02:46
    5. Providing Cassandra With Permission To Directories 00:01:46
    6. Starting Cassandra 00:03:41
    7. Checking Status 00:04:00
    8. Accessing The Cassandra system.log File 00:02:06
    9. Pop Quiz - Installing Cassandra 00:01:28
  21. Communicating With Cassandra
    1. Understanding Ways To Communicate With Cassandra 00:03:47
    2. Using CQLSH 00:02:29
    3. Pop Quiz - Communicating with Cassandra 00:01:08
  22. Creating A Database
    1. Understanding A Cassandra Database 00:01:54
    2. Defining A Keyspace 00:04:57
    3. Deleting A Keyspace 00:00:52
    4. Pop Quiz - Creating a Database 00:01:53
    5. Lab: Create A Second Database 00:02:39
  23. Creating A Table
    1. Creating A Table 00:01:49
    2. Defining Columns And Data Types 00:02:48
    3. Defining A Primary Key 00:01:49
    4. Recognizing A Partition Key 00:02:44
    5. Specifying A Descending Clustering Order 00:03:02
    6. Pop Quiz - Creating a Table 00:01:54
    7. Lab: Create A Second Table 00:02:33
  24. Inserting Data
    1. Understanding Ways To Write Data 00:01:28
    2. Using The INSERT INTO Command 00:04:45
    3. Using The COPY Command 00:05:53
    4. How Data Is Stored In Cassandra 00:04:21
    5. How Data Is Stored On Disk 00:05:29
    6. Pop Quiz - Inserting Data 00:02:15
    7. Lab: Insert Data 00:09:10
  25. Modeling Data
    1. Understanding Data Modeling In Cassandra 00:01:21
    2. Using A WHERE Clause 00:04:17
    3. Understanding Secondary Indexes 00:02:18
    4. Creating A Secondary Index 00:01:38
    5. Defining A Composite Partition Key 00:09:34
    6. Pop Quiz - Modeling Data 00:03:34
  26. Creating An Application
    1. Understanding Cassandra Drivers 00:02:31
    2. Exploring The DataStax Java Driver 00:03:14
    3. Setting Up A Development Environment 00:04:04
    4. Creating An Application Page 00:04:51
    5. Acquiring The DataStax Java Driver Files 00:03:24
    6. Getting The DataStax Java Driver Files Through Maven 00:02:23
    7. Providing The DataStax Java Driver Files Manually 00:02:36
    8. Connecting To A Cassandra Cluster 00:03:39
    9. Executing A Query 00:07:47
    10. Displaying Query Results - Part 1 00:05:59
    11. Displaying Query Results - Part 2 00:07:20
    12. Using An MVC Pattern 00:04:59
    13. Pop Quiz - Creating an Application 00:02:50
    14. Lab: Create A Second Application - Part 1 00:05:20
    15. Lab: Create A Second Application - Part 2 00:09:49
    16. Lab: Create A Second Application - Part 3 00:03:08
  27. Updating And Deleting Data
    1. Updating Data 00:03:39
    2. Understanding How Updating Works 00:03:55
    3. Deleting Data 00:07:10
    4. Understanding Tombstones 00:07:18
    5. Using TTLs 00:05:09
    6. Updating A TTL 00:02:38
    7. Pop Quiz - Updating and Deleting Data 00:02:38
    8. Lab: Update And Delete Data 00:07:00
  28. Selecting Hardware
    1. Understanding Hardware Choices 00:00:30
    2. Understanding RAM And CPU Recommendations 00:02:45
    3. Selecting Storage 00:04:08
    4. Deploying In The Cloud 00:04:07
    5. Pop Quiz - Selecting Hardware 00:02:06
  29. Adding Nodes To A Cluster
    1. Understanding Cassandra Nodes 00:03:39
    2. Having A Network Connection - Part 1 00:05:35
    3. Having A Network Connection - Part 2 00:05:02
    4. Having A Network Connection - Part 3 00:04:46
    5. Specifying The IP Address Of A Node In Cassandra 00:04:12
    6. Specifying Seed Nodes 00:06:30
    7. Bootstrapping A Node 00:06:18
    8. Cleaning Up A Node 00:02:59
    9. Using cassandra-stress 00:10:33
    10. Pop Quiz - Adding Nodes to a Cluster 00:01:39
    11. Lab: Add A Third Node 00:10:42
  30. Monitoring A Cluster
    1. Understanding Cassandra Monitoring Tools 00:00:46
    2. Using Nodetool 00:04:54
    3. Using JConsole 00:03:24
    4. Learning About OpsCenter 00:03:24
    5. Pop Quiz - Monitoring a Cluster 00:01:49
  31. Repairing Nodes
    1. Understanding Repair 00:05:17
    2. Repairing Nodes 00:04:17
    3. Understanding Consistency - Part 1 00:06:26
    4. Understanding Consistency - Part 2 00:04:33
    5. Understanding Hinted Handoff 00:03:30
    6. Understanding Read Repair 00:01:58
    7. Pop Quiz - Repairing Nodes 00:03:30
    8. Lab: Repair Nodes For A Keyspace 00:05:45
  32. Removing A Node
    1. Understanding Removing A Node 00:00:54
    2. Decommissioning A Node 00:04:36
    3. Putting A Node Back Into Service 00:06:38
    4. Removing A Dead Node 00:06:42
    5. Pop Quiz - Removing a Node 00:04:10
    6. Lab: Put A Node Back Into Service 00:05:00
  33. Redefining A Cluster For Multiple Data Centers
    1. Redefining For Multiple Data Centers - Part 1 00:04:50
    2. Redefining For Multiple Data Centers - Part 2 00:05:59
    3. Changing Snitch Type 00:05:25
    4. Modifying cassandra-rackdc.properties 00:07:45
    5. Changing Replication Strategy - Part 1 00:05:55
    6. Changing Replication Strategy - Part 2 00:03:58
    7. Pop Quiz - Redefining a Cluster 00:02:30
  34. Resources For FurTher Learning
    1. Accessing Documentation 00:02:51
    2. Reading Blogs And Books 00:04:53
    3. Watching Video Recordings 00:04:05
    4. Posting Questions 00:04:10
    5. Attending Events 00:03:00
    6. Wrap Up 00:01:03
    7. The Case for Kafka 00:11:23
    8. The Basics 00:09:10
    9. Setting up a Kafka Cluster 00:15:30
    10. Writing a Kafka Producer 00:14:33
    11. Writing a Kafka Consumer 00:16:34
    12. Using Kafka from Python 00:08:03
    13. Troubleshooting Kafka 00:29:29
    14. Integrating Kafka and Hadoop with Flafka 00:26:06
    15. Kafka Availability and Consistency 00:22:38
    16. Kafka Ecosystem 00:13:13
    17. Future of Kafka 00:08:53
    18. Pre-Flight Check 00:13:08
    19. Spark Deconstructed 00:14:31
    20. A Brief History 00:23:28
    21. Simple Spark Apps 00:25:07
    22. Spark Essentials 00:35:18
    23. Spark Examples 00:21:55
    24. Unifying the Pieces - Spark SQL 00:24:07
    25. Unifying the Pieces - Spark Streaming 00:14:48
    26. Unifying the Pieces - MLlib and GraphX 00:20:00
    27. Unified Workflows Demo 00:22:35
    28. The Full SDLC 00:04:01
    29. Developer Certification 00:06:10
    30. Resources 00:04:44
    31. Introduction - Why DataFrames? 00:02:28
    32. ETL to Prepare the Data from Capital Bikeshare 00:02:46
    33. Create a DataFrame, Explore using SQL 00:02:47
    34. Data Preparation for Machine Learning Models 00:05:33
    35. Build a Classifier Using Naive Bayes 00:04:43
    36. Build a Classifier Using Decision Trees 00:02:26
    37. Build a Classifier Using Random Forests 00:02:20
    38. Use a DataFrame to Compare Models 00:04:15
    39. Parquet as a Best Practice with DataFrames 00:00:58
    40. How to Store a DataFrame with Parquet 00:03:25
    41. How to Read a DataFrame Back in From Parquet 00:02:57
    42. Use SQL to Estimate Route Durations 00:01:41
    43. Data Preparation for GraphX - Model Route Costs 00:04:43
    44. Use PageRank to Rank Popular Stations 00:03:14
    45. Optimize Routes to Columbus Circle 00:03:43
    46. Compare Results with Google Maps 00:01:58
    47. Analyze a Popular Tourist Route 00:02:30
    48. Examples of How to Use DataFrames in Python 00:02:57
    49. Summary - The New DataFrames Features in Spark 00:01:03
  35. Introduction
    1. About Alluxio And The Course 00:03:38
    2. About The Author 00:01:24
  36. Using Alluxio Locally
    1. Downloading Alluxio 00:03:03
    2. Starting The System Locally 00:05:09
    3. Interacting Via The Shell 00:02:45
    4. Browsing The Web UI 00:03:53
  37. Examples With Alluxio
    1. Setting Up Alluxio With Spark And S3 00:06:15
    2. Running Spark on Alluxio with S3 00:05:29
    3. Using Alluxio With Unified Namespace 00:06:05
  38. Deploying Alluxio On A Cluster
    1. Deploying Alluxio In AWS 00:07:49
  39. Conclusion
    1. Contributing To The Project And Conclusion 00:03:52