You are previewing Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight® in 24 Hours.
O'Reilly logo
Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight® in 24 Hours

Book Description

Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours

NOTE: This is a Safari Enhanced Edition. Hour 25, "Getting Started with Apache HBase on HDInsight," and Hour 26, "Integration of Enterprise Data Warehouse with Hadoop and the Microsoft Analytics Platform System," are available exclusively to Safari subscribers.

In just 24 lessons of one hour or less, Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours helps you leverage Hadoop’s power on a flexible, scalable cloud platform using Microsoft’s newest business intelligence, visualization, and productivity tools.

This book’s straightforward, step-by-step approach shows you how to provision, configure, monitor, and troubleshoot HDInsight and use Hadoop cloud services to solve real analytics problems. You’ll gain more of Hadoop’s benefits, with less complexity—even if you’re completely new to Big Data analytics. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success.

Practical, hands-on examples show you how to apply what you learn

Quizzes and exercises help you test your knowledge and stretch your skills

Notes and tips point out shortcuts and solutions

Learn how to…

·         Master core Big Data and NoSQL concepts, value propositions, and use cases

·         Work with key Hadoop features, such as HDFS2 and YARN

·         Quickly install, configure, and monitor Hadoop (HDInsight) clusters in the cloud

·         Automate provisioning, customize clusters, install additional Hadoop projects, and administer clusters

·         Integrate, analyze, and report with Microsoft BI and Power BI

·         Automate workflows for data transformation, integration, and other tasks

·         Use Apache HBase on HDInsight

·         Use Sqoop or SSIS to move data to or from HDInsight

·         Perform R-based statistical computing on HDInsight datasets

·         Accelerate analytics with Apache Spark

·         Run real-time analytics on high-velocity data streams

·         Write MapReduce, Hive, and Pig programs

Register your book at for convenient access to downloads, updates, and corrections as they become available. 

Table of Contents

  1. About This E-Book
  2. Title Page
  3. Copyright Page
  4. Contents at a Glance
  5. Table of Contents
  6. About the Authors
  7. Dedications
  8. Acknowledgments
  9. We Want to Hear from You!
  10. Reader Services
  11. Introduction
    1. Who Should Read This Book
    2. How This Book Is Organized
    3. Conventions Used in This Book
      1. Try It Yourself
    4. System Requirements
  12. Part I: Understanding Big Data, Hadoop 1.0, and 2.0
    1. Hour 1. Introduction of Big Data, NoSQL, and Business Value Proposition
      1. Types of Analysis
      2. Types of Data
        1. Structured Data
        2. Unstructured Data
        3. Semi-Structured Data
      3. Big Data
        1. Volume Characteristics of Big Data
        2. Variety Characteristics of Big Data
        3. Velocity Characteristics of Big Data
        4. What Big Data Is Not
      4. Managing Big Data
        1. More Data, More Accurate Models
        2. More—and Cheaper—Computing Power and Storage
        3. Increased Awareness of the Competition and a Means to Proactively Win Over Competitors
        4. Availability of New Tools and Technologies to Process and Manage Big Data
      5. NoSQL Systems
        1. NoSQL Versus RDBMS
        2. Major Types of NoSQL Technologies
        3. Benefits of Using NoSQL Systems
        4. Limitations of NoSQL Systems
      6. Big Data, NoSQL Systems, and the Business Value Proposition
      7. Application of Big Data and Big Data Solutions
      8. Summary
      9. Q&A
    2. Hour 2. Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings
      1. What Is Apache Hadoop?
      2. Architecture of Hadoop and Hadoop Ecosystems
        1. Hadoop Distributed File System
        2. MapReduce
        3. Hadoop Ecosystems
      3. What’s New in Hadoop 2.0
        1. Single Point of Failure
        2. Limited to Running MapReduce Jobs on HDFS
        3. Low Computing Resource Utilization
        4. Horizontal Scaling Performance Issue
        5. Overly Crowded JobTracker
      4. Architecture of Hadoop 2.0
        1. HDFS High Availability
        2. HDFS Federation
        3. HDFS Snapshot
      5. Tools and Technologies Needed with Big Data Analytics
        1. Data Acquisition
        2. Data Storage
        3. Data Analysis
        4. Data Visualization
        5. Data Management
        6. Development and Monitoring Tools
      6. Major Players and Vendors for Hadoop
        1. Cloudera
        2. Hortonworks
        3. MapR
        4. Amazon
        5. Microsoft
      7. Deployment Options for Microsoft Big Data Solutions
        1. On-Premises
        2. Cloud
      8. Summary
      9. Q&A
    3. Hour 3. Hadoop Distributed File System Versions 1.0 and 2.0
      1. Introduction to HDFS
      2. HDFS Architecture
        1. File Split in HDFS
        2. Block Placement and Replication in HDFS
        3. Writing to HDFS
        4. Reading from HDFS
        5. Handling Failures
        6. Delete Files from HDFS to Decrease the Replication Factor
      3. Rack Awareness
        1. Making Clusters Rack Aware
      4. WebHDFS
      5. Accessing and Managing HDFS Data
        1. HDFS Command-Line Interface
        2. Using MapReduce, Hive, Pig, or Sqoop
      6. What’s New in HDFS 2.0
        1. HDFS High Availability
        2. HDFS Federation
        3. HDFS Snapshot
      7. Summary
      8. Q&A
        1. Quiz
        2. Answers
    4. Hour 4. The MapReduce Job Framework and Job Execution Pipeline
      1. Introduction to MapReduce
      2. MapReduce Architecture
        1. MapReduce Job Request and Response Flow
        2. TaskTracker and Data Node Co-location
      3. MapReduce Job Execution Flow
        1. Multiple Input and Output Format
        2. Mapper
        3. Partitioner
        4. Reducer
        5. Combiner
        6. Driver
        7. Tool Interface
        8. Context Object
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
    5. Hour 5. MapReduce—Advanced Concepts and YARN
      1. DistributedCache
      2. Hadoop Streaming
      3. MapReduce Joins
        1. Map-Side Join
        2. Reduce-Side Join
      4. Bloom Filter
      5. Performance Improvement
        1. Use of Compression
        2. Reusing Java Virtual Machine
        3. MapReduce Job Scheduling
        4. Fair Scheduler
        5. Capacity Scheduler
      6. Handling Failures
        1. JobTracker Failure
        2. TaskTracker Failure
        3. Task Failure
        4. Speculative Execution
        5. Handling Bad Records
      7. Counter
      8. YARN
        1. Different Components of YARN
        2. Node Manager
        3. Container
        4. Job Execution Flow in YARN
      9. Uber-Tasking Optimization
      10. Failures in YARN
        1. Task Failure
        2. Application Master Failure
        3. Node Manager Failure
        4. Resource Manager Failure
      11. Resource Manager High Availability and Automatic Failover in YARN
        1. How to Reach an Active Resource Manager
      12. Summary
      13. Q&A
        1. Quiz
        2. Answers
  13. Part II: Getting Started with HDInsight and Understanding Its Different Components
    1. Hour 6. Getting Started with HDInsight, Provisioning Your HDInsight Service Cluster, and Automating HDInsight Cluster Provisioning
      1. Introduction to Microsoft Azure
        1. Azure Storage Service
      2. Understanding HDInsight Service
        1. HDInsight Cluster Deployment
      3. Provisioning HDInsight on the Azure Management Portal
        1. Enabling a Remote Desktop Connection via the Remote Desktop Protocol
        2. Verifying HDInsight Setup
      4. Automating HDInsight Provisioning with PowerShell
        1. Prerequisites
        2. Provisioning HDInsight Cluster
        3. Verifying HDInsight Setup with PowerShell
      5. Managing and Monitoring HDInsight Cluster and Job Execution
      6. Summary
      7. Q&A
      8. Exercise
    2. Hour 7. Exploring Typical Components of HDFS Cluster
      1. HDFS Cluster Components
        1. Understanding Name Node Functionality
        2. Why the Secondary Name Node Is Not a Standby Node
        3. Standby Name Node
      2. HDInsight Cluster Architecture
      3. High Availability in HDInsight
        1. HA Based on Quorum-Based Storage
        2. Failover Detection Using ZooKeeper
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
    3. Hour 8. Storing Data in Microsoft Azure Storage Blob
      1. Understanding Storage in Microsoft Azure
      2. Benefits of Azure Storage Blob over HDFS
      3. Azure Storage Explorer Tools
        1. Azure Storage Explorer
        2. AZCopy
        3. Azure PowerShell
        4. Hadoop Command Line
        5. HDInsight Storage Architecture Details
        6. Configuring the Default File System
        7. Understanding the Impact of Blob Storage on Performance and Data Locality
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
    4. Hour 9. Working with Microsoft Azure HDInsight Emulator
      1. Getting Started with HDInsight Emulator
        1. Setting Up Microsoft HDInsight Emulator
      2. Setting Up Microsoft Azure Emulator for Storage
        1. Setting Up Microsoft Storage Emulator
      3. Summary
      4. Q&A
        1. Quiz
        2. Answers
  14. Part III: Programming MapReduce and HDInsight Script Action
    1. Hour 10. Programming MapReduce Jobs
      1. MapReduce Hello World!
        1. Running a Java MapReduce Program on HDInsight Emulator
      2. Analyzing Flight Delays with MapReduce
      3. Serialization Frameworks for Hadoop
        1. Avro
      4. Hadoop Streaming
      5. Summary
      6. Q&A
        1. Quiz
        2. Answers
    2. Hour 11. Customizing the HDInsight Cluster with Script Action
      1. Identifying the Need for Cluster Customization
      2. Developing Script Action
        1. Using the HDInsightUtilities Module
      3. Consuming Script Action
        1. Using Script Action with the Azure Management Portal
        2. Using Script Action with PowerShell
        3. Using Script Action with HDInsight .NET SDK
      4. Running a Giraph Job on a Customized HDInsight Cluster
      5. Testing Script Action with HDInsight Emulator
      6. Summary
      7. Q&A
        1. Quiz
        2. Answers
  15. Part IV: Querying and Processing Big Data in HDInsight
    1. Hour 12. Getting Started with Apache Hive and Apache Tez in HDInsight
      1. Introduction to Apache Hive
      2. Getting Started with Apache Hive in HDInsight
        1. Using the Hive Command-Line Interface
        2. Using PowerShell Scripting
        3. Using the Cluster Dashboard
      3. Azure HDInsight Tools for Visual Studio
        1. Connecting to HDInsight Cluster from Visual Studio
        2. Viewing Existing Table Properties and Data
        3. Viewing Hive Jobs on HDInsight Cluster
        4. Creating New Tables in Hive
        5. Writing Hive Queries
        6. Creating a Hive Application
      4. Programmatically Using the HDInsight .NET SDK
      5. Introduction to Apache Tez
        1. Using the Apace Tez Engine with Hive on HDInsight
      6. Summary
      7. Q&A
      8. Exercise
    2. Hour 13. Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog
      1. Programming with Hive in HDInsight
        1. Running Examples on HDInsight Emulator
        2. Comparison with RDBMS Databases
        3. Database or Schema
      2. Using Tables in Hive
        1. Internal Table
        2. External Table
        3. Internal and External Tables
        4. Supported Data Types for Columns in Hive Tables
        5. Other Clauses Used When Creating a Table in Hive
      3. Serialization and Deserialization
        1. CREATE TABLE AS SELECT Command
        2. CREATE TABLE LIKE Command
        3. Temporary Table
        4. Creating Table Views
      4. Data Load Processes for Hive Tables
        1. Data Manipulation Language
        2. Built-in Functions in Hive
      5. Querying Data from Hive Tables
        1. Writing Data Analysis Queries
        2. Partition Switching or Swapping
        3. Dynamic Partition Insert
        4. Creating Datasets for Analysis
        5. Data Analysis of Timely Departure Percentage, Based on Airline
        6. Data Analysis of Cancelled Flights, Based on Cancellation Reason
      6. Indexing in Hive
      7. Apache Tez in Action
      8. Apache HCatalog
      9. Summary
      10. Q&A
      11. Exercise
    3. Hour 14. Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 1
      1. Introduction to Hive ODBC Driver
        1. 32-Bit Versus 64-Bit Hive ODBC Driver
        2. Setting Up the Hive ODBC Driver
        3. Configuring the 32-Bit Driver
      2. Introduction to Microsoft Power BI
      3. Accessing Hive Data from Microsoft Excel
      4. Summary
      5. Q&A
    4. Hour 15. Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 2
      1. Accessing Hive Data from PowerPivot
        1. Reporting and Data Visualization with PowerPivot
        2. Reporting and Data Visualization with Excel
        3. Reporting and Data Visualization with Power View
        4. Reporting and Data Visualization with Power Map
      2. Accessing Hive Data from SQL Server
        1. Accessing Data from SQL Server Analysis Services
        2. Accessing Data from SQL Server Reporting Services
      3. Accessing HDInsight Data from Power Query
      4. Summary
      5. Q&A
      6. Exercise
    5. Hour 16. Integrating HDInsight with SQL Server Integration Services
      1. The Need for Data Movement
      2. Introduction to SSIS
      3. Analyzing On-time Flight Departure with SSIS
        1. Scenario Prerequisites
        2. Package Variables
        3. Setting Up Azure PowerShell for Automation
      4. Provisioning HDInsight Cluster
        1. Executing Hive Query
        2. Loading Query Results to a SQL Azure Table
        3. Executing the Package
      5. Summary
      6. Q&A
        1. Quiz
        2. Answers
    6. Hour 17. Using Pig for Data Processing
      1. Introduction to Pig Latin
      2. Using Pig to Count Cancelled Flights
        1. Uploading Data to an HDInsight Cluster for Processing
        2. Defining Pig Relations
        3. Filtering Pig Relations
        4. Grouping Records by Cancellation Code
        5. Summarizing Cancelled Flights by Reason
        6. Retrieving the Cancellation Description by Joining Relations
        7. Saving Results to the File System
      3. Using HCatalog in a Pig Latin Script
        1. Specifying Parallelism in Pig Latin
      4. Submitting Pig Jobs with PowerShell
        1. Adding Azure Subscription
        2. Creating a Pig Job Definition
        3. Submitting a Pig Job for Execution
        4. Getting the Job Output
      5. Summary
      6. Q&A
        1. Quiz
        2. Answers
    7. Hour 18. Using Sqoop for Data Movement Between RDBMS and HDInsight
      1. What Is Sqoop?
        1. Importing Data to HDInsight Clusters
        2. Importing to Hive
        3. Exporting Data from HDFS
        4. Understanding the Export Process
      2. Using Sqoop Import and Export Commands
      3. Using Sqoop with PowerShell
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
  16. Part V: Managing Workflow and Performing Statistical Computing
    1. Hour 19. Using Oozie Workflows and Job Orchestration with HDInsight
      1. Introduction to Oozie
        1. Oozie Workflow
      2. Determining On-time Flight Departure Percentage with Oozie
        1. Scenario Prerequisites
        2. Creating an Oozie Workflow
        3. Executing the Workflow
        4. Monitoring Job Status
        5. Querying the Results
      3. Submitting an Oozie Workflow with HDInsight .NET SDK
      4. Coordinating Workflows with Oozie
      5. Oozie Compared to SSIS
      6. Summary
      7. Q&A
        1. Quiz
        2. Answers
    2. Hour 20. Performing Statistical Computing with R
      1. Introduction to R
        1. Installing R on Windows
        2. Loading External Data
        3. Performing Rudimentary Data Analysis
      2. Integrating R with Hadoop
      3. Enabling R on HDInsight
        1. Installing R on HDInsight
        2. Using R with HDInsight
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
  17. Part VI: Performing Interactive Analytics and Machine Learning
    1. Hour 21. Performing Big Data Analytics with Spark
      1. Introduction to Spark
        1. Installing Spark on HDInsight
      2. Spark Programming Model
        1. Log Mining with the Spark Shell
      3. Blending SQL Querying with Functional Programs
        1. Hive Compared to Spark SQL
        2. Using SQL Blended with Functional Code to Analyze Crime Data
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
    2. Hour 22. Microsoft Azure Machine Learning
      1. History of Traditional Machine Learning
      2. Introduction to Azure ML
        1. Benefits of Azure ML
      3. Azure ML Workspace
        1. Azure ML Studio
      4. Processes to Build Azure ML Solutions
      5. Getting Started with Azure ML
        1. Retrieving Data into Azure ML Modules
        2. Using the Descriptive Statistics Module
      6. Creating Predictive Models with Azure ML
      7. Publishing Azure ML Models as Web Services
      8. Summary
      9. Q&A
      10. Exercise
  18. Part VII: Performing Real-time Analytics
    1. Hour 23. Performing Stream Analytics with Storm
      1. Introduction to Storm
        1. Understanding the Storm Architecture
      2. Using SCP.NET to Develop Storm Solutions
      3. Analyzing Speed Limit Violation Incidents with Storm
        1. Creating the Storm Topology
        2. Creating the SQL Azure Table to Store Violation Counts
        3. Submitting the Topology to the HDInsight Storm Cluster
      4. Summary
      5. Q&A
        1. Quiz
        2. Answers
    2. Hour 24. Introduction to Apache HBase on HDInsight
      1. Introduction to Apache HBase
        1. When to Use HBase
      2. HBase Architecture
        1. Creating HBase Tables
        2. Writing Data to HBase Tables
        3. Reading Data from HBase Tables
        4. Data Distribution and Storage
        5. Compaction of Data
      3. Creating HDInsight Cluster with HBase
        1. Using the Azure Management Portal
        2. Using PowerShell Scripting
        3. Verifying the Created HDInsight with HBase Cluster
      4. Summary
      5. Q&A
  19. Part VIII: Bonus Chapters
    1. Hour 25. Getting Started with Apache HBase on HDInsight
      1. Working with HBase Tables
        1. Table Management Commands
        2. Data Manipulation Commands
        3. Reading Data from HBase Tables
        4. General HBase Commands
      2. Programmatically Accessing HBase Using C#
      3. HBase Versus Hive
        1. Differences in Hive and HBase Tables
      4. Using Apache Phoenix with HBase Tables
      5. Summary
      6. Q&A
    2. Hour 26. Integration of Enterprise Data Warehouse with Hadoop and the Microsoft Analytics Platform System
      1. Integrating Enterprise Data Warehouse with the Hadoop World
      2. Microsoft Analytics Platform System
        1. HDInsight
        2. PolyBase Is a Game Changer
        3. SQL Server Parallel Data Warehouse
      3. How SQL Server PDW Works
        1. Database in SQL Server PDW
        2. Replicated Table
        3. Distributed Table
        4. Query Processing
        5. Data Loading
        6. The Importance of Statistics and How They Work in PDW
        7. Clustered Column-Store Index in SQL Server PDW
      4. Integrated PolyBase Queries Across SQL Server PDW and Hadoop
        1. Workload Management
        2. Monitoring and Managing Microsoft APS Appliances
      5. Summary
      6. Q&A
        1. Exercises
  20. Index
  21. Code Snippets