You are previewing Pro Microsoft HDInsight: Hadoop on Windows.
O'Reilly logo
Pro Microsoft HDInsight: Hadoop on Windows

Book Description

Pro Microsoft HDInsight is a complete guide to deploying and using Apache Hadoop on the Microsoft Windows Azure Platforms. The information in this book enables you to process enormous volumes of structured as well as non-structured data easily using HDInsight, which is Microsoft's own distribution of Apache Hadoop. Furthermore, the blend of Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings available through Windows Azure lets you take advantage of Hadoop's processing power without the worry of creating, configuring, maintaining, or managing your own cluster.

With the data explosion that is soon to happen, the open source Apache Hadoop Framework is gaining traction, and it benefits from a huge ecosystem that has risen around the core functionalities of the Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. Pro Microsoft HDInsight equips you with the knowledge, confidence, and technique to configure and manage this ecosystem on Windows Azure. The book is an excellent choice for anyone aspiring to be a data scientist or data engineer, putting you a step ahead in the data mining field.

  • Guides you through installation and configuration of an HDInsight cluster on Windows Azure

  • Provides clear examples of configuring and executing Map Reduce jobs

  • Helps you consume data and diagnose errors from the Windows Azure HDInsight Service

What you'll learn

  • Create and Manage HDInsight clusters on Windows Azure

  • Understand the different HDInsight services and configuration files

  • Develop and run Map Reduce jobs using .NET and PowerShell

  • Consume data from client applications like Microsoft Excel and Power View

  • Monitor job executions and logs

  • Troubleshoot common problems

Who this book is for

Pro Microsoft HDInsight: Hadoop on Windows is an excellent choice for developers in the field of business intelligence and predictive analysis who want that extra edge in technology on Microsoft Windows and Windows Azure platforms. The book is for people who love to slice and dice data, and identify trends and patterns through analysis of data to help in creative and intelligent decision making.

Table of Contents

  1. Title Page
  2. Dedication
  3. Contents at a Glance
  4. Contents
  5. About the Author
  6. About the Technical Reviewers
  7. Acknowledgments
  8. Introduction
  9. CHAPTER 1: Introducing HDInsight
    1. What Is Big Data, and Why Now?
    2. How Is Big Data Different?
    3. Is Big Data the Right Solution for You?
    4. The Apache Hadoop Ecosystem
    5. Microsoft HDInsight: Hadoop on Windows
    6. Combining HDInsight with Your Business Processes
    7. Summary
  10. CHAPTER 2: Understanding Windows Azure HDInsight Service
    1. Microsoft’s Cloud-Computing Platform
    2. Windows Azure HDInsight Service
    3. Summary
  11. CHAPTER 3: Provisioning Your HDInsight Service Cluster
    1. Creating the Storage Account
    2. Creating a SQL Azure Database
    3. Deploying Your HDInsight Cluster
    4. Customizing Your Cluster Creation
    5. Configuring the Cluster User and Hive/Oozie Storage
    6. Choosing Your Storage Account
    7. Finishing the Cluster Creation
    8. Monitoring the Cluster
    9. Configuring the Cluster
    10. Summary
  12. CHAPTER 4: Automating HDInsight Cluster Provisioning
    1. Using the Hadoop .NET SDK
    2. Using the PowerShell cmdlets for HDInsight
    3. Command-Line Interface (CLI)
    4. Summary
  13. CHAPTER 5: Submitting Jobs to Your HDInsight Cluster
    1. Using the Hadoop .NET SDK
    2. Using PowerShell
    3. Using MRRunner
    4. Summary
  14. CHAPTER 6: Exploring the HDInsight Name Node
    1. Accessing the HDInsight Name Node
    2. Hadoop Command Line
    3. Hadoop Web Interfaces
    4. HDInsight Windows Services
    5. Installation Directory
    6. Summary
  15. CHAPTER 7: Using Windows Azure HDInsight Emulator
    1. Installing the Emulator
    2. Verifying the Installation
    3. Using the Emulator
    4. Future Directions
    5. Summary
  16. CHAPTER 8: Accessing HDInsight over Hive and ODBC
    1. Hive: The Hadoop Data Warehouse
    2. Working with Hive
    3. Hive Storage
    4. The Hive ODBC Driver
    5. Summary
  17. CHAPTER 9: Consuming HDInsight from Self-Service BI Tools
    1. PowerPivot Enhancements
    2. Creating a Stock Report
    3. Power View for Excel
    4. Power BI: The Future
    5. Summary
  18. CHAPTER 10: Integrating HDInsight with SQL Server Integration Services
    1. SSIS as an ETL Tool
    2. Creating the Project
    3. Creating the Data Flow
    4. Creating the Source Hive Connection
    5. Creating the Destination SQL Connection
    6. Creating the Hive Source Component
    7. Creating the SQL Destination Component
    8. Mapping the Columns
    9. Running the Package
    10. Summary
  19. CHAPTER 11: Logging in HDInsight
    1. Service Logs
    2. Hadoop log4j Log Files
    3. Log4j Framework
    4. Windows ODBC Tracing
    5. Logging Windows Azure Storage Blob Operations
    6. Logging in Windows Azure HDInsight Emulator
    7. Summary
  20. CHAPTER 12: Troubleshooting Cluster Deployments
    1. Cluster Creation
    2. Installer Logs
    3. Troubleshooting Visual Studio Deployments
    4. Troubleshooting PowerShell Deployments
    5. Summary
  21. CHAPTER 13: Troubleshooting Job Failures
    1. MapReduce Jobs
    2. Hive Jobs
    3. Pig Jobs
    4. Sqoop Jobs
    5. Windows Azure Storage Blob
    6. Connectivity Failures
    7. Summary
  22. Index