You are previewing HDInsight Essentials - Second Edition.
O'Reilly logo
HDInsight Essentials - Second Edition

Book Description

Learn how to build and deploy a modern big data architecture to empower your business

In Detail

Traditional relational databases are today ineffective with dealing with the challenges presented by Big Data. A Hadoop-based architecture offers a radical solution, as it is designed specifically to handle huge sets of unstructured data.

This book takes you through the journey of building a modern data lake architecture using HDInsight, a Hadoop-based service that allows you to successfully manage high volume and velocity data in the Microsoft Azure Cloud. Featuring a wealth of practical examples, you'll find tips and techniques to provision your own HDInsight cluster to ingest, organize, transform, and analyze data.

While guided through HDInsight, you'll explore the wider Hadoop ecosystem with plenty of working examples on Hadoop technologies including Hive, Pig, MapReduce, HBase, Storm, and analytics solutions including using Excel PowerQuery, PowerMap, and PowerBI.

What You Will Learn

  • Explore core features of Hadoop, including the HDFS2 and YARN, the new resource manager for Hadoop

  • Build your HDInsight cluster in minutes and learn how to administer it using Azure PowerShell

  • Discover what's new in Hadoop 2.X and the reference architecture for a modern data lake based on Hadoop

  • Find out more about a data lake vision and its core capabilities

  • Ingest and organize your data into HDInsight

  • Utilize open source software to transform data including Hive, Pig, and MapReduce, and make it available for decision makers

  • Get to grips with architectural considerations for scalability, maintainability, and security

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. HDInsight Essentials Second Edition
      1. Table of Contents
      2. HDInsight Essentials Second Edition
      3. Credits
      4. About the Author
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
          3. Instant updates on new Packt books
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Hadoop and HDInsight in a Heartbeat
        1. Data is everywhere
          1. Business value of big data
        2. Hadoop concepts
          1. Brief history of Hadoop
          2. Core components
          3. Hadoop cluster layout
          4. HDFS overview
            1. Writing a file to HDFS
            2. Reading a file from HDFS
            3. HDFS basic commands
          5. YARN overview
            1. YARN application life cycle
            2. YARN workloads
        3. Hadoop distributions
        4. HDInsight overview
          1. HDInsight and Hadoop relationship
        5. Hadoop on Windows deployment options
          1. Microsoft Azure HDInsight Service
          2. HDInsight Emulator
          3. Hortonworks Data Platform (HDP) for Windows
        6. Summary
      9. 2. Enterprise Data Lake using HDInsight
        1. Enterprise Data Warehouse architecture
          1. Source systems
          2. Data warehouse
            1. Storage
            2. Processing
          3. User access
          4. Provisioning and monitoring
          5. Data governance and security
          6. Pain points of EDW
        2. The next generation Hadoop-based Enterprise data architecture
          1. Source systems
          2. Data Lake
            1. Storage
            2. Processing
          3. User access
            1. Provisioning and monitoring
            2. Data governance, security, and metadata
        3. Journey to your Data Lake dream
          1. Ingestion and organization
          2. Transformation (rules driven)
          3. Access, analyze, and report
        4. Tools and technology for Hadoop ecosystem
        5. Use case powered by Microsoft HDInsight
          1. Problem statement
          2. Solution
            1. Source systems
            2. Storage
            3. Processing
            4. User access
          3. Benefits
        6. Summary
      10. 3. HDInsight Service on Azure
        1. Registering for an Azure account
        2. Azure storage
        3. Provisioning an HDInsight cluster
          1. Cluster topology
          2. Provisioning using Azure PowerShell
            1. Creating a storage container
            2. Provisioning a new HDInsight cluster
        4. HDInsight management dashboard
          1. Dashboard
          2. Monitor
          3. Configuration
        5. Exploring clusters using the remote desktop
          1. Running a sample MapReduce
        6. Deleting the cluster
        7. HDInsight Emulator for the development
          1. Installing HDInsight Emulator
          2. Installation verification
          3. Using HDInsight Emulator
        8. Summary
      11. 4. Administering Your HDInsight Cluster
        1. Monitoring cluster health
        2. Name Node status
          1. The Name Node Overview page
          2. Datanode Status
          3. Utilities and logs
        3. Hadoop Service Availability
        4. YARN Application Status
        5. Azure storage management
          1. Configuring your storage account
          2. Monitoring your storage account
          3. Managing access keys
          4. Deleting your storage account
        6. Azure PowerShell
          1. Access Azure Blob storage using Azure PowerShell
        7. Summary
      12. 5. Ingest and Organize Data Lake
        1. End-to-end Data Lake solution
        2. Ingesting to Data Lake using HDFS command
          1. Connecting to a Hadoop client
          2. Getting your files on the local storage
          3. Transferring to HDFS
        3. Loading data to Azure Blob storage using Azure PowerShell
        4. Loading files to Data Lake using GUI tools
          1. Storage access keys
          2. Storage tools
          3. CloudXplorer
            1. Key benefits
            2. Registering your storage account
            3. Uploading files to your Blob storage
        5. Using Sqoop to move data from RDBMS to Data Lake
          1. Key benefits
          2. Two modes of using Sqoop
          3. Using Sqoop to import data (SQL to Hadoop)
        6. Organizing your Data Lake in HDFS
        7. Managing file metadata using HCatalog
          1. Key benefits
          2. Using HCatalog Command Line to create tables
        8. Summary
      13. 6. Transform Data in the Data Lake
        1. Transformation overview
        2. Tools for transforming data in Data Lake
          1. HCatalog
          2. Persisting HCatalog metastore in a SQL database
          3. Apache Hive
            1. Hive architecture
            2. Starting Hive in HDInsight
            3. Basic Hive commands
          4. Apache Pig
            1. Pig architecture
            2. Starting Pig in HDInsight node
            3. Basic Pig commands
          5. Pig or Hive
          6. MapReduce
            1. The mapper code
            2. The reducer code
            3. The driver code
            4. Executing MapReduce on HDInsight
          7. Azure PowerShell for execution of Hadoop jobs
        3. Transformation for the OTP project
          1. Cleaning data using Pig
          2. Executing Pig script
          3. Registering a refined and aggregate table using Hive
          4. Executing Hive script
          5. Reviewing results
        4. Other tools used for transformation
          1. Oozie
          2. Spark
        5. Summary
      14. 7. Analyze and Report from Data Lake
        1. Data access overview
        2. Analysis using Excel and Microsoft Hive ODBC driver
          1. Prerequisites
          2. Step 1 – installing the Microsoft Hive ODBC driver
          3. Step 2 – creating Hive ODBC Data Source
          4. Step 3 – importing data to Excel
        3. Analysis using Excel Power Query
          1. Prerequisites
          2. Step 1 – installing the Microsoft Power Query for Excel
          3. Step 2 – importing Azure Blob storage data into Excel
          4. Step 3 – analyzing data using Excel
        4. Other BI features in Excel
          1. PowerPivot
          2. Power View and Power Map
          3. Step 1 – importing Azure Blob storage data into Excel
          4. Step 2 – launch map view
          5. Step 3 – configure the map
          6. Power BI Catalog
        5. Ad hoc analysis using Hive
        6. Other alternatives for analysis
          1. RHadoop
          2. Apache Giraph
          3. Apache Mahout
          4. Azure Machine Learning
        7. Summary
      15. 8. HDInsight 3.1 New Features
        1. HBase
          1. HBase positioning in Data Lake and use cases
          2. Provisioning HDInsight HBase cluster
          3. Creating a sample HBase schema
            1. Designing the airline on-time performance table
            2. Connecting to HBase using the HBase shell
            3. Creating an HBase table
            4. Loading data to the HBase table
            5. Querying data from the HBase table
          4. HBase additional information
        2. Storm
          1. Storm positioning in Data Lake
          2. Storm key concepts
          3. Provisioning HDInsight Storm cluster
          4. Running a sample Storm topology
            1. Connecting to Storm using Storm shell
            2. Running the Storm Wordcount topology
            3. Monitoring status of the Wordcount topology
          5. Additional information on Storm
        3. Apache Tez
        4. Summary
      16. 9. Strategy for a Successful Data Lake Implementation
        1. Challenges on building a production Data Lake
        2. The success path for a production Data Lake
          1. Identifying the big data problem
          2. Proof of technology for Data Lake
          3. Form a Data Lake Center of Excellence
            1. Executive sponsors
            2. Data Lake consumers
            3. Development
            4. Operations and infrastructure
        3. Architectural considerations
          1. Extensible and modular
          2. Metadata-driven solution
          3. Integration strategy
          4. Security
        4. Online resources
        5. Summary
      17. Index