You are previewing Data Lake Development with Big Data.
O'Reilly logo
Data Lake Development with Big Data

Book Description

Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies

About This Book

  • Comprehend the intricacies of architecting a Data Lake and build a data strategy around your current data architecture

  • Efficiently manage vast amounts of data and deliver it to multiple applications and systems with a high degree of performance and scalability

  • Packed with industry best practices and use-case scenarios to get you up-and-running

  • Who This Book Is For

    This book is for architects and senior managers who are responsible for building a strategy around their current data architecture, helping them identify the need for a Data Lake implementation in an enterprise context. The reader will need a good knowledge of master data management and information lifecycle management, and experience of Big Data technologies.

    What You Will Learn

  • Identify the need for a Data Lake in your enterprise context and learn to architect a Data Lake

  • Learn to build various tiers of a Data Lake, such as data intake, management, consumption, and governance, with a focus on practical implementation scenarios

  • Find out the key considerations to be taken into account while building each tier of the Data Lake

  • Understand Hadoop-oriented data transfer mechanism to ingest data in batch, micro-batch, and real-time modes

  • Explore various data integration needs and learn how to perform data enrichment and data transformations using Big Data technologies

  • Enable data discovery on the Data Lake to allow users to discover the data

  • Discover how data is packaged and provisioned for consumption

  • Comprehend the importance of including data governance disciplines while building a Data Lake

  • In Detail

    A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications.

    This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.

    Data Lakes can be viewed as having three capabilities—intake, management, and consumption. This book will take readers through each of these processes of developing a Data Lake and guide them (using best practices) in developing these capabilities. It will also explore often ignored, yet crucial considerations while building Data Lakes, with the focus on how to architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data. You will be able to utilize Data Lakes for efficient and easy data processing and analytics.

    Style and approach

    Data Lake Development with Big Data provides architectural approaches to building a Data Lake. It follows a use case-based approach where practical implementation scenarios of each key component are explained. It also helps you understand how these use cases are implemented in a Data Lake. The chapters are organized in a way that mimics the sequential data flow evidenced in a Data Lake.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Data Lake Development with Big Data
      1. Table of Contents
      2. Data Lake Development with Big Data
      3. Credits
      4. About the Authors
      5. Acknowledgement
      6. About the Reviewer
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Errata
          2. Piracy
          3. Questions
      9. 1. The Need for Data Lake
        1. Before the Data Lake
        2. Need for Data Lake
        3. Defining Data Lake
        4. Key benefits of Data Lake
        5. Challenges in implementing a Data Lake
        6. When to go for a Data Lake implementation
        7. Data Lake architecture
          1. Architectural considerations
          2. Architectural composition
          3. Architectural details
            1. Understanding Data Lake layers
              1. The Data Governance and Security Layer
              2. The Information Lifecycle Management layer
              3. The Metadata Layer
            2. Understanding Data Lake tiers
              1. The Data Intake tier
                1. The Source System Zone
                2. The Transient Zone
                3. The Raw Zone
                4. Batch Raw Storage
                5. The real-time Raw Storage
              2. The Data Management tier
                1. The Integration Zone
                2. The Enrichment Zone
                3. The Data Hub Zone
              3. The Data Consumption tier
                1. The Data Discovery Zone
                2. The Data Provisioning Zone
        8. Summary
      10. 2. Data Intake
        1. Understanding Intake tier zones
          1. Source System Zone functionalities
            1. Understanding connectivity processing
            2. Understanding Intake Processing for data variety
              1. Structured data
                1. The need for integrating Structured Data in the Data Lake
                2. Structured data loading approaches
              2. Semi-structured data
                1. The need for integrating semi-structured data in the Data Lake
                2. Semi-structured data loading approaches
              3. Unstructured data
                1. The need for integrating Unstructured data in the Data Lake
                2. Unstructured data loading approaches
          2. Transient Landing Zone functionalities
            1. File validation checks
              1. File duplication checks
              2. File integrity checks
              3. File size checks
              4. File periodicity checks
            2. Data Integrity checks
              1. Checking record counts
              2. Checking for column counts
              3. Schema validation checks
          3. Raw Storage Zone functionalities
            1. Data lineage processes
              1. Watermarking process
              2. Metadata capture
            2. Deep Integrity checks
              1. Bit Level Integrity checks
              2. Periodic checksum checks
            3. Security and governance
            4. Information Lifecycle Management
          4. Practical Data Ingestion scenarios
        2. Architectural guidance
          1. Structured data use cases
          2. Semi-structured and unstructured data use cases
          3. Big Data tools and technologies
            1. Ingestion of structured data
              1. Sqoop
                1. Use case scenarios for Sqoop
              2. WebHDFS
                1. Use case scenarios for WebHDFS
            2. Ingestion of streaming data
              1. Apache Flume
                1. Use case scenarios for Flume
              2. Fluentd
                1. Use case scenarios for Fluentd
              3. Kafka
                1. Use case scenarios for Kafka
              4. Amazon Kinesis
                1. Use case scenarios for Kinesis
              5. Apache Storm
                1. Use case scenarios for Storm
        3. Summary
      11. 3. Data Integration, Quality, and Enrichment
        1. Introduction to the Data Management Tier
        2. Understanding Data Integration
          1. Introduction to Data Integration
            1. Prominent features of Data Integration
              1. Loosely coupled Integration
              2. Ease of use
              3. Secure access
              4. High-quality data
              5. Lineage tracking
          2. Practical Data Integration scenarios
          3. The workings of Data Integration
            1. Raw data discovery
            2. Data quality assessment
              1. Profiling the data
            3. Data cleansing
              1. Deletion of missing, null, or invalid values
              2. Imputation of missing, null, or invalid values
            4. Data transformations
              1. Unstructured text transformation techniques
              2. Structured data transformations
            5. Data enrichment
            6. Collect metadata and track data lineage
          4. Traditional Data Integration versus Data Lake
            1. Data pipelines
              1. Addressing the limitations using Data Lake
            2. Data partitioning
              1. Addressing the limitations using Data Lake
            3. Scale on demand
              1. Addressing the limitations using Data Lake
            4. Data ingest parallelism
              1. Addressing the limitations using Data Lake
            5. Extensibility
              1. Addressing the limitations using Data Lake
        3. Big Data tools and technologies
          1. Syncsort
            1. Use case scenarios for Syncsort
          2. Talend
            1. Use case scenarios for Talend
          3. Pentaho
            1. Use case scenarios for Pentaho
        4. Summary
      12. 4. Data Discovery and Consumption
        1. Understanding the Data Consumption tier
          1. Data Consumption – Traditional versus Data Lake
          2. An introduction to Data Consumption
          3. Practical Data Consumption scenarios
        2. Data Discovery and metadata
          1. Enabling Data Discovery
            1. Data classification
              1. Classifying unstructured data
                1. Named entity recognition
                2. Topic modeling
                3. Text clustering
              2. Applications of data classification
            2. Relation extraction
              1. Extracting relationships from unstructured data
                1. Feature-based methods
                2. Understanding how feature-based methods work
                3. Implementation
                4. Semantic technologies
                5. Understanding how semantic technologies work
                6. Implementation
              2. Extracting Relationships from structured data
              3. Applications of relation extraction
            3. Indexing data
              1. Inverted index
                1. Understanding how inverted index works
                2. Implementation
              2. Applications of Indexing
          2. Performing Data Discovery
            1. Semantic search
              1. Word sense disambiguation
              2. Latent Semantic Analysis
            2. Faceted search
            3. Fuzzy search
              1. Edit distance
              2. Wildcard and regular expressions
        3. Data Provisioning and metadata
          1. Data publication
          2. Data subscription
          3. Data Provisioning functionalities
            1. Data formatting
            2. Data selection
          4. Data Provisioning approaches
          5. Post-provisioning processes
        4. Architectural guidance
          1. Data Discovery
            1. Big Data tools and technologies
              1. Elasticsearch
                1. Use case scenarios for Elasticsearch
              2. IBM InfoSphere Data Explorer
                1. Use case scenarios for IBM InfoSphere Data Explorer
              3. Tableau
                1. Use case scenarios for Tableau
              4. Splunk
                1. Use case scenarios for Splunk
          2. Data Provisioning
            1. Big Data tools and technologies
              1. Data Dispatch
                1. Use case scenarios for Data Dispatch
        5. Summary
      13. 5. Data Governance
        1. Understanding Data Governance
          1. Introduction to Data Governance
            1. The need for Data Governance
            2. Governing Big Data in the Data Lake
          2. Data Governance – Traditional versus Data Lake
          3. Practical Data Governance scenarios
        2. Data Governance components
          1. Metadata management and lineage tracking
          2. Data security and privacy
            1. Big Data implications for security and privacy
            2. Security issues in the Data Lake tiers
              1. The Intake Tier
              2. The Management Tier
              3. The Consumption Tier
          3. Information Lifecycle Management
            1. Big Data implications for ILM
            2. Implementing ILM using Data Lake
              1. The Intake Tier
              2. The Management Tier
              3. The Consumption Tier
        3. Architectural guidance
          1. Big Data tools and technologies
            1. Apache Falcon
              1. Understanding how Falcon works
              2. Use case scenarios for Falcon
            2. Apache Atlas
              1. Understanding how Atlas works
              2. Use case scenarios for Atlas
            3. IBM Big Data platform
              1. Understanding how governance is provided in IBM Big Data platform
              2. Use case scenarios for IBM Big Data platform
        4. The current and future trends
          1. Data Lake and future enterprise trajectories
          2. Future Data Lake technologies
        5. Summary
      14. Index