You are previewing Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump.
O'Reilly logo
Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump

Book Description

Organizations invest incredible amounts of time and money obtaining and then storing big data in data stores called data lakes. But how many of these organizations can actually get the data back out in a useable form? Very few can turn the data lake into an information gold mine. Most wind up with garbage dumps.

Data Lake Architecture will explain how to build a useful data lake, where data scientists and data analysts can solve business challenges and identify new business opportunities. Learn how to structure data lakes as well as analog, application, and text-based data ponds to provide maximum business value. Understand the role of the raw data pond and when to use an archival data pond. Leverage the four key ingredients for data lake success: metadata, integration mapping, context, and metaprocess.

Bill Inmon opened our eyes to the architecture and benefits of a data warehouse, and now he takes us to the next level of data lake architecture.

Table of Contents

  1. Introduction
  2. Chapter 1 Data Lakes
    1. Enter Big Data
    2. Enter the Data Lake
    3. “One Way” Data Lake
    4. In Summary
  3. Chapter 2 Transforming the Data Lake
    1. Metadata
    2. Integration Mapping
    3. Context
    4. Metaprocess
    5. Data Scientist
    6. General Usability
    7. In Summary
  4. Chapter 3 Inside the Data Lake
    1. Analog Data
    2. Application Data
    3. Textual Data
    4. Another Perspective
    5. In Summary
  5. Chapter 4 Data Ponds
    1. Conditioning Data
    2. Raw Data Pond
    3. Analog Data Pond
    4. Application Data Pond
    5. Textual Data Pond
    6. Data Passing Directly Into the Data Ponds
    7. Archival Data Pond
    8. In Summary
  6. Chapter 5 Generic Structure of the Data Pond
    1. Pond Descriptor
    2. Pond Target
    3. Pond Data
    4. Pond Metadata
    5. Pond Metaprocess
    6. Pond Transformation Criteria
    7. In Summary
  7. Chapter 6 Analog Data Pond
    1. Analog Data Issues
    2. Data Descriptor
    3. Capturing Raw Data/Transforming Raw Data
    4. Transforming/Conditioning Raw Analog Data
    5. Data Excision
    6. Clustering Data
    7. Data Relationships
    8. Probability of Future Usage
    9. Outliers
    10. Specialized Ad Hoc Analysis
    11. In Summary
  8. Chapter 7 Application Data Pond
    1. DNA of Data
    2. Descriptors
    3. Standard Database Format
    4. Basic Organization of Data
    5. Integration of Data
    6. Data Model
    7. Necessity of Integration
    8. Pointing From one Application to the Next
    9. Intersecting Applications
    10. Subsets of Data in the Application Data Pond
    11. In Summary
  9. Chapter 8 Textual Data Pond
    1. Uniform Data and the Computer
    2. Valuable Text
    3. Textual Disambiguation
    4. Text Sent to the Data Pond
    5. Output of Textual Disambiguation
    6. Inherent Complexity
    7. Textual Disambiguation Functionality
    8. Taxonomies and Ontologies
    9. Value of Text and Context
    10. Tracing Text Back to the Source
    11. Mechanics of Disambiguation
    12. Analyzing the Database
    13. Visualizing the Results
    14. In Summary
  10. Chapter 9 Comparing the Ponds
    1. Similarities Across the Data Ponds
    2. Dissimilarities Across the Data Ponds
    3. Relational Format for Final State Data
    4. Technology Differences
    5. Total Expected Volume of Data in the Data Pond
    6. Moving Data From Pond to Pond
    7. Doing Analysis From Multiple Ponds
    8. Using Metadata to Relate Data From Different Ponds
    9. What if…?
    10. In Summary
  11. Chapter 10 Using the Infrastructure
    1. “One Way” Data Lake
    2. Transforming the Data Lake
    3. Transformation Technology
    4. Some Analytical Questions
    5. Querying Textual Data
    6. Real Analysis
    7. In Summary
  12. Chapter 11 Search and Analysis
    1. Confusion Spread by the Vendors
    2. In Summary
  13. Chapter 12 Business Value in the Data Ponds
    1. Business Value in the Analog Data Pond
    2. Business Value in the Application Data Pond
    3. Business Value in the Textual Data Pond
    4. Percent of Records That Have Business Value
    5. In Summary
  14. Chapter 13 Additional Topics
    1. High System Level Documentation
    2. Detailed Data Pond Level Documentation
    3. What Data Flows Into the Data Lake/Data Pond?
    4. Where Does Analysis Occur?
    5. The age of Data
    6. Security of Data
    7. In Summary
  15. Chapter 14 Analytical and Integration Tools
    1. Visualization
    2. Search and Qualify
    3. Textual Disambiguation
    4. Statistical Analysis
    5. Classical ETL Processing
    6. In Summary
  16. Chapter 15 Archiving Data Ponds
    1. Criteria for Removal
    2. Structural Alteration
    3. Creating Independent Indexes for Archival Data
    4. In Summary
  17. Glossary
  18. References
  19. Index