You are previewing Data Architecture: A Primer for the Data Scientist.
O'Reilly logo
Data Architecture: A Primer for the Data Scientist

Book Description

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can’t be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist.

Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You’ll be able to:

  • Turn textual information into a form that can be analyzed by standard tools.
  • Make the connection between analytics and Big Data
  • Understand how Big Data fits within an existing systems environment
  • Conduct analytics on repetitive and non-repetitive data


  • Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it
  • Shows how to turn textual information into a form that can be analyzed by standard tools.
  • Explains how Big Data fits within an existing systems environment
  • Presents new opportunities that are afforded by the advent of Big Data
  • Demystifies the murky waters of repetitive and non-repetitive data in Big Data

Table of Contents

  1. Cover
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Preface
  7. About the Authors
  8. 1.1: Corporate Data
    1. Abstract
    2. The Totality of Data Across the Corporation
    3. Dividing Unstructured Data
    4. Business Relevancy
    5. Big Data
    6. The Great Divide
    7. The Continental Divide
    8. The Complete Picture
  9. 1.2: The Data Infrastructure
    1. Abstract
    2. Two Types of Repetitive Data
    3. Repetitive Structured Data
    4. Repetitive Big Data
    5. The Two Infrastructures
    6. What’s being Optimized?
    7. Comparing the Two Infrastructures
  10. 1.3: The “Great Divide”
    1. Abstract
    2. Classifying Corporate Data
    3. The “Great Divide”
    4. Repetitive Unstructured Data
    5. Nonrepetitive Unstructured Data
    6. Different Worlds
  11. 1.4: Demographics of Corporate Data
    1. Abstract
  12. 1.5: Corporate Data Analysis
    1. Abstract
  13. 1.6: The Life Cycle of Data – Understanding Data Over Time
    1. Abstract
  14. 1.7: A Brief History of Data
    1. Abstract
    2. Paper Tape and Punch Cards
    3. Magnetic Tapes
    4. Disk Storage
    5. Database Management System
    6. Coupled Processors
    7. Online Transaction Processing
    8. Data Warehouse
    9. Parallel Data Management
    10. Data Vault
    11. Big Data
    12. The Great Divide
  15. 2.1: A Brief History of Big Data
    1. Abstract
    2. An Analogy – Taking the High Ground
    3. Taking the High Ground
    4. Standardization with the 360
    5. Online Transaction Processing
    6. Enter Teradata and Massively Parallel Processing
    7. Then Came Hadoop and Big Data
    8. IBM and Hadoop
    9. Holding the High Ground
  16. 2.2: What is Big Data?
    1. Abstract
    2. Another Definition
    3. Large Volumes
    4. Inexpensive Storage
    5. The Roman Census Approach
    6. Unstructured Data
    7. Data in Big Data
    8. Context in Repetitive Data
    9. Nonrepetitive Data
    10. Context in Nonrepetitive Data
  17. 2.3: Parallel Processing
    1. Abstract
  18. 2.4: Unstructured Data
    1. Abstract
    2. Textual Information Everywhere
    3. Decisions Based on Structured Data
    4. The Business Value Proposition
    5. Repetitive and Nonrepetitive Unstructured Information
    6. Ease of Analysis
    7. Contextualization
    8. Some Approaches to Contextualization
    9. MapReduce
    10. Manual Analysis
  19. 2.5: Contextualizing Repetitive Unstructured Data
    1. Abstract
    2. Parsing Repetitive Unstructured Data
    3. Recasting the Output Data
  20. 2.6: Textual Disambiguation
    1. Abstract
    2. From Narrative into an Analytical Database
    3. Input into Textual Disambiguation
    4. Mapping
    5. Input/Output
    6. Document Fracturing/Named Value Processing
    7. Preprocessing a Document
    8. Emails – A Special Case
    9. Spreadsheets
    10. Report Decompilation
  21. 2.7: Taxonomies
    1. Abstract
    2. Data Models and Taxonomies
    3. Applicability of Taxonomies
    4. What is a Taxonomy?
    5. Taxonomies in Multiple Languages
    6. Dynamics of Taxonomies and Textual Disambiguation
    7. Taxonomies and Textual Disambiguation – Separate Technologies
    8. Different Types of Taxonomies
    9. Taxonomies – Maintenance Over Time
  22. 3.1: A Brief History of Data Warehouse
    1. Abstract
    2. Early Applications
    3. Online Applications
    4. Extract Programs
    5. 4GL Technology
    6. Personal Computers
    7. Spreadsheets
    8. Integrity of Data
    9. Spider-Web Systems
    10. The Maintenance Backlog
    11. The Data Warehouse
    12. To an Architected Environment
    13. To the CIF
    14. DW 2.0
  23. 3.2: Integrated Corporate Data
    1. Abstract
    2. Many Applications
    3. Looking Across the Corporation
    4. More Than One Analyst
    5. ETL Technology
    6. The Challenges of Integration
    7. The Benefits of a Data Warehouse
    8. The Granular Perspective
  24. 3.3: Historical Data
    1. Abstract
  25. 3.4: Data Marts
    1. Abstract
    2. Granular Data
    3. Relational Database Design
    4. The Data Mart
    5. Key Performance Indicators
    6. The Dimensional Model
    7. Combining the Data Warehouse and Data Marts
  26. 3.5: The Operational Data Store
    1. Abstract
    2. Online Transaction Processing on Integrated Data
    3. The Operational Data Store
    4. ODS and the Data Warehouse
    5. ODS Classes
    6. External Updates into the ODS
    7. The ODS/Data Warehouse Interface
  27. 3.6: What a Data Warehouse is Not
    1. Abstract
    2. A Simple Data Warehouse Architecture
    3. Online High-Performance Transaction Processing in the Data Warehouse
    4. Integrity of Data
    5. The Data Warehouse Workload
    6. Statistical Processing from the Data Warehouse
    7. The Frequency of Statistical Processing
    8. The Exploration Warehouse
  28. 4.1: Introduction to Data Vault
    1. Abstract
    2. Data Vault 2.0 Modeling
    3. Data Vault 2.0 Methodology Defined
    4. Data Vault 2.0 Architecture
    5. Data Vault 2.0 Implementation
    6. Business Benefits of Data Vault 2.0
    7. Data Vault 1.0
  29. 4.2: Introduction to Data Vault Modeling
    1. Abstract
    2. A Data Vault Model Concept
    3. Data Vault Model Defined
    4. Components of a Data Vault Model
    5. Data Vault and Data Warehousing
    6. Translating to Data Vault Modeling
    7. Data Restructure
    8. Basic Rules of Data Vault Modeling
    9. Why We Need Many-to-Many Link Structures
    10. Hash keys Instead of Sequence Numbers
  30. 4.3: Introduction to Data Vault Architecture
    1. Abstract
    2. Data Vault 2.0 Architecture
    3. How NoSQL Fits into the Architecture
    4. Data Vault 2.0 Architecture Objectives
    5. Data Vault 2.0 Modeling Objective
    6. Hard and Soft Business Rules
    7. Managed SSBI and the Architecture
  31. 4.4: Introduction to Data Vault Methodology
    1. Abstract
    2. Data Vault 2.0 Methodology Overview
    3. CMMI and Data Vault 2.0 Methodology
    4. CMMI Versus Agility
    5. Project Management Practices and SDLC Versus CMMI and Agile
    6. Six Sigma and Data Vault 2.0 Methodology
    7. Total Quality Management
  32. 4.5: Introduction to Data Vault Implementation
    1. Abstract
    2. Implementation Overview
    3. The Importance of Patterns
    4. Reengineering and Big Data
    5. Virtualize Our Data Marts
    6. Managed Self-Service BI
  33. 5.1: The Operational Environment – A Short History
    1. Abstract
    2. Commercial Uses of the Computer
    3. The First Applications
    4. Ed Yourdon and the Structured Revolution
    5. System Development Life Cycle
    6. Disk Technology
    7. Enter the Database Management System
    8. Response Time and Availability
    9. Corporate Computing Today
  34. 5.2: The Standard Work Unit
    1. Abstract
    2. Elements of Response Time
    3. An Hourglass Analogy
    4. The Racetrack Analogy
    5. Your Vehicle Runs as Fast as the Vehicle in Front of It
    6. The Standard Work Unit
    7. The Service Level Agreement
  35. 5.3: Data Modeling for the Structured Environment
    1. Abstract
    2. The Purpose of the Road Map
    3. Granular Data Only
    4. The Entity Relationship Diagram
    5. The DIS
    6. Physical Database Design
    7. Relating the Different Levels of the Data Model
    8. An Example of the Linkage
    9. Generic Data Models
    10. Operational Data Models and Data Warehouse Data Models
  36. 5.4: Metadata
    1. Abstract
    2. Typical Metadata
    3. The Repository
    4. Using Metadata
    5. Analytical Uses of Metadata
    6. Looking at Multiple Systems
    7. The Lineage of Data
    8. Comparing Existing Systems to Proposed Systems
  37. 5.5: Data Governance of Structured Data
    1. Abstract
    2. A Corporate Activity
    3. Motivations for Data Governance
    4. Repairing Data
    5. Granular, Detailed Data
    6. Documentation
    7. Data Stewardship
  38. 6.1: A Brief History of Data Architecture
    1. Abstract
  39. 6.2: Big Data/Existing Systems Interface
    1. Abstract
    2. The Big Data/Existing Systems Interface
    3. The Repetitive Raw Big Data/Existing Systems Interface
    4. Exception-Based Data
    5. The Nonrepetitive Raw Big Data/Existing Systems Interface
    6. Into the Existing Systems Environment
    7. The “Context-Enriched” Big Data Environment
    8. Analyzing Structured Data/Unstructured Data Together
  40. 6.3: The Data Warehouse/Operational Environment Interface
    1. Abstract
    2. The Operational/Data Warehouse Interface
    3. The Classical ETL Interface
    4. The Operational Data Store/ETL Interface
    5. The Staging Area
    6. Changed Data Capture
    7. Inline Transformation
    8. ELT Processing
  41. 6.4: Data Architecture – A High-Level Perspective
    1. Abstract
    2. A High-Level Perspective
    3. Redundancy
    4. The System of Record
    5. Different Communities
  42. 7.1: Repetitive Analytics – Some Basics
    1. Abstract
    2. Different Kinds of Analysis
    3. Looking for Patterns
    4. Heuristic Processing
    5. The Sandbox
    6. The “Normal” Profile
    7. Distillation, Filtering
    8. Subsetting Data
    9. Filtering Data
    10. Repetitive Data and Context
    11. Linking Repetitive Records
    12. Log Tape Records
    13. Analyzing Points of Data
    14. Data Over Time
  43. 7.2: Analyzing Repetitive Data
    1. Abstract
    2. Log Data
    3. Active/Passive Indexing of Data
    4. Summary/Detailed Data
    5. Metadata in Big Data
    6. Linking Data
  44. 7.3: Repetitive Analysis
    1. Abstract
    2. Internal, External Data
    3. Universal Identifiers
    4. Security
    5. Filtering, Distillation
    6. Archiving Results
    7. Metrics
  45. 8.1: Nonrepetitive Data
    1. Abstract
    2. Inline Contextualization
    3. Taxonomy/Ontology Processing
    4. Custom Variables
    5. Homographic Resolution
    6. Acronym Resolution
    7. Negation Analysis
    8. Numeric Tagging
    9. Date Tagging
    10. Date Standardization
    11. List Processing
    12. Associative Word Processing
    13. Stop Word Processing
    14. Word Stemming
    15. Document Metadata
    16. Document Classification
    17. Proximity Analysis
    18. Functional Sequencing within Textual ETL
    19. Internal Referential Integrity
    20. Preprocessing, Postprocessing
  46. 8.2: Mapping
    1. Abstract
  47. 8.3: Analytics from Nonrepetitive Data
    1. Abstract
    2. Call Center Information
    3. Medical Records
  48. 9.1: Operational Analytics
    1. Abstract
    2. Transaction Response Time
  49. 10.1: Operational Analytics
    1. Abstract
  50. 11.1: Personal Analytics
    1. Abstract
  51. 12.1: A Composite Data Architecture
    1. Abstract
  52. Glossary
  53. Index