You are previewing DW 2.0: The Architecture for the Next Generation of Data Warehousing.
O'Reilly logo
DW 2.0: The Architecture for the Next Generation of Data Warehousing

Book Description

Data Warehousing has been around for 20 years and has become part of the information technology infrastructure. Data warehousing originally grew in response to the corporate need for information--not data--and it supplies integrated, granular, and historical data to the corporation.

There are many kinds of data warehouses, in large part due to evolution and different paths of software and hardware vendors. But DW 2.0, defined by this author in many talks, articles, and his b-eye-network newsletter that reaches 65,000 professionals monthly, is the well-identified and defined next generation data warehouse.

The book carries that theme and describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level. The perspective of the book is from the top down: looking at the overall architecture and then delving into the issues underlying the components. The benefit of this for people who are building or using a data warehouse can see what lies ahead, and can determine: what new technology to buy, how to plan extensions to the data warehouse, what can be salvaged from the current system, and how to justify the expense--at the most practical level.

All of this gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.

* First book on the new generation of data warehouse architecture, DW 2.0.
* Written by the "father of the data warehouse", Bill Inmon, a columnist and newsletter editor of The Bill Inmon Channel on the Business Intelligence Network.
* Long overdue comprehensive coverage of the implementation of technology and tools that enable the new generation of the DW: metadata, temporal data, ETL, unstructured data, and data quality control.

Table of Contents

  1. Copyright
    1. Dedication
  2. The Morgan Kaufmann Series in Data Management Systems
  3. Preface
  4. Acknowledgments
  5. About the Authors
  6. 1. A brief history of data warehousing and first-generation data warehouses
    1. Data base management systems
    2. Online applications
    3. Personal computers and 4GL technology
    4. The spider web environment
    5. Evolution from the business perspective
    6. The data warehouse environment
    7. What is a data warehouse?
    8. Integrating data—a painful experience
    9. Volumes of data
    10. A different development approach
    11. Evolution to the DW 2.0 environment
    12. The business impact of the data warehouse
    13. Various components of the data warehouse environment
      1. ETL—extract/transform/load
      2. ODS—operational data store
      3. Data mart
      4. Exploration warehouse
    14. The evolution of data warehousing from the business perspective
    15. Other notions about a data warehouse
    16. The active data warehouse
    17. The federated data warehouse approach
    18. The star schema approach
    19. The data mart data warehouse
    20. Building a “real” data warehouse
    21. Summary
  7. 2. An introduction to DW 2.0
    1. DW 2.0—a new paradigm
    2. DW 2.0—from the business perspective
    3. The life cycle of data
    4. Reasons for the different sectors
    5. Metadata
    6. Access of data
    7. Structured data/unstructured data
    8. Textual analytics
    9. Blather
    10. The issue of terminology
    11. Specific text/general text
    12. Metadata—a major component
    13. Local metadata
    14. A foundation of technology
    15. Changing business requirements
    16. The flow of data within DW 2.0
    17. Volumes of data
    18. Useful applications
    19. DW 2.0 and referential integrity
    20. Reporting in DW 2.0
    21. Summary
  8. 3. DW 2.0 components—about the different sectors
    1. The Interactive Sector
    2. The Integrated Sector
    3. The Near Line Sector
    4. The Archival Sector
    5. Unstructured processing
    6. From the business perspective
    7. Summary
  9. 4. Metadata in DW 2.0
    1. Reusability of data and analysis
    2. Metadata in DW 2.0
    3. Active repository/passive repository
    4. The active repository
    5. Enterprise metadata
    6. Metadata and the system of record
    7. Taxonomy
    8. Internal taxonomies/external taxonomies
    9. Metadata in the Archival Sector
    10. Maintaining metadata
    11. Using metadata—an example
    12. From the end-user perspective
    13. Summary
  10. 5. Fluidity of the DW 2.0 technology infrastructure
    1. The technology infrastructure
    2. Rapid business changes
    3. The treadmill of change
    4. Getting off the treadmill
    5. Reducing the length of time for IT to respond
    6. Semantically temporal, semantically static data
    7. Semantically temporal data
    8. Semantically stable data
    9. Mixing semantically stable and unstable data
    10. Separating semantically stable and unstable data
    11. Mitigating business change
    12. Creating snapshots of data
    13. A historical record
    14. Dividing data
    15. From the end-user perspective
    16. Summary
  11. 6. Methodology and approach for DW 2.0
    1. Spiral methodology—a summary of key features
    2. The seven streams approach—an overview
    3. Enterprise reference model stream
    4. Enterprise knowledge coordination stream
    5. Information factory development stream
    6. Data profiling and mapping stream
    7. Data correction stream (previously called the Data Cleansing Stream)
    8. Infrastructure stream
    9. Total information quality management stream
    10. Summary
  12. 7. Statistical processing and DW 2.0
    1. Two types of transactions
    2. Using statistical analysis
    3. The integrity of the comparison
    4. Heuristic analysis
    5. Freezing data
    6. Exploration processing
    7. The frequency of analysis
    8. The exploration facility
    9. The sources for exploration processing
    10. Refreshing exploration data
    11. Project-based data
    12. Data marts and the exploration facility
    13. A backflow of data
    14. Using exploration data internally
    15. From the perspective of the business analyst
    16. Summary
  13. 8. Data models and DW 2.0
    1. An intellectual road map
    2. The data model and business
    3. The scope of integration
    4. Making the distinction between granular and summarized data
    5. Levels of the data model
    6. Data models and the Interactive Sector
    7. The corporate data model
    8. A transformation of models
    9. Data models and unstructured data
    10. From the perspective of the business user
    11. Summary
  14. 9. Monitoring the DW 2.0 environment
    1. Monitoring the DW 2.0 environment
    2. The transaction monitor
    3. Monitoring data quality
    4. A data warehouse monitor
    5. The transaction monitor—response time
    6. Peak-period processing
    7. The ETL data quality monitor
    8. The data warehouse monitor
    9. Dormant data
    10. From the perspective of the business user
    11. Summary
  15. 10. DW 2.0 and security
    1. Protecting access to data
    2. Encryption
    3. Drawbacks
    4. The firewall
    5. Moving data offline
    6. Limiting encryption
    7. A direct dump
    8. The data warehouse monitor
    9. Sensing an attack
    10. Security for near line data
    11. From the perspective of the business user
    12. Summary
  16. 11. Time-variant data
    1. All data in DW 2.0—relative to time
    2. Time relativity in the Interactive Sector
    3. Data relativity elsewhere in DW 2.0
    4. Transactions in the Integrated Sector
    5. Discrete data
    6. Continuous time span data
    7. A sequence of records
    8. Nonoverlapping records
    9. Beginning and ending a sequence of records
    10. Continuity of data
    11. Time-collapsed data
    12. Time variance in the Archival Sector
    13. From the perspective of the end user
    14. Summary
  17. 12. The flow of data in DW 2.0
    1. The flow of data throughout the architecture
    2. Entering the Interactive Sector
    3. The role of ETL
    4. Data flow into the Integrated Sector
    5. Data flow into the Near Line Sector
    6. Data flow into the Archival Sector
    7. The falling probability of data access
    8. Exception-based flow of data
    9. From the perspective of the business user
    10. Summary
  18. 13. ETL processing and DW 2.0
    1. Changing states of data
    2. Where ETL fits
    3. From application data to corporate data
    4. ETL in online mode
    5. ETL in batch mode
    6. Source and target
    7. An ETL mapping
    8. Changing states—an example
    9. More complex transformations
    10. ETL and throughput
    11. ETL and metadata
    12. ETL and an audit trail
    13. ETL and data quality
    14. Creating ETL
    15. Code creation or parametrically driven ETL
    16. ETL and rejects
    17. Changed data capture
    18. ELT
    19. From the perspective of the business user
    20. Summary
  19. 14. DW 2.0 and the granularity manager
    1. The granularity manager
    2. Raising the level of granularity
    3. Filtering data
    4. The functions of the granularity manager
    5. Home-grown versus third-party granularity managers
    6. Parallelizing the granularity manager
    7. Metadata as a by-product
    8. From the perspective of the business user
    9. Summary
  20. 15. DW 2.0 and performance
    1. Good performance—a cornerstone for DW 2.0
    2. Online response time
    3. Analytical response time
    4. The flow of data
    5. Queues
    6. Heuristic processing
    7. Analytical productivity and response time
    8. Many facets to performance
    9. Indexing
    10. Removing dormant data
    11. End-user education
    12. Monitoring the environment
    13. Capacity planning
    14. Metadata
    15. Batch parallelization
    16. Parallelization for transaction processing
    17. Workload management
    18. Data marts
    19. Exploration facilities
    20. Separation of transactions into classes
    21. Service level agreements
    22. Protecting the Interactive Sector
    23. Partitioning data
    24. Choosing the proper hardware
    25. Separating farmers and explorers
    26. Physically group data together
    27. Check automatically generated code
    28. From the perspective of the business user
    29. Summary
  21. 16. Migration
    1. Houses and cities
    2. Migration in a perfect world
    3. The perfect world almost never happens
    4. Adding components incrementally
    5. Adding the Archival Sector
    6. Creating enterprise metadata
    7. Building the metadata infrastructure
    8. “Swallowing” source systems
    9. ETL as a shock absorber
    10. Migration to the unstructured environment
    11. From the perspective of the business user
    12. Summary
  22. 17. Cost justification and DW 2.0
    1. Is DW 2.0 worth it?
    2. Macro-level justification
    3. A micro-level cost justification
    4. Company B has DW 2.0
    5. Creating new analysis
    6. Executing the steps
    7. So how much does all of this cost?
    8. Consider company B
    9. Factoring the cost of DW 2.0
    10. Reality of information
    11. The real economics of DW 2.0
    12. The time value of information
    13. The value of integration
    14. Historical information
    15. First-generation DW and DW 2.0—the economics
    16. From the perspective of the business user
    17. Summary
  23. 18. Data quality in DW 2.0
    1. The DW 2.0 data quality tool set
    2. Data profiling tools and the reverse-engineered data model
    3. Data model types
    4. Data profiling inconsistencies challenge top-down modeling
    5. Summary
  24. 19. DW 2.0 and unstructured data
    1. DW 2.0 and unstructured data
    2. Reading text
    3. Where to do textual analytical processing
    4. Integrating text
    5. Simple editing
    6. Stop words
    7. Synonym replacement
    8. Synonym concatenation
    9. Homographic resolution
    10. Creating themes
    11. External glossaries/taxonomies
    12. Stemming
    13. Alternate spellings
    14. Text across languages
    15. Direct searches
    16. Indirect searches
    17. Terminology
    18. Semistructured data/VALUE = NAME data
    19. The technology needed to prepare the data
    20. The relational data base
    21. Structured/unstructured linkage
    22. From the perspective of the business user
    23. Summary
  25. 20. DW 2.0 and the system of record
    1. Other systems of record
    2. From the perspective of the business user
    3. Summary
  26. 21. Miscellaneous topics
    1. Data marts
    2. The convenience of a data mart
    3. Transforming data mart data
    4. Monitoring DW 2.0
    5. Moving data from one data mart to another
    6. Bad data
    7. A balancing entry
    8. Resetting a value
    9. Making corrections
    10. The speed of movement of data
    11. Data warehouse utilities
    12. Summary
  27. 22. Processing in the DW 2.0 environment
    1. Summary
  28. 23. Administering the DW 2.0 environment
    1. The data model
    2. Architectural administration
      1. Defining the moment when an Archival Sector will be needed
      2. Determining whether the Near Line Sector is needed
    3. Metadata administration
    4. Data base administration
    5. Stewardship
    6. Systems and technology administration
    7. Management administration of the DW 2.0 environment
      1. Prioritization and prioritization conflicts
      2. Budget
      3. Scheduling and determination of milestones
      4. Allocation of resources
      5. Managing consultants
    8. Summary