You are previewing Building the Unstructured Data Warehouse: Architecture, Analysis, and Design.
O'Reilly logo
Building the Unstructured Data Warehouse: Architecture, Analysis, and Design

Book Description

Learn essential techniques from data warehouse legend Bill Inmon on how to build the reporting environment your business needs now!

Answers for many valuable business questions hide in text. How well can your existing reporting environment extract the necessary text from email, spreadsheets, and documents, and put it in a useful format for analytics and reporting? Transforming the traditional data warehouse into an efficient unstructured data warehouse requires additional skills from the analyst, architect, designer, and developer. This book will prepare you to successfully implement an unstructured data warehouse and, through clear explanations, examples, and case studies, you will learn new techniques and tips to successfully obtain and analyze text.

Master these ten objectives:

  • Build an unstructured data warehouse using the 11-step approach

  • Integrate text and describe it in terms of homogeneity, relevance, medium, volume, and structure

  • Overcome challenges including blather, the Tower of Babel, and lack of natural relationships

  • Avoid the Data Junkyard and combat the Spider's Web

  • Reuse techniques perfected in the traditional data warehouse and Data Warehouse 2.0, including iterative development

  • Apply essential techniques for textual Extract, Transform, and Load (ETL) such as phrase recognition, stop word filtering, and synonym replacement

  • Design the Document Inventory system and link unstructured text to structured data

  • Leverage indexes for efficient text analysis and taxonomies for useful external categorization

  • Manage large volumes of data using advanced techniques such as backward pointers

  • Evaluate technology choices suitable for unstructured data processing, such as data warehouse appliances

The following outline briefly describes each chapter's content:

  • Chapter 1 defines unstructured data and explains why text is the main focus of this book.

  • Chapter 2 addresses the challenges one faces when managing unstructured data.

  • Chapter 3 discusses the DW 2.0 architecture, which leads into the role of the unstructured data warehouse. The unstructured data warehouse is defined and benefits are given. There are several features of the conventional data warehouse that can be leveraged for the unstructured data warehouse, including ETL processing, textual integration, and iterative development.

  • Chapter 4 focuses on the heart of the unstructured data warehouse: Textual Extract, Transform, and Load (ETL).

  • Chapter 5 describes the 11 steps required to develop the unstructured data warehouse.

  • Chapter 6 describes how to inventory documents for maximum analysis value, as well as link the unstructured text to structured data for even greater value.

  • Chapter 7 goes through each of the different types of indexes necessary to make text analysis efficient. Indexes range from simple indexes, which are fast to create and are good if the analyst really knows what needs to be analyzed before the indexing process begins, to complex combined indexes, which can be made up of any and all of the other kinds of indexes.

  • Chapter 8 explains taxonomies and how they can be used within the unstructured data warehouse.

  • Chapter 9 explains ways of coping with large amounts of unstructured data. Techniques such as keeping the unstructured data at its source and using backward pointers are discussed. The chapter explains why iterative development is so important.

  • Chapter 10 focuses on challenges and some technology choices that are suitable for unstructured data processing. In addition, the data warehouse appliance is discussed.

  • Chapters 11, 12, and 13 put all of the previously discussed techniques and approaches in context through three case studies.

Table of Contents

  1. Contents at a Glance
  2. Contents
  3. Introduction
  5. Unstructured Data Warehouse Essentials
  6. CHAPTER 1
  7. Exploring our Unstructured World
    1. Text Form
      1. Documents
      2. Email
      3. Spreadsheets
      4. Embedded Text
    2. Text Characteristics
      1. Homogeneity
      2. Format
      3. Medium
      4. Volume
      5. Structure
  8. CHAPTER 2
  9. Managing Unstructured Data
    1. Volume
    2. Blather
    3. The Tower of Babel
    4. Spelling
    5. Lack of Natural Relationships
    6. Storage Format
    7. Data Junkyards
    8. Paper
  10. CHAPTER 3
  11. Evolving to the Unstructured Data Warehouse
    1. We Have Come a Long Way
    2. Caught in the Spider’s Web
    3. Data Warehouse to the Rescue
    4. Data Warehouse 2.0 to the Rescue
      1. Unstructured Components
    5. Unstructured Data Warehouse to the Rescue
      1. The Thematic Approach
      2. Advantages Over a Traditional Search Engine
    6. Leveraging the Traditional Data Warehouse
      1. ETL Processing
      2. Integration
      3. Iteration
  12. CHAPTER 4
  13. Extracting, Transforming, and Loading Text Much of the material in this chapter is a description of intellectual property which is Patent Pending. If you have a need to use this material in any way – designing a product, refining the design to a product, creating a new product, etc., please contact Forest Rim Technology for licensing information.
    1. Extracting Text (The ‘E’ of ETL)
      1. Knowing the Source
      2. Reading Documents Only Once
      3. Identifying Common File Types
      4. Acquiring the “Read” Interface
    2. Transforming Text (The ‘T’ of ETL)
      1. Words and Phrases
      2. Stop Words
      3. Case
      4. Punctuation
      5. Font
      6. Stem
      7. Synonym Replacement
      8. Alternate Spelling
      9. Conceptual Abstractions
      10. Homographic Resolution
      11. Negativity Exclusion
      12. Inline Additions
      13. Recognizing Extensions of a Concept
      14. Patterns
      15. Proximity Analysis
      16. Clustering
    3. Loading Text (The ‘L’ of ETL)
      1. Using the Move/Remove Utility
      2. Reviewing the Output Tables
      3. Knowing the Final Destination
      4. Managing Volumes of Data
      5. Performing Checkpoint Processing
    4. Textual ETL Examples
      1. Email
      2. Spreadsheets
  14. CHAPTER 5
  15. Developing the Unstructured Data Warehouse
    1. SDLC
    2. Spiral Approach
    3. Hybrid Approach
      1. 1. Understand the business problem and business context
      2. 2. Survey the data sources to determine which data is useful
      3. 3. Select and customize taxonomies
      4. 4. Select the initial set of data
      5. 5. Determine future iterations and source document requirements
      6. 6. Choose the textual ETL tool
      7. 7. Load parameters for transformations
      8. 8. Execute ETL scripts with initial set of data
      9. 9. Examine results and make adjustments if needed
      10. 10. Execute ETL scripts on remaining iterations
      11. 11. Continuous business analysis and make adjustments if needed
    4. Putting the Steps Together
  17. Unstructured Data Warehouse Advanced Topics
  18. CHAPTER 6
  19. Inventorying and Linking Text
    1. Document Inventory
    2. Document Classification
    3. Linking Unstructured to Structured Data
      1. A Probabilistic Linkage
      2. Dynamic Linkages
      3. Static Linkages
      4. Dynamic versus Static
  20. CHAPTER 7
  21. Using Indexes
    1. Simple Index
    2. Fractured Index
    3. Named Value Index
    4. Taxonomy (or External Categorization) Index
    5. Patterned Index
    6. Homographic Index
    7. Alternate Spelling Index
    8. Stemmed Words Index
    9. Clustered Index
    10. Combined Index
    11. Leverage Multiple Indexing Strategies
      1. Semistructured (Sub Doc) Processing
  22. CHAPTER 8
  23. Leveraging Taxonomies
    1. Simple Taxonomy
    2. Pairs of Words
    3. Preferred Taxonomy
    4. External Categorization
    5. Real World Problems
      1. Hierarchies Within the Taxonomy
      2. Multiple Types Within the Taxonomy
      3. Recursion Within the Taxonomy
      4. Relationships Between Taxonomies
    6. Taxonomies and Data Modeling
  24. CHAPTER 9
  25. Coping with Large Amounts of Data
    1. Keeping Unstructured Data in Place
    2. Implementing Backward Pointers
    3. Doing Iterative Development
    4. Avoiding Rework
    5. Screening Data
    6. Removing Extraneous Data
    7. Selecting Appropriate Index Types
    8. Parallelizing the Workload
    9. Building Small Logically Related Tables
    10. Dividing Data into Sectors
  26. Chapter 10
  27. Selecting Technology
    1. Processing Structured Data
      1. Data Warehouse Performance
    2. Processing Unstructured Data
    3. Data Warehouse Appliance
      1. Appliance Architecture
      2. Data Distribution
      3. Workload
      4. Best Practices for Implementing Data Warehouse Appliances
      5. Using the Data Warehouse Appliance to build the Unstructured Database
      6. Example of Processing Unstructured Data
  29. Unstructured Data Warehouse Case Studies
  30. CHAPTER 11
  31. The Ablatz Medical Group
    1. Information Systems
    2. Special Treatment Collections
    3. Users
    4. Integration
    5. Unstructured Text
    6. Sources of Data
    7. Textual Operating Parameters
    8. Visualization
  32. CHAPTER 12
  33. The Eastern Hills Oil Company
  34. CHAPTER 13
  35. The Amber Oil Company
    1. Maximizing Search Engines
      1. Legacy Search
      2. Relevancy Rankings
  36. Suggested Reading
  37. Index