You are previewing Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence.
O'Reilly logo
Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence

Book Description

“The authors, the best minds on the topic, are breaking new ground. They show how every organization can realize the benefits of a system that can search and present complex ideas or data from what has been a mostly untapped source of raw data.”

--Randy Chalfant, CTO, Sun Microsystems

The Definitive Guide to Unstructured Data Management and Analysis--From the World’s Leading Information Management Expert

A wealth of invaluable information exists in unstructured textual form, but organizations have found it difficult or impossible to access and utilize it. This is changing rapidly: new approaches finally make it possible to glean useful knowledge from virtually any collection of unstructured data.

William H. Inmon--the father of data warehousing--and Anthony Nesavich introduce the next data revolution: unstructured data management. Inmon and Nesavich cover all you need to know to make unstructured data work for your organization. You’ll learn how to bring it into your existing structured data environment, leverage existing analytical infrastructure, and implement textual analytic processing technologies to solve new problems and uncover new opportunities. Inmon and Nesavich introduce breakthrough techniques covered in no other book--including the powerful role of textual integration, new ways to integrate textual data into data warehouses, and new SQL techniques for reading and analyzing text. They also present five chapter-length, real-world case studies--demonstrating unstructured data at work in medical research, insurance, chemical manufacturing, contracting, and beyond.

This book will be indispensable to every business and technical professional trying to make sense of a large body of unstructured text: managers, database designers, data modelers, DBAs, researchers, and end users alike.

Coverage includes

  • What unstructured data is, and how it differs from structured data

  • First generation technology for handling unstructured data, from search engines to ECM--and its limitations

  • Integrating text so it can be analyzed with a common, colloquial vocabulary: integration engines, ontologies, glossaries, and taxonomies

  • Processing semistructured data: uncovering patterns, words, identifiers, and conflicts

  • Novel processing opportunities that arise when text is freed from context

  • Architecture and unstructured data: Data Warehousing 2.0

  • Building unstructured relational databases and linking them to structured data

  • Visualizations and Self-Organizing Maps (SOMs), including Compudigm and Raptor solutions

  • Capturing knowledge from spreadsheet data and email

  • Implementing and managing metadata: data models, data quality, and more

  • William H. Inmon is founder, president, and CTO of Inmon Data Systems. He is the father of the data warehouse concept, the corporate information factory, and the government information factory. Inmon has written 47 books on data warehouse, database, and information technology management; as well as more than 750 articles for trade journals such as Data Management Review, Byte, Datamation, and ComputerWorld. His b-eye-network.com newsletter currently reaches 55,000 people.

    Anthony Nesavich worked at Inmon Data Systems, where he developed multiple reports that successfully query unstructured data.

    Preface xvii

    1          Unstructured Textual Data in the Organization 1

    2          The Environments of Structured Data and Unstructured Data 15

    3          First Generation Textual Analytics 33

    4          Integrating Unstructured Text into the Structured Environment 47

    5          Semistructured Data 73

    6          Architecture and Textual Analytics 83

    7          The Unstructured Database 95

    8          Analyzing a Combination of Unstructured Data and Structured Data 113

    9          Analyzing Text Through Visualization 127

    10        Spreadsheets and Email 135

    11        Metadata in Unstructured Data 147

    12        A Methodology for Textual Analytics 163

    13        Merging Unstructured Databases into the Data Warehouse 175

    14        Using SQL to Analyze Text 185

    15        Case Study--Textual Analytics in Medical Research 195

    16        Case Study--A Database for Harmful Chemicals 203

    17        Case Study--Managing Contracts Through an Unstructured Database 209

    18        Case Study--Creating a Corporate Taxonomy (Glossary) 215

    19        Case Study--Insurance Claims 219

    Glossary 227

    Index 233

    Table of Contents

    1. Copyright
      1. Dedication
    2. Preface
    3. Acknowledgments
    4. About the Authors
    5. 1. Unstructured Textual Data in the Organization
      1. Unstructured Textual Data
      2. Unstructured Textual Data and Organizational Functions
      3. Unstructured Data and Its Characteristics
      4. Updating Structured and Unstructured Data
      5. The Challenges of Unstructured Textual Data and Analytical Processing
      6. The Opportunities of Unstructured Textual Data
      7. Summary
    6. 2. The Environments of Structured Data and Unstructured Data
      1. The Structured Environment
        1. Databases
        2. Speed of Storage, Retrieval
      2. The Unstructured Environment
        1. Emails
        2. Spreadsheets
        3. Transcripted Telephone Conversations
        4. Medical Records
        5. Legal Information
        6. Corporate Contracts
        7. Unstructured Data—Found Everywhere
      3. The Analytical Environment
        1. Doing Textual Analytics in the Unstructured Environment or the Structured Environment
        2. Bringing Unstructured Data into the Structured Environment
      4. Summary
    7. 3. First Generation Textual Analytics
      1. Simplicity
      2. Search Engines That Look for Patterns
      3. Search Engine—The Hit
      4. Where Unstructured Data Resides
      5. Searching an Index
        1. A Crawler
      6. Search Arguments and Schedulers
      7. Tagging
      8. Searching in Multiple Languages
      9. Collecting Output in a Taxonomy
      10. Hyperlink References
      11. Federated Queries
      12. Integration
      13. Enhancing the Search Argument
        1. Wild Cards
        2. Boolean Expressions
      14. Visualizing Text—First Generation
      15. Understanding Context
      16. Summary
    8. 4. Integrating Unstructured Text into the Structured Environment
      1. Possibilities of Unstructured Systems
      2. Integrating Unstructured Textual Data
        1. Reading the Unstructured Textual Data
        2. Choosing a File Type
        3. Reading Unstructured Data from Voice Recordings
      3. The Importance of Integration
        1. Simple Search
        2. Indirect Search of Alternate Terms
        3. Indirect Search of Related Terms
        4. Permutations of Words
      4. The Issues of Textual Integration
      5. External Categorization
      6. Simple Integration Applications
        1. Examining the Contents of Existing Unstructured Data
        2. Enterprise Metadata Repository
        3. Customer Communications
        4. The Resulting Architecture
      7. Choosing the Best Types of Integration
        1. Ways in Which Integration Occurs
        2. Performance Limitations
        3. Disadvantages to Integration
      8. Summary
    9. 5. Semistructured Data
      1. The Many Forms of Semistructured Data
        1. Common Patterns of Data
        2. Prefacing Values
        3. Default Values
        4. Conflicting Values
      2. The Degree of Accuracy
      3. Preprocessing Semistructured Data
        1. Variable Output
        2. Semistructured Processing in an Unstructured Environment
        3. Preparation Time
      4. Summary
    10. 6. Architecture and Textual Analytics
      1. The Growth of Information Systems
        1. The Maturing Need for Information
        2. The Need for Data Integrity
      2. The Need to Include Unstructured Data
        1. A Fundamental Difference
        2. First Generation of Analytical Processing
        3. The Second Generation of Textual Analytics
      3. DW 2.0—Data Warehouse Architecture
      4. Summary
    11. 7. The Unstructured Database
      1. The General Flow of Data
        1. Preventing a Blob of Unstructured Data
      2. A Partial Relational Table
        1. A Document/Word Table
        2. The Key of the Relational Table
      3. Different Indexes
        1. Indexing Semistructured Data
        2. Indexes for Both Semistructured and Unstructured Data
      4. Managing Large Volumes of Data
        1. Bring Everything into the Structured Environment
        2. Selectively Storing Terms
      5. Simple Pointer
        1. Moving the Underlying Document
        2. Alternatives to Moved Documents
        3. Synonyms and Homographs
        4. External Categories
      6. Using the Unstructured Database for Analysis
        1. Access Through a Standard SQL Interface
        2. The High Level Design
        3. Accessing Multiple Tables
        4. “Hot” Data
      7. Summary
    12. 8. Analyzing a Combination of Unstructured Data and Structured Data
      1. Intersecting Structured and Unstructured Data
        1. Linking Communications
          1. A Word of Caution—False Positives
        2. Other Types of Linkages
        3. A Probabilistic Link
        4. Dynamic and Static Links
      2. Accessing Linkages
        1. Periodically Monitoring Linkage
        2. Independently Processing the Static Link
        3. A Join Based on the Link
      3. Submitting Real-Time Online Queries
        1. Enhancing Performance Through Priority Ranking
        2. Filtering Data Based on Probability of Access
        3. Looking at Links Partially
      4. Unstructured Text and Future Capabilities
      5. Summary
    13. 9. Analyzing Text Through Visualization
      1. Creating a Textual Visualization
        1. Selecting Homogeneous Data
        2. Integrating the Text
        3. Creating a Database Format
        4. Creating the Visualization
        5. Other Types of SOMs
        6. Iterative Development of the Visualization
      2. Analytical Activities
      3. Recasting a SOM Visualization
        1. Creating a SOM for Semistructured Data
      4. Summary
    14. 10. Spreadsheets and Email
      1. The Challenges of Spreadsheets
        1. Strategies for Access
        2. A Unique Spreadsheet Identifier
        3. Unstructured Data in the Spreadsheet
        4. Identifying Cells
        5. Cross Equivalency
        6. Global Spreadsheet Analysis
      2. Emails
        1. Screening Emails for Useful Data
        2. Collecting Customer Communications
      3. Summary
    15. 11. Metadata in Unstructured Data
      1. Metadata in the Unstructured Environment
        1. Internal Metadata
        2. External Metadata
      2. Approaching Metadata Through the Data Model
        1. The Basis for the Data Model
        2. The Levels of the Data Model
        3. How the Components Relate—The Larger Perspective
      3. Two Types of Data Model—External and Internal
        1. How Is the Internal Data Model Created?
        2. Creating the External Data Model
        3. Using Generic Data Models
        4. The Corporate Glossary
        5. Corporate Taxonomy
        6. Corporate Ontology
        7. The Role of the Data Model
      4. Data Quality in the Unstructured Environment
      5. Summary
    16. 12. A Methodology for Textual Analytics
      1. Preparing the Basic Components
      2. Specifying the Processing
      3. Organizing the Source Data
      4. Analyzing the Results
      5. Summary
    17. 13. Merging Unstructured Databases into the Data Warehouse
      1. The Data Warehouse
      2. Linking Databases
        1. Communications-Based Linkage Between Unstructured and Structured Data
        2. Identifier-Based Linkage Between Unstructured and Structured Data
      3. An Integrated Data Warehouse
        1. Multi-Database Linkage
        2. ETL for Unstructured Data
      4. Housing Databases in DBMS Technology
        1. Binding the Databases Together
        2. Federated Analytical Technology
      5. Simple Queries
      6. Summary
    18. 14. Using SQL to Analyze Text
      1. A Simple Query
      2. Indirect Query
      3. Proximity Search
      4. Basic Word String Snippet Report
      5. Summary
    19. 15. Case Study—Textual Analytics in Medical Research
      1. Medical Records
      2. The Cardiology Study
        1. The Challenge of Terminology
          1. The Tower of Babel
        2. Volumes of Data
      3. Revealing a Strategy
        1. The First Iteration
        2. The Second Iteration
        3. The Third Iteration
        4. Fourth Iteration
        5. Fifth Iteration
      4. Textual Analytics
        1. Standard Analytical Tools
    20. 16. Case Study—A Database for Harmful Chemicals
      1. The Institution’s Data on Harmful Chemicals
        1. An Abundance of Paper Reports
      2. A Request for Analyses Leading to the Requirement for Storing Data Electronically
        1. Putting Paper Data in an Electronic Format
        2. Creating the Standard Set of Reference Tables
        3. Text Editing and Integration
      3. Textual Analytics
      4. Periodic Additions
    21. 17. Case Study—Managing Contracts Through an Unstructured Database
      1. A Hodgepodge of Contracts
      2. A Plan for Textual Analytics
      3. Iterative Development
      4. Analytical Processing
    22. 18. Case Study—Creating a Corporate Taxonomy (Glossary)
      1. A Technological Jumble
      2. The IT Problem
      3. The Issue of Terminology
      4. Creating a Corporate Taxonomy (Glossary)
        1. Creating Definitions and Equivocations
      5. Using the Taxonomy
    23. 19. Case Study—Insurance Claims
      1. A Collection of Inconsistent Claims
      2. A Wealth of Information
      3. Creating the Integrated Data Warehouse
        1. First Iteration
        2. Second Iteration
        3. Third Iteration
        4. Fourth Iteration
    24. Glossary