You are previewing Data Quality Assessment.
O'Reilly logo
Data Quality Assessment

Book Description

Imagine a group of prehistoric hunters armed with stone-tipped spears. Their primitive weapons made hunting large animals, such as mammoths, dangerous work. Over time, however, a new breed of hunters developed. They would stretch the skin of a previously killed mammoth on the wall and throw their spears, while observing which spear, thrown from which angle and distance, penetrated the skin the best. The data gathered helped them make better spears and develop better hunting strategies. Quality data is the key to any advancement, whether it is from the Stone Age to the Bronze Age. Or from the Information Age to whatever Age comes next. The success of corporations and government institutions largely depends on the efficiency with which they can collect, organize, and utilize data about products, customers, competitors, and employees. Fortunately, improving your data quality does not have to be such a mammoth task. DATA QUALITY ASSESSMENT is a must read for anyone who needs to understand, correct, or prevent data quality issues in their organization. Skipping theory and focusing purely on what is practical and what works, this text contains a proven approach to identifying, warehousing, and analyzing data errors. Master techniques in data profiling and gathering metadata, designing data quality rules, organizing rule and error catalogues, and constructing the dimensional data quality scorecard. David Wells, Director of Education of the Data Warehousing Institute, says "This is one of those books that marks a milestone in the evolution of a discipline. Arkady's insights and techniques fuel the transition of data quality management from art to science -- from crafting to engineering. From deep experience, with thoughtful structure, and with engaging style Arkady brings the discipline of data quality to practitioners."

Table of Contents

  1. Chapter 1 Causes of data quality problems
    1. 1.1. Initial Data Conversion
    2. 1.2. System Consolidations
    3. 1.3. Manual Data Entry
    4. 1.4. Batch Feeds
    5. 1.5. Real-Time Interfaces
    6. 1.6. Data Processing
    7. 1.7. Data Cleansing
    8. 1.8. Data Purging
    9. 1.9. Changes Not Captured
    10. 1.10. System Upgrades
    11. 1.11. New Data Uses
    12. 1.12. Loss of Expertise
    13. 1.13. Process Automation
  2. Chapter 2 Data Quality Program Overview
    1. 2.1. Data Quality Assessment
    2. 2.2. Data Cleansing
    3. 2.3. Monitoring Data Integration Interfaces
    4. 2.4. Ensuring Data Quality in Data Conversion and Consolidation
    5. 2.5. Building Data Quality Meta Data Warehouse
  3. Chapter 3 Data Quality Assessment Overview
    1. 3.1. Project Team
    2. 3.2. Project Plan Overview
    3. 3.3. Planning Phase
    4. 3.4. Preparation Phase
      1. 3.4.1. Loading Data to Staging Area
      2. 3.4.2. Gathering General Meta Data
      3. 3.4.3. Designing Data Quality Meta Data Warehouse
    5. 3.5. Implementation Phase
      1. 3.5.1. Data Profiling
      2. 3.5.2. Designing Data Quality Rules
    6. 3.6. Fine-Tuning Phase
    7. 3.7. Ongoing Data Quality Monitoring
  4. Chapter 4 Attribute Domain Constraints
    1. 4.1. Introduction to Attribute Domain Constraints
    2. 4.2. Attribute Profiling
    3. 4.3. Optionality Constraints
    4. 4.4. Attribute Format Constraints
    5. 4.5. Valid Value Constraints
    6. 4.6. Precision Constraints
  5. Chapter 5 Relational Integrity Rules
    1. 5.1. Relational Data Model Basics
    2. 5.2. Identity Rules
    3. 5.3. Reference Rules
    4. 5.4. Cardinal Rules
    5. 5.5. Inheritance Rules
  6. Chapter 6 Rules for Historical Data
    1. 6.1. Introduction to Historical Data
    2. 6.2. Basic Data Quality Rules for Historical Data
      1. 6.2.1. Currency Rules
      2. 6.2.2. Retention Rules
      3. 6.2.3. Continuity and Granularity Rules
    3. 6.3. Advanced Data Quality Rules for Historical Data
      1. 6.3.1. Timeline Patterns
      2. 6.3.2. Value Patterns
    4. 6.4. Data Quality Rules for Event Histories
      1. 6.4.1. Event Dependencies
      2. 6.4.2. Event Conditions
      3. 6.4.3. Event-Specific Attribute Constraints
  7. Chapter 7 Rules for State-Dependent Objects
    1. 7.1. Introduction to State-Dependent Objects
    2. 7.2. Identifying State-Dependent Entities
    3. 7.3. Profiling State-Transition Models
      1. 7.3.1. State and Terminator Profiling
      2. 7.3.2. State-Transition Profiling
      3. 7.3.3. Action Profiling
      4. 7.3.4. Conclusion
    4. 7.4. Rules Derived from State-Transition Diagrams
      1. 7.4.1. Domain Constraints
      2. 7.4.2. Transition Constraints
    5. 7.5. Timeline Constraints
      1. 7.5.1. Continuity Rules
      2. 7.5.2. Duration Rules
      3. 7.5.3. State Duration Profiling
      4. 7.5.4. Cumulative Duration Rules
    6. 7.6. Advanced Rules
      1. 7.6.1. Action-Specific Attribute Constraints
      2. 7.6.2. State-Specific Attribute Constraints
      3. 7.6.3. Action Pre-Conditions and Post-Conditions
  8. Chapter 8 Attribute Dependency Rules
    1. 8.1. Introduction to Attribute Dependency Rules
      1. 8.1.1. Redundant Attributes
      2. 8.1.2. Derived Attributes
      3. 8.1.3. Partially Dependent Attributes
      4. 8.1.4. Attributes with Conditional Optionality
      5. 8.1.5. Correlated Attributes
    2. 8.2. Identifying Dependencies through Analysis
      1. 8.2.1. Gathering Expert Knowledge
      2. 8.2.2. Investigating Data Relationships
      3. 8.2.3. Data Gazing
    3. 8.3. Identifying Dependencies through Data Profiling
      1. 8.3.1. Value Affinity
      2. 8.3.2. Value Correlation
      3. 8.3.3. Value Clustering
    4. 8.4. Identifying Dependencies Across Data Sources
      1. 8.4.1. Step 1 – Identifying Secondary Data Sources
      2. 8.4.2. Step 2 – Qualifying Secondary Data Sources
      3. 8.4.3. Step 3 – Subject Matching
      4. 8.4.4. Step 4 – Identifying Related Entities and Attributes
  9. Chapter 9 Implementing Data Quality Rules
    1. 9.1. Project Scope and Rule Design
    2. 9.2. Selecting Optimal Rule Design
      1. 9.2.1. Rule Aggregation
      2. 9.2.2. Rule Specialization
      3. 9.2.3. Derived Rules
      4. 9.2.4. Error Grouping
    3. 9.3. Rule Cataloguing
      1. 9.3.1. Rule Catalogue Components
      2. 9.3.2. Rule Catalogue Data Model
    4. 9.4. Rule Coding
      1. 9.4.1. Writing Individual Programs for Each Rule
      2. 9.4.2. Using Parameterized Rule Engine
      3. 9.4.3. Combining Two Approaches
  10. Chapter 10 Fine-Tuning Data Quality Rules
    1. 10.1. Rule Imperfections
    2. 10.2. Rule Fine-Tuning Process
    3. 10.3. Identifying Rule Imperfections
      1. 10.3.1. What to Validate?
      2. 10.3.2. How to Select Validation Sample?
      3. 10.3.3. How to Perform Validation in Iterations?
    4. 10.4. Analyzing Imperfection Patterns
    5. 10.5. Eliminating False Positives
    6. 10.6. Handling False Negatives
    7. 10.7. Handling Uncertainty in Error Location
  11. Chapter 11 Cataloguing Errors
    1. 11.1. Error Catalogue Basics
    2. 11.2. Recording Missing Records
    3. 11.3. Errors Affecting Multiple Records
    4. 11.4. Error Groups
    5. 11.5. Subject-Level Error Tracking
    6. 11.6. Error Messages
  12. Chapter 12 Measuring Data Quality Scores
    1. 12.1. Introduction to Aggregate Scores
      1. 12.1.1. Scores Measuring Impact of Bad Data
      2. 12.1.2. Scores Identifying Sources of Bad Data
      3. 12.1.3. Scores Identifying Location of Bad Data
      4. 12.1.4. Record-Level and Subject-Level Scores
    2. 12.2. Score Tabulation Process Overview
    3. 12.3. Building Score Catalogue
      1. 12.3.1. Defining Score Objective
      2. 12.3.2. Identifying Relevant Data Elements
      3. 12.3.3. Identifying Relevant Data Quality Rules
      4. 12.3.4. Defining Relevant Subject Populations
      5. 12.3.5. Defining Relevant Recordsets
    4. 12.4. Tabulating Record-Level Scores
      1. 12.4.1. Counting All Relevant Records
      2. 12.4.2. Counting Erroneous Records
      3. 12.4.3. Counting Missing Records
    5. 12.5. Adjusting Scores for Rule Imperfections
    6. 12.6. Tabulating Subject-Level Scores
  13. Chapter 13 Data Quality Meta Data Warehouse
    1. 13.1. Data Quality Assessment Meta Data
      1. 13.1.1. Step 1 – Gathering General Meta Data
      2. 13.1.2. Step 2 – Data Analysis and Profiling
      3. 13.1.3. Step 3 – Populating Staging Area
      4. 13.1.4. Step 4 – Designing Data Quality Rules
      5. 13.1.5. Step 5 – Implementing Data Quality Rules
      6. 13.1.6. Step 6 – Fine-Tuning Data Quality Rules
      7. 13.1.7. Step 7 – Tabulating Aggregate Scores
    2. 13.2. Data Quality Scorecard
      1. 13.2.1. Score Summary
      2. 13.2.2. Score Decompositions
      3. 13.2.3. Intermediate Error Reports
      4. 13.2.4. Atomic Level Information
      5. 13.2.5. Miscellaneous Definitions
    3. 13.3. Other DQMDW Functions and Reports
  14. Chapter 14 Recurrent Data Quality Assessment
    1. 14.1. Basics of Recurrent Data Quality Assessment
    2. 14.2. Data Quality Changes on Atomic Level
    3. 14.3. Adding Time Dimension to DQMDW
    4. 14.4. Executing Assessment Runs Against Production Data