You are previewing Anonymizing Health Data.
O'Reilly logo
Anonymizing Health Data

Book Description

With this practical book, you will learn proven methods for anonymizing health data to help your organization share meaningful datasets, without exposing patient identity. Leading experts Khaled El Emam and Luk Arbuckle walk you through a risk-based methodology, using case studies from their efforts to de-identify hundreds of datasets.

Table of Contents

  1. Preface
    1. Audience
    2. Conventions Used in this Book
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgements
  2. 1. Introduction
    1. To Anonymize or Not to Anonymize
      1. Consent, or Anonymization?
      2. Penny Pinching
      3. People Are Private
    2. The Two Pillars of Anonymization
      1. Masking Standards
      2. De-Identification Standards
        1. Lists
        2. Heuristics
        3. Risk-based methodology
    3. Anonymization in the Wild
      1. Organizational Readiness
      2. Making It Practical
      3. Use Cases
    4. Stigmatizing Analytics
    5. Anonymization in Other Domains
    6. About This Book
  3. 2. A Risk-Based De-Identification Methodology
    1. Basic Principles
    2. Steps in the De-Identification Methodology
      1. Step 1: Selecting Direct and Indirect Identifiers
      2. Step 2: Setting the Threshold
      3. Step 3: Examining Plausible Attacks
      4. Step 4: De-Identifying the Data
      5. Step 5: Documenting the Process
    3. Measuring Risk Under Plausible Attacks
      1. T1: Deliberate Attempt at Re-Identification
      2. T2: Inadvertent Attempt at Re-Identification
      3. T3: Data Breach
      4. T4: Public Data
    4. Measuring Re-Identification Risk
      1. Probability Metrics
      2. Information Loss Metrics
    5. Risk Thresholds
      1. Choosing Thresholds
      2. Meeting Thresholds
    6. Risky Business
  4. 3. Cross-Sectional Data: Research Registries
    1. Process Overview
      1. Secondary Uses and Disclosures
      2. Getting the Data
      3. Formulating the Protocol
      4. Negotiating with the Data Access Committee
    2. BORN Ontario
      1. BORN Data Set
    3. Risk Assessment
      1. Threat Modeling
      2. Results
      3. Year on Year: Reusing Risk Analyses
    4. Final Thoughts
  5. 4. Longitudinal Discharge Abstract Data: State Inpatient Databases
    1. Longitudinal Data
      1. Don’t Treat It Like Cross-Sectional Data
    2. De-Identifying Under Complete Knowledge
      1. Approximate Complete Knowledge
      2. Exact Complete Knowledge
      3. Implementation
      4. Generalization Under Complete Knowledge
    3. The State Inpatient Database (SID) of California
      1. The SID of California and Open Data
    4. Risk Assessment
      1. Threat Modeling
      2. Results
    5. Final Thoughts
  6. 5. Dates, Long Tails, and Correlation: Insurance Claims Data
    1. The Heritage Health Prize
    2. Date Generalization
      1. Randomizing Dates Independently of One Another
      2. Shifting the Sequence, Ignoring the Intervals
      3. Generalizing Intervals to Maintain Order
      4. Dates and Intervals and Back Again
      5. A Different Anchor
      6. Other Quasi-Identifiers
      7. Connected Dates
    3. Long Tails
      1. The Risk from Long Tails
      2. Threat Modeling
      3. Number of Claims to Truncate
      4. Which Claims to Truncate
    4. Correlation of Related Items
      1. Expert Opinions
      2. Predictive Models
      3. Implications for De-Identifying Data Sets
    5. Final Thoughts
  7. 6. Longitudinal Events Data: A Disaster Registry
    1. Adversary Power
      1. Keeping Power in Check
      2. Power in Practice
      3. A Sample of Power
    2. The WTC Disaster Registry
      1. Capturing Events
      2. The WTC Data Set
      3. The Power of Events
    3. Risk Assessment
      1. Threat Modeling
      2. Results
    4. Final Thoughts
  8. 7. Data Reduction: Research Registry Revisited
    1. The Subsampling Limbo
      1. How Low Can We Go?
      2. Not for All Types of Risk
      3. BORN to Limbo!
    2. Many Quasi-Identifiers
      1. Subsets of Quasi-Identifiers
      2. Covering Designs
      3. Covering BORN
    3. Final Thoughts
  9. 8. Free-Form Text: Electronic Medical Records
    1. Not So Regular Expressions
    2. General Approaches to Text Anonymization
    3. Ways to Mark the Text as Anonymized
    4. Evaluation Is Key
      1. Appropriate Metrics, Strict but Fair
      2. Standards for Recall, and a Risk-Based Approach
      3. Standards for Precision
    5. Anonymization Rules
    6. Informatics for Integrating Biology and the Bedside (i2b2)
      1. i2b2 Text Data Set
    7. Risk Assessment
      1. Threat Modeling
      2. A Rule-Based System
      3. Results
    8. Final Thoughts
  10. 9. Geospatial Aggregation: Dissemination Areas and ZIP Codes
    1. Where the Wild Things Are
    2. Being Good Neighbors
      1. Distance Between Neighbors
      2. Circle of Neighbors
      3. Round Earth
      4. Flat Earth
    3. Clustering Neighbors
      1. We All Have Boundaries
      2. Fast Nearest Neighbor
    4. Too Close to Home
      1. Levels of Geoproxy Attacks
      2. Measuring Geoproxy Risk
    5. Final Thoughts
  11. 10. Medical Codes: A Hackathon
    1. Codes in Practice
    2. Generalization
      1. The Digits of Diseases
      2. The Digits of Procedures
      3. The (Alpha)Digits of Drugs
    3. Suppression
    4. Shuffling
    5. Final Thoughts
  12. 11. Masking: Oncology Databases
    1. Schema Shmema
    2. Data in Disguise
      1. Field Suppression
      2. Randomization
      3. Pseudonymization
      4. Frequency of Pseudonyms
    3. Masking On the Fly
    4. Final Thoughts
  13. 12. Secure Linking
    1. Let’s Link Up
    2. Doing It Securely
      1. Don’t Try This at Home
      2. The Third-Party Problem
      3. Basic Layout for Linking Up
    3. The Nitty-Gritty Protocol for Linking Up
      1. Bringing Paillier to the Parties
      2. Matching on the Unknown
    4. Scaling Up
      1. Cuckoo Hashing
      2. How Fast Does a Cuckoo Run?
    5. Final Thoughts
  14. 13. De-Identification and Data Quality
    1. Useful Data from Useful De-Identification
    2. Degrees of Loss
    3. Workload-Aware De-Identification
      1. Questions to Improve Data Utility
    4. Final Thoughts
  15. Colophon
  16. Index
  17. Copyright