You are previewing Sharing Big Data Safely.
O'Reilly logo
Sharing Big Data Safely

Book Description

Many big data-driven companies today are moving to protect certain types of data against intrusion, leaks, or unauthorized eyes. But how do you lock down data while granting access to people who need to see it? In this practical book, authors Ted Dunning and Ellen Friedman offer two novel and practical solutions that you can implement right away.

Table of Contents

  1. Preface
    1. Preface
    2. Who Should Use This Book
  2. 1. So Secure It’s Lost
    1. Safe Access in Secure Big Data Systems
  3. 2. The Challenge: Sharing Data Safely
    1. Surprising Outcomes with Anonymity
    2. The Netflix Prize
    3. Unexpected Results from the Netflix Contest
    4. Implications of Breaking Anonymity
    5. Be Alert to the Possibility of Cross-Reference Datasets
    6. New York Taxicabs: Threats to Privacy
    7. Sharing Data Safely
  4. 3. Data on a Need-to-Know Basis
    1. Views: A Secure Way to Limit What Is Seen
    2. Why Limit Access?
    3. Apache Drill Views for Granular Security
    4. How Views Work
    5. Summary of Need-to-Know Methods
  5. 4. Fake Data Gives Real Answers
    1. The Surprising Thing About Fake Data
    2. Keep It Simple: log-synth
    3. Log-synth Use Case 1: Broken Large-Scale Hive Query
    4. Log-synth Use Case 2: Fraud Detection Model for Common Point of Compromise
      1. What Thieves Do
      2. Why Machine Learning Experts Were Consulted
      3. Using log-synth to Generate Fake User Histories
    5. Summary: Fake Data and log-synth to Safely Work with Secure Data
  6. 5. Fixing a Broken Large-Scale Query
    1. A Description of the Problem
    2. Determining What the Synthetic Data Needed to Be
    3. Schema for the Synthetic Data
    4. Generating the Synthetic Data
    5. Tips and Caveats
    6. What to Do from Here?
  7. 6. Fraud Detection
    1. What Is Really Important?
    2. The User Model
    3. Sampler for the Common Point of Compromise
    4. How the Breach Model Works
    5. Results of the Entire System Together
    6. Handy Tricks
    7. Summary
  8. 7. A Detailed Look at log-synth
    1. Goals
    2. Maintaining Simplicity: The Role of JSON in log-synth
    3. Structure
    4. Sampling Complex Values
    5. Structuring and De-structuring Samplers
    6. Extending log-synth
    7. Using log-synth with Apache Drill
    8. Choice of Data Generators
    9. R is for Random
    10. Benchmark Systems
    11. Probabilistic Programming
    12. Differential Privacy Preserving Systems
    13. Future Directions for log-synth
  9. 8. Sharing Data Safely: Practical Lessons
  10. A. Additional Resources
    1. Log-synth Open Source Software
    2. Apache Drill and Drill SQL Views
    3. General Resources and References
      1. Cheapside Hoard and Treasures
      2. Codes and Cipher
      3. Netflix Prize
      4. Problems with Data Sharing
      5. Additional O’Reilly Books by Dunning and Friedman