You are previewing Principles of Big Data.
O'Reilly logo
Principles of Big Data

Book Description

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.



• Learn general methods for specifying Big Data in a way that is understandable to humans and to computers.

• Avoid the pitfalls in Big Data design and analysis.

• Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources.

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Acknowledgments
  7. Author Biography
  8. Preface
  9. Introduction
    1. Definition of Big Data
    2. Big Data Versus Small Data
    3. Whence Comest Big Data?
    4. The Most Common Purpose of Big Data is to Produce Small Data
    5. Opportunities
    6. Big Data Moves to the Center of the Information Universe
  10. Chapter 1. Providing Structure to Unstructured Data
    1. Background
    2. Machine Translation
    3. Autocoding
    4. Indexing
    5. Term Extraction
    6. References
  11. Chapter 2. Identification, Deidentification, and Reidentification
    1. Background
    2. Features of an Identifier System
    3. Registered Unique Object Identifiers
    4. Really Bad Identifier Methods
    5. Embedding Information in an Identifier: Not Recommended
    6. One-Way Hashes
    7. Use Case: Hospital Registration
    8. Deidentification
    9. Data Scrubbing
    10. Reidentification
    11. Lessons Learned
    12. References
  12. Chapter 3. Ontologies and Semantics
    1. Background
    2. Classifications, the Simplest of Ontologies
    3. Ontologies, Classes with Multiple Parents
    4. Choosing a Class Model
    5. Introduction to Resource Description Framework Schema
    6. Common Pitfalls in Ontology Development
    7. References
  13. Chapter 4. Introspection
    1. Background
    2. Knowledge of Self
    3. eXtensible Markup Language
    4. Introduction to Meaning
    5. Namespaces and the Aggregation of Meaningful Assertions
    6. Resource Description Framework Triples
    7. Reflection
    8. Use Case: Trusted Time Stamp
    9. Summary
    10. References
  14. Chapter 5. Data Integration and Software Interoperability
    1. Background
    2. The Committee to Survey Standards
    3. Standard Trajectory
    4. Specifications and Standards
    5. Versioning
    6. Compliance Issues
    7. Interfaces to Big Data Resources
    8. References
  15. Chapter 6. Immutability and Immortality
    1. Background
    2. Immutability and Identifiers
    3. Data Objects
    4. Legacy Data
    5. Data Born from Data
    6. Reconciling Identifiers across Institutions
    7. Zero-Knowledge Reconciliation
    8. The Curator’s Burden
    9. References
  16. Chapter 7. Measurement
    1. Background
    2. Counting
    3. Gene Counting
    4. Dealing with Negations
    5. Understanding Your Control
    6. Practical Significance of Measurements
    7. Obsessive-Compulsive Disorder: The Mark of a Great Data Manager
    8. References
  17. Chapter 8. Simple but Powerful Big Data Techniques
    1. Background
    2. Look At the Data
    3. Data Range
    4. Denominator
    5. Frequency Distributions
    6. Mean and Standard Deviation
    7. Estimation-Only Analyses
    8. Use Case: Watching Data Trends with Google Ngrams
    9. Use Case: Estimating Movie Preferences
    10. References
  18. Chapter 9. Analysis
    1. Background
    2. Analytic Tasks
    3. Clustering, Classifying, Recommending, and Modeling
    4. Data Reduction
    5. Normalizing and Adjusting Data
    6. Big Data Software: Speed and Scalability
    7. Find Relationships, Not Similarities
    8. References
  19. Chapter 10. Special Considerations in Big Data Analysis
    1. Background
    2. Theory in Search of Data
    3. Data in Search of a Theory
    4. Overfitting
    5. Bigness Bias
    6. Too Much Data
    7. Fixing Data
    8. Data Subsets in Big Data: Neither Additive nor Transitive
    9. Additional Big Data Pitfalls
    10. References
  20. Chapter 11. Stepwise Approach to Big Data Analysis
    1. Background
    2. Step 1. A Question Is Formulated
    3. Step 2. Resource Evaluation
    4. Step 3. A Question Is Reformulated
    5. Step 4. Query Output Adequacy
    6. Step 5. Data Description
    7. Step 6. Data Reduction
    8. Step 7. Algorithms Are Selected, If Absolutely Necessary
    9. Step 8. Results Are Reviewed and Conclusions Are Asserted
    10. Step 9. Conclusions Are Examined and Subjected to Validation
    11. References
  21. Chapter 12. Failure
    1. Background
    2. Failure Is Common
    3. Failed Standards
    4. Complexity
    5. When Does Complexity Help?
    6. When Redundancy Fails
    7. Save Money; Don’t Protect Harmless Information
    8. After Failure
    9. Use Case: Cancer Biomedical Informatics Grid, a Bridge too Far
    10. References
  22. Chapter 13. Legalities
    1. Background
    2. Responsibility for the Accuracy and Legitimacy of Contained Data
    3. Rights to Create, Use, and Share the Resource
    4. Copyright and Patent Infringements Incurred by Using Standards
    5. Protections for Individuals
    6. Consent
    7. Unconsented Data
    8. Good Policies Are a Good Policy
    9. Use Case: The Havasupai Story
    10. References
  23. Chapter 14. Societal Issues
    1. Background
    2. How Big Data Is Perceived
    3. The Necessity of Data Sharing, Even When It Seems Irrelevant
    4. Reducing Costs and Increasing Productivity with Big Data
    5. Public Mistrust
    6. Saving Us from Ourselves
    7. Hubris and Hyperbole
    8. References
  24. Chapter 15. The Future
    1. Background
    2. Last Words
    3. References
  25. Glossary
  26. References
  27. Index