O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Simplification

Book Description

Data Simplification: Taming Information With Open Source Tools addresses the simple fact that modern data is too big and complex to analyze in its native form. Data simplification is the process whereby large and complex data is rendered usable. Complex data must be simplified before it can be analyzed, but the process of data simplification is anything but simple, requiring a specialized set of skills and tools.

This book provides data scientists from every scientific discipline with the methods and tools to simplify their data for immediate analysis or long-term storage in a form that can be readily repurposed or integrated with other data.

Drawing upon years of practical experience, and using numerous examples and use cases, Jules Berman discusses the principles, methods, and tools that must be studied and mastered to achieve data simplification, open source tools, free utilities and snippets of code that can be reused and repurposed to simplify data, natural language processing and machine translation as a tool to simplify data, and data summarization and visualization and the role they play in making data useful for the end user.

  • Discusses data simplification principles, methods, and tools that must be studied and mastered
  • Provides open source tools, free utilities, and snippets of code that can be reused and repurposed to simplify data
  • Explains how to best utilize indexes to search, retrieve, and analyze textual data
  • Shows the data scientist how to apply ontologies, classifications, classes, properties, and instances to data using tried and true methods

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Foreword
  7. Preface
    1. Organization of this book
    2. Chapter Organization
    3. How to Read this Book
    4. Nota Bene
    5. Glossary
  8. Author Biography
  9. Chapter 1: The Simple Life
    1. Abstract
    2. 1.1 Simplification Drives Scientific Progress
    3. 1.2 The Human Mind is a Simplifying Machine
    4. 1.3 Simplification in Nature
    5. 1.4 The Complexity Barrier
    6. 1.5 Getting Ready
    7. Open Source Tools
    8. Glossary
  10. Chapter 2: Structuring Text
    1. Abstract
    2. 2.1 The Meaninglessness of Free Text
    3. 2.2 Sorting Text, the Impossible Dream
    4. 2.3 Sentence Parsing
    5. 2.4 Abbreviations
    6. 2.5 Annotation and the Simple Science of Metadata
    7. 2.6 Specifications Good, Standards Bad
    8. Open Source Tools
    9. Glossary
  11. Chapter 3: Indexing Text
    1. Abstract
    2. 3.1 How Data Scientists Use Indexes
    3. 3.2 Concordances and Indexed Lists
    4. 3.3 Term Extraction and Simple Indexes
    5. 3.4 Autoencoding and Indexing with Nomenclatures
    6. 3.5 Computational Operations on Indexes
    7. Open Source Tools
    8. Glossary
  12. Chapter 4: Understanding Your Data
    1. Abstract
    2. 4.1 Ranges and Outliers
    3. 4.2 Simple Statistical Descriptors
    4. 4.3 Retrieving Image Information
    5. 4.4 Data Profiling
    6. 4.5 Reducing Data
    7. Open Source Tools
    8. Glossary
  13. Chapter 5: Identifying and Deidentifying Data
    1. Abstract
    2. 5.1 Unique Identifiers
    3. 5.2 Poor Identifiers, Horrific Consequences
    4. 5.3 Deidentifiers and Reidentifiers
    5. 5.4 Data Scrubbing
    6. 5.5 Data Encryption and Authentication
    7. 5.6 Timestamps, Signatures, and Event Identifiers
    8. Open Source Tools
    9. Glossary
  14. Chapter 6: Giving Meaning to Data
    1. Abstract
    2. 6.1 Meaning and Triples
    3. 6.2 Driving Down Complexity with Classifications
    4. 6.3 Driving Up Complexity With Ontologies
    5. 6.4 The Unreasonable Effectiveness of Classifications
    6. 6.5 Properties That Cross Multiple Classes
    7. Open Source Tools
    8. Glossary
  15. Chapter 7: Object-Oriented Data
    1. Abstract
    2. 7.1 The Importance of Self-Explaining Data
    3. 7.2 Introspection and Reflection
    4. 7.3 Object-Oriented Data Objects
    5. 7.4 Working with Object-Oriented Data
    6. Open Source Tools
    7. Glossary
  16. Chapter 8: Problem Simplification
    1. Abstract
    2. 8.1 Random Numbers
    3. 8.2 Monte Carlo Simulations
    4. 8.3 Resampling and Permutating
    5. 8.4 Verification, Validation, and Reanalysis
    6. 8.5 Data Permanence and Data Immutability
    7. Open Source Tools
    8. Glossary
  17. Index