Cover image for Beautiful Data

Book description

In this insightful book, you'll learn from the best data practitioners in the field just how wide-ranging -- and beautiful -- working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video. With Beautiful Data, you will:

  • Explore the opportunities and challenges involved in working with the vast number of datasets made available by the Web

  • Learn how to visualize trends in urban crime, using maps and data mashups

  • Discover the challenges of designing a data processing system that works within the constraints of space travel

  • Learn how crowdsourcing and transparency have combined to advance the state of drug research

  • Understand how new data can automatically trigger alerts when it matches or overlaps pre-existing data

  • Learn about the massive infrastructure required to create, capture, and process DNA data

That's only small sample of what you'll find in Beautiful Data. For anyone who handles data, this is a truly fascinating book. Contributors include:

  • Nathan Yau

  • Jonathan Follett and Matt Holm

  • J.M. Hughes

  • Raghu Ramakrishnan, Brian Cooper, and Utkarsh Srivastava

  • Jeff Hammerbacher

  • Jason Dykes and Jo Wood

  • Jeff Jonas and Lisa Sokol

  • Jud Valeski

  • Alon Halevy and Jayant Madhavan

  • Aaron Koblin with Valdean Klump

  • Michal Migurski

  • Jeff Heer

  • Coco Krumme

  • Peter Norvig

  • Matt Wood and Ben Blackburne

  • Jean-Claude Bradley, Rajarshi Guha, Andrew Lang, Pierre Lindenbaum, Cameron Neylon, Antony Williams, and Egon Willighagen

  • Lukas Biewald and Brendan O'Connor

  • Hadley Wickham, Deborah Swayne, and David Poole

  • Andrew Gelman, Jonathan P. Kastellec, and Yair Ghitza

  • Toby Segaran

Table of Contents

  1. Beautiful Data
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Preface
      1. How This Book Is Organized
      2. Conventions Used in This Book
      3. Using Code Examples
      4. How to Contact Us
      5. Safari® Books Online
    4. 1. Seeing Your Life in Data
      1. Personal Environmental Impact Report (PEIR)
      2. your.flowingdata (YFD)
      3. Personal Data Collection
        1. Working Data Collection into Routine
          1. Asynchronous data collection
      4. Data Storage
      5. Data Processing
      6. Data Visualization
        1. PEIR
          1. Mapping location-based data
          2. Experimenting with visual cues
          3. Mapping multivariate location traces
          4. Choosing a color scheme
          5. Making trips interactive
          6. Displaying distributions
          7. Sharing personal data
        2. YFD
      7. The Point
      8. How to Participate
    5. 2. The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
      1. Introduction: User Empathy Is the New Black
        1. What Is UX?
        2. The Benefits of Applying UX Best Practices to Data Collection
      2. The Project: Surveying Customers About a New Luxury Product
      3. Specific Challenges to Data Collection
        1. Challenges of Accessibility
        2. Challenges of Perception
          1. Building trust
          2. Length of survey
          3. Accurate data collection
          4. Motivation
      4. Designing Our Solution
        1. Design Philosophy
        2. Designing the Form Layout
          1. Web form typography and accessibility
          2. Giving them some space
          3. Accommodating different browsers and testing for compatibility
          4. Interaction design considerations: Dynamic form length
          5. Designing trust
          6. Designing for accurate data collection
          7. Motivation
          8. Reporting the live data results
      5. Results and Reflection
    6. 3. Embedded Image Data Processing on Mars
      1. Abstract
      2. Introduction
      3. Some Background
      4. To Pack or Not to Pack
      5. The Three Tasks
      6. Slotting the Images
      7. Passing the Image: Communication Among the Three Tasks
      8. Getting the Picture: Image Download and Processing
      9. Image Compression
      10. Downlink, or, It's All Downhill from Here
      11. Conclusion
    7. 4. Cloud Storage Design in a PNUTShell
      1. Introduction
      2. Updating Data
        1. The Challenge
        2. Our Approach
          1. More on mastership
          2. Supporting ordered data
          3. Trading off consistency for availability
      3. Complex Queries
        1. The Challenge
        2. Our Approach
      4. Comparison with Other Systems
        1. Google's BigTable
        2. Amazon's Dynamo
        3. Microsoft Azure SDS
        4. Other Related Systems
        5. Other Systems at Yahoo!
      5. Conclusion
      6. Acknowledgments
      7. References
    8. 5. Information Platforms and the Rise of the Data Scientist
      1. Libraries and Brains
      2. Facebook Becomes Self-Aware
      3. A Business Intelligence System
      4. The Death and Rebirth of a Data Warehouse
      5. Beyond the Data Warehouse
      6. The Cheetah and the Elephant
      7. The Unreasonable Effectiveness of Data
      8. New Tools and Applied Research
      9. MAD Skills and Cosmos
      10. Information Platforms As Dataspaces
      11. The Data Scientist
      12. Conclusion
    9. 6. The Geographic Beauty of a Photographic Archive
      1. Beauty in Data: Geograph
      2. Visualization, Beauty, and Treemaps
        1. What Is Beauty in Visual Data Exploration?
        2. Making Treemaps Beautiful: A Geographic Perspective
      3. A Geographic Perspective on Geograph Term Use
        1. Representing the Term Hierarchy
        2. Representing Absolute Location with Color
        3. Representing Relative Location with Spatial Treemaps
        4. Representing Location Displacement
      4. Beauty in Discovery
      5. Reflection and Conclusion
      6. Acknowledgments
      7. References
    10. 7. Data Finds Data
      1. Introduction
      2. The Benefits of Just-in-Time Discovery
      3. Corruption at the Roulette Wheel
      4. Enterprise Discoverability
      5. Federated Search Ain't All That
      6. Directories: Priceless
      7. Relevance: What Matters and to Whom?
      8. Components and Special Considerations
        1. The Existence of, and Availability of, Observations
        2. The Ability to Extract and Classify Features from the Observations
        3. The Ability to Efficiently Discover Related Historical Context
        4. The Ability to Make Assertions (Same or Related) About New Observations
        5. The Ability to Recognize When New Observations Reverse Earlier Assertions
        6. The Ability to Accumulate and Persist This Asserted Context
        7. The Ability to Recognize the Formation of Relevance/Insight
        8. The Ability to Notify the Appropriate Entity of Such Insight
      9. Privacy Considerations
      10. Conclusion
    11. 8. Portable Data in Real Time
      1. Introduction
      2. The State of the Art
        1. Transport
          1. XMPP
          2. BitTorrent
          3. Proprietary/P2P
        2. Formats
        3. APIs
        4. Polling
          1. Rate limiting
          2. Getting it right
          3. Zero miles per gallon efficiency
        5. Events
          1. HTML 5 events
        6. WAN Scale Events
      3. Social Data Normalization
        1. Business Value of Data
          1. Public versus private
      4. Conclusion: Mediation via Gnip
    12. 9. Surfacing the Deep Web
      1. What Is the Deep Web?
      2. Alternatives to Offering Deep-Web Access
        1. Basics of HTML Form Processing
        2. Queries and Query Templates
        3. Selecting Input Combinations
          1. Quality of query templates
          2. Informativeness test
          3. Searching for informative query templates
        4. Predicting Input Values
          1. Generic text inputs
          2. Typed text inputs
      3. Conclusion and Future Work
      4. References
    13. 10. Building Radiohead's House of Cards
      1. How It All Started
      2. The Data Capture Equipment
        1. Velodyne Lidar
        2. Geometric Informatics
      3. The Advantages of Two Data Capture Systems
      4. The Data
      5. Capturing the Data, aka "The Shoot"
        1. The Outdoor Lidar Shoot
        2. The Indoor Lidar Shoot
        3. The Indoor GeoVideo Shoot
      6. Processing the Data
      7. Post-Processing the Data
      8. Launching the Video
      9. Conclusion
    14. 11. Visualizing Urban Data
      1. Introduction
      2. Background
      3. Cracking the Nut
      4. Making It Public
      5. Revisiting
      6. Conclusion
    15. 12. The Design of Sense.us
      1. Visualization and Social Data Analysis
      2. Data
      3. Visualization
        1. Design Considerations
          1. Foster personal relevance
          2. Provide effective visual encodings
          3. Make each display distinct
          4. Support intuitive exploration
          5. Be engaging and playful
        2. Visualization Designs
          1. Job Voyager
          2. Birthplace Voyager
          3. U.S. census state map and scatterplot
          4. Population pyramid
          5. Implementation details
      4. Collaboration
        1. View Sharing
        2. Doubly Linked Discussion
        3. Pointing via Graphical Annotation
        4. Collecting and Linking Views
        5. Awareness and Social Navigation
        6. Unobtrusive Collaboration
      5. Voyagers and Voyeurs
        1. Hunting for Patterns
        2. Making Sense of It All
        3. Crowd Surfing
      6. Conclusion
      7. References
    16. 13. What Data Doesn't Do
      1. When Doesn't Data Drive?
        1. 1. More Data Isn't Always Better
        2. 2. More Data Isn't Always Easy
        3. 3. Data Alone Doesn't Explain
        4. 4. Data Isn't Good for a Single Answer
        5. 5. Data Doesn't Predict
        6. 6. Probability Isn't Intuitive
        7. 7. Probabilities Aren't Intuitive
        8. 8. The Real World Doesn't Create Random Variables
        9. 9. Data Doesn't Stand Alone
        10. 10. Data Isn't Free from the Eye of the Beholder
      2. Conclusion
      3. References
    17. 14. Natural Language Corpus Data
      1. Word Segmentation
      2. Secret Codes
      3. Spelling Correction
      4. Other Tasks
        1. Language Identification
        2. Spam Detection and Other Classification Tasks
        3. Author Identification (Stylometry)
        4. Document Unshredding and DNA Sequencing
        5. Machine Translation
      5. Discussion and Conclusion
      6. Acknowledgments
    18. 15. Life in Data: The Story of DNA
      1. DNA As a Data Store
        1. DNA Makes RNA Makes Proteins
        2. Hacking Your DNA Data Store with Drugs
        3. Cancer
        4. Replication
        5. Cracking the Code
        6. DNA As Digital Storage
        7. Evolution As an Algorithm
      2. DNA As a Data Source
        1. A Quantum Leap
        2. "My God, It's Full of Bases…"
      3. Fighting the Data Deluge
        1. The Sanger Institute's Sequencing Platform
          1. Project management
        2. Flexible Data Capture
        3. Instrument and Data Management
      4. The Future of DNA
        1. How to Become a Genetic Hacker
        2. Next Next-Gen
        3. The Era of Big Data
      5. Acknowledgments
    19. 16. Beautifying Data in the Real World
      1. The Problem with Real Data
      2. Providing the Raw Data Back to the Notebook
      3. Validating Crowdsourced Data
      4. Representing the Data Online
        1. Unique Identifiers for Chemical Entities
        2. Open Data and Accessible Services Enable a Wide Range of Visualization and Analysis Options
        3. Integrating Data with a Central Aggregation Service
        4. Enabling Data Integration via Unique Identifiers and Self-Describing Data Formats
      5. Closing the Loop: Visualizations to Suggest New Experiments
      6. Building a Data Web from Open Data and Free Services
      7. Acknowledgments
      8. References
    20. 17. Superficial Data Analysis: Exploring Millions of Social Stereotypes
      1. Introduction
      2. Preprocessing the Data
      3. Exploring the Data
      4. Age, Attractiveness, and Gender
      5. Looking at Tags
      6. Which Words Are Gendered?
      7. Clustering
      8. Conclusion
      9. Acknowledgments
      10. References
    21. 18. Bay Area Blues: The Effect of the Housing Crisis
      1. Introduction
      2. How Did We Get the Data?
      3. Geocoding
      4. Data Checking
      5. Analysis
      6. The Influence of Inflation
      7. The Rich Get Richer and the Poor Get Poorer
      8. Geographic Differences
      9. Census Information
      10. Exploring San Francisco
      11. Conclusion
      12. References
    22. 19. Beautiful Political Data
      1. Example 1: Redistricting and Partisan Bias
      2. Example 2: Time Series of Estimates
      3. Example 3: Age and Voting
      4. Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
      5. Example 5: Localized Partisanship in Pennsylvania
      6. Conclusion
      7. References
    23. 20. Connecting Data
      1. What Public Data Is There, Really?
      2. The Possibilities of Connected Data
      3. Within Companies
      4. Impediments to Connecting Data
        1. The Representation Problem
        2. Shared Nouns and Shared Verbs
        3. The Same Thing with Different Names
        4. Different Things with the Same Name
      5. Possible Solutions
        1. Matching on Multiple Fields
        2. Collective Reconciliation
      6. Conclusion
    24. A. Contributors
    25. Index
    26. About the Authors
    27. COLOPHON
    28. SPECIAL OFFER: Upgrade this ebook with O’Reilly