You are previewing Beautiful Data.

Beautiful Data

Cover of Beautiful Data by Jeff Hammerbacher... Published by O'Reilly Media, Inc.
  1. Beautiful Data
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Preface
      1. How This Book Is Organized
      2. Conventions Used in This Book
      3. Using Code Examples
      4. How to Contact Us
      5. Safari® Books Online
    4. 1. Seeing Your Life in Data
      1. Personal Environmental Impact Report (PEIR)
      2. your.flowingdata (YFD)
      3. Personal Data Collection
      4. Data Storage
      5. Data Processing
      6. Data Visualization
      7. The Point
      8. How to Participate
    5. 2. The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
      1. Introduction: User Empathy Is the New Black
      2. The Project: Surveying Customers About a New Luxury Product
      3. Specific Challenges to Data Collection
      4. Designing Our Solution
      5. Results and Reflection
    6. 3. Embedded Image Data Processing on Mars
      1. Abstract
      2. Introduction
      3. Some Background
      4. To Pack or Not to Pack
      5. The Three Tasks
      6. Slotting the Images
      7. Passing the Image: Communication Among the Three Tasks
      8. Getting the Picture: Image Download and Processing
      9. Image Compression
      10. Downlink, or, It's All Downhill from Here
      11. Conclusion
    7. 4. Cloud Storage Design in a PNUTShell
      1. Introduction
      2. Updating Data
      3. Complex Queries
      4. Comparison with Other Systems
      5. Conclusion
      6. Acknowledgments
      7. References
    8. 5. Information Platforms and the Rise of the Data Scientist
      1. Libraries and Brains
      2. Facebook Becomes Self-Aware
      3. A Business Intelligence System
      4. The Death and Rebirth of a Data Warehouse
      5. Beyond the Data Warehouse
      6. The Cheetah and the Elephant
      7. The Unreasonable Effectiveness of Data
      8. New Tools and Applied Research
      9. MAD Skills and Cosmos
      10. Information Platforms As Dataspaces
      11. The Data Scientist
      12. Conclusion
    9. 6. The Geographic Beauty of a Photographic Archive
      1. Beauty in Data: Geograph
      2. Visualization, Beauty, and Treemaps
      3. A Geographic Perspective on Geograph Term Use
      4. Beauty in Discovery
      5. Reflection and Conclusion
      6. Acknowledgments
      7. References
    10. 7. Data Finds Data
      1. Introduction
      2. The Benefits of Just-in-Time Discovery
      3. Corruption at the Roulette Wheel
      4. Enterprise Discoverability
      5. Federated Search Ain't All That
      6. Directories: Priceless
      7. Relevance: What Matters and to Whom?
      8. Components and Special Considerations
      9. Privacy Considerations
      10. Conclusion
    11. 8. Portable Data in Real Time
      1. Introduction
      2. The State of the Art
      3. Social Data Normalization
      4. Conclusion: Mediation via Gnip
    12. 9. Surfacing the Deep Web
      1. What Is the Deep Web?
      2. Alternatives to Offering Deep-Web Access
      3. Conclusion and Future Work
      4. References
    13. 10. Building Radiohead's House of Cards
      1. How It All Started
      2. The Data Capture Equipment
      3. The Advantages of Two Data Capture Systems
      4. The Data
      5. Capturing the Data, aka "The Shoot"
      6. Processing the Data
      7. Post-Processing the Data
      8. Launching the Video
      9. Conclusion
    14. 11. Visualizing Urban Data
      1. Introduction
      2. Background
      3. Cracking the Nut
      4. Making It Public
      5. Revisiting
      6. Conclusion
    15. 12. The Design of
      1. Visualization and Social Data Analysis
      2. Data
      3. Visualization
      4. Collaboration
      5. Voyagers and Voyeurs
      6. Conclusion
      7. References
    16. 13. What Data Doesn't Do
      1. When Doesn't Data Drive?
      2. Conclusion
      3. References
    17. 14. Natural Language Corpus Data
      1. Word Segmentation
      2. Secret Codes
      3. Spelling Correction
      4. Other Tasks
      5. Discussion and Conclusion
      6. Acknowledgments
    18. 15. Life in Data: The Story of DNA
      1. DNA As a Data Store
      2. DNA As a Data Source
      3. Fighting the Data Deluge
      4. The Future of DNA
      5. Acknowledgments
    19. 16. Beautifying Data in the Real World
      1. The Problem with Real Data
      2. Providing the Raw Data Back to the Notebook
      3. Validating Crowdsourced Data
      4. Representing the Data Online
      5. Closing the Loop: Visualizations to Suggest New Experiments
      6. Building a Data Web from Open Data and Free Services
      7. Acknowledgments
      8. References
    20. 17. Superficial Data Analysis: Exploring Millions of Social Stereotypes
      1. Introduction
      2. Preprocessing the Data
      3. Exploring the Data
      4. Age, Attractiveness, and Gender
      5. Looking at Tags
      6. Which Words Are Gendered?
      7. Clustering
      8. Conclusion
      9. Acknowledgments
      10. References
    21. 18. Bay Area Blues: The Effect of the Housing Crisis
      1. Introduction
      2. How Did We Get the Data?
      3. Geocoding
      4. Data Checking
      5. Analysis
      6. The Influence of Inflation
      7. The Rich Get Richer and the Poor Get Poorer
      8. Geographic Differences
      9. Census Information
      10. Exploring San Francisco
      11. Conclusion
      12. References
    22. 19. Beautiful Political Data
      1. Example 1: Redistricting and Partisan Bias
      2. Example 2: Time Series of Estimates
      3. Example 3: Age and Voting
      4. Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
      5. Example 5: Localized Partisanship in Pennsylvania
      6. Conclusion
      7. References
    23. 20. Connecting Data
      1. What Public Data Is There, Really?
      2. The Possibilities of Connected Data
      3. Within Companies
      4. Impediments to Connecting Data
      5. Possible Solutions
      6. Conclusion
    24. A. Contributors
    25. Index
    26. About the Authors
    27. COLOPHON
    28. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Chapter 1. Seeing Your Life in Data

Nathan Yau

IN THE NOT-TOO-DISTANT PAST, THE WEB WAS ABOUT SHARING, BROADCASTING, AND DISTRIBUTION. But the tide is turning: the Web is moving toward the individual. Applications spring up every month that let people track, monitor, and analyze their habits and behaviors in hopes of gaining a better understanding about themselves and their surroundings. People can track eating habits, exercise, time spent online, sexual activity, monthly cycles, sleep, mood, and finances online. If you are interested in a certain aspect of your life, chances are that an application exists to track it.

Personal data collection is of course nothing new. In the 1930s, Mass Observation, a social research group in Britain, collected data on various aspects of everyday life—such as beards and eyebrows, shouts and gestures of motorists, and behavior of people at war memorials—to gain a better understanding about the country. However, data collection methods have improved since 1930. It is no longer only a pencil and paper notepad or a manual counter. Data can be collected automatically with mobile phones and handheld computers such that constant flows of data and information upload to servers, databases, and so-called data warehouses at all times of the day.

With these advances in data collection technologies, the data streams have also developed into something much heftier than the tally counts reported by Mass Observation participants. Data can update in real-time, and as a result, people want up-to-date information.

It is not enough to simply supply people with gigabytes of data, though. Not everyone is a statistician or computer scientist, and not everyone wants to sift through large data sets. This is a challenge that we face frequently with personal data collection.

While the types of data collection and data returned might have changed over the years, individuals' needs have not. That is to say that individuals who collect data about themselves and their surroundings still do so to gain a better understanding of the information that lies within the flowing data. Most of the time we are not after the numbers themselves; we are interested in what the numbers mean. It is a subtle difference but an important one. This need calls for systems that can handle personal data streams, process them efficiently and accurately, and dispense information to nonprofessionals in a way that is understandable and useful. We want something that is more than a spreadsheet of numbers. We want the story in the data.

To construct such a system requires careful design considerations in both analysis and aesthetics. This was important when we implemented the Personal Environmental Impact Report (PEIR), a tool that allows people to see how they affect the environment and how the environment affects them on a micro-level; and your.flowingdata (YFD), an in-development project that enables users to collect data about themselves via Twitter, a microblogging service.

For PEIR, I am the frontend developer, and I mostly work on the user interface and data visualization. As for YFD, I am the only person who works on it, so my responsibilities are a bit different, but my focus is still on the visualization side of things. Although PEIR and YFD are fairly different in data type, collection, and processing, their goals are similar. PEIR and YFD are built to provide information to the individual. Neither is meant as an endpoint. Rather, they are meant to spur curiosity in how everyday decisions play a big role in how we live and to start conversations on personal data. After a brief background on PEIR and YFD, I discuss personal data collection, storage, and analysis with this idea in mind. I then go into depth on the design process behind PEIR and YFD data visualizations, which can be generalized to personal data visualization as a whole. Ultimately, we want to show individuals the beauty in their personal data.

Personal Environmental Impact Report (PEIR)

PEIR is developed by the Center for Embedded Networked Sensing at the University of California at Los Angeles, or more specifically, the Urban Sensing group. We focus on using everyday mobile technologies (e.g., cell phones) to collect data about our surroundings and ourselves so that people can gain a better understanding of how they interact with what is around them. For example, DietSense is an online service that allows people to self-monitor their food choices and further request comments from dietary specialists; Family Dynamics helps families and life coaches document key features of a family's daily interactions, such as colocation and family meals; and Walkability helps residents and pedestrian advocates make observations and voice their concerns about neighborhood walkability and connections to public transit[1] All of these projects let people get involved in their communities with just their mobile phones. We use a phone's built-in sensors, such as its camera, GPS, and accelerometer, to collect data, which we use to provide information.

PEIR applies similar principles. A person downloads a small piece of software called Campaignr onto his phone, and it runs in the background. As he goes about his daily activities—jogging around the track, driving to and from work, or making a trip to the grocery store, for example—the phone uploads GPS data to PEIR's central servers every two minutes. This includes latitude, longitude, altitude, velocity, and time. We use this data to estimate an individual's impact on and exposure to the environment. Environmental pollution sensors are not required. Instead, we use what is already available on many mobile phones—GPS—and then pass this data with context, such as weather, into established environmental models. Finally, we visualize the environmental impact and exposure data. The challenge at this stage is to communicate meaning in data that is unfamiliar to most. What does it mean to emit 1,000 kilograms of carbon in a week? Is that a lot or is that a little? We have to keep the user and purpose in mind, as they drive the system design from the visualization down to the data collection and storage.

[1] CENS Urban Sensing,

The best content for your career. Discover unlimited learning on demand for around $1/day.