You are previewing Beautiful Data.

Beautiful Data

Cover of Beautiful Data by Jeff Hammerbacher... Published by O'Reilly Media, Inc.
  1. Beautiful Data
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Preface
      1. How This Book Is Organized
      2. Conventions Used in This Book
      3. Using Code Examples
      4. How to Contact Us
      5. Safari® Books Online
    4. 1. Seeing Your Life in Data
      1. Personal Environmental Impact Report (PEIR)
      2. your.flowingdata (YFD)
      3. Personal Data Collection
      4. Data Storage
      5. Data Processing
      6. Data Visualization
      7. The Point
      8. How to Participate
    5. 2. The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
      1. Introduction: User Empathy Is the New Black
      2. The Project: Surveying Customers About a New Luxury Product
      3. Specific Challenges to Data Collection
      4. Designing Our Solution
      5. Results and Reflection
    6. 3. Embedded Image Data Processing on Mars
      1. Abstract
      2. Introduction
      3. Some Background
      4. To Pack or Not to Pack
      5. The Three Tasks
      6. Slotting the Images
      7. Passing the Image: Communication Among the Three Tasks
      8. Getting the Picture: Image Download and Processing
      9. Image Compression
      10. Downlink, or, It's All Downhill from Here
      11. Conclusion
    7. 4. Cloud Storage Design in a PNUTShell
      1. Introduction
      2. Updating Data
      3. Complex Queries
      4. Comparison with Other Systems
      5. Conclusion
      6. Acknowledgments
      7. References
    8. 5. Information Platforms and the Rise of the Data Scientist
      1. Libraries and Brains
      2. Facebook Becomes Self-Aware
      3. A Business Intelligence System
      4. The Death and Rebirth of a Data Warehouse
      5. Beyond the Data Warehouse
      6. The Cheetah and the Elephant
      7. The Unreasonable Effectiveness of Data
      8. New Tools and Applied Research
      9. MAD Skills and Cosmos
      10. Information Platforms As Dataspaces
      11. The Data Scientist
      12. Conclusion
    9. 6. The Geographic Beauty of a Photographic Archive
      1. Beauty in Data: Geograph
      2. Visualization, Beauty, and Treemaps
      3. A Geographic Perspective on Geograph Term Use
      4. Beauty in Discovery
      5. Reflection and Conclusion
      6. Acknowledgments
      7. References
    10. 7. Data Finds Data
      1. Introduction
      2. The Benefits of Just-in-Time Discovery
      3. Corruption at the Roulette Wheel
      4. Enterprise Discoverability
      5. Federated Search Ain't All That
      6. Directories: Priceless
      7. Relevance: What Matters and to Whom?
      8. Components and Special Considerations
      9. Privacy Considerations
      10. Conclusion
    11. 8. Portable Data in Real Time
      1. Introduction
      2. The State of the Art
      3. Social Data Normalization
      4. Conclusion: Mediation via Gnip
    12. 9. Surfacing the Deep Web
      1. What Is the Deep Web?
      2. Alternatives to Offering Deep-Web Access
      3. Conclusion and Future Work
      4. References
    13. 10. Building Radiohead's House of Cards
      1. How It All Started
      2. The Data Capture Equipment
      3. The Advantages of Two Data Capture Systems
      4. The Data
      5. Capturing the Data, aka "The Shoot"
      6. Processing the Data
      7. Post-Processing the Data
      8. Launching the Video
      9. Conclusion
    14. 11. Visualizing Urban Data
      1. Introduction
      2. Background
      3. Cracking the Nut
      4. Making It Public
      5. Revisiting
      6. Conclusion
    15. 12. The Design of
      1. Visualization and Social Data Analysis
      2. Data
      3. Visualization
      4. Collaboration
      5. Voyagers and Voyeurs
      6. Conclusion
      7. References
    16. 13. What Data Doesn't Do
      1. When Doesn't Data Drive?
      2. Conclusion
      3. References
    17. 14. Natural Language Corpus Data
      1. Word Segmentation
      2. Secret Codes
      3. Spelling Correction
      4. Other Tasks
      5. Discussion and Conclusion
      6. Acknowledgments
    18. 15. Life in Data: The Story of DNA
      1. DNA As a Data Store
      2. DNA As a Data Source
      3. Fighting the Data Deluge
      4. The Future of DNA
      5. Acknowledgments
    19. 16. Beautifying Data in the Real World
      1. The Problem with Real Data
      2. Providing the Raw Data Back to the Notebook
      3. Validating Crowdsourced Data
      4. Representing the Data Online
      5. Closing the Loop: Visualizations to Suggest New Experiments
      6. Building a Data Web from Open Data and Free Services
      7. Acknowledgments
      8. References
    20. 17. Superficial Data Analysis: Exploring Millions of Social Stereotypes
      1. Introduction
      2. Preprocessing the Data
      3. Exploring the Data
      4. Age, Attractiveness, and Gender
      5. Looking at Tags
      6. Which Words Are Gendered?
      7. Clustering
      8. Conclusion
      9. Acknowledgments
      10. References
    21. 18. Bay Area Blues: The Effect of the Housing Crisis
      1. Introduction
      2. How Did We Get the Data?
      3. Geocoding
      4. Data Checking
      5. Analysis
      6. The Influence of Inflation
      7. The Rich Get Richer and the Poor Get Poorer
      8. Geographic Differences
      9. Census Information
      10. Exploring San Francisco
      11. Conclusion
      12. References
    22. 19. Beautiful Political Data
      1. Example 1: Redistricting and Partisan Bias
      2. Example 2: Time Series of Estimates
      3. Example 3: Age and Voting
      4. Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
      5. Example 5: Localized Partisanship in Pennsylvania
      6. Conclusion
      7. References
    23. 20. Connecting Data
      1. What Public Data Is There, Really?
      2. The Possibilities of Connected Data
      3. Within Companies
      4. Impediments to Connecting Data
      5. Possible Solutions
      6. Conclusion
    24. A. Contributors
    25. Index
    26. About the Authors
    27. COLOPHON
    28. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Comparison with Other Systems

When we began thinking about PNUTS, two other massive scale database systems from Google and Amazon had recently been announced, and a third from Microsoft would later be made public. As we developed our designs, we examined these other systems carefully to see whether some or all of their ideas could be useful to us. Some of the ideas from these systems influenced us, but we decided to build a new system with an architecture that was different in many ways. We now look at each of these systems and discuss why we decided to depart from their design principles.

Google's BigTable

BigTable (Chang et al. 2006) is a system designed to support many of Google's web applications. The system is based on horizontally partitioning a "big table" into many smaller tablets, and scattering those tablets across servers. This basic approach to scalability, as well as features such as flexible schema and ordered storage, are similar to the approach we took. However, there were several design decisions where we diverged from BigTable.

The first major difference was in our approach to replication. BigTable is built on top of the Google File System (GFS; Ghemawat et al. 2003), and GFS handles the replication of data by synchronously updating three copies of the data on three different servers. This approach works well in a single colo, where interserver latencies are low. However, synchronously updating servers in three different, widely dispersed colos is too expensive; Alice might wait a long time for her status to be updated, especially if her friends access a datacenter with a poor connection to the Internet backbone. To support cross-colo replication, we developed the timeline consistency model, and the associated mechanisms for mastership, load balancing, and failure handling.

We also decided not to enforce the separation between database server and filesystem that is enforced between BigTable and GFS. GFS was originally designed and optimized for scan-oriented workloads of large files (for example, for MapReduce). BigTable uses GFS by keeping a version history of each record, compacted into a file format called SSTables to save space. This means that on record reads and updates, the data must be decoded and encoded into this compressed format. Moreover, the scan-oriented nature of GFS makes BigTable useful for column-oriented scans (such as "retrieve all the locations of all the users"). In contrast, our primary workload is to read or update a single version of a single record or a small range of records. Thus, we store data on disk as complete records organized into a B-tree. This approach is optimized for quickly locating, and updating in-place, individual records identified by primary key.

PNUTS differs from BigTable in other ways as well. For example, we support multiple tables for an application, instead of one large table, and we support hash as well as ordered tables. A follow on to BigTable, called MegaStore (Furman et al. 2008), adds transactions, indexes, and a richer API, but still follows the basic architectural tenets of BigTable.

Amazon's Dynamo

Dynamo (DeCandia et al. 2007) is one of the systems Amazon has built recently for large-scale data workloads, and is the one most closely aligned with our goals of a highly available, massive scale structured record store. (Records in Dynamo are referred to as objects.) Dynamo provides write availability by allowing applications to write to any replica, and lazily propagating those updates to other replicas via a gossip protocol (explained next).

The decision to lazily propagate updates to deal with slow and failure-prone networks matches our own; however, our mechanism for replication is quite different. In a gossip protocol, an update is propagated to randomly chosen replicas, which in turn propagates it to other randomly chosen replicas. This randomness is essential to the probabilistic guarantees offered by the protocol, which ensures that most replicas are updated relatively quickly. In our setting, however, randomness is decidedly suboptimal. Consider an update Alice makes to her status in a colo on the west coast of the U.S. Under gossip, this update may be randomly propagated to a replica in Singapore, which then randomly propagates the update to a replica in Texas, which then randomly propagates the update to a replica in Tokyo. The update has crossed the Pacific Ocean three times, whereas a more deterministic approach could conserve scarce trans-Pacific backbone bandwidth and transfer it (and other updates) only once. Moreover, gossip requires the replica propagating the update to know which servers in which other colos have replicas, which makes it hard to move data between servers for load balancing or recovery.

Another key difference with Dynamo is the consistency protocol. Gossip lends itself to an eventual consistency model: all data replicas will eventually match, but in the interim, while updates are propagating, replicas can be inconsistent. In particular, replicas can have a state that is later deemed "invalid." Consider, for example, Alice, who updates her status from "Sleeping" to "Busy" and then updates her location from "Home" to "Work." Because of the order of updates, the only valid states of the record (from Alice's perspective, which is what matters) are (Sleeping,Home), (Busy,Home), and (Busy,Work). Under eventual consistency, if the two updates are made at different replicas, some replicas might receive the update to "Work" first, meaning that those replicas show a state of (Sleeping,Work) temporarily. If Alice's boss sees this status, Alice might be in trouble! Applications that rely on the application of multiple updates to a record in the proper order need a stronger guarantee than eventual consistency. Although our timeline consistency model allows replicas to be stale, even stale replicas have a consistent version that reflects the proper update ordering.

There are various other differences with Dynamo: Dynamo provides only a hash table and not an ordered table, and we have opted for a more flexible mapping of data to servers in order to improve load balancing and recovery (especially for ordered tables, which might have unpredictable hot spots). Amazon also provides other storage systems besides Dynamo: S3 for storing blobs of data, and SimpleDB for executing queries over structured, indexed data. Although SimpleDB provides a richer API, it requires that the application come up with a partitioning of the data such that each partition is within a fixed size limit. Thus, data growth within a partition is restricted.

Microsoft Azure SDS

Microsoft has built a massive scale version of SQL Server (called SQL Data Services or SDS) as part of its Azure services offering ( Again, the focus is on scalability through horizontal partitioning. A nice feature of SDS is the enhanced query capabilities made available by extensively indexing data and providing SQL Server as the query-processing engine. However, SDS achieves this query expressiveness by rigidly enforcing partitioning: applications create their own partitions and cannot easily repartition data. Thus, although you can ask expressive queries over a partition, if a partition grows or becomes hot, the system cannot easily or automatically relieve the hotspot by splitting the partition. Our decision to hide partitioning behind the abstraction of a table allows us to make and change partitioning decisions for load and recovery reasons. While this means that our query model is less expressive (since we do not support complex queries which cross partitions), we are continuing to look at ways to enhance our query functionality (for example, through views, as described earlier).

Another difference with SDS is that PNUTS has geographic replication built in as a first-class feature of the system. In at least the first release of SDS, the workload is expected to live within a single datacenter, and remote copies are only used in case of a total failure of the primary replica. We want Alice's friends in Singapore, Berlin, and Rio de Janeiro to have their own local, first-class copies of Alice's updates.

Other Related Systems

A variety of other systems have been built by companies who have scalability and flexibility needs similar to ours. Facebook has built Cassandra (Lakshman et al. 2008), a peer-to-peer data store with a BigTable-like data model but built on a Dynamo-like infrastructure. Consequently, Cassandra provides only eventual consistency.

Sharded databases (such as the MySQL sharding approach used by Flickr [Pattishall] and Facebook [Sobel 2008]) provide scalability by partitioning the data across many servers; however, sharding systems do not typically provide as much flexibility for scaling or globally replicating data as we desire. Data must be prepartitioned, just like in SimpleDB. Also, only one of the replicas can be the master and accept writes. In PNUTS, all replicas in different data centers can accept writes (although for different records).

Other Systems at Yahoo!

PNUTS is one of several cloud systems that are being built at Yahoo!. Two other components of the cloud are also targeted at data management, although they focus on a different set of problems than PNUTS. Hadoop (, an open source implementation of the MapReduce framework (Dean and Ghemawat 2007), provides massively parallel analytical processing over large datafiles. Hadoop includes a filesystem, HDFS, which is optimized for scans, since MapReduce jobs are primarily scan-oriented workloads. In contrast, PNUTS is focused on reads and writes of individual records. Another system is MObStor, which is designed to store and serve massive objects such as images or video. MObStor's goal is to provide low-latency retrieval and inexpensive storage for objects that do not change. Since many applications need a combination of record storage, data analysis, and object retrieval, we are working on ways to seamlessly integrate the three systems. A survey of our efforts to integrate these systems into a comprehensive cloud is at (Cooper et al. 2009).

The best content for your career. Discover unlimited learning on demand for around $1/day.