You are previewing Beautiful Data.

Beautiful Data

Cover of Beautiful Data by Jeff Hammerbacher... Published by O'Reilly Media, Inc.
  1. Beautiful Data
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Preface
      1. How This Book Is Organized
      2. Conventions Used in This Book
      3. Using Code Examples
      4. How to Contact Us
      5. Safari® Books Online
    4. 1. Seeing Your Life in Data
      1. Personal Environmental Impact Report (PEIR)
      2. your.flowingdata (YFD)
      3. Personal Data Collection
      4. Data Storage
      5. Data Processing
      6. Data Visualization
      7. The Point
      8. How to Participate
    5. 2. The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
      1. Introduction: User Empathy Is the New Black
      2. The Project: Surveying Customers About a New Luxury Product
      3. Specific Challenges to Data Collection
      4. Designing Our Solution
      5. Results and Reflection
    6. 3. Embedded Image Data Processing on Mars
      1. Abstract
      2. Introduction
      3. Some Background
      4. To Pack or Not to Pack
      5. The Three Tasks
      6. Slotting the Images
      7. Passing the Image: Communication Among the Three Tasks
      8. Getting the Picture: Image Download and Processing
      9. Image Compression
      10. Downlink, or, It's All Downhill from Here
      11. Conclusion
    7. 4. Cloud Storage Design in a PNUTShell
      1. Introduction
      2. Updating Data
      3. Complex Queries
      4. Comparison with Other Systems
      5. Conclusion
      6. Acknowledgments
      7. References
    8. 5. Information Platforms and the Rise of the Data Scientist
      1. Libraries and Brains
      2. Facebook Becomes Self-Aware
      3. A Business Intelligence System
      4. The Death and Rebirth of a Data Warehouse
      5. Beyond the Data Warehouse
      6. The Cheetah and the Elephant
      7. The Unreasonable Effectiveness of Data
      8. New Tools and Applied Research
      9. MAD Skills and Cosmos
      10. Information Platforms As Dataspaces
      11. The Data Scientist
      12. Conclusion
    9. 6. The Geographic Beauty of a Photographic Archive
      1. Beauty in Data: Geograph
      2. Visualization, Beauty, and Treemaps
      3. A Geographic Perspective on Geograph Term Use
      4. Beauty in Discovery
      5. Reflection and Conclusion
      6. Acknowledgments
      7. References
    10. 7. Data Finds Data
      1. Introduction
      2. The Benefits of Just-in-Time Discovery
      3. Corruption at the Roulette Wheel
      4. Enterprise Discoverability
      5. Federated Search Ain't All That
      6. Directories: Priceless
      7. Relevance: What Matters and to Whom?
      8. Components and Special Considerations
      9. Privacy Considerations
      10. Conclusion
    11. 8. Portable Data in Real Time
      1. Introduction
      2. The State of the Art
      3. Social Data Normalization
      4. Conclusion: Mediation via Gnip
    12. 9. Surfacing the Deep Web
      1. What Is the Deep Web?
      2. Alternatives to Offering Deep-Web Access
      3. Conclusion and Future Work
      4. References
    13. 10. Building Radiohead's House of Cards
      1. How It All Started
      2. The Data Capture Equipment
      3. The Advantages of Two Data Capture Systems
      4. The Data
      5. Capturing the Data, aka "The Shoot"
      6. Processing the Data
      7. Post-Processing the Data
      8. Launching the Video
      9. Conclusion
    14. 11. Visualizing Urban Data
      1. Introduction
      2. Background
      3. Cracking the Nut
      4. Making It Public
      5. Revisiting
      6. Conclusion
    15. 12. The Design of
      1. Visualization and Social Data Analysis
      2. Data
      3. Visualization
      4. Collaboration
      5. Voyagers and Voyeurs
      6. Conclusion
      7. References
    16. 13. What Data Doesn't Do
      1. When Doesn't Data Drive?
      2. Conclusion
      3. References
    17. 14. Natural Language Corpus Data
      1. Word Segmentation
      2. Secret Codes
      3. Spelling Correction
      4. Other Tasks
      5. Discussion and Conclusion
      6. Acknowledgments
    18. 15. Life in Data: The Story of DNA
      1. DNA As a Data Store
      2. DNA As a Data Source
      3. Fighting the Data Deluge
      4. The Future of DNA
      5. Acknowledgments
    19. 16. Beautifying Data in the Real World
      1. The Problem with Real Data
      2. Providing the Raw Data Back to the Notebook
      3. Validating Crowdsourced Data
      4. Representing the Data Online
      5. Closing the Loop: Visualizations to Suggest New Experiments
      6. Building a Data Web from Open Data and Free Services
      7. Acknowledgments
      8. References
    20. 17. Superficial Data Analysis: Exploring Millions of Social Stereotypes
      1. Introduction
      2. Preprocessing the Data
      3. Exploring the Data
      4. Age, Attractiveness, and Gender
      5. Looking at Tags
      6. Which Words Are Gendered?
      7. Clustering
      8. Conclusion
      9. Acknowledgments
      10. References
    21. 18. Bay Area Blues: The Effect of the Housing Crisis
      1. Introduction
      2. How Did We Get the Data?
      3. Geocoding
      4. Data Checking
      5. Analysis
      6. The Influence of Inflation
      7. The Rich Get Richer and the Poor Get Poorer
      8. Geographic Differences
      9. Census Information
      10. Exploring San Francisco
      11. Conclusion
      12. References
    22. 19. Beautiful Political Data
      1. Example 1: Redistricting and Partisan Bias
      2. Example 2: Time Series of Estimates
      3. Example 3: Age and Voting
      4. Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
      5. Example 5: Localized Partisanship in Pennsylvania
      6. Conclusion
      7. References
    23. 20. Connecting Data
      1. What Public Data Is There, Really?
      2. The Possibilities of Connected Data
      3. Within Companies
      4. Impediments to Connecting Data
      5. Possible Solutions
      6. Conclusion
    24. A. Contributors
    25. Index
    26. About the Authors
    27. COLOPHON
    28. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Chapter 4. Cloud Storage Design in a PNUTShell

Brian F. Cooper

Raghu Ramakrishnan

Utkarsh Srivastava


YAHOO! RUNS SOME OF THE WORLD'S MOST POPULAR WEBSITES, AND EVERY MONTH OVER HALF A BILLION people visit those sites. These websites are powered by database infrastructures that store user profiles, photos, restaurant reviews, blog posts, and a remarkable array of other kinds of data. Yahoo! has developed and deployed mature, stable database architectures to support its sites, and to provide low-latency access to data so that pages load quickly.

Unfortunately, these systems suffer from some important limitations. First, adding system capacity is often difficult, requiring months of planning and data reorganization, and impacting the quality of service experienced by applications during the transition. Some systems have a hard upper limit on the scale they can support, even if sufficient hardware were to be added. Second, many systems were designed a long time ago, with a single datacenter in mind. Since then, Yahoo! has grown to a global brand with a large user base spread all over the world. To provide these users with a good experience, we have to replicate data to be close to them so that their pages load quickly. Since the database systems did not provide global replication as a built-in feature, applications had to build it themselves, resulting in complex application logic and brittle infrastructure. Because of all the effort required to deploy a large-scale, geographically replicated database architecture, it was hard to quickly roll out new applications or new features of existing applications that depended on that architecture.

PNUTS is a system that aims to support Yahoo!'s websites and application platforms and address these limitations (Cooper et al. 2008). It is designed to be operated as a storage cloud that efficiently handles mixed read and write workloads from tenant applications and supports global data replication natively. Like many other distributed systems, PNUTS achieves high performance and scalability by horizontally partitioning the data across an array of storage servers. Complex analysis or decision-support workloads are not our focus. Our system makes two properties first-class features, baked in from the start:


Data is partitioned across servers, and adding capacity is as easy as adding new servers. The system smoothly transfers load to the new servers.


Data is automatically replicated around the world. Once the developer tells the system at which colos[3] to replicate the data, the system takes care of the details of making it happen, including the details of handling failures (of machines, links, and even entire colos).

We also set several other goals for the system. In particular, we want application developers to be able to focus on the logic of their application, not on the nuts and bolts of operating the database. So we decided to make the database hosted, and to provide a simple, clean API to allow a developer to store and access data without having to tune a large number of parameters. Because the system is to be hosted, we wanted to make it as self-maintainable as possible.

While all of these goals are important to us, building a database system that could both scale-out and globally replicate data was the most compelling and immediate value proposition for the company. And as we began to design the system, it became clear that this required us to rethink many well-understood and long-used mechanisms in database systems (Ramakrishnan and Gehrke 2002).

The key idea we use to achieve both scale-out and geo-replication is to carry out only simple, cheap operations synchronously, and to do all the expensive heavy lifting asynchronously in the background. For example, when a user in California is trying to tag a photo with a keyword, she definitely does not want to wait for the system to commit that tag to the Singapore replica of the tag database (the network latency from California to Singapore can be as high as a second). However, she still wants her friend in Singapore to be able to see the tag, so the Singapore replica must be updated asynchronously in the background, quickly (in seconds or less) and reliably.

As another example of how we leverage asynchrony, consider queries such as aggregations and joins that typically require examining data on many different servers. As we scale out, the probability that some of these servers are slow or down increases, thereby adversely affecting request latency. To remedy this problem, we can maintain materialized views that reorganize the base data so that (a predetermined set of) complex queries can be answered by accessing a single server. Similar to database replicas, updating each view synchronously would be prohibitively slow on writes. Hence, our approach is to update views asynchronously.

In the rest of this chapter, we explore the implications of focusing on scale-out and geo-replication as first-class features. We illustrate the main issues with an example, explain our basic approach, and discuss several issues and extensions. We then compare PNUTS with alternative approaches. Our discussion concentrates on the design philosophy, rather than the details of system architecture or implementation, and covers some features that are not in the current production version of the system in order to highlight the choices made in the overall approach.

[3] Colocation facility, or data center. Yahoo! operates a large number of these, spread across the world.

The best content for your career. Discover unlimited learning on demand for around $1/day.