You are previewing Beautiful Data.

Beautiful Data

Cover of Beautiful Data by Jeff Hammerbacher... Published by O'Reilly Media, Inc.
  1. Beautiful Data
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Preface
      1. How This Book Is Organized
      2. Conventions Used in This Book
      3. Using Code Examples
      4. How to Contact Us
      5. Safari® Books Online
    4. 1. Seeing Your Life in Data
      1. Personal Environmental Impact Report (PEIR)
      2. your.flowingdata (YFD)
      3. Personal Data Collection
      4. Data Storage
      5. Data Processing
      6. Data Visualization
      7. The Point
      8. How to Participate
    5. 2. The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
      1. Introduction: User Empathy Is the New Black
      2. The Project: Surveying Customers About a New Luxury Product
      3. Specific Challenges to Data Collection
      4. Designing Our Solution
      5. Results and Reflection
    6. 3. Embedded Image Data Processing on Mars
      1. Abstract
      2. Introduction
      3. Some Background
      4. To Pack or Not to Pack
      5. The Three Tasks
      6. Slotting the Images
      7. Passing the Image: Communication Among the Three Tasks
      8. Getting the Picture: Image Download and Processing
      9. Image Compression
      10. Downlink, or, It's All Downhill from Here
      11. Conclusion
    7. 4. Cloud Storage Design in a PNUTShell
      1. Introduction
      2. Updating Data
      3. Complex Queries
      4. Comparison with Other Systems
      5. Conclusion
      6. Acknowledgments
      7. References
    8. 5. Information Platforms and the Rise of the Data Scientist
      1. Libraries and Brains
      2. Facebook Becomes Self-Aware
      3. A Business Intelligence System
      4. The Death and Rebirth of a Data Warehouse
      5. Beyond the Data Warehouse
      6. The Cheetah and the Elephant
      7. The Unreasonable Effectiveness of Data
      8. New Tools and Applied Research
      9. MAD Skills and Cosmos
      10. Information Platforms As Dataspaces
      11. The Data Scientist
      12. Conclusion
    9. 6. The Geographic Beauty of a Photographic Archive
      1. Beauty in Data: Geograph
      2. Visualization, Beauty, and Treemaps
      3. A Geographic Perspective on Geograph Term Use
      4. Beauty in Discovery
      5. Reflection and Conclusion
      6. Acknowledgments
      7. References
    10. 7. Data Finds Data
      1. Introduction
      2. The Benefits of Just-in-Time Discovery
      3. Corruption at the Roulette Wheel
      4. Enterprise Discoverability
      5. Federated Search Ain't All That
      6. Directories: Priceless
      7. Relevance: What Matters and to Whom?
      8. Components and Special Considerations
      9. Privacy Considerations
      10. Conclusion
    11. 8. Portable Data in Real Time
      1. Introduction
      2. The State of the Art
      3. Social Data Normalization
      4. Conclusion: Mediation via Gnip
    12. 9. Surfacing the Deep Web
      1. What Is the Deep Web?
      2. Alternatives to Offering Deep-Web Access
      3. Conclusion and Future Work
      4. References
    13. 10. Building Radiohead's House of Cards
      1. How It All Started
      2. The Data Capture Equipment
      3. The Advantages of Two Data Capture Systems
      4. The Data
      5. Capturing the Data, aka "The Shoot"
      6. Processing the Data
      7. Post-Processing the Data
      8. Launching the Video
      9. Conclusion
    14. 11. Visualizing Urban Data
      1. Introduction
      2. Background
      3. Cracking the Nut
      4. Making It Public
      5. Revisiting
      6. Conclusion
    15. 12. The Design of Sense.us
      1. Visualization and Social Data Analysis
      2. Data
      3. Visualization
      4. Collaboration
      5. Voyagers and Voyeurs
      6. Conclusion
      7. References
    16. 13. What Data Doesn't Do
      1. When Doesn't Data Drive?
      2. Conclusion
      3. References
    17. 14. Natural Language Corpus Data
      1. Word Segmentation
      2. Secret Codes
      3. Spelling Correction
      4. Other Tasks
      5. Discussion and Conclusion
      6. Acknowledgments
    18. 15. Life in Data: The Story of DNA
      1. DNA As a Data Store
      2. DNA As a Data Source
      3. Fighting the Data Deluge
      4. The Future of DNA
      5. Acknowledgments
    19. 16. Beautifying Data in the Real World
      1. The Problem with Real Data
      2. Providing the Raw Data Back to the Notebook
      3. Validating Crowdsourced Data
      4. Representing the Data Online
      5. Closing the Loop: Visualizations to Suggest New Experiments
      6. Building a Data Web from Open Data and Free Services
      7. Acknowledgments
      8. References
    20. 17. Superficial Data Analysis: Exploring Millions of Social Stereotypes
      1. Introduction
      2. Preprocessing the Data
      3. Exploring the Data
      4. Age, Attractiveness, and Gender
      5. Looking at Tags
      6. Which Words Are Gendered?
      7. Clustering
      8. Conclusion
      9. Acknowledgments
      10. References
    21. 18. Bay Area Blues: The Effect of the Housing Crisis
      1. Introduction
      2. How Did We Get the Data?
      3. Geocoding
      4. Data Checking
      5. Analysis
      6. The Influence of Inflation
      7. The Rich Get Richer and the Poor Get Poorer
      8. Geographic Differences
      9. Census Information
      10. Exploring San Francisco
      11. Conclusion
      12. References
    22. 19. Beautiful Political Data
      1. Example 1: Redistricting and Partisan Bias
      2. Example 2: Time Series of Estimates
      3. Example 3: Age and Voting
      4. Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
      5. Example 5: Localized Partisanship in Pennsylvania
      6. Conclusion
      7. References
    23. 20. Connecting Data
      1. What Public Data Is There, Really?
      2. The Possibilities of Connected Data
      3. Within Companies
      4. Impediments to Connecting Data
      5. Possible Solutions
      6. Conclusion
    24. A. Contributors
    25. Index
    26. About the Authors
    27. COLOPHON
    28. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Complex Queries

As web applications become more complex and interesting, they need to retrieve and combine information from the database in new and different ways. Next, we examine how to support those queries at a massive scale.

The Challenge

Our system is optimized for queries that touch one or just a few records. In particular, we can look up records by primary key; once we know Alice's username, it is straightforward to determine which partition contains her profile record and read it while loading her page. Also, our system can store data as hash-partitioned or range-partitioned tables. For range-partitioned tables, we can conduct range scans over ordered ranges of primary keys. For example, we might store Alice's friends list by having one record per connection, where the primary key of each connection is the pair of user IDs for Alice and the friend (Table 4-7).

Table 4-7. Friends table

User1

User2

Alice

Bob

Alice

Charles

Alice

Dave

In a range-partitioned table, all of the records prefixed with "Alice" will be clustered, and a short-range scan will be able to pick them up.

Now imagine that we want to add another feature to our social network site. Users can post photos and then comment on one another's photos. Alice might comment on Bob's photo, Charles's photo, and Dave's photo. When we display a photo, we want to show the set of comments associated with that photo. We also want to show Alice the set of comments she has made on other people's photos. We specify the primary key of the comments table as (PhotoID,CommentID) and store it as an ordered YDOT table (Table 4-8), so that all comments for the same photo are clustered and can be retrieved by a range query.

Table 4-8. Photo comments table

PhotoID

CommentID

Comment

Commenter

Photo123

18

Cool

Mary

Photo123

22

Pretty

Alice

Photo123

29

Interesting

Charles

How can we collect the set of comments that Alice has made? We have to perform a join between Alice's profile record (which contains her username as a key) and the comment records (which have Alice's username as a foreign key). Because of our scale-out architecture, data is partitioned across many servers, so computing the join can require accessing many servers. This expensive operation drives up the latency of requests, both because multiple servers must be contacted and because a single query generates a great deal of server load (which slows down other requests).

Another type of query that can be expensive to compute in a scale-out system is group-by-aggregate queries. Imagine that users specify hobbies, and we want to count the number of users who have each hobby so that we can show Alice which hobbies are most popular. Such a query requires scanning all of the data and maintaining counts. The table scan will place prohibitive load on the system and certainly cannot be done synchronously, as Alice's page will take forever to load.

These examples show that while point lookups and range scans can be executed quickly, more expensive join and aggregation queries cannot be executed synchronously.

Our Approach

Our key principle for handling expensive operations is to do them asynchronously, but expensive queries cannot really be handled this way; we do not want to make Alice come back repeatedly to check whether the asynchronous query collecting all of her comments has completed.

Materialized views (Agarawal et al. 2009) can, however, be maintained asynchronously, and when Alice logs in she can quickly (and synchronously) query the view.[6] Although an asynchronously maintained view can be stale compared to the base data, the application already must be built to cope with stale replicas, so dealing with stale view data is usually acceptable. In fact, we treat a materialized view as a special kind of replica that both replicates and transforms data. By using the same mechanism that updates replicas to also update views, we ensure that views have similar reliability and consistency guarantees as replicated base data, without having to design and implement a second mechanism.

Even though view maintenance is done in the background, we still want to make it cheap. If view maintenance takes too many system resources, it will either disrupt synchronous read and write requests (adding latency to every query), or we will have to throttle it to run slowly, at which point the view will be so stale as to possibly be unusable. Thus, we have to find ways to make view maintenance efficient. Consider the earlier example where we want to show Alice all of the comments she has made on other people's photos. We will create a materialized view where comment data is reorganized to be clustered by the foreign key (username of the commenter) rather than the primary key. Then, all of the comments made by Alice will be clustered together. We can also place Alice's profile record in the view, keyed by her username, so that her profile and her comments are clustered. Computing the key/foreign key join is as easy as scanning the set of view records prefixed with "Alice", and then joining them. The result is shown in Table 4-9.

Table 4-9. Co-clustering joining profile and comment records

Alice

West

32

Alice Smith

← Profile record

Alice

Photo123

22

Pretty

← Comment records

Alice

Photo203

43

Nice

Alice

Photo418

33

OK

 

Note that we do not prejoin the profile and comment records in the view. By merely co-locating records that would join, we make join maintenance cheap: whenever there is an update to a base record, we only have to update a single view record, even if that view record would join with multiple other records.

How can we store profile and comment records in the same table? In a traditional database it would be difficult, since the two records have different schemas. However, a core feature of PNUTS is its ability to represent flexible schemas. Different records in the same table can have different sets of attributes. This feature is very useful in web applications since web data is often sparse; a database of items for sale will have different attributes (e.g., color, weight, RAM, flavor) depending on what kind of item it is. It turns out that flexible schemas are also key to implementing materialized join views so that we can colocate joining records from different tables.

The asynchronous view approach is useful for helping to answer other kinds of queries as well. A group-by-aggregation query can be effectively answered by a materialized view that has pregrouped, and maybe even preaggregated, the data. There are even "simple" queries, such as a selection over a nonprimary key attribute, that can be most effectively answered by a materialized view. Consider a query for users who live in Sunnyvale, California. Since our user table is keyed by username, this query normally requires an expensive table scan. However, we can use the materialized view mechanism to build a secondary index over the "location" field of the table, store the index in an ordered YDOT table, and then conduct a range scan over the "Sunnyvale, California" index records to answer our query (Table 4-10).

Table 4-10. Location index

Location

Username

Sunnyvale, CA

Alice

Sunnyvale, CA

Mary

Sunnyvale, CA

Steve

Sunnyvale, CA

Zach

As with materialized views in other systems, we can create them effectively only if we know in advance what kinds of queries to expect. Luckily, in web-serving workloads, the queries are usually templates known in advance with specific parameters (such as the location or username) bound at runtime. As such, application developers know in advance which queries are complex enough to require materializing a view. To ask ad hoc queries over data stored in PNUTS, developers have to use our plug-ins to pull data out of our system into a compute grid running Hadoop, the open source implementation of MapReduce.

Once we have a few different mechanisms for handling complex queries, it will be useful to implement a query planner to help execute queries effectively. A planner helps remove some of the burden from the application developer, who can write declarative queries without worrying too much about how they will be executed. However, an effective query planner at our scale will require sophisticated statistics collection, load monitoring, network monitoring, and a variety of other mechanisms to make sure the planner has enough information about all the possible bottlenecks in the system to make the most effective query plan.



[6] Materialized views are not currently in the production version of the system.

The best content for your career. Discover unlimited learning on demand for around $1/day.