You are previewing Beautiful Visualization.

Beautiful Visualization

Cover of Beautiful Visualization by Noah Iliinsky... Published by O'Reilly Media, Inc.
  1. Beautiful Visualization
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. How This Book Is Organized
    2. Conventions Used in This Book
    3. Using Code Examples
    4. How to Contact Us
    5. Safari® Books Online
    6. Acknowledgments
  4. 1. On Beauty
    1. What Is Beauty?
      1. Novel
      2. Informative
      3. Efficient
      4. Aesthetic
    2. Learning from the Classics
      1. The Periodic Table of the Elements
      2. The London Underground Map
      3. Other Subway Maps and Periodic Tables Are Weak Imitations
    3. How Do We Achieve Beauty?
      1. Step Outside Default Formats
      2. Make It Informative
      3. Make It Efficient
      4. Leverage the Aesthetics
    4. Putting It Into Practice
    5. Conclusion
  5. 2. Once Upon a Stacked Time Series
    1. Question + Visual Data + Context = Story
    2. Steps for Creating an Effective Visualization
      1. Formulate the Question
      2. Gather the Data
      3. Apply a Visual Representation
    3. Hands-on Visualization Creation
      1. Data Tasks
      2. Formulating the Question
      3. Applying the Visual Presentation
      4. Building the Visual
    4. Conclusion
  6. 3. Wordle
    1. Wordle's Origins
      1. Anatomy of a Tag Cloud
      2. Filling a Two-Dimensional Space
    2. How Wordle Works
      1. Text Analysis
      2. Layout
    3. Is Wordle Good Information Visualization?
      1. Word Sizing Is Naïve
      2. Color Is Meaningless
      3. Fonts Are Fanciful
      4. Word Count Is Not Specific Enough
    4. How Wordle Is Actually Used
      1. Using Wordle for Traditional Infovis
    5. Conclusion
    6. Acknowledgments
    7. References
  7. 4. Color: The Cinderella of Data Visualization
    1. Why Use Color in Data Graphics?
      1. 1. Vary Your Plotting Symbols
      2. 2. Use Small Multiples on a Canvas
      3. 3. Add Color to Your Data
      4. So Why Bother with Color?
      5. If Color Is Three-Dimensional, Can I Encode Three Dimensions with It?
    2. Luminosity As a Means of Recovering Local Density
    3. Looking Forward: What About Animation?
    4. Methods
    5. Conclusion
    6. References and Further Reading
  8. 5. Mapping Information: Redesigning the New York City Subway Map
    1. The Need for a Better Tool
    2. London Calling
    3. New York Blues
    4. Better Tools Allow for Better Tools
    5. Size Is Only One Factor
    6. Looking Back to Look Forward
    7. New York's Unique Complexity
    8. Geography Is About Relationships
      1. Include the Essentials
      2. Leave Out the Clutter
      3. Coloring Inside the Lines
    9. Sweat the Small Stuff
      1. Try It On
      2. Users Are Only Human
      3. A City of Neighborhoods
      4. One Size Does Not Fit All
    10. Conclusion
  9. 6. Flight Patterns: A Deep Dive
    1. Techniques and Data
    2. Color
    3. Motion
    4. Anomalies and Errors
    5. Conclusion
    6. Acknowledgments
  10. 7. Your Choices Reveal Who You Are: Mining and Visualizing Social Patterns
    1. Early Social Graphs
    2. Social Graphs of Amazon Book Purchasing Data
      1. Determining the Network Around a Particular Book
      2. Putting the Results to Work
      3. Social Networks of Political Books
    3. Conclusion
    4. References
  11. 8. Visualizing the U.S. Senate Social Graph (1991–2009)
    1. Building the Visualization
      1. Gathering the Raw Data
      2. Computing the Voting Affinity Matrix
      3. Visualizing the Data with GraphViz
    2. The Story That Emerged
    3. What Makes It Beautiful?
    4. And What Makes It Ugly?
      1. Labels
      2. Orientation
      3. Party Affiliation
    5. Conclusion
    6. References
  12. 9. The Big Picture: Search and Discovery
    1. The Visualization Technique
    2. YELLOWPAGES.COM
      1. Query Logs
      2. Categorical Similarity
      3. Visualization As a Substrate for Analytics
      4. The Visualization
      5. Advantages and Disadvantages of the Technique
    3. The Netflix Prize
      1. Preference Similarity
      2. Labeling
      3. Closer Looks
    4. Creating Your Own
    5. Conclusion
    6. References
  13. 10. Finding Beautiful Insights in the Chaos of Social Network Visualizations
    1. Visualizing Social Networks
    2. Who Wants to Visualize Social Networks?
    3. The Design of SocialAction
    4. Case Studies: From Chaos to Beauty
      1. The Social Network of Senatorial Voting
      2. The Social Network of Terrorists
    5. References
  14. 11. Beautiful History: Visualizing Wikipedia
    1. Depicting Group Editing
      1. The Data
      2. History Flow: Visualizing Edit Histories
      3. Age of Edit
      4. Authorship
      5. Individual Authors
    2. History Flow in Action
      1. Communicating the Results
    3. Chromogram: Visualizing One Person at a Time
      1. Showing All the Data
      2. What We Saw
      3. Analyzing the Data
    4. Conclusion
  15. 12. Turning a Table into a Tree: Growing Parallel Sets into a Purposeful Project
    1. Categorical Data
    2. Parallel Sets
    3. Visual Redesign
    4. A New Data Model
    5. The Database Model
    6. Growing the Tree
    7. Parallel Sets in the Real World
    8. Conclusion
    9. References
  16. 13. The Design of "X by Y"
    1. Briefing and Conceptual Directions
    2. Understanding the Data Situation
    3. Exploring the Data
    4. First Visual Drafts
      1. The Visual Principle
    5. The Final Product
      1. All Submissions
      2. By Prize
      3. By Category
      4. By Country
      5. By Year
      6. By Year and Category
      7. Exhibition
    6. Conclusion
    7. Acknowledgments
    8. References
  17. 14. Revealing Matrices
    1. The More, the Better?
    2. Databases As Networks
    3. Data Model Definition Plus Emergence
    4. Network Dimensionality
    5. The Matrix Macroscope
    6. Reducing for Complexity
    7. Further Matrix Operations
    8. The Refined Matrix
    9. Scaling Up
    10. Further Applications
    11. Conclusion
    12. Acknowledgments
    13. References
  18. 15. This Was 1994: Data Exploration with the NYTimes Article Search API
    1. Getting Data: The Article Search API
    2. Managing Data: Using Processing
    3. Three Easy Steps
    4. Faceted Searching
    5. Making Connections
    6. Conclusion
  19. 16. A Day in the Life of the New York Times
    1. Collecting Some Data
    2. Let's Clean 'Em First
    3. Python, Map/Reduce, and Hadoop
    4. The First Pass at the Visualization
      1. Processing
      2. The Underlay Map
      3. Now, Where's That Data We Just Processed?
    5. Scene 1, Take 1
      1. No Scale
      2. No Sense of Time
      3. Time-Lapse
    6. Scene 1, Take 2
      1. Let's Run This Thing and See What Happens!
    7. The Second Pass at the Visualization
      1. Back to That Scale Problem
      2. Massaging the Data Some More
      3. The New Data Format
    8. Visual Scale and Other Visualization Optimizations
    9. Getting the Time Lapse Working
      1. Semiautomating
      2. Math for Rendering Time-Lapse Video
    10. So, What Do We Do with This Thing?
    11. Conclusion
    12. Acknowledgments
  20. 17. Immersed in Unfolding Complex Systems
    1. Our Multimodal Arena
    2. Our Roadmap to Creative Thinking
      1. Beauty and Symmetry
      2. The Computational Medium
      3. Interpretation As a Filter
    3. Project Discussion
      1. Allobrain
      2. Artificial Nature
      3. Hydrogen Bond
      4. Hydrogen Atom
      5. Hydrogen Atom with Spin
      6. Coherent Precession of Electron Spin
    4. Conclusion
    5. References
  21. 18. Postmortem Visualization: The Real Gold Standard
    1. Background
    2. Impact on Forensic Work
    3. The Virtual Autopsy Procedure
      1. Data Acquisition
      2. Visualization: Image Analysis
      3. Objective Documentation
      4. Advantages and Disadvantages of Virtual Autopsy
    4. The Future for Virtual Autopsies
    5. Conclusion
    6. References and Suggested Reading
  22. 19. Animation for Visualization: Opportunities and Drawbacks
    1. Principles of Animation
    2. Animation in Scientific Visualization
    3. Learning from Cartooning
      1. The Downsides of Animation
      2. GapMinder and Animated Scatterplots
      3. Testing Animated Scatterplots
    4. Presentation Is Not Exploration
    5. Types of Animation
      1. Dynamic Data, Animated Recentering
      2. A Taxonomy of Animations
    6. Staging Animations with DynaVis
    7. Principles of Animation
    8. Conclusion: Animate or Not?
    9. Further Reading
    10. Acknowledgments
    11. References
  23. 20. Visualization: Indexed.
    1. Visualization: It's an Elephant.
    2. Visualization: It's Art.
    3. Visualization: It's Business.
    4. Visualization: It's Timeless.
    5. Visualization: It's Right Now.
    6. Visualization: It's Coded.
    7. Visualization: It's Clear.
    8. Visualization: It's Learnable.
    9. Visualization: It's a Buzzword.
    10. Visualization: It's an Opportunity.
  24. A. Contributors
  25. B. Colophon
  26. Index
  27. About the Authors
  28. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  29. Copyright
O'Reilly logo

Chapter 4. Color: The Cinderella of Data Visualization

Michael Driscoll

Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.

Edward Tufte Envisioning Information Graphics Press

COLOR IS ONE OF THE MOST ABUSED AND NEGLECTED tools in data visualization: we abuse it when we make poor color choices, and we neglect it when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us would think twice before walking outside in fluorescent red Underoos®. If only we were as cautious in choosing colors for infographics! The difference is that few of us design our own clothes, while we must all be our own infographics tailors in order to get colors that fit our purposes (at least until good palettes—like ColorBrewer—become commonplace).

While obsessing about how to implement color on the Dataspora Labs PitchFX viewer, I began with a basic motivating question: why use color in data graphics? We'll consider that question next.

Why Use Color in Data Graphics?

For a simple dataset, a single color is sufficient (even preferable). For example, Figure 4-1 shows a scatterplot of 287 pitches thrown by Major League pitcher Oscar Villarreal in 2008. With just two dimensions of data to describe—the x and y locations in the strike zone—black and white is sufficient. In fact, this scatterplot is a perfectly lossless representation of the dataset (assuming no data points overlap perfectly).

Location of pitches indicated in an x/y plane

Figure 4-1. Location of pitches indicated in an x/y plane

But what if we'd like to know more? For instance, what kinds of pitches (curveballs, fastballs) landed where? Or what was their speed? Visualizations occupy two dimensions, but the world they describe is rarely so confined.

The defining challenge of data visualization is projecting high-dimensional data onto a low-dimensional canvas. As a rule, one should never do the reverse (visualize more dimensions than already exist in the data).

Getting back to our pitching example, if we want to layer another dimension of data—pitch type—into our plot, we have several methods at our disposal:

  1. Plotting symbols. We can vary the glyphs that we use (circles, triangles, etc.).

  2. Small multiples. We can vary extra dimensions in space, creating a series of smaller plots.

  3. Color. We can color our data, encoding extra dimensions inside a color space.

Which technique you employ in a visualization should depend on the nature of the data and the media of your canvas. I will describe these three by way of example.

1. Vary Your Plotting Symbols

In Figure 4-2, I've layered the categorical dimension of pitch type into our plot by using four different plotting symbols.

Location and pitch type indicated by plotting symbols

Figure 4-2. Location and pitch type indicated by plotting symbols

I consider this visualization an abject failure. There are two reasons why graphs like this one make our heads hurt: because distinguishing glyphs demands extra attention (versus what academics call "preattentively processed" cues like color), and because even after we've visually decoded the symbols, we must map those symbols to their semantic categories. (Admittedly, this can be mitigated with Chernoff faces or other iconic symbols, where the categorical mapping is self-evident).

2. Use Small Multiples on a Canvas

While Edward Tufte has done much to promote the use of small multiples in information graphics, folding additional dimensions into a partitioned canvas has a distinguished pedigree. This technique has been employed everywhere from Galileo's sunspot illustrations to William Cleveland's trellis plots. And as Scott McCloud's unexpected tour de force on comics makes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.

In Figure 4-3, plots of the four types of pitches that Oscar throws are arranged horizontally. By reducing our plot sizes, we've given up some resolution in positional information. But in return, patterns that were invisible in our first plot and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).

Location and pitch type indicated by facets

Figure 4-3. Location and pitch type indicated by facets

Multiplying plots in space works especially well on printed media, which can display more than 10 times as many dots per square inch as a screen. Additional plots can be arranged in both columns and rows, with the result being a matrix of scatterplots (in R, see the splom function).

3. Add Color to Your Data

In Figure 4-4, I've used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I've chosen is a divergent palette that moves along one dimension (think of it as the "redness-blueness" dimension) in the Lab color space,[27] while maintaining a constant level of luminosity.

Location and pitch type, with pitch velocity indicated by a one-dimensional color palette

Figure 4-4. Location and pitch type, with pitch velocity indicated by a one-dimensional color palette

On the one hand, holding luminosity constant has advantages, because luminosity (similar to brightness) determines a color's visual impact. Bright colors pop, and dark colors recede. A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.

On the other hand, luminosity—unlike hue—possesses an inherent order that hue lacks, making it suitable for mapping to quantitative (and not categorical) dimensions of data.

Because I am going to use luminosity to encode yet another dimension later, I decided to use hue for encoding speed here; it suits our purposes well enough. I chose only seven gradations of color, so I'm downsampling (in a lossy way) our speed data. Segmentation of our color ramp into many more colors would make it difficult to distinguish them.

I've also chosen to use filled circles as the plotting symbol in this version, as opposed to the open circles used in all the previous plots. This improves the perception of each pitch's speed via its color: small patches of color are less perceptible. However, a consequence of this choice—compounded by the decision to work with a series of smaller plots—is that more points overlap. Hence, we've further degraded some of the positional information. (We'll attempt to recover some of this information in just a moment.)

So Why Bother with Color?

As compared to most print media, computer displays have fewer units of space but a broader color gamut. So, color is a compensatory strength.

For multidimensional data, color can convey additional dimensions inside a unit of space, and can do so instantly. Color differences can be detected within 200 milliseconds, before you're even conscious of paying attention (the "preattentive" concept I mentioned earlier).

But the most important reason to use color in multivariate graphics is that color is itself multidimensional. Our perceptual color space—however you slice it—is three-dimensioned.

We've now brought color to bear on our visualization, but we've only encoded a single dimension: speed. This leads us to another question.

If Color Is Three-Dimensional, Can I Encode Three Dimensions with It?

In theory, yes—Colin Ware (2000) researched this exact question using red, blue, and green as the three axes. (There are other useful ways of dividing the color spectrum, as we will soon see.) In practice, though, it's difficult. It turns out that asking observers to assess the amount of "redness," "blueness," and "greenness" of points is possible, but doing so is not intuitive.

Another complicating factor is that a nontrivial fraction of the population has some form of colorblindness (also known as dichromacy, in contrast to normal trichromacy). This effectively reduces color perception to two dimensions.

And finally, the truth is that our sensation of color is not equal along all dimensions: there are fewer perceptible shades of yellow than there are "blues." It's thought that the closely related "red" and "green" receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).

Because of the high level of colorblindness in the population, and because of the challenge of encoding three dimensions in color, I believe color is best used to encode no more than two dimensions of data.

Luminosity As a Means of Recovering Local Density

For the last iteration of our pitching plot data visualization, shown in Figure 4-5, I will introduce luminosity as a means of encoding the local density of points. This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.

Location and pitch type, with pitch velocity and local density indicated by a two-dimensional color palette (see inset for details)

Figure 4-5. Location and pitch type, with pitch velocity and local density indicated by a two-dimensional color palette (see inset for details)

Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis to denote speed, luminosity varying along the other to denote local density. As detailed in the Methods section, these plots were created using the color space package in R, which provides the ability to specify colors in any of the major color spaces (RGB, HSV, Lab). Because the Lab color space varies chromaticity independently from luminosity, I chose it for creating this particular two-dimensional palette.

One final point about using luminosity is that observing colors in a data visualization involves overloading, in the programming sense. That is, we rely on cognitive functions that were developed for one purpose (seeing lions) and use them for another (seeing lines).

We can overload color any way we want, but whenever possible we should choose mappings that are natural. Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth. Likewise, when sampling from the color space, we might as well choose colors found in nature. These are the palettes our eyes were gazing at for millions of years before the RGB color space showed up.

Looking Forward: What About Animation?

This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data. I've purposely neglected one very powerful dimension: time. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization (a stunning example is Aaron Koblin's visualizations of U.S. and Canadian flight patterns, explored in Chapter 6). But packing that information into a time-varying data structure involves considerable effort, and animating data in a way that is informative, not simply aesthetically pleasing, remains challenging. Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like Processing[28] are a promising start toward their development.

Methods

All of the visualizations here were developed using the R programming language and the Lattice graphics package. The R code for building a two-dimensional color palette follows:

## colorPalette.R
## builds an (m × n) 2D palette
## by mixing 2 hues (col1, col2)
## and across two luminosities (lum1,lum2)
## returns a matrix of the hex RGB values
makePalette <- function(col1,col2,lum1,lum2,m,n,...) {
    C <- matrix(data=NA,ncol=m,nrow=n)
    alpha <- seq(0,1,length.out=m)
    ## for each luminosity level (rows)
    lum <- seq(lum1,lum2,length.out = n)
    for (i in 1:n) {
         c1 <- LAB(lum[i], coords(col1)[2], coords(col1)[3])
         c2 <- LAB(lum[i], coords(col2)[2], coords(col2)[3])
         ## for each mixture level (columns)
         for (j in 1:m) {
             c <- mixcolor(alpha[j],c1,c2)
             hexc <- hex(c,fixup=TRUE)
             C[i,j] <- hexc
         }
    }
    return(C)
}

## plot a vector or matrix of RGB colors
plotPalette <- function(C,...) {
    if (!is.matrix(C)) {
        n <- 1
        C <- t(matrix(data=C))
    } else {
        n <- dim(C)[1]
    }
    plot(0, 0, type="n", xlim = c(0, 1), ylim = c(0, n), axes = FALSE,
            mar=c(0,0,0,0),...)

    ## helper function for plotting rectangles
    plotRectangle <- function(col, ybot=0, ytop=1, border = "light gray") {
        n <- length(col)
        rect(0:(n-1)/n, ybot, 1:n/n, ytop, col=col, border=border, mar=c(0,0,0,0))
    }

    for (i in 1:n) {
        plotRectangle(C[i,], ybot=i-1, ytop=i)
    }
}

## Let's put it all together.
## We make two colors in the LAB space, and then plot a 2D palette
## going from 60 to 25 luminosity values.
library(colorspace)
lightRed <- LAB(50,48,48)
lightBlue <- LAB(50,-48,-48)
C <- makePalette(col1=lightBlue, col2=lightRed, lum1=60, lum2=25, m=7, n=7)
plotPalette(C, xlab='speed', ylab='density')

Conclusion

As this example has demonstrated, color—used thoughtfully and responsibly—can be an incredibly valuable tool in visualizing high-dimensional data. The final product—five-dimensional pitch plots for all available data for the 2008 season—can be explored via the PitchFX Django-driven web tool at Dataspora labs (http://labs.dataspora.com/gameday/).

References and Further Reading

Few, Stephen. 2006. Information Dashboard Design, Chapter 4. Sebastopol, CA: O'Reilly Media.

Ihaka, Ross. Lectures 12–14 on Information Visualization. Department of Statistics, University of Auckland. http://www.stat.auckland.ac.nz/~ihaka/120/lectures.html.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with R. New York: Springer-Verlag.

Tufte, Edward. 2001. Envisioning Information, Chapter 4. Cheshire, CT: Graphics Press.

Ware, Colin. 2000. Information Visualization, Chapter 4. San Francisco, CA: Morgan Kaufmann.

The best content for your career. Discover unlimited learning on demand for around $1/day.