Chapter 4. Components of a Visualization

The previous two chapters outlined the process of refining a question into tasks. Chapter 2 broke each task down into components: actions, objects, measures, and partitions. These terms help identify where and how to turn fuzzy tasks into specific, actionable ones. Then, Chapter 3 discussed in more detail how to solicit the use scenarios and user stories that motivate the decisions made about proxies during operationalization.

The process in Chapter 2 concluded with a well-operationalized task and promised that this can lead to a visualization. But it did not discuss how to translate an operationalized task into a visualization. There is one step left before we can start doing visualization: we must understand the data..

This chapter takes the first step to translating these descriptions into visualizations. Understanding the characteristics of the data will make it easier to select an appropriate visualization. Chapter 5 then describes specific visualizations to match the data characteristics outlined here—more specifically, its dimensions and measures, how it is grouped and aggregated. In Chapter 6, we’ll look at how views can be combined to support rich, dynamic analysis of complex tasks and data.

Dimensions and Measures

The attributes of the data serve particular roles in a task. A dimension is an attribute that groups, separates, or filters data items. A measure is an attribute that addresses the question of interest and that the analyst expects to vary across the dimensions. Both the measures and the dimensions might be attributes directly found in the dataset or derived attributes calculated from the existing data.

In different fields, these terms get somewhat different names. In the sciences, it’s more common to talk about independent variables (those that the experimenter manipulates) and dependent variables (the outcomes of the experiment). The intuition is the same for task operationalization, although in many business intelligence scenarios, for example, the data analyst cannot actually control who walks into the store or visits the website.

The term metric is sometimes used to describe a measure that stands as a proxy for a desired value.1 One virtue of a visualization approach is the ability to handle multiple metrics at once. Rather than trying to reduce everything to a single number, the analyst can look at several different measures. For example, it is reasonable to say “The fastest route is getting faster, and that’s good, but the variance is really brutal.” Chapter 6 discusses several techniques to visualize multiple metrics.

Example: International Towing & Ice Cream

This section discusses different data types with a motivating example. Sue is a data analyst for International Towing & Ice Cream (ITIC), a fictional company that provides a variety of important roadside services. ITIC’s products and services are purchased on the road, so their location is important—and, as in any ice cream delivery service, so is the temperature (Table 4-1).

Table 4-1. Sample metrics
Time Customer Sales location (lat) Sales location (lon) Product category Product Temperature Revenue

June 17, 10:30 am

0121

47.6062

-122.332

Roadside

Towing

84

$100

June 17, 10:35 am

0232

33.26

-112.04

Roadside

Flat

96

$50

June 17, 10:37 am

0304

37.52

-122.16

Delivery

Ice cream

103

$10

The operationalization and data counseling process helped Sue realize that she wants to display sales grouped by product categories. Because product purchases vary over time—on a daily cycle, a weekly rhythm, and by season—she will want to look at sales, divided among categories, over time and locations. For example, Sue might look at the total revenue by product; in this case, the product is the dimension, while the revenue is a measure.

Dimensions

The dimensions of the data are the ways in which the data varies. Chapter 2 discussed partitions on the data; these partitions can be seen as dimensions.

In the ITIC example, there are a number of dimensions:

  • Temperature

  • Time

  • Product

  • Location

There are several different types of data here. When choosing good visualizations to explore data, it is important to recognize the type, as different charts are designed to optimize different data types. For example, a visualization that works well for showing time of day may not be effective for showing geospatial location.

The next section looks at the types of data used in visualizations; the visualizations in Chapter 5 are indexed on these data types. A user may have data that needs to be changed into a different representation. The following section describes a selection of ways to transform between data types.

Types of Data

Chapter 5 examines a variety of charts. The charts are indexed to the user task and can be selected based on the types of dimensions and measures.

Data attributes can be divided into three principal types:

Continuous (interval and ratio) data

Consists of ordered, equally and meaningfully spaced values. Ratio data has a meaningful zero point, and so can be added or subtracted: 10 feet plus 20 feet adds to 30 feet. Interval values, on the other hand, lack a meaningful zero point. As such, differences between interval values can be computed, but two interval values cannot be added together: values like dates, pH readings, and oven temperatures are interval data. In the ITIC example, the temperature and time of day are both continuous data. In many scenarios, ratio data is a likely measure: revenue and sales amount are examples of ratio data.

Ordinal data

Consists of discrete values that are ordered, but that cannot be meaningfully added or subtracted. Rankings are a good example of ordinal data: if a runner comes in first in one race and ninth in another, they did not come in a total of tenth, and it is not clear how to compare them to the runner who came in fifth twice.

Categorical data

Consists of discrete values; every item falls into a single category. Categorical data has no particular ordering—north does not logically come before or after west. In visualization, knowing something about the cardinality—the number of distinct values—of categorical data is important. In using categorical data for an axis or a color scale, there should be few enough categories that it makes sense to group the data into them and for the list of categories to be readable and comparable.

In addition, there are three specialized forms of data that are worth discussing on their own as they have specific mappings to visualization chart types:

Temporal data

This is a form of interval data that has a time component. While a single timestamp refers to a single time (e.g., “November 20, 2010, 8:01 am”), it can be interpreted in a broad variety of ways.

Temporal data is often interpreted cyclically and hierarchically. Time comes in cycles (e.g., “every day at 8:00 am,” or “weekdays from 8 to 9 am”). Time may be grouped into ranges (e.g., “November 2010”), and can be placed against a number of calendars (e.g., fiscal years, calendar years, workdays). Times can be subtracted to get a duration, which is ratio data. Visualization toolkits often offer powerful tools for organizing temporal data.

Geographical data

Refers to places; it is inherently two-dimensional (or three-dimensional, in some cases). It may come in the form of positions, outlines of shapes, or names of places. It can often be grouped into categorical data with the help of an atlas to assign zip codes, city names, or other relevant groupings.

Relational data

This is data that connects two other points: this might be from a hierarchy or a network. For example, the fact that some number of commuters go from one place to another is relational data; so is the fact that one person reports to another. When data items are categorized, they sometimes are represented as relational; the relation is between the data item and its category.

Transforming Between Dimension Types

Different data types can be difficult to fit into particular visualization types. Often, transforming between data types may help simplify the data into a form that can be processed more easily. This section highlights a few of the most common and useful transformations:

Categorical-to-ordinal and ordinal-to-categorical

Categorical data almost always has to be interpreted in some order or another. Conversely, many visualizations are marked as taking categorical data when the user has ordinal data. Each type may be interpreted as the other, as needed, ensuring that the order in ordinal data is always preserved.

Continuous to ordinal

Continuous data can be difficult to deal with as a dimension so it is sometimes transformed into ordinal data. In the ITIC example, the analyst might group a number of entries together into hot and cool temperatures, or might separate mornings and afternoons. This process makes analysis far more tractable—it is useful to make statements like “We sold twice as much ice cream on hot days as we did on cool days.” Unfortunately, this imposes a hard line on otherwise smooth data: if 80 degrees and above is considered hot, then a day when it’s one degree cooler (–79 degrees) is now a qualitatively different sort of day than an 80-degree day. When a continuous measure is broken into ordered groups, it is referred to as binning.

Ordinal to continuous

While ordinal values cannot be directly added, they can be assigned point values. This is familiar from sporting events, like the Olympics, where top scores tend to be very similar. As such, the rank is a more useful measure then the actual value. To assign overall winners across multiple rounds, though, each rank is transformed into points. The points can then be added and ranked.

Reducing cardinality for categorical data

Categorical data refers to the data column within its context. A company’s entire product catalog probably has too many items to be analyzed with categorical techniques unless the analyst is looking specifically at a particular subproduct. Rolling together smaller categories into an other category, for example, can reduce cardinality; so can finding implicit or explicit hierarchies in the data.

Drilldowns

The drilldown is a common interactive technique between several hierarchical dimensions. Drilling down merely means moving the focus of attention from a higher-level dimension to a single, lower-level component: an analyst might drill down from a view that shows multiple years to focus on the year 2012, and then look at the months within it. Drilling from nation to region to state to city is common, or from business units to teams, or feature areas in telemetry data to features to specific events.

Rollups

The rollup is the logical opposite of the drilldown: grouping items that share a hierarchical level and shifting the focus up a level.

Pivoting data

The pivot operation summarizes items that have been grouped together. For example, in the ITIC example, communicating total revenue by product category would require that the data be pivoted along that column. (Roadside total revenue would then add to $150; total delivery to $10.)

Dimensionality Reduction and Clustering

In the machine learning work that is increasingly important for dealing with large datasets, some core techniques fall under the umbrellas of dimensionality reduction and clustering. Although it is far outside the scope of this book to discuss how these techniques work, it is worth briefly considering what these techniques do to data for consideration in an operationalization.

Dimensionality reduction is a way of reducing a large number of different measures into a smaller set of metrics. The intent is that the reduced metrics are a simpler description of the complex space that retains most of the meaning. For example, a movie recommendation service might keep hundreds of individual dimensions about a user, such as the set of movies that she has reviewed and watched. These dimensions are both difficult to interpret alone and far too sparse to be useful: most movies have been watched by comparatively few users. Dimensionality reduction attempts to reduce these to a smaller set of useful dimensions, such as “likes horror movies,” which can be more directly analyzed and inspected. The outcome dimensions are usually continuous; depending on the technique, they may even produce ratio data, so that one movie is twice as much a horror film as another.

Clustering techniques are similarly useful for reducing a large number of items into a smaller set of groups. A clustering technique finds groups of items that are logically near each other and gathers them together. For example, the movie recommender service might cluster users into groups. Analysts can then carry out analyses on individual groups.

Examining Actions

Chapter 2 discussed some of the core actions in tasks, but left the concept rather broad. The action helps identify candidate visualizations and encodings. Some single visualizations can address multiple actions: a bar chart can allow a user to find a specific value, identify the largest or smallest value, roughly guess an average, or compare two or more bars to each other. On the other hand, some tasks are particularly well-supported by one visualization or another; for example, a node-link diagram can be great for tracing paths through a network.

Some of the actions that often come up describe:

  • Finding and reading individual values in the data

  • Characterizing the distribution of a dimension: minimum, maximum, outliers, central tendency, sort order, etc.

  • Identifying the trend of a metric over time (or some other dimension)

There are also more complex actions:

  • Comparing a value across a category (“dollars from store A versus store B”)

  • Comparing a metric to another metric (“height versus weight of subjects” or “salary distribution of men versus women”)

  • Contrasting a metric with many others (“Seattle versus other cities”)

  • Clustering values (“divide consumers into market segments”)

Many of these actions look like statistical tasks (e.g., “I want to know if men or women spend, on average, more money at our store”), Indeed, if an analyst needs only one or two of these tasks—“I want to know if men or women spend an average of more money at our store”—then a visualization probably is not necessary.

Multiple tasks, however, are often linked: an analyst may want to be able to explore the distribution to find reasonable cutoffs, or explore subdivisions of the data across a range of different dimensions. For example, an analyst may want to see how a distribution of product sales looks when the data is partitioned by store, or product, or even by the display aisle, or an analyst may want to switch from making comparisons of older women versus younger women to older women versus older men. A visualization tool can often support this more open-ended exploration better than statistical tests.

Action keywords can cue which visualization to use. Tasks like “Compare one object to another across multiple dimensions” are a cue that the analyst might want to compare multiple series. In contrast, “How is this item different?” suggests that the analyst might want to pull out a single item to compare to a background set of items. “Are any items different?” is a cue to look for visualizations that help show outliers.

The next two chapters look at how to choose a visualization based on the operationalization and the concepts described here.

1 Though the distinction between a metric and a measure makes for entertaining online debates, this book sees the two as effectively synonymous.

Get Making Data Visual now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.