Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Tom Barker, a software engineer, an engineering manager, a professor and an author. Currently he is Director of Software Engineering and Development at Comcast, and an Adjunct Professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer, and can be reached at @tomjbarker.

A previous post that I wrote looked at using scatterplots to identify relationships between sets of data. I talked about the different types of relationships that could exist between data sets, such as positive and negative correlation. This idea was couched in the premise of team dynamics – do you see any correlation between the amount of people on a team and the amount of work that the team can complete, or between the amount of work completed and the number of defects generated.

In this post I will talk about my current favorite type of chart: the parallel coordinate chart. The parallel coordinate is my favorite because it clearly shows nuanced relationships between several different data points, and it is much like a scatterplot, but scaled to encompass many different axes, instead of just two like with a scatterplot.

## What are Parallel Coordinate Charts?

Parallel coordinate charts are a visualization that consists of `N` amount of vertical axes, each representing a unique data set, with lines drawn across the axes. The lines show the relationship between the axes, much like scatterplots, and the patterns that the lines form indicate the relationship. You can also gather details about the relationships between the axes when you see the clustering of lines. Let’s take a look at this using the chart below as an example.

I’ve constructed the chart above from the data set Seatbelts that comes built into R. To see a breakdown of the dataset type `?Seatbelts` at the R command line. I’ve extracted a subset of the columns available to better highlight the relationships in the data.

The dataset represents the number of drivers killed in car accidents in Great Britain before and after it became compulsory to wear seat belts. The axes represent the number of drivers killed, the distance driven, the cost of gas at the time, and whether or not there was a seat belt law in place.

There are a number of useful ways to look at parallel coordinates. If you look at the lines between a single pair of axes you can see the relationships between those data sets. For example, if you look at the relationship between the price of gas and the seat belt law, you can see that the price of gas is constrained pretty tightly for when the seat belt law was in place, but covered a large range of prices for when the seat belt law was not in place. This relationship could imply a lot of different things, but since I know the data, I know it’s because there is a much smaller sample size for deaths after the law was put in place. There was 14 years worth of data before the seat belt law, but only 2 years worth of data after the seat belt law.

You can also trace lines across all of the axes to see how each of the axes relates. This is difficult to do with all of the lines the same color, but when you change the color and shading of lines, you can more easily see the patterns across the chart. Let’s take the existing chart and assign colors to the lines, which gives us the results shown below.

So, from this figure, you can begin to see the patterns that exist in the data. You can see the lines that have the lowest number of deaths also have the most distances driven, and mainly fall into the point in time after the seat belt law was enacted. Again, note that we do have a much smaller sample size available to us for post seat belt law than we do pre-seat belt law, but you can see how it becomes useful and it is telling to be able to trace the interconnectedness of these data points.

## History of Parallel Coordinate Plots

Before moving further into the topic of parallel coordinates, let’s briefly examine their history. The idea of using parallel coordinates on vertical axes was invented in 1885 by Maurice D’ Ocagne when he created the nomograph and the field of nomography. Nomographs are essentially tools to calculate values across mathematical rules. The classic example of a nomograph still in use today is the line on a thermometer that shows values in both Fahrenheit and Celsius. Or think of rulers that show values in inches on one side and centimeters on the other. Ron Doerfler has written an extensive thesis on nomography. Doerfler also hosts a site called Modern Nomograms, that “offers eye-catching and useful graphical calculators uniquely designed for today’s applications”.

The term parallel coordinates and the concept that they represent was popularized and re-discovered by Alfred Inselberg while studying at the University of Illinois. Currently Dr. Inselberg is a Professor at Tel Aviv University and a Senior Fellow at the San Diego Supercomputing Center. Dr. Inselberg has also published a book on the subject, Parallel Coordinates: Visual Multidimensional Geometry and Its Applications. He has also published a dissertation on how to effectively read and use parallel coordinates, entitled The Multidimensional Detective, available from the IEEE.

## Applying the Concept

We understand that parallel coordinates are used to visualize the relationship between multiple variables, but how is that useful for a member or leader of an agile engineering team? In previous posts I’ve been talking about quantifying and visualizing the defect backlog, the sources of production incidents, and even the amount of work that teams commit to. Arguably, balancing these aspects of product development can be one of the most challenging activities that a team does.

With each iteration, either formally or informally, you have to decide how much effort to put toward each of these concerns: working on new features, fixing bugs on existing features, and addressing production incidents from direct feedback from users. And these are just a sampling of the nuances that every product team must juggle. You may also have to factor in time to spend on updating infrastructure.

You can use parallel coordinates to visualize this balance, both for documentation and as a tool for analysis when starting new sprints.

There are several different approaches that you can take with this. Using the data from the last post covering scatterplots, you could look at the running totals per iteration. Recall that this data was a total of points committed to per iteration, per team. You can augment this data with a snapshot of how many bugs and production incidents are in each team’s backlog, how many new bugs were opened during the iteration, as well as how many members there were on the team, so that the data looks much like this below:

To begin using this, simply read it into R:

We then can create a new data frame with all of the columns from our `teamvelocity` variable, except the `Team` column. That column is a string, and R’s `parcoord()` function will throw an error if you include strings in the object that you pass in to it. Also, team information wouldn’t make sense in this context. The lines that will be drawn in the chart will be representative of the teams.

You then pass the new object into the `parcoord()` function. You also pass the `rainbow()` function into the `color` parameter, as well as set the `var.label` parameter to true, to make the upper and lower boundaries of each axis visible on the chart.

This produces the visualization shown here:

This presents an interesting story. It tells you what the effects of each sprint are on your respective backlogs, including both bugs and production incidents. It also tells you how many bugs were opened with each sprint, and the relationship between each of these data points.

## Conclusion

But this is just the beginning. There are any number of ways you could expand this example, including digging deeper to uncover the amount of effort spent towards each axis.

For more details about R, see the resources below from Safari Books Online.

## Safari Books Online has the content you need

Tags: charts, Maurice D’ Ocagne, parallel coordinate charts, R, Ron Doerfler,

### 2 Responses to “Mastering Parallel Coordinate Charts in R”

1. #### Ritz

Hi,

I am also trying to make a parallel coordinate chart.

Besides the first y axis, is there any way in which i can set the limits of the rest of the y axes using parcoord function like all other y axes beginning from 0 and ending at 5?

2. #### Tom Barker

The easiest way I can think of would be to take a subset of your data, that represents the limits you want to visualize, and pour that into parcoord. There are other packages that have similar functionality like ggplot2 and lattice, but I think truncating the source data is still the simplest way