A guest blog post by Tom Barker, a software engineer, an engineering manager, a professor and an author who can be reached at @tomjbarker.
A number of years ago after a good amount of introspection and analysis, I realized that most of the issues that I dealt with on a daily basis could be traced back to a single root cause: communication. Bugs were a result of the lack of communication of requirements somewhere in the mix between a Product, Engineering, and QA. The same goes for contention over velocity, and even what features were completed in a given sprint.
But the most insidious type of communication breakdown was recency bias. If you aren’t familiar with recency bias, it is an investing term used to describe the misconception that current market conditions indicate future performance. If a bug report for a particular day was high, then the consensus among stakeholders was that quality was slipping. Or even the inverse, if we were in a particularly quiet time for production incidents, there could be the impression that system stability is at an all-time high. Of course neither are true or false based on just that single data point.
I realized at that point that I could make my working life considerably less stressful if I could just find a way to communicate out the salient data that would give the necessary context for a given situation.
It was also around this time that I attended my first Velocity conference. The conference was full of the best and brightest in our field communicating nuanced performance data without issue to a room full of hundreds, all by using data visualizations, which I will explore in depth in this post.
I realized that I could use the same technique to tell the story of the work being done in my organization. I jumped right in and began creating team health reports for all of my stakeholders. These reports covered a range of topics like the amount of bugs in our backlog over time, the amount of bugs per product, production incidents by product, even things like what day and time the most code check ins came from my teams (my original intent was to make sure no meetings were ever scheduled during that time, but I soon came to realize that the teams were most active checking in because there were already no meetings scheduled at that time).
Once I had my first taste I couldn’t get enough, I submerged myself into the world of data visualizations. As much as it was the initial exposure to the concept at Velocity (specifically seeing John Rauser speak), it was my exposure to the work of Edward Tufte that taught me that much like any other form of communication, there are general best practices and rules of syntax around data visualizations.
What is Data Visualization
Data visualization is the art and practice of gathering, analyzing, and graphically representing empirical information to tell a story. I include gathering and analyzing data in my definition because that is the most important part. It is simple enough to throw a data structure at the
plot() function in the R command line and see something, but in order to have your visualizations be meaningful, you need to both make sure that you are gathering the correct data and that you understand what the data is telling you.
Gathering the correct data is domain dependent, and as such, completely subjective to your particular business, environment, and software stack.
Understanding the data is about knowing your domain, but also performing statistical analysis on your raw data. Your raw data might look like a large CSV file, but you can use tools like R to determine quartiles, distributions, and other interesting aggregate facts. These facts tell a story, a story you want to tell via visuals.
The graphical representation of data will generally be in the form of charts and can be classified by the type of data that is being represented.
Types of Data Visualizations
When looking at how discreet values change over time you use a Time Series chart. Time series charts generally have an x and a y axis, and the y-axis is the vertical bar that shows the range of values in the data set, whereas the x-axis is the horizontal bar on the bottom that shows the range of time that the data set covers. Time series charts also have at least one line showing where the two values intersect:
You can see in the time series above that the y-axis covers a range of 100 to 600 and the x-axis displays a span of time from 1950 to 1960. For reference, the above chart is from a data set named AirPassengers that is packaged with R and denotes the monthly airline passenger totals.
What airline and what destination isn’t shown, but the pattern in the data is clear: over the course of 10 years the amount of passengers grew nearly exponentially, but in that same time period there are recurring months with relatively low patronage. Imagine coming off the high of the peaks above, which are the summer months, and seeing your customer base shrink nearly in half. Without the context of the pattern in the data it might feel like a catastrophic down turn. With the context of the data we can see that it is just the natural flow of consumer habits in our particular business sector.
The time series above can be recreated in R with this code:
plot(AirPassengers, col="red", bty="7", lwd=2, xlab="", ylab="",
main="Internation Airline Passengers (Total)")
The time series chart is the creation of William Playfair, who first used the tool to represent economic data over time in his 1786 work, The Commercial and Political Atlas. To really appreciate the depth and importance of Playfair’s book, you really should pick up a copy that just recently came back to print thanks to Howard Wainer, Ian Spence, and Cambridge University Press. It is a precursor to our modern www.data.gov.
In his book, The Commercial and Political Atlas, Playfair also invented another type of chart, but one he deemed as lesser in value – the bar chart. In one specific data set Playfair did not have dates associated with the data points, so he was unable to show the change in values over time. His solution was to compare the discrete values to each other, as bars side by side.
See below for an example of a bar chart. Notice the y-axis shows a list of states; the x-axis lists a range of values from 0 to 14, and red bars denoting the value for each state:
Why Playfair considered this chart to be of less importance than the time series should be obvious here: the above values are a snapshot in time. We can’t see how they change over time, so we can’t identify patterns over time. We can, however, see how the discrete values compare to each other in this specific snapshot of time.
We can recreate the above chart with the following R code:
x <- sort(state.x77[,5], decreasing=FALSE)
opar <- par(no.readonly=TRUE, mar=c(5,5,5,5))
barplot(x, horiz=TRUE, space=2,cex.axis=0.6, cex.names=0.6,
col="red", main="Murder Rate by State (1976)")
Sometimes we want to see how results in a data set are distributed across the data set. This is definitely the case when looking at performance data. If we have a range of performance numbers, we certainly care about what the distributions are. Say we gather a summary of the performance data so that we can see minimum, maximum and the quartiles of the data and we see a large range. That’s concerning, so we need to next know what the distribution of those values is.
Let’s look at this in the context of something everyone can relate to: the size of US states in total area. Let’s first pull up a summary of our data.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1214 37320 56220 72370 83230 589800
This is interesting, but how are those values distributed? How many states are in the 589800 range? Let’s look at a histogram to see:
Interesting. While the mean is 72370, and we can see from the histogram that roughly 80% of states fall into the 0-100000 range, there are some significant outliers.
hist(state.area, col="green", main="Distribution of Total Area by State", freq=TRUE)
Finally, if you want to see how data points correlate to each other you would use a scatterplot to show potential relationships in the data. Scatterplots have two axes; each representing one of the two data points being compared. There are usually either points on the plot to show the intersection of the two values, or a line to show the direction of the correlation, based upon the intersection of the data. A line starting low on the left and rising to the right is a positive correlation; meaning that as one data point increases so does the other. A line starting high on the right and going down to the left indicates a negative correlation; meaning that as one data point decreases the other increases.
See below for a scatterplot demonstrating the positive correlation between stopping distance and speed in cars. Notice that as speed increases so does stopping distance:
Finally, we’ll end this with a word of caution. Data visualization is all about telling a story. It’s important to understand your data before telling a story around it. And it is ethically important to tell an accurate story to the best of your ability with this tool.
For more details about data visualization, see the resources below from Safari Books Online.
Not a subscriber? Sign up for a free trial.
Safari Books Online has the content you need
About the author