Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Tom Barker, a software engineer, an engineering manager, a professor and an author. Currently he is Director of Software Engineering and Development at Comcast, and an Adjunct Professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer, and can be reached at @tomjbarker.

In previous posts I’ve written about using R, from getting up to speed using the R console, to ingesting and parsing external data, to object oriented programming in R, and even how to distribute your R scripts over the Web. Here we will explore how to craft specific data visualizations in R, starting with the Scatterplot.

Scatterplots are charts that plot two independent data sets on their own axes, displayed as points on a Cartesian grid (x and y coordinates). Scatterplots are used to try and identify relationships between the two data points. The pattern, or lack of a pattern, that the points form, indicates the relationship. At a very high level, relationships can be:

• Positive correlation, where one variable increases as the other increases. This is demonstrated by the dots forming a line trending diagonally upward from left to right:

This shows a positive correlation between total phones in North America and Europe.

• Negative correlation, where one variable increases as the other decreases. This is demonstrated by the dots forming a line trending downward from left to right:

This shows the negative correlation between body weight and time passing (for a person on a diet).

• No correlation, demonstrated (or not) by a scatterplot that has no discernible trend line:

This shows no correlation between the number of accidental deaths in the US over a year.

Of course simply identifying correlation between two data points or data sets does not imply that there is direct cause in the relationship – hence the convention that correlation does not imply causation. For example, see the negative correlation chart above. If we were to assume direct causation between the two axes – weight and number of days – we would be assuming that the passing of time caused body weight to decrease.

Michael Friendly and Daniel Denis have published a thoughtful and thoroughly researched dissertation on the history of scatterplots, originally published by the Journal of the History of the Behavioral Sciences, Vol. 41 in 2005 and available on Friendly’s website at http://www.datavis.ca/papers/friendly-scat.pdf. Their article is recommended reading, as it tries to trace back the very first recorded scatterplots, the first time a chart was called a scatterplot and the article very deftly delineates the difference between a scatterplot and a time series (time series always have time as one of the data points, but scatterplots can have any discrete values as data points).

## Correlation Analysis

To make this real for us, we can apply this methodology to concrete concepts that are applicable to software engineering. Let’s say we wanted to look at the relationship between the number of team members and a team’s velocity.

To begin this analysis let’s export a totaled sum of story points for each sprint along with the team name. We should compile all of these data points into a single file that we will name teamvelocity.txt. Our file should look something like below, where we are showing data for the 12.1 and 12.2 sprints for the teams named Red and Gold (arbitrary names for teams that are working on the same product just with different bodies of work).

Let’s add an additional column in there to represent the total team members on each team for each sprint. Our data should now look like so:

Excellent, let’s now read this into R:

Let’s next create a scatterplot using the `plot()` function to compare the total points that the teams completed each sprint with, against how many members were on the team for each sprint. We pass `teamvelocity\$TotalPoints` and `teamvelocity\$TotalDevs` as the first two parameters, set the type to “`p`” and give meaningful labels for our axes.

This creates the scatterplot that we can see in the next figure, where we add more members to a team, and the amount of story points that they can complete in an iteration, or sprint, also increases.

If we wanted a greater insight into the data that we have so far, like to show which points belong to which team, we could surface that information with a bubble chart. We can create bubble charts using the `symbols()` function. We pass in `TotalPoints` and `TotalDevs` into symbols, just like we did for `plot()`, but we also pass in the Team column into a parameter named `circles`. This specifies the radius of the circle to draw on our chart. Since, for our example, `Team` is a string R will convert it to a factor. We also set the color of the circle with the `bg` parameter and the stroke color of the circle with the `fg` parameter.

The above R code should produce a bubble chart that looks like this:

## Conclusion

The intention here is not to imply anything about the sample data (there are in fact times when adding team members lowers a team’s velocity, like when new member’s need to be ramped up) but to show how we can apply exploratory data analysis to concepts that work with every day to gather more insight into our processes.

For more details about R, see the resources below from Safari Books Online.

Not a subscriber? Sign up for a free trial.