Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

code A guest post by Tom Barker, a software engineer, an engineering manager, a professor and an author. Currently he is Director of Software Engineering and Development at Comcast, and an Adjunct Professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer, and can be reached at @tomjbarker.

In previous posts I’ve written about using R, from getting up to speed using the R console, to ingesting and parsing external data, to object oriented programming in R, and even how to distribute your R scripts over the Web. Here we will explore how to craft specific data visualizations in R, starting with the Scatterplot.

Scatterplots are charts that plot two independent data sets on their own axes, displayed as points on a Cartesian grid (x and y coordinates). Scatterplots are used to try and identify relationships between the two data points. The pattern, or lack of a pattern, that the points form, indicates the relationship. At a very high level, relationships can be:

  • Positive correlation, where one variable increases as the other increases. This is demonstrated by the dots forming a line trending diagonally upward from left to right:

    positive correlation

    This shows a positive correlation between total phones in North America and Europe.

  • Negative correlation, where one variable increases as the other decreases. This is demonstrated by the dots forming a line trending downward from left to right:

    negative correlation

    This shows the negative correlation between body weight and time passing (for a person on a diet).

  • No correlation, demonstrated (or not) by a scatterplot that has no discernible trend line:

    no correlation

    This shows no correlation between the number of accidental deaths in the US over a year.

Of course simply identifying correlation between two data points or data sets does not imply that there is direct cause in the relationship – hence the convention that correlation does not imply causation. For example, see the negative correlation chart above. If we were to assume direct causation between the two axes – weight and number of days – we would be assuming that the passing of time caused body weight to decrease.

Michael Friendly and Daniel Denis have published a thoughtful and thoroughly researched dissertation on the history of scatterplots, originally published by the Journal of the History of the Behavioral Sciences, Vol. 41 in 2005 and available on Friendly’s website at http://www.datavis.ca/papers/friendly-scat.pdf. Their article is recommended reading, as it tries to trace back the very first recorded scatterplots, the first time a chart was called a scatterplot and the article very deftly delineates the difference between a scatterplot and a time series (time series always have time as one of the data points, but scatterplots can have any discrete values as data points).

Correlation Analysis

To make this real for us, we can apply this methodology to concrete concepts that are applicable to software engineering. Let’s say we wanted to look at the relationship between the number of team members and a team’s velocity.

To begin this analysis let’s export a totaled sum of story points for each sprint along with the team name. We should compile all of these data points into a single file that we will name teamvelocity.txt. Our file should look something like below, where we are showing data for the 12.1 and 12.2 sprints for the teams named Red and Gold (arbitrary names for teams that are working on the same product just with different bodies of work).

Let’s add an additional column in there to represent the total team members on each team for each sprint. Our data should now look like so:

Excellent, let’s now read this into R:

Let’s next create a scatterplot using the plot() function to compare the total points that the teams completed each sprint with, against how many members were on the team for each sprint. We pass teamvelocity$TotalPoints and teamvelocity$TotalDevs as the first two parameters, set the type to “p” and give meaningful labels for our axes.

This creates the scatterplot that we can see in the next figure, where we add more members to a team, and the amount of story points that they can complete in an iteration, or sprint, also increases.

adding more members

If we wanted a greater insight into the data that we have so far, like to show which points belong to which team, we could surface that information with a bubble chart. We can create bubble charts using the symbols() function. We pass in TotalPoints and TotalDevs into symbols, just like we did for plot(), but we also pass in the Team column into a parameter named circles. This specifies the radius of the circle to draw on our chart. Since, for our example, Team is a string R will convert it to a factor. We also set the color of the circle with the bg parameter and the stroke color of the circle with the fg parameter.

The above R code should produce a bubble chart that looks like this:

bubble chart

Conclusion

The intention here is not to imply anything about the sample data (there are in fact times when adding team members lowers a team’s velocity, like when new member’s need to be ramped up) but to show how we can apply exploratory data analysis to concepts that work with every day to gather more insight into our processes.

For more details about R, see the resources below from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

R for Everyone: Advanced Analytics and Graphics shows how by using the open source R language, you can build powerful statistical models to answer many of your most challenging questions. R has traditionally been difficult for non-statisticians to learn, and most R books assume far too much knowledge to be of help. R for Everyone is the solution. You’ll download and install R; navigate and use the R environment; master basic program control, data import, and manipulation; and walk through several essential tests. Then, building on this foundation, you’ll construct several complete models, both linear and nonlinear, and use some data mining techniques.
The Art of R Programming is both broad in its coverage of various language constructs and data structures, and deep and co mprehensive in explaining them. It provides working examples, and iluminates the R philosophy: a clean functional language with strong vector operation support, and a “do more with less typing” foundation that can make programs an order of magnitude smaller and expressive.
Pro Data Visualization using R and JavaScript by Tom Barker, makes the R language approachable, and promotes the idea of data gathering and analysis. You’ll see how to use R to interrogate and analyze your data, and then use the D3 JavaScript library to format and display that data in an elegant, informative, and interactive way. You will learn how to gather data effectively, and also how to understand the philosophy and implementation of each type of chart, so as to be able to represent the results visually.
Pro JavaScript Performance: Monitoring and Visualization by Tom Barker, gives you the tools to observe and track the performance of your web applications over time from multiple perspectives, so that you are always aware of, and can fix, all aspects of your performance.
Learning R will help you learn how to perform data analysis with the R language and software environment, even if you have little or no programming experience. With the tutorials in this hands-on guide, you’ll learn how to use the essential R tools you need to know to analyze data, including data types and programming concepts.

Tags: Daniel Denis, Michael Friendly, negative correlation, no correlation, positive correlation, R, scatterplots,

Comments are closed.