Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest blog post by Tom Barker, a software engineer, an engineering manager, a professor and an author who can be reached at @tomjbarker.

If you haven’t heard of R before, R is both an environment and a language to run statistical computations and produce data graphics. Both anecdotally, and per Google Trends, R is the language and tool most closely associated with creating data visualizations. See the Google Trend chart below, and available here, to see how the interest in R closely stays in line with the interest in data visualization.

fig1

Ross Ihaka and Robert Gentleman created R in 1993 while at the University of Auckland. The R environment is the runtime environment that you develop and run R in, and the R language is the programming language that you develop in. R is the successor to the S language, a statistical programming language that came out of Bell Labs in 1976.

When talking about data visualization, R is the de facto standard for creating your own visualization. Compare this to say Splunk, which may be the de facto standard for log introspection, but also comes with data visualization capabilities.

R is free to download from http://www.r-project.org/. Here is a visual shot of the R environment:

fig2

R is lightweight, self-contained, and easy to use – but there are some points where R differs from most other languages. It is these points that make the language seem unapproachable to some developers, at least at first. In this post I will introduce you to the language and cover some of these points of difference.

When I use R, my general workflow is as follows:

So let’s take a look at how to accomplish each step.

Ingest Data

When reading data into R, we generally will use the read.table() or read.csv() function. This opens a file and returns the content of that file.

In the above example we store the contents of the file in the variable bugData. Notice that we use the <- operator in R instead of the = like in most other languages.

There are certain parameters that we can pass in to table.read(). Among the most often used of these parameters are: sep, header, row.name, and col.name.

The sep parameter specifies what character is used to separate columns in the file:

The header parameter indicates if the first row contains header information or content:

The row.names parameter allows us to specify identifiers for the rows of our data. This parameter can accept either a list of strings to use as row name, or the column name to use as row name values. In the example below we are using the data in the Feature column to serve as row name.

Remember that by default, R uses incrementing numbers as row IDs. Keep in mind that the row names need to be unique for each row.

Finally, the col.names parameter allows us to set the column names for our data set. Just like row.names, col.names accepts a list of strings to use as the column names.

Notice in the example above that we created the list of column names using the c() function. The c() function accepts N string parameters and returns a Vector or N length that contains all of the passed in strings.

In the above examples we created variables, including vectors and data frames, but we didn’t talk much about what they are. Let’s take a step back and look at the data types that R supports and how to use them. This is one of the areas where R differs from most other languages, in that data types and the naming of data types are more reflective of mathematical structures than they are of other programming languages.

Data types in R are called modes, and can be either numeric, character, logical, complex, raw or list. We can use the mode() function to check the mode of a variable.

Character and numeric modes correspond to string and number data types. Logical modes are Boolean values. The complex mode is for complex numbers. The raw mode is used to store raw byte data.

List data types or modes are vectors, matrices, or data frames. Vectors are single dimensional arrays that can only hold a single mode or data type at a time.

Data frames are like arrays in most other languages – they are containers that hold different types of data, referenced by index. The main difference between data frames and arrays is that data frames see the data that they contain as rows and columns, and combinations of the two. This just means that individual elements in a data frame are referenced by their column and row combination, so that df[1,1] points to the value in the first column and the first row of the data frame df, but df[1] returns the entire first column in the data frame df.

Matrices are just like data frames, except that while data frames can hold different data types, matrices can only hold one type of data.

Data Analysis

We know how to read in data, and generally how to store data, next let’s look at some statistical analysis tools that come with R.

The first function in R that we’ll look at is the summary() function. The summary() function accepts an object and returns the following key descriptive metrics, grouped by column:

  • minimum value
  • maximum value
  • median for numbers and frequency for strings (also the third quartile)
  • mean
  • first quartile
  • third quartile

This allows us to quickly, at-a-glance, see the range of values, and get a high level idea of the break down. See below where we pull a summary for Johnson & Johnson earnings data for 1960 to 1980.

What if we were curious about the standard deviation across the data set? R provides an sd() function that provides that for us:

This is just the first taste of statistical analysis tools available in R. There are books and books about just this subject, as specific as you could want to get.

Data Visualization

Finally let’s look at how we can visualize our data in R.

R provides the plot function that can be used to create time series charts. We can either pass in a complete data structure like in the example below (if it contains a plotting function), or we can pass in lists to serve as the x- and y- axes of the chart.

To read more about the plot function type ?plot in your R console.

fig3

R also provides a barplot() function to create bar charts. The barplot function accepts either a matrix or a vector value as the data structure. If you want to use a data frame, you can usually convert it to a matrix as done in the example below.

fig4

For more information on the barplot function, type in ?barplot into your R console.

R provides the hist() function to create histograms. The hist() function accepts a vector of values.

fig5

To read more about the hist() function type ?hist into your R console.

These are just some of the native plotting functions in R. There are also a staggering number of independently developed open-source libraries that you can load in to extend the types of charts that can be produced, from maps to heatmaps to parallel coordinate charts (see below).

fig6

I hope that you now see R as being a little more approachable, and that I’ve conveyed at least a little of the potential and depth that R has and makes available.

For more details about R with data visualization, see the resources below from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Pro Data Visualization using R and JavaScript by Tom Barker, makes the R language approachable, and promotes the idea of data gathering and analysis. You’ll see how to use R to interrogate and analyze your data, and then use the D3 JavaScript library to format and display that data in an elegant, informative, and interactive way. You will learn how to gather data effectively, and also how to understand the philosophy and implementation of each type of chart, so as to be able to represent the results visually.
Pro JavaScript Performance: Monitoring and Visualization by Tom Barker, gives you the tools to observe and track the performance of your web applications over time from multiple perspectives, so that you are always aware of, and can fix, all aspects of your performance.
Learning R will help you learn how to perform data analysis with the R language and software environment, even if you have little or no programming experience. With the tutorials in this hands-on guide, you’ll learn how to use the essential R tools you need to know to analyze data, including data types and programming concepts.

About the author

TomBarker_biosmall Tom Barker is a software engineer, an engineering manager, a professor and an author. Currently he is the Senior Manager of Web Development at Comcast, and an Adjunct Professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer, and can be reached at @tomjbarker.

Tags: Data Visualization, Intro to R, R, Robert Gentleman, Ross Ihaka,

Comments are closed.