Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

code A guest post by Tom Barker, a software engineer, an engineering manager, a professor and an author. Currently he is the Senior Manager of Web Development at Comcast, and an Adjunct Professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer, and can be reached at @tomjbarker.

In a previous article I talked at length about visualization and R. In the world of data visualization, R is one of the standard bearers. Its sole purpose is to perform statistical analysis and the visualization of data. It’s completely self-contained, meaning it is not just a language, but also the environment to run the language. It is also free, and the R environment is open source.

I was giving a talk recently at Philly.rb talking about data visualization, and one of the questions after the talk was about how could someone get started creating data visualizations. My response was to simply download R, read in some data they already had like their Apache access logs, and start analyzing and visualizing interesting metrics from that. They could then chart out their error rates, what their top errors were, what geographic location their users were clustered around, and things of that nature. The immediate response from the audience was, how do we ingest data from R? I realized this topic was something that needed more fleshing out. So, in this article I will focus on reading and parsing external data in R.

There are several ways to import data into R. To read in a standard comma separated file, or an otherwise delineated flat file, you use the read function. The read function has several flavors:

Imagine that your data file looks like the following:

You can read in this file using the read.table() function. The read.table() function has a parameter named sep that allows you to specify the character to use as the column delimiter. Let’s set that to be a comma:

If you print the value of your file to the screen, you can see that read.table() created a data frame with 3 rows and 5 columns: one row for each line in the flat file, and one column for each comma separated value in the flat file.

If your data file contained header information, you could pass in TRUE for the header parameter into read.table, so that the first row of the file is used as column names instead of a data row.

Note that R also supports a row.names attribute that allows you to specify identifiers for your rows. By default R uses incrementing numbers as row IDs. Keep in mind that the row names need to be unique for each row.

Now, what if you want to have column names, but the first line of your file is not header information? You can use the col.names parameter to specify a vector that you can use as column names.

First you’ll create a vector named columnNames that will hold the strings to use as the column names:

Then you’ll read in the data, passing in your vector to the col.names parameter:

Now that we’ve covered how to read in CSV files, let’s look at the answer to the original question that sparked this article. If you want to read in your Apache access logs you can simply do the following:

You can read the file, usually located at /var/log/apache2/access_log, into a variable named log.

If you want to see what the column names are in the data set you can call the colnames() function, passing in your log data frame:

That covers reading in CSV files, but what if your data is JSON? To read in JSON data you can use an R package called rjson. This will allow you to read in and parse JSON with the fromJSON() function.

Say you have a JSON data structure formatted like so:

First you need to make sure that you have rjson installed. You can install rjson by calling the install.packages() function:

To use rjson you first need to load it into your current R session using the library() function:

Next, you can read your JSON file using the fromJSON function:

The structure of the object that is created looks like the following:

You can see that it mirrors the JSON hierarchy of having a results array that contains objects with body, byline, date, title, and url properties. But if you want to access individual cells, you need to reach into the results array, which can get cumbersome.

You can, however, restructure your data structure to resemble a more traditional data frame in R. You simply need to create a new data frame, iterate through the results array, pull out each value row by row, and insert new rows into our new data frame:

This produces a data frame that looks like the following:

You can access individual cell data the same way you would any other data frame:

Hopefully this gives you a taste of how easy it is to get started reading data into R. And this is just the beginning. I haven’t even talked about reading and parsing XML or accessing databases from R.

I hope that you now see R as being a little more approachable, and that I’ve conveyed at least a little of the potential and depth that R has and makes available to you.

For more details about R with data visualization, see the resources below from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Pro Data Visualization using R and JavaScript by Tom Barker, makes the R language approachable, and promotes the idea of data gathering and analysis. You’ll see how to use R to interrogate and analyze your data, and then use the D3 JavaScript library to format and display that data in an elegant, informative, and interactive way. You will learn how to gather data effectively, and also how to understand the philosophy and implementation of each type of chart, so as to be able to represent the results visually.
Pro JavaScript Performance: Monitoring and Visualization by Tom Barker, gives you the tools to observe and track the performance of your web applications over time from multiple perspectives, so that you are always aware of, and can fix, all aspects of your performance.
Learning R will help you learn how to perform data analysis with the R language and software environment, even if you have little or no programming experience. With the tutorials in this hands-on guide, you’ll learn how to use the essential R tools you need to know to analyze data, including data types and programming concepts.

About the author

TomBarker_biosmall Tom Barker is a software engineer, an engineering manager, a professor and an author. Currently he is the Senior Manager of Web Development at Comcast, and an Adjunct Professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer, and can be reached at @tomjbarker.

Tags: Data Visualization, external data, JSON, parsing, R, reading, rjson,

Comments are closed.