Exploring the Data

There are many great tools for data analysis. Some of the most commonly used are compared in Table 17-2.

Table 17-2. Comparison of data analysis packages

Name

Advantages

Disadvantages

Open source?

Typical users

R

Library support; visualization

Steep learning curve

Yes

Statistics

Matlab

Elegant matrix support; visualization

Expensive; incomplete statistics support

No

Engineering

SciPy/NumPy/Matplotlib

Python: flexible and general-purpose programming language

Components poorly integrated

Yes

Engineering

Excel

Easy; visual; flexible

Large data sets; weak numeric and programming support

No

Business

SAS

Very large data sets

Very baroque; hardest to learn

No

Business

SPSS, Stata

Easy statistical analysis

Inflexible

No

Science (bio and social)

We like to use R, which is an open source statistical and visualization programming environment with a vibrant and growing development community. It's emerged as a de facto standard among statisticians. For exploratory data analysis, we prefer it to the other options because of its graphing libraries, convenient indexing notation, and an amazing array of statistically sophisticated, community-maintained packages. You can read about it and download it at http://www.r-project.org; also look at the references at the end of this chapter.

R provides many excellent tools for looking at what's in the data. >From its interactive interpreter:

Load the data  > data = read.delim("http://data.doloreslabs.com/face_scores.tsv", sep="\t") 
and plot.      > plot(data)

Given a basic table of ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.