Exploratory data analysis (EDA) is an approach to examining and describing data to gain insight, discover structure, and detect anomalies and outliers. John Tukey (1915–2000), an American mathematician and statistician who pioneered many of the techniques now used in EDA, stated in his 1977 book Exploratory Data Analysis (Tukey (1977)) that “Exploratory data analysis is detective work—numerical detective work—counting detective work—or graphical detective work.” In this chapter, we will learn many of the basic techniques and tools for gaining insight into data.
Statistical software packages can easily do the calculations needed for the basic plots and numeric summaries of data. We will use the software package R. We will assume that you have gone through the introduction to R available at the web site https://sites.google.com/site/ChiharaHesterberg.
In Chapter 1, we described data on the lengths of flight delays of two major airlines flying from LaGuardia Airport in New York City in 2009. Some basic questions we might ask include how many of these flights were flown by United Airlines and how many by American Airlines? How many flights flown by each of these airlines were delayed more than 30 min?
A categorical variable is one that places the observations into groups. For instance, in the FlightDelays data set, Carrier is a categorical variable (we will also call this a factor variable) with two levels, UA and AA. Other data sets might ...