Summary Statistics

R includes a variety of functions for calculating summary statistics.

To calculate the mean of a vector, use the mean function. You can calculate minima with the min function, or maxima with the max function. As an example, let’s use the dow30 data set that we created in An extended example. This data set is also available in the nutshell package:

> library(nutshell)
> data(dow30)
> mean(dow30$Open)
[1] 36.24574
> min(dow30$Open)
[1] 0.99
> max(dow30$Open)
[1] 122.45

For each of these functions, the argument na.rm specifies how NA values are treated. By default, if any value in the vector is NA, then the value NA is returned. Specify na.rm=TRUE to ignore missing values:

> mean(c(1,2,3,4,5,NA))
[1] NA
> mean(c(1,2,3,4,5,NA),na.rm=TRUE)
[1] 3

Optionally, you can also remove outliers when using the mean function. To do this, use the trim argument to specify the fraction of observations to filter:

> mean(c(-1,0:100,2000))
[1] 68.4369
> mean(c(-1,0:100,2000),trim=0.1)
[1] 50

To calculate the minimum and maximum at the same time, use the range function. This returns a vector with the minimum and maximum value:

> range(dow30$Open)
[1]   0.99 122.45

Another useful function is quantile. This function can be used to return the values at different percentiles (specified by the probs argument):

> quantile(dow30$Open, probs=c(0,0.25,0.5,0.75,1.0))
     0%     25%     50%     75%    100% 
  0.990  19.655  30.155  51.680 122.450

You can return this specific set of values (minimum, 25th percentile, median, 75th percentile, ...

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.