Filtering missing data before or during the actual analysis

Let's suppose we want to calculate the mean of the actual length of flights:

> mean(hflights$ActualElapsedTime)
[1] NA

The result is NA of course, because as identified previously, this variable contains missing values, and almost every R operation with NA results in NA. So let's overcome this issue as follows:

> mean(hflights$ActualElapsedTime, na.rm = TRUE)
[1] 129.3237
> mean(na.omit(hflights$ActualElapsedTime))
[1] 129.3237

Any performance issues there? Or other means of deciding which method to use?

> library(microbenchmark)
> NA.RM   <- function()
+              mean(hflights$ActualElapsedTime, na.rm = TRUE)
> NA.OMIT <- function()
+              mean(na.omit(hflights$ActualElapsedTime))
> microbenchmark(NA.RM(), ...

Get Mastering Data Analysis with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Data Analysis with R by Gergely Daroczi

Filtering missing data before or during the actual analysis

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly