Filtering missing data before or during the actual analysis
Let's suppose we want to calculate the mean
of the actual length of flights:
> mean(hflights$ActualElapsedTime) [1] NA
The result is NA
of course, because as identified previously, this variable contains missing values, and almost every R operation with NA
results in NA
. So let's overcome this issue as follows:
> mean(hflights$ActualElapsedTime, na.rm = TRUE) [1] 129.3237 > mean(na.omit(hflights$ActualElapsedTime)) [1] 129.3237
Any performance issues there? Or other means of deciding which method to use?
> library(microbenchmark) > NA.RM <- function() + mean(hflights$ActualElapsedTime, na.rm = TRUE) > NA.OMIT <- function() + mean(na.omit(hflights$ActualElapsedTime)) > microbenchmark(NA.RM(), ...
Get Mastering Data Analysis with R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.