Finding and Removing Duplicates
Data sources often contain duplicate values. Depending on how you plan to use the data, the duplicates might cause problems. It’s a good idea to check for duplicates in your data (if they aren’t supposed to be there).
R provides some useful functions for detecting duplicate values.
Suppose that you accidentally included one stock ticker twice (say, GE) when you fetched stock quotes:
> my.tickers.2 <- c("GE","GOOG","AAPL","AXP","GS","GE") > my.quotes.2 <- get.multiple.quotes(my.tickers.2, from=as.Date("2009-01-01"), + to=as.Date("2009-03-31"), interval="m")
R provides some useful functions for detecting duplicate values
such as the duplicated
function.
This function returns a logical vector showing which elements are
duplicates of values with lower indices. Let’s apply duplicated
to the data frame my.quotes.2
:
> duplicated(my.quotes.2) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [12] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
As expected, duplicated
shows
that the last three rows are duplicates of earlier rows. You can use
the resulting vector to remove duplicates:
> my.quotes.unique <- my.quotes.2[!duplicated(my.quotes.2),]
Alternatively, you could use the unique
function to
remove the duplicate values:
my.quotes.unique <- unique(my.quotes.2)
Get R in a Nutshell now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.