Finding and Removing Duplicates

Data sources often contain duplicate values. Depending on how you plan to use the data, the duplicates might cause problems. It’s a good idea to check for duplicates in your data (if they aren’t supposed to be there).

R provides some useful functions for detecting duplicate values.

Suppose that you accidentally included one stock ticker twice (say, GE) when you fetched stock quotes:

> my.tickers.2 <- c("GE","GOOG","AAPL","AXP","GS","GE")
> my.quotes.2 <- get.multiple.quotes(my.tickers.2, from=as.Date("2009-01-01"),
+ to=as.Date("2009-03-31"), interval="m")

R provides some useful functions for detecting duplicate values such as the duplicated function. This function returns a logical vector showing which elements are duplicates of values with lower indices. Let’s apply duplicated to the data frame my.quotes.2:

> duplicated(my.quotes.2)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

As expected, duplicated shows that the last three rows are duplicates of earlier rows. You can use the resulting vector to remove duplicates:

> my.quotes.unique <- my.quotes.2[!duplicated(my.quotes.2),]

Alternatively, you could use the unique function to remove the duplicate values:

my.quotes.unique <- unique(my.quotes.2)

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.