Finding and Removing Duplicates

Data sources often contain duplicate values. Depending on how you plan to use the data, the duplicates might cause problems. It’s a good idea to check for duplicates in your data (if they aren’t supposed to be there).

R provides some useful functions for detecting duplicate values.

Suppose that you accidentally included one stock ticker twice (say, GE) when you fetched stock quotes:

> my.tickers.2 <- c("GE","GOOG","AAPL","AXP","GS","GE")
> my.quotes.2 <- get.multiple.quotes(my.tickers.2, from=as.Date("2009-01-01"),
+ to=as.Date("2009-03-31"), interval="m")

R provides some useful functions for detecting duplicate values such as the duplicated function. This function returns a logical vector showing which elements are duplicates of values with lower indices. Let’s apply duplicated to the data frame my.quotes.2:

> duplicated(my.quotes.2)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

As expected, duplicated shows that the last three rows are duplicates of earlier rows. You can use the resulting vector to remove duplicates:

> my.quotes.unique <- my.quotes.2[!duplicated(my.quotes.2),]

Alternatively, you could use the unique function to remove the duplicate values:

my.quotes.unique <- unique(my.quotes.2)

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

R in a Nutshell by Joseph Adler

Finding and Removing Duplicates

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly