Identifying and removing duplicates

Relational databases work on the idea that items of data can be uniquely identified. However hard we try, there will always be bad data arriving from somewhere. This recipe shows you how to diagnose that and clean up the mess.

Getting ready

Let's start by looking at our example table, cust. It has a duplicate value in customerid:

postgres=# SELECT * FROM cust;
 customerid | firstname | lastname | age
------------+-----------+----------+-----
          1 | Philip    | Marlowe  |  38
          2 | Richard   | Hannay   |  42
          3 | Holly     | Martins  |  25
          4 | Harry     | Palmer   |  36
          4 | Mark      | Hall     |  47
(5 rows)

Before you delete duplicate data, remember that sometimes, it isn't the data that is wrong; it is your understanding of it. In those cases, ...

Get PostgreSQL 9 Administration Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

PostgreSQL 9 Administration Cookbook - Second Edition by Simon Riggs, Gianni Ciolli, Hannu Krosing, Gabriele Bartolini

Identifying and removing duplicates

Getting ready

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly