Cover by Toby Segaran, Jeff Hammerbacher

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

O'Reilly logo

Possible Solutions

Although it's important to realize that this remains an unsolved problem in the general case, there are a number of ideas that people have tried that work in certain circumstances. Some of these approaches will be dead ends, but others, when further developed, seem to have the potential to work on a wide range of data sets.

Matching on Multiple Fields

In Chapter 7, "Data Finds Data," Jeff Jonas describes a hypothetical employee who could be discovered to also be a shoplifter through a combination of his name and his address. In that case, a combination of a name and an address is sufficient evidence to suggest that two different records in fact represent the same person. Jeff would also be quick to point out that he's come across cases where a "Patrick Smith" and a "Patricia Smith" shared an address and both went by "Pat Smith," so if you're not careful it's easy to get trapped in a maze of exceptions to otherwise obvious rules.

This does illustrate the basic and most common approach to matching items in data sets: choose a set of parameters and create a set of fixed rules that tell you whether things match or not. For example, "do two people have the same name and the same address?" or "do two films have the same name and were released the same year?"

This approach will work in many cases, but it has a few drawbacks. First of all, it requires the developer to identify the fields and rules by which things match. This can be incredibly tedious, since when they realize ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required