Chapter 17. String Matching

Chapter 16 concentrated on efficient techniques for finding one string within another. This chapter focuses on matching whole strings, and, in particular, attempting to find matches between non-identical yet similar strings. This can be very useful for detecting duplicate entries in a database, spell-checking documents, and even searching for genes in DNA.

This chapter discusses the following topics:

  • Understanding Soundex

  • Understanding Levenshtein word distance

Understanding Soundex

Soundex encoding is one of a class of algorithms known as phonetic encoding algorithms. Phonetic encoding takes strings and converts similar sounding words into the same encoded value (much like a hash function).

Soundex, developed by R. C. Russell to process data collected from the 1980 census, is also known as the Russell Soundex algorithm and has been used in its original form and with many variations in numerous applications—ranging from human resource management to genealogy, and, of course, census taking—in an attempt to eliminate data duplication that occurs because of differences in the spelling of people's surnames.

In 1970, Robert L. Taft, working as part of the New York State Identification and Intelligence project (NYSII), published a paper titled "Name Search Techniques," in which he presented findings on two phonetic encoding schemes. One of these was Soundex, the other an algorithm developed by the NYSII based on extensive statistical analysis of real data. The NYSII ...

Get Beginning Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.