5.8. Find Repeated Words

Problem

You’re editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with “The the”. You also want to allow differing amounts of whitespace between words, even if this causes the words to extend across more than one line.

Solution

A backreference matches something that has been matched before, and therefore provides the key ingredient for this recipe:

\b([A-Z]+)\s+\1\b
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If you want to use this regular expression to keep the first word but remove subsequent duplicate words, replace all matches with backreference 1. Another approach is to highlight matches by surrounding them with other characters (such as an HTML tag), so you can more easily identify them during later inspection. Recipe 3.15 shows how you can use backreferences in your replacement text, which you’ll need to do to implement either of these approaches.

If you just want to find repeated words so you can manually examine whether they need to be corrected, Recipe 3.7 shows the code you need. A text editor or grep-like tool, such as those mentioned in Tools for Working with Regular Expressions in Chapter 1, can help you find repeated words while providing the context needed to determine whether the words in question are in fact used correctly.

Discussion

There are two things needed to match something that ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.