Detecting Duplicate Words

Problem

You want to check for doubled words in a document.

Solution

Use backreferences in your regular expression.

Discussion

Parentheses in a pattern make the regular expression engine remember what matched that part of the pattern. Later in your pattern, you can refer to the actual string that matched with \1 (indicating the string matched by the first set of parentheses), \2 (for the second string matched by the second set of parentheses), and so on. Don’t use $1; it would be treated as a variable and interpolated before the match began. If you match /([A-Z])\1/, that says to match a capital letter followed not just by any capital letter, but by whichever one was captured by the first set of parentheses in that pattern.

This sample code reads its input files by paragraph, with the definition of paragraph following Perl’s notion of two or more contiguous newlines. Within each paragraph, it finds all duplicate words. It ignores case and can match across newlines.

Here we use /x to embed whitespace and comments to make the regular expression readable. /i lets us match both instances of "is" in the sentence "Is is this ok?". We use /g in a while loop to keep finding duplicate words until we run out of text. Within the pattern, use \b (word boundary) and \s (whitespace) to help pick out whole words and avoid matching "This".

$/ = ''; # paragrep mode while (<>) { while ( m{ \b # start at a word boundary (begin letters) (\S+) # find chunk of non-whitespace \b # ...

Get Perl Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.