O'Reilly logo

Perl Cookbook by Nathan Torkington, Tom Christiansen

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Detecting Duplicate Words

Problem

You want to check for doubled words in a document.

Solution

Use backreferences in your regular expression.

Discussion

Parentheses in a pattern make the regular expression engine remember what matched that part of the pattern. Later in your pattern, you can refer to the actual string that matched with \1 (indicating the string matched by the first set of parentheses), \2 (for the second string matched by the second set of parentheses), and so on. Don’t use $1; it would be treated as a variable and interpolated before the match began. If you match /([A-Z])\1/, that says to match a capital letter followed not just by any capital letter, but by whichever one was captured by the first set of parentheses in that pattern.

This sample code reads its input files by paragraph, with the definition of paragraph following Perl’s notion of two or more contiguous newlines. Within each paragraph, it finds all duplicate words. It ignores case and can match across newlines.

Here we use /x to embed whitespace and comments to make the regular expression readable. /i lets us match both instances of "is" in the sentence "Is is this ok?". We use /g in a while loop to keep finding duplicate words until we run out of text. Within the pattern, use \b (word boundary) and \s (whitespace) to help pick out whole words and avoid matching "This".

$/ = ''; # paragrep mode while (<>) { while ( m{ \b # start at a word boundary (begin letters) (\S+) # find chunk of non-whitespace \b # ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required