2.6. Match Whole Words

Problem

Create a regex that matches cat in My cat is brown, but not in category or bobcat. Create another regex that matches cat in staccato, but not in any of the three previous subject strings.

Solution

Word boundaries

\bcat\b
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Nonboundaries

\Bcat\B
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Word boundaries

The regular expression token \b is called a word boundary. It matches at the start or the end of a word. By itself, it results in a zero-length match. \b is an anchor, just like the tokens introduced in the previous section.

Strictly speaking, \b matches in these three positions:

  • Before the first character in the subject, if the first character is a word character

  • After the last character in the subject, if the last character is a word character

  • Between two characters in the subject, where one is a word character and the other is not a word character

None of the flavors discussed in this book have separate tokens for matching only before or only after a word. Unless you wanted to create a regex that consists of nothing but a word boundary, these aren’t needed. The tokens before or after the \b in your regular expression will determine where \b can match. The \b in \bx and !\b could match only at the start of a word. The \b in x\b and \b! could match only at the end of a word. x\bx and !\b! can never match anywhere. ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.