5.5. Find Any Word Not Followed by a Specific Word

Problem

You want to match any word that is not immediately followed by the word cat, ignoring any whitespace, punctuation, or other nonword characters that appear in between.

Solution

Negative lookahead is the secret ingredient for this regular expression:

\b\w+\b(?!\W+cat\b)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Recipes 3.7 and 3.14 show examples of how you might want to implement this regular expression in code.

Discussion

As with many other recipes in this chapter, word boundaries (\b) and the word character token (\w) work together to match a complete word. You can find in-depth descriptions of these features in Recipe 2.6.

The (?!) surrounding the second part of this regex is a negative lookahead. Lookahead tells the regex engine to temporarily step forward in the string, to check whether the pattern inside the lookahead can be matched just ahead of the current position. It does not consume any of the characters matched inside the lookahead. Instead, it merely asserts whether a match is possible. Since we’re using a negative lookahead, the result of the assertion is inverted. In other words, if the pattern inside the lookahead can be matched just ahead, the match attempt fails, and regex engine moves forward to try all over again starting from the next character in the subject string. You can find much more detail about lookahead (and its counterpart, lookbehind) in Recipe 2.16 ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.