Cover by Steven Levithan, Jan Goyvaerts

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

2.16. Test for a Match Without Adding It to the Overall Match

Problem

Find any word that occurs between a pair of HTML bold tags, without including the tags in the regex match. For instance, if the subject is My <b>cat</b> is furry, the only valid match should be cat.

Solution

(?<=<b>)\w+(?=</b>)
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

JavaScript and Ruby 1.8 support the lookahead (?=</b>), but not the lookbehind (?<=<b>).

Discussion

Lookaround

The four kinds of lookaround groups supported by modern regex flavors have the special property of giving up the text matched by the part of the regex inside the lookaround. Essentially, lookaround checks whether certain text can be matched without actually matching it.

Lookaround that looks backward is called lookbehind. This is the only regular expression construct that will traverse the text from right to left instead of from left to right. The syntax for positive lookbehind is (?<=). The four characters (?<= form the opening bracket. What you can put inside the lookbehind, here represented by , varies among regular expression flavors. But simple literal text, such as (?<=<b>), always works.

Lookbehind checks to see whether the text inside the lookbehind occurs immediately to the left of the position that the regular expression engine has reached. If you match (?<=<b>) against My <b>cat</b> is furry, the lookbehind will fail to match until the regular expression starts the match attempt ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required