Cover by Steven Levithan, Jan Goyvaerts

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

2.13. Choose Minimal or Maximal Repetition

Problem

Match a pair of <p> and </p> XHTML tags and the text between them. The text between the tags can include other XHTML tags.

Solution

<p>.*?</p>
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

All the quantifiers discussed in Recipe 2.12 are greedy, meaning they try to repeat as many times as possible, giving back only when required to allow the remainder of the regular expression to match.

This can make it hard to pair tags in XHTML (which is a version of XML and therefore requires every opening tag to be matched by a closing tag). Consider the following simple excerpt of XHTML:

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>

There are two opening <p> tags and two closing </p> tags in the excerpt. You want to match the first <p> with the first </p>, because they mark a single paragraph. Note that this paragraph contains a nested <em> tag, so the regex can’t simply stop when it encounters a < character.

Take a look at one incorrect solution for the problem in this recipe:

<p>.*</p>
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The only difference is that this incorrect solution lacks the extra question mark after the asterisk. The incorrect solution uses the same greedy asterisk explained in Recipe 2.12.

After matching the first <p> tag in ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required