9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

You want to remove all tags in a string except <em> and <strong>.

In a separate case, you not only want to remove all tags other than <em> and <strong>, you also want to remove <em> and <strong> tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (following the code in Recipe 3.14), only the approved tags are left behind.

Solution 1: Match tags except <em> and <strong>

</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
    \b                 # Word boundary avoids partial word matches
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

With one change (replacing the \b with \s*>), you can make the regex also match any <em> and <strong> tags that contain ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.