8.6. Find a Specific Attribute in XML-Style Tags

Problem

You want to find tags within an (X)HTML or XML file that contain a specific attribute, such as id.

This recipe covers several variations on the same problem. Suppose that you want to match each of the following types of strings using separate regular expressions:

  • Tags that contain an id attribute.

  • <div> tags that contain an id attribute.

  • Tags that contain an id attribute with the value my-id.

  • Tags that contain my-class within their class attribute value (classes are separated by whitespace).

Solution

Tags that contain an id attribute (quick and dirty)

If you want to do a quick search in a text editor that lets you preview your results, the following (overly simplistic) regex might do the trick:

<[^>]+\sid\b[^>]*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here’s a breakdown of the regex in free-spacing mode:

<         # Start of the tag
[^>]+     # Tag name, attributes, etc.
\s id \b  # The target attribute name, as a whole word
[^>]*     # The remainder of the tag, including the id attribute's value
>         # End of the tag
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Tags that contain an id attribute (more reliable)

Unlike the regex just shown, this next take on the same problem supports quoted attribute values that contain literal > characters, and it doesn’t match tags that merely contain the word id within one of their attributes’ values:

<(?:[^>"']|"[^"]*"|'[^']*')+?\sid\s*=\s*("[^"]*"|'[^']*')↵ ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.