8.9. Find Words Within XML-Style Comments

Problem

You want to find all occurrences of the word TODO within (X)HTML or XML comments. For example, you want to match only the underlined text within the following string:

This "TODO" is not within a comment, but the next one is. <!-- TODO: ↵
Come up with a cooler comment for this example. -->

Solution

There are at least two approaches to this problem, and both have their advantages. The first tactic, described as the Two-step approach, is to find comments with an outer regex, and then search within each match using a separate regex or even a plain text search. That works best if you’re writing code to do the job, since separating the task into two steps keeps things simple and fast. However, if you’re searching through files using a text editor or grep tool, splitting the task in two won’t work unless your tool of choice offers a special option to search within matches found by another regex.[8]

When you need to find words within comments using a single regex, you can accomplish this with the help of lookaround. This second method is shown in the upcoming section .

Two-step approach

When it’s a workable option, the better solution is to split the task in two: search for comments, and then search within those comments for TODO.

Here’s how you can find comments:

<!--.*?-->
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

JavaScript doesn’t have a “dot matches line breaks” option, but you can use an all-inclusive ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.