9.9. Remove XML-Style Comments by Steven Levithan, Jan Goyvaerts

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

9.9. Remove XML-Style Comments

Problem

You want to remove comments from an (X)HTML or XML document. For example, you want to remove development comments from a web page before it is served to web browsers, or you want to perform subsequent searches without finding any matches within comments.

Solution

Finding comments is not a difficult task, thanks to the availability of lazy quantifiers. Here is the regular expression for the job:

<!--.*?-->
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

That’s pretty straightforward. As usual, though, JavaScript’s lack of a “dot matches line breaks” option (unless you use the XRegExp library) means that you’ll need to replace the dot with an all-inclusive character class in order for the regular expression to match comments that span more than one line. Following is a version that works with standard JavaScript:

<!--[\s\S]*?-->
Regex options: None
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

To remove the comments, replace all matches with the empty string (i.e., nothing). Recipe 3.14 lists code to replace all matches of a regex.

Discussion

How it works

At the beginning and end of this regular expression are the literal character sequences <!-- and -->. Since none of those characters are special in regex syntax (except within character classes, where hyphens create ranges), they don’t need to be escaped. That just leaves the .*? or [\s\S]*? in the middle of the regex to examine ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required