O'Reilly logo

Regular Expressions Cookbook, 2nd Edition by Steven Levithan, Jan Goyvaerts

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

9.6. Decode XML Entities

Problem

You want to convert all character entities defined by the XML standard to their corresponding literal characters. The conversion should handle named character references (such as &, <, and ") as well as numeric character references (be they in decimal notation as Σ or Σ, or in hexadecimal notation as Σ, Σ, or Σ).

Solution

Regular expression

&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This regular expression includes three capturing groups. Only one of the groups participate in any particular match and capture a value. Using three groups like this allows you to easily check which type of entity was matched.

Replace matches with their corresponding literal characters

Use the regular expression just shown, together with the code in Recipe 3.16. The code examples listed there show how to perform a search-and-replace with replacement text generated in code.

When writing your replacement callback function, use backreferences to determine the appropriate replacement character. If group 1 captured a value, backreference 1 holds a numeric character reference in decimal notation, possibly with leading zeros. If group 2 captured a value, backreference 2 holds a numeric character reference in hexadecimal notation, possibly with leading zeros. If group 3 captured a value, backreference 3 holds an entity name. Use a lookup object, dictionary, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required