Expat Encodings

XML documents may be encoded in character sets other than Unicode as long as they can be mapped into the Unicode character set. Expat has further restrictions on encodings. Read the xmlparse.h header file in the expat distribution to see details on these restrictions.

Expat has built-in encodings for: UTF-8, ISO-8859-1, UTF-16, and US-ASCII. Encodings are set through either the XML declaration encoding attribute or the ProtocolEncoding option to XML::Parser or XML::Parser::Expat.

For encodings other than the built-ins, Expat calls the function load_encoding in the Expat package with the encoding name. This function looks for a file in the path list @XML::Parser::Expat::Encoding_Path that matches the lowercased name with a .enc extension. The first one it finds, it loads.

If you wish to build your own encoding maps, check out the XML::Encoding module from CPAN.

Get Perl in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.