The htmlentitydefs Module

The htmlentitydefs module contains a dictionary with many ISO Latin-1 character entities used by HTML. Its use is demonstrated in Example 5-10.

Example 5-10. Using the htmlentitydefs Module

File: htmlentitydefs-example-1.py

import htmlentitydefs

entities = htmlentitydefs.entitydefs

for entity in "amp", "quot", "copy", "yen":
    print entity, "=", entities[entity]

amp = &
quot = "
copy = \302\251
yen = \302\245

Example 5-11 shows how to combine regular expressions with this dictionary to translate entities in a string (the opposite of cgi.escape).

Example 5-11. Using the htmlentitydefs Module to Translate Entities

File: htmlentitydefs-example-2.py

import htmlentitydefs
import re
import cgi

pattern = re.compile("&(\w+?);")

def descape_entity(m, defs=htmlentitydefs.entitydefs):
    # callback: translate one entity to its ISO Latin value
    try:
        return defs[m.group(1)]
    except KeyError:
        return m.group(0) # use as is

def descape(string):
    return pattern.sub(descape_entity, string)

print descape("<spam&eggs>")
print descape(cgi.escape("<spam&eggs>"))

<spam&eggs>
<spam&eggs>

Finally, Example 5-12 shows how to use translate reserved XML characters and ISO Latin-1 characters to an XML string. This is similar to cgi.escape, but it also replaces non-ASCII characters.

Example 5-12. Escaping ISO Latin-1 Entities

File: htmlentitydefs-example-3.py import htmlentitydefs import re, string # this pattern matches substrings of reserved and non-ASCII characters pattern = re.compile(r"[&<>\"\x80-\xff]+") ...

Get Python Standard Library now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.