Converting HTML entities with BeautifulSoup

HTML entities are strings such as "&amp;" or "&lt;". These are encodings of normal ASCII characters that have special uses in HTML. For example, "&lt;" is the entity for "<", but you can't just have "<" within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the "&lt;" entity. "&amp;" is the entity code for "&", which as we've just seen is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.

Getting ready

You'll need to install BeautifulSoup, which you should be able to do ...

Get Python 3 Text Processing with NLTK 3 Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.