Dealing with non-ASCII text and HTML entities

HTML is not as structured as data from a database query or a pandas DataFrame. You may be tempted to manipulate HTML with regular expressions or string functions. However, this approach works only in a limited number of cases. You are better off using specialized Python libraries to process HTML. In this recipe, we will use the clean_html() function of the lxml library. This function strips all JavaScript and CSS from a HTML page.

American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007 with UTF-8 (8-bit Unicode) taking over first place. ASCII is limited to the English alphabet and has no support for alphabets of different languages. ...

Get Python Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.