Detecting and converting character encodings
A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.
Getting ready
You'll need to install the charade
module using sudo pip install charade
or sudo easy_install charade
. You can learn more about charade
at https://pypi.python.org/pypi/charade.
How to do it...
Encoding detection and conversion functions are provided in encoding.py
. These are simple wrapper functions around the charade ...
Get Python 3 Text Processing with NLTK 3 Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.