O'Reilly logo

Python Cookbook, 3rd Edition by Brian K. Jones, David Beazley

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

2.12. Sanitizing and Cleaning Up Text

Problem

Some bored script kiddie has entered the text “pýtĥöñ” into a form on your web page and you’d like to clean it up somehow.

Solution

The problem of sanitizing and cleaning up text applies to a wide variety of problems involving text parsing and data handling. At a very simple level, you might use basic string functions (e.g., str.upper() and str.lower()) to convert text to a standard case. Simple replacements using str.replace() or re.sub() can focus on removing or changing very specific character sequences. You can also normalize text using unicodedata.normalize(), as shown in Recipe 2.9.

However, you might want to take the sanitation process a step further. Perhaps, for example, you want to eliminate whole ranges of characters or strip diacritical marks. To do so, you can turn to the often overlooked str.translate() method. To illustrate, suppose you’ve got a messy string such as the following:

>>> s = 'pýtĥöñ\fis\tawesome\r\n'
>>> s
'pýtĥöñ\x0cis\tawesome\r\n'
>>>

The first step is to clean up the whitespace. To do this, make a small translation table and use translate():

>>> remap = {
...     ord('\t') : ' ',
...     ord('\f') : ' ',
...     ord('\r') : None      # Deleted
... }
>>> a = s.translate(remap)
>>> a
'pýtĥöñ is awesome\n'
>>>

As you can see here, whitespace characters such as \t and \f have been remapped to a single space. The carriage return \r has been deleted entirely.

You can take this remapping idea a step further and build much bigger tables. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required