Chapter 7. Mangle Data Like a Pro

In this chapter, you’ll learn many techniques for taming data. Most of them concern these built-in Python data types:

strings

Sequences of Unicode characters, used for text data.

bytes and bytearrays

Sequences of eight-bit integers, used for binary data.

Text Strings

Text is the most familiar type of data to most readers, so we’ll begin with some of the powerful features of text strings in Python.

Unicode

All of the text examples in this book thus far have been plain old ASCII. ASCII was defined in the 1960s, when computers were the size of refrigerators and only slightly better at performing computations. The basic unit of computer storage is the byte, which can store 256 unique values in its eight bits. For various reasons, ASCII only used 7 bits (128 unique values): 26 uppercase letters, 26 lowercase letters, 10 digits, some punctuation symbols, some spacing characters, and some nonprinting control codes.

Unfortunately, the world has more letters than ASCII provides. You could have a hot dog at a diner, but never a Gewürztraminer1 at a café. Many attempts have been made to add more letters and symbols, and you’ll see them at times. Just a couple of those include:

  • Latin-1, or ISO 8859-1

  • Windows code page 1252

Each of these uses all eight bits, but even that’s not enough, especially when you need non-European languages. Unicode is an ongoing international standard to define the characters of all the world’s languages, plus symbols ...

Get Introducing Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.