Appendix C. Understanding Unicode

Some Background on Characters

Before we see what Unicode is, it makes sense to step back slightly to think about just what it means to store “characters” in digital files. Anyone who uses a tool like a text editor usually just thinks of what they are doing as entering some characters—numbers, letters, punctuation, and so on. But behind the scene a little bit more is going on. “Characters” that are stored on digital media must be stored as sequences of ones and zeros, and some encoding and decoding must happen to make these ones and zeros into characters we see on a screen or type in with a keyboard.

Sometime around the 1960s, a few decisions were made about just what ones and zeros (bits) would represent ...

Get Text Processing in Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.