Chapter 7Encodings

It’s sometimes easy to forget, ensconced as many of us are in an English-speaking bubble, that the world is a multilingual place. It has different languages, different alphabets, and different sets of symbols—not just English with its 26 letters, simple punctuation, and handful of symbols.

Even if we’re not trying to work directly in one of these many different languages, though, it doesn’t take very long when working with text in our scripts and programs to encounter the frustration of character-encoding issues. Output littered with boxes and question marks, odd characters showing up unexpectedly (seeing ö instead of ö, for example), the dreaded “invalid byte sequence” error—these are all character-encoding issues, and ...

Get Text Processing with Ruby now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.