Detecting Encodings

We’ve only looked at the issue of known character formats: that is, of converting text from one particular format to another when we know what both those formats are. But sometimes we don’t know exactly what we’re dealing with: in those cases, we must resort to guessing the character encodings in question.

Ideally, we’d want to be able to follow a general logic of guessing what the character encoding of the text is and, if the answer is anything other than UTF-8, converting the text into UTF-8.

This is definitely possible, but it will always be a guess, so it isn’t going to work 100 percent of the time. But often a guess is all we need or all we can do, so it’s definitely worth exploring.

In Ruby, we can do this guessing ...

Get Text Processing with Ruby now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.