Removing Smart Quotes

Problem

You want simple ASCII text out of a document in MS Word, but when you save it as text some odd characters still remain.

Solution

Translate the odd characters back to simple ASCII like this:

$ tr '\221\222\223\224\226\227' '\047\047""--' <odd.txt >plain.txt

Discussion

Such “smart quotes” come from the Windows-1252 character set, and may also show up in email messages that you save as text. To quote from Wikipedia on this subject:

A few mail clients send curved quotes using the Windows-1252 codes but mark the text as ISO-8859-1 causing problems for decoders that do not make the dubious assumption that C1 control codes in ISO-8859-1 text were meant to be Windows-1252 printable characters.

To clean up such text, we can use the tr command. The 221 and 222 (octal) curved single-quotes will be translated to simple single quotes. We specify them in octal (047) to make it easier on us, since the shell uses single quotes as a delimiter. The 223 and 224 (octal) are opening and closing curved quotes, and will be translated to simple double quotes. The double quotes can be typed within the second argument since the single quotes protect them from shell interpretation. The 226 and 227 (octal) are dash characters and will be translated to hyphens (and no, that second hyphen in the second argument is not technically needed, since tr will repeat the last character to match the length of the first argument, but it’s better to be specific).

Get bash Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.