Removing Smart Quotes
Problem
You want simple ASCII text out of a document in MS Word, but when you save it as text some odd characters still remain.
Solution
Translate the odd characters back to simple ASCII like this:
$ tr '\221\222\223\224\226\227' '\047\047""--' <odd.txt >plain.txt
Discussion
Such “smart quotes” come from the Windows-1252 character set, and may also show up in email messages that you save as text. To quote from Wikipedia on this subject:
A few mail clients send curved quotes using the Windows-1252 codes but mark the text as ISO-8859-1 causing problems for decoders that do not make the dubious assumption that C1 control codes in ISO-8859-1 text were meant to be Windows-1252 printable characters.
To clean up such text, we can use the tr
command. The 221
and 222
(octal) curved single-quotes will be
translated to simple single quotes. We specify them in octal (047)
to make it easier on us, since the shell
uses single quotes as a delimiter. The 223
and 224
(octal) are opening
and closing curved quotes, and will be translated to simple double
quotes. The double quotes can be typed within the second argument since
the single quotes protect them from shell interpretation. The 226
and 227
(octal) are dash characters and will be translated to hyphens (and no,
that second hyphen in the second argument is not technically needed,
since tr will repeat the last character to match
the length of the first argument, but it’s better to be
specific).
See Also
Get bash Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.