Converting Between Character Sets

The ultimate solution to this character set morass is to use Unicode in either UTF-16 or UTF-8 format for all your XML documents. An increasing number of tools support one of these two formats natively; even the unassuming Notepad offers an option to save files in Unicode in Windows NT 4.0, 2000, and XP. Microsoft Word 97 and later saves the text of its documents in Unicode, although unlike XML documents, Word files are hardly pure text. Much of the binary data in a Word file is not Unicode or any other kind of text. However, Word 2000 and later can actually save plain text files into Unicode. To save as plain Unicode text in Word 2000, select the format Encoded Text from the Save As Type: Choice menu in Word’s Save As dialog box. Then select one of the four Unicode formats in the resulting File Conversion dialog box. In Word 2003, select the plain text format. When you save, Word will pop up a dialog box that prompts you for the encoding. Choose Other Encoding and then select one of the four Unicode formats in the list box on the right.

Most current tools are still adapted primarily for vendor-specific character sets that can’t handle more than a few languages at one time. Thus, learning how to convert your documents from proprietary to more standard character sets is crucial.

Some of the better XML and HTML editors let you choose the character set you wish to save in and perform automatic conversions from the native character set you use for editing. ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.