Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo
Unicode Demystified

Book Description

"Rich has a clear, colloquial style that allows him to make even complex Unicode matters understandable. People dealing with Unicode will find this book a valuable resource."

--Dr. Mark Davis, President, The Unicode Consortium

As the software marketplace becomes more global in scope, programmers are recognizing the importance of the Unicode standard for engineering robust software that works across multiple regions, countries, languages, alphabets, and scripts. Unicode Demystified offers an in-depth introduction to the encoding standard and provides the tools and techniques necessary to create today's globally interoperable software systems.

An ideal complement to specifics found in The Unicode Standard, Version 3.0 (Addison-Wesley, 2000), this practical guidebook brings the "big picture" of Unicode into practical focus for the day-to-day programmer and the internationalization specialist alike. Beginning with a structural overview of the standard and a discussion of its heritage and motivations, the book then shifts focus to the various writing systems represented by Unicode--along with the challenges associated with each. From there, the book looks at Unicode in action and presents strategies for implementing various aspects of the standard.

Topics covered include:

  • The basics of Unicode--what it is and what it isn't

  • The history and development of character encoding

  • The architecture and salient features of Unicode, including character properties, normalization forms, and storage and serialization formats

  • The character repertoire: scripts of Europe, the Middle East, Africa, Asia, and more, plus numbers, punctuation, symbols, and special characters

  • Implementation techniques: conversions, searching and sorting, rendering, and editing

  • Using Unicode with the Internet, programming languages, and operating systems

With this book as a guide, programmers now have the tools necessary to understand, create, and deploy dynamic software systems across today's increasingly global marketplace.


Table of Contents

  1. Copyright
  2. Foreword
  3. Preface
  4. Unicode in EssenceAn Architectural Overview of the Unicode Standard
    1. Language, Computers, and Unicode
      1. What Unicode Is
      2. What Unicode Isn't
      3. The Challenge of Representing Text in Computers
      4. What This Book Does
      5. How This Book Is Organized
    2. A Brief History of Character Encoding
      1. Prehistory
      2. Single-Byte Encoding Systems
      3. Character Encoding Terminology
      4. Multiple-Byte Encoding Systems
      5. ISO 10646 and Unicode
      6. How the Unicode Standard Is Maintained
    3. Architecture: Not Just a Pile of Code Charts
      1. The Unicode Character–Glyph Model
      2. Character Positioning
      3. The Principle of Unification
      4. Multiple Representations
      5. Flavors of Unicode
      6. Character Semantics
      7. Unicode Versions and Unicode Technical Reports
      8. Arrangement of the Encoding Space
      9. Conforming to the Standard
    4. Combining Character Sequences and Unicode Normalization
      1. How Unicode Non-spacing Marks Work
      2. Canonical Decompositions
      3. Canonical Accent Ordering
      4. Double Diacritics
      5. Compatibility Decompositions
      6. Singleton Decompositions
      7. Hangul
      8. Unicode Normalization Forms
      9. Grapheme Clusters
    5. Character Properties and the Unicode Character Database
      1. Where to Get the Unicode Character Database
      2. The UNIDATA Directory
      3. UnicodeData.txt
      4. PropList.txt
      5. General Character Properties
      6. General Category
      7. Other Categories
      8. Properties of Letters
      9. Properties of Digits, Numerals, and Mathematical Symbols
      10. Layout-Related Properties
      11. Normalization-Related Properties
      12. Unihan.txt
    6. Unicode Storage and Serialization Formats
      1. A Historical Note
      2. UTF-32
      3. UTF-16 and the Surrogate Mechanism
      4. Endian-ness and the Byte Order Mark
      5. UTF-8
      6. CESU-8
      7. UTF-EBCDIC
      8. UTF-7
      9. Standard Compression Scheme for Unicode
      10. BOCU
      11. Detecting Unicode Storage Formats
  5. Unicode in DepthA Guided Tour of the Character Repertoire
    1. Scripts of Europe
      1. The Western Alphabetic Scripts
      2. The Latin Alphabet
      3. Diacritical Marks
      4. The Greek Alphabet
      5. The Cyrillic Alphabet
      6. The Armenian Alphabet
      7. The Georgian Alphabet
    2. Scripts of the Middle East
      1. Bidirectional Text Layout
      2. The Unicode Bidirectional Layout Algorithm
      3. Bidirectional Text in a Text-Editing Environment
      4. The Hebrew Alphabet
      5. The Arabic Alphabet
      6. The Syriac Alphabet
      7. The Thaana Script
    3. Scripts of India and Southeast Asia
      1. Devanagari
      2. Bengali
      3. Gurmukhi
      4. Gujarati
      5. Oriya
      6. Tamil
      7. Telugu
      8. Kannada
      9. Malayalam
      10. Sinhala
      11. Thai
      12. Lao
      13. Khmer
      14. Myanmar
      15. Tibetan
      16. The Philippine Scripts
    4. Scripts of East Asia
      1. The Han Characters
      2. Variant Forms of Han Characters
      3. Han Characters in Unicode
      4. Ideographic Description Sequences
      5. Bopomofo
      6. Japanese
      7. Korean
      8. Half-width and Full-width Characters
      9. Vertical Text Layout
      10. Ruby
      11. Yi
    5. Scripts from Other Parts of the World
      1. Mongolian
      2. Ethiopic
      3. Cherokee
      4. Canadian Aboriginal Syllables
      5. Historical Scripts
    6. Numbers, Punctuation, Symbols, and Specials
      1. Numbers
      2. Punctuation
      3. Special Characters
      4. Symbols Used with Numbers
      5. Other Symbols and Miscellaneous Characters
  6. Unicode in ActionImplementing and Using the Unicode Standard
    1. Techniques and Data Structures for Handling Unicode Text
      1. Useful Data Structures
      2. Testing for Membership in a Class
      3. Mapping Single Characters to Other Values
      4. Mapping Single Characters to Multiple Values
      5. Mapping Multiple Characters to Other Values
      6. Single Versus Multiple Tables
    2. Conversions and Transformations
      1. Converting Between Unicode Encoding Forms
      2. Unicode Normalization
      3. Converting Between Unicode and Other Standards
      4. Case Mapping and Case Folding
      5. Transliteration
    3. Searching and Sorting
      1. The Basics of Language-Sensitive String Comparison
      2. Language-Sensitive Comparison on Unicode Text
      3. Language-Insensitive String Comparison
      4. Sorting
      5. Searching
      6. Using Unicode with Regular Expressions
    4. Rendering and Editing
      1. Line Breaking
      2. Line Layout
      3. Glyph Selection and Positioning
      4. Special Text-Editing Considerations
    5. Unicode and Other Technologies
      1. Unicode and the Internet
      2. Unicode and Programming Languages
      3. Unicode and Operating Systems
      4. Conclusion
    6. Glossary
    7. Bibliography
      1. The Unicode Standard
      2. Other Standards Documents
      3. Books and Magazine Articles
      4. Unicode Conference Papers
      5. Other Papers
      6. Online Resources