Strings

Strings are sequences of characters. However, what constitutes a character depends greatly on the language used and the settings of the operating system on which the application runs. Gone are the days when you could assume each character in a string is represented by a single byte. Multibyte encodings (either fixed length or variable length) of Unicode are needed to accurately store text in today’s global economy.

More recently designed languages, such as Java and C#, have a multibyte fundamental character type, whereas a char in C and C++ is always a single byte. (Recent versions of C and C++ also define a character type wchar_t, which is usually multibyte.) Even with built-in multibyte character types, properly handling all cases of Unicode can be tricky: There are more than 100,000 code points (representation-independent character definitions) defined in Unicode, so they can’t all be represented with a single, 2-byte Java or C# char. This problem is typically solved using variable length encodings, which use sequences of more than one fundamental character type to represent some code points.

One such encoding is UTF-16, used to encode strings in Java and C#. UTF-16 represents most of the commonly used Unicode code points in a single 16-bit char and uses two 16-bit chars to represent the remainder. UTF-8, another common encoding, is frequently used for text stored in files or transmitted across networks. UTF-8 uses one to four 8-bit chars to encode all Unicode code ...

Get Programming Interviews Exposed: Secrets to Landing Your Next Job, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.