3.12. Detecting Illegal UTF-8 Characters

Problem

Your program accepts external input in UTF-8 encoding. You need to make sure that the UTF-8 encoding is valid.

Solution

Scan the input string for illegal UTF-8 sequences. If any illegal sequences are detected, reject the input.

Discussion

UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte. Encodings such as Unicode’s UCS-2 may (and often do) contain NULL bytes as “padding” if they are treated as byte streams. For example, the letter “A” is 0x41 in ASCII or UTF-8, but it is 0x0041 in UCS-2.

The first byte in a UTF-8 sequence determines the number of bytes that follow it to make up the complete sequence. The number of upper bits set in the first byte minus one indicates the number of bytes that follow. A bit that is never set immediately follows the count, and the remaining bits are used as part of the character encoding. The bytes that follow the first byte will always have the upper two bits set and unset, respectively; the remaining bits are combined with the encoding bits from the other bytes in the sequence to compute the character. Table 3-2 lists the binary encodings for the range of characters from 0x00000000 to 0x7FFFFFFF.

Table 3-2. UTF-8 encoding byte sequences

Byte range

UTF-8 binary representation

0x00000000 ...

Get Secure Programming Cookbook for C and C++ now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.