3.12. Detecting Illegal UTF-8 Characters
Problem
Your program accepts external input in UTF-8 encoding. You need to make sure that the UTF-8 encoding is valid.
Solution
Scan the input string for illegal UTF-8 sequences. If any illegal sequences are detected, reject the input.
Discussion
UTF-8 is an
encoding that is used to represent multibyte character sets in a way
that is backward-compatible with single-byte character sets. Another
advantage of UTF-8 is that it ensures there are no
NULL
bytes in the data, with the exception of an
actual NULL
byte. Encodings such as
Unicode’s UCS-2 may (and often do) contain
NULL
bytes as
“padding” if they are treated as
byte streams. For example, the letter
“A” is 0x41
in
ASCII or UTF-8, but it is 0x0041
in UCS-2.
The first byte in a UTF-8 sequence determines the number of bytes
that follow it to make up the complete sequence. The number of upper
bits set in the first byte minus one indicates the number of bytes
that follow. A bit that is never set immediately follows the count,
and the remaining bits are used as part of the character encoding.
The bytes that follow the first byte will always have the upper two
bits set and unset, respectively; the remaining bits are combined
with the encoding bits from the other bytes in the sequence to
compute the character. Table 3-2 lists the binary
encodings for the range of characters from
0x00000000
to 0x7FFFFFFF
.
Table 3-2. UTF-8 encoding byte sequences
Byte range |
UTF-8 binary representation |
---|---|
|
Get Secure Programming Cookbook for C and C++ now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.