Language

Sometimes insight into the person responsible for a message or web site can come from the language they use to express themselves. Part of this is glaringly obvious. If someone sends me an email in Korean, then it is a good bet that she is Korean. But in the case of English, the most common language used on the Internet, you cannot assume that to be the author’s native language.

But careful examination may reveal clues about that language. In most cases, these will add weight to other clues about location and nationality. In others, they disagree with other evidence, suggesting that the author is using a computer in a foreign country or that he is a resident in that country.

Email is usually the richest source of this type of clue. Here you want to look at the headers Content-Transfer-Encoding and Content-Type. These occur in the main block of mail headers or in each block of a multipart message. Here is a simple example:

    Content-Transfer-Encoding: quoted-printable
    Content-Type: text/html; charset="iso-8859-1"

The Content-Type header is the more important of the two, but it helps to know a little about content encoding first.

The original specification for email was only set up to handle the first 128 characters of the ASCII character set, which can be encoded in 7 bits. That was fine for basic messages in English or languages that used this basic character set. But for languages with even a few special characters, such as a German umlaut or French accented characters, the ...

Get Internet Forensics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.