In §6.5.2, “Implementing Categories That Do Not Conform to the Classical Theory” we briefly discussed the use of the machine learning technique known as clustering to create a system of categories for classifying a set of resources or documents for which measures of inter-item similarity can be calculated. Clustering programs do not start with a set of resources that are already classified, making them unsupervised techniques. The categories they create maximize the similarity of resources within a category and maximize the differences between them, but these statistically-designed categories are not always meaningful ones that can be named and used by people. We ended Chapter 6 by suggesting that it is often better to start with a designed classification scheme and then train computers with supervised learning techniques to assign new resources to the categories.
Because of its importance, ubiquity, and ease of processing by computers, it should not be surprising that a great many computational classification problems involve text. Some of these problems are relatively simple, like identifying the language in which a text is written, which is solved by comparing the probability of one, two, and three character-long contiguous strings in the text against their probabilities in different languages. For example, in English the most likely strings are “the”, “and”, “to”, “of”, “a”, “in”, and so on. But if the most likely strings are “der”, “die”,