Words intended to represent concepts: that is the questionable foundation upon which information retrieval is built. Words in the content. Words in the query. Even collections of images and software and physical objects rely on words in the form of metadata for representation and retrieval. And words are imprecise, ambiguous, indeterminate, vague, opaque; you get the picture. Our language bubbles with synonyms, homonyms, acronyms, and even contronyms (words with contradictory meanings in different contexts such as sanction, cleave, and bi-weekly). And this is before we even talk about the epic numbers of spelling errors committed on a daily basis. In The Mother Tongue, author Bill Bryson shares a wealth of colorful facts about language, including:
The residents of the Trobriand Islands of Papua New Guinea have a hundred words for yams, while the Maoris of New Zealand have thirty-five words for dung.
In the OED, round alone (that is without variants like rounded and roundup) takes 7 pages to define or about 15,000 words of text.
English retains probably the richest vocabulary, and most diverse shading of meanings, of any language.... No other language has so many words all saying the same thing.
Interestingly, when this ambiguity of language is subjected to statistical analysis, familiar patterns indicative of power laws , shown in Figure 3-3, emerge. First observed by the Italian economist Vilfredo Pareto in the early 1900s, power laws result in many small ...