CHAPTER 2

TEXT PATTERNS

2.1 INTRODUCTION

Did you ever remember a certain passage in a book but forgot where it was? With the advent of electronic texts, this unpleasant experience has been replaced by the joy of using a search utility. Computers have limitations, but their ability to do what they are told without tiring is invaluable when it comes to combing through large electronic documents. Many of the more sophisticated techniques later in this book rely on an initial analysis that starts with one or more searches.

Before beginning with text patterns, consider the following question. Since humans are experts at understanding text, and, at present, computers are essentially illiterate, can a procedure as simple as a search really find something unexpected to a human? Yes, it can, and here is an example. Anyone fluent in English knows that the precedes its noun, so the following sentence is clearly ungrammatical.

(2.1) Dog the is hungry.

Putting the the before the noun corrects the problem, so sentence 2.2 is correct.

(2.2) The dog is hungry.

A systematically collected sample of text is called a corpus (its plural form is corpora), and large corpora have been collected to study language. For example, the Cambridge International Corpus has over 800 million words and is used in Cambridge University Press language reference books [26]. Since a book has roughly 500 words on a page, this corresponds to roughly 1.6 million pages of text. In such a corpus, is it possible to find a ...

Get Practical Text Mining with Perl now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.