CHAPTER 6

CONCORDANCE LINES AND CORPUS LINGUISTICS

6.1 INTRODUCTION

A corpus (plural corpora) is a collection of texts that have been put together to research one or more aspects of language. This term is from the Latin and means body. Not surprisingly, corpus linguistics is the study of language using a corpus.

The idea of collecting language samples is old. For example, Samuel Johnson’s dictionary was the first in English to emphasize how words are used by supplying over 100,000 quotations (see the introduction of the abridged version edited by Lynch [61] for more details). Note that his dictionary is still in print. In fact, a complete digital facsimile of the first edition is available [62].

In the spirit of Samuel Johnson, a number of large corpora have been developed to support language references, for example, the Longman Dictionary of American English [74] or the Cambridge Grammar of English [26]. To analyze such corpora, this chapter creates concordances.

The next section introduces a few ideas of statistical sampling, and then considers how to apply these to text sampling. The rest of this chapter discusses examples of concordancing, which provide ample opportunity to apply the Perl programming techniques covered in the earlier chapters.

6.2 SAMPLING

Sampling replaces measuring all of the objects in a population with those from a subset. Assuming that the sample is representative of the population, then estimates are computable along with their accuracy. Although taking ...

Get Practical Text Mining with Perl now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.