O'Reilly logo
  • Morgen Kimbrell thinks this is interesting:

The Penn TreeBank (Marcus et al. 1993) is a 4.5-million-word corpus that contains texts from four sources: the Wall Street Journal, the Brown Corpus, ATIS, and the Switchboard Corpus. By contrast, the BNC is a 100-million-word corpus that contains texts from a broad range of genres, domains, and media.


Cover of Natural Language Annotation for Machine Learning


other corpra