O'Reilly logo

Taming Text: How to Find, Organize, and Manipulate It by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 2. Foundations of taming text

In this chapter

  • Understanding text processing building blocks like tokenizing, chunking, parsing, and part of speech tagging
  • Extracting text from common file formats using the Apache Tika open source project

Naturally, before we can get started with the hard-core text-taming processes, we need a little warm-up first. We’ll start by laying the ground work with a short high school English refresher where we’ll delve into topics such as tokenization, stemming, parts of speech, and phrases and clauses. Each of these steps can play an important role in the quality of results you’ll see when building applications utilizing text. For instance, the seemingly simple act of splitting up words, especially in languages ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required