Chapter 7. Extracting Information from Text

For any given question, it’s likely that someone has written the answer down somewhere. The amount of natural language text that is available in electronic form is truly staggering, and is increasing every day. However, the complexity of natural language can make it very difficult to access the information in that text. The state of the art in NLP is still a long way from being able to build general-purpose representations of meaning from unrestricted text. If we instead focus our efforts on a limited set of questions or “entity relations,” such as “where are different facilities located” or “who is employed by what company,” we can make significant progress. The goal of this chapter is to answer the following questions:

  1. How can we build a system that extracts structured data from unstructured text?

  2. What are some robust methods for identifying the entities and relationships described in a text?

  3. Which corpora are appropriate for this work, and how do we use them for training and evaluating our models?

Along the way, we’ll apply techniques from the last two chapters to the problems of chunking and named entity recognition.

Information Extraction

Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships. For example, we might be interested in the relation between companies and locations. Given a particular company, we would like to be able ...

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.