O'Reilly logo

Big Data Glossary by Pete Warden

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 7. NLP

Natural language processing (NLP) is a subset of data processing that’s so crucial, it earned its own section. Its focus is taking messy, human-created text and extracting meaningful information. As you can imagine, this chaotic problem domain has spawned a large variety of approaches, with each tool most useful for particular kinds of text. There’s no magic bullet that will understand written information as well as a human, but if you’re prepared to adapt your use of the results to handle some errors and don’t expect miracles, you can pull out some powerful insights.

The NLTK is a collection of Python modules and datasets that implement common natural language processing techniques. It offers the building blocks that you need to build more complex algorithms for specific problems. For example, you can use it to break up texts into sentences, break sentences into words, stem words by removing common suffixes (like -ing from English verbs), or use machine-readable dictionaries to spot synonyms. The framework is used by most researchers in the field, so you’ll often find cutting-edge approaches included as modules or as algorithms built from its modules. There are also a large number of compatible datasets available, as well as ample documentation.

NLTK isn’t aimed at developers looking for an off-the-shelf solution to domain-specific problems. Its flexibility does mean you need a basic familiarity with the NLP world before you can create solutions ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required