Chapter 16Natural Language Processing

Natural language processing (NLP) is a collection of techniques for working with human language. Examples would include flagging e-mails as spam, using Twitter to assess public sentiment, and finding which text documents are about similar topics. NLP is an area that many data scientists never actually need to touch. But enough of them end up needing it, and it is sufficiently different from other subjects that it deserves a chapter in this book.

This chapter will start with several generic sections about NLP datasets and big-picture concepts. Then I will switch gears to the core NLP concepts, moving from the simple, quick-and-dirty techniques to more complicated ones.

I also want to emphasize that NLP techniques are not strictly limited to language. I've also seen them used to parse computer log files, figuring out what “sentences” the computer generates. Personally, I first learned many of the statistical techniques while working with bioinformatics.

16.1 Do I Even Need NLP?

The first question to ask when using NLP is whether you even need it. There is often pressure from customers and bosses to solve problems using NLP, because it is seen as some kind of magical silver bullet. But in my experience, NLP is hard to implement, and it is prone to bizarre errors that are obviously wrong when a human looks at them.

I've seen people bang their heads against a problem using NLP techniques, only to eventually give up and try solving the problem ...

Get The Data Science Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.