Chapter 8. Reading and Writing Natural Languages

So far the data we have worked with generally has been in the form of numbers or countable values. In most cases, we’ve simply stored the data without conducting any analysis after the fact. In this chapter, we’ll attempt to tackle the tricky subject of the English language.1

How does Google know what you’re looking for when you type “cute kitten” into its Image Search? Because of the text that surrounds the cute kitten images. How does YouTube know to bring up a certain Monty Python sketch when you type “dead parrot” into its search bar? Because of the title and description text that accompanies each uploaded video. 

In fact, even typing in terms such as “deceased bird monty python” immediately brings up the same “Dead Parrot” sketch, even though the page itself contains no mention of the words “deceased” or “bird.” Google knows that a “hot dog” is a food and that a “boiling puppy” is an entirely different thing. How? It’s all statistics!

Although you might not think that text analysis has anything to do with your project, understanding the concepts behind it can be extremely useful for all sorts of machine learning, as well as the more general ability to model real-world problems in probabilistic and algorithmic terms. 

For instance, the Shazam music service can identify audio as containing a certain song recording, even if that audio contains ambient noise or distortion. Google is working ...

Get Web Scraping with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.