Chapter 8. Reading and Writing Natural Languages

So far the data we have worked with generally has been in the form of numbers or countable values. In most cases, weâve simply stored the data without conducting any analysis after the fact. In this chapter, weâll attempt to tackle the tricky subject of the English language.¹

How does Google know what youâre looking for when you type âcute kittenâ into its Image Search? Because of the text that surrounds the cute kitten images. How does YouTube know to bring up a certain Monty Python sketch when you type âdead parrotâ into its search bar? Because of the title and description text that accompanies each uploaded video.Â

In fact, even typing in terms such as âdeceased bird monty pythonâ immediately brings up the same âDead Parrotâ sketch, even though the page itself contains no mention of the words âdeceasedâ or âbird.â Google knows that a âhot dogâ is a food and that a âboiling puppyâ is an entirely different thing. How? Itâs all statistics!

Although you might not think that text analysis has anything to do with your project, understanding the concepts behind it can be extremely useful for all sorts of machine learning, as well as the more general ability to model real-world problems in probabilistic and algorithmic terms.Â

For instance, the Shazam music service can identify audio as containing a certain song recording, even if that audio contains ambient noise or distortion. Google is working ...

Get Web Scraping with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Web Scraping with Python by Ryan Mitchell

Chapter 8. Reading and Writing Natural Languages

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly