Tokenizing and normalizing text

Extracting the contents of the page is just the first step. Before we get to the fun part of analyzing what the article contains (or, if you looked at blog posts, what they are about), we need to split the whole article into sentences and further into words.

Having done so, we would still face another issue; in any of the text, we would see sentences in different tenses, people using the passive voice, or some rarely seen grammatical constructs. For the purpose of extracting the topic or analyzing the sentiment, we do not really need to see words said and says separately—the word say would be enough. Thus, we will also be looking at normalizing the text, that is, bringing all the different versions of the same word ...

Get Practical Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.