Text analysis and TF-IDF on notes

After discussing how to download a list of notes and activities for a given page or user, we will shift our focus to the textual analysis of the content.

For each post published by a given user, we want to extract the most interesting keywords, which could be used to summarize the post itself.

While this is intuitively a simple exercise, there are a few subtleties to consider. On the practical side, we can easily observe that the content of each post is not always a clean piece of text, in fact, HTML tags can be included in the content. Before we can carry out our computation, we need to extract the clean text. While the JSON object returned by the Google+ API has a clear structure, the content itself is not necessarily ...

Get Mastering Social Media Mining with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.