Case study - AlphaGo tweets analytics

Now that we have a good understanding of GraphX, let's apply our newly gained knowledge to analyze a retweet network. Like any big data project, the first task is to define a pipeline, figure out the data elements, the source, transformations, mapping, and processing.

Data pipeline

For this case study, I collected Twitter data pertaining to the AlphaGo project:

Data pipeline

While the full mechanics of data collection from Twitter is out of scope, I will quickly mention the main steps:

  1. Using Python and the tweepy framework, you can download the tweets mentioning the hashtag #alphago. Initially, pull all the tweets that Twitter ...

Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.