You are previewing Mining the Social Web.

Mining the Social Web

Cover of Mining the Social Web by Matthew A. Russell Published by O'Reilly Media, Inc.
  1. Mining the Social Web
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Content Updates
      1. February 22, 2012
    2. To Read This Book?
    3. Or Not to Read This Book?
    4. Tools and Prerequisites
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  4. 1. Introduction: Hacking on Twitter Data
    1. Installing Python Development Tools
    2. Collecting and Manipulating Twitter Data
      1. Tinkering with Twitter’s API
      2. Frequency Analysis and Lexical Diversity
      3. Visualizing Tweet Graphs
      4. Synthesis: Visualizing Retweets with Protovis
    3. Closing Remarks
  5. 2. Microformats: Semantic Markup and Common Sense Collide
    1. XFN and Friends
    2. Exploring Social Connections with XFN
      1. A Breadth-First Crawl of XFN Data
    3. Geocoordinates: A Common Thread for Just About Anything
      1. Wikipedia Articles + Google Maps = Road Trip?
    4. Slicing and Dicing Recipes (for the Health of It)
    5. Collecting Restaurant Reviews
    6. Summary
  6. 3. Mailboxes: Oldies but Goodies
    1. mbox: The Quick and Dirty on Unix Mailboxes
    2. mbox + CouchDB = Relaxed Email Analysis
      1. Bulk Loading Documents into CouchDB
      2. Sensible Sorting
      3. Map/Reduce-Inspired Frequency Analysis
      4. Sorting Documents by Value
      5. couchdb-lucene: Full-Text Indexing and More
    3. Threading Together Conversations
      1. Look Who’s Talking
    4. Visualizing Mail “Events” with SIMILE Timeline
    5. Analyzing Your Own Mail Data
      1. The Graph Your (Gmail) Inbox Chrome Extension
    6. Closing Remarks
  7. 4. Twitter: Friends, Followers, and Setwise Operations
    1. RESTful and OAuth-Cladded APIs
      1. No, You Can’t Have My Password
    2. A Lean, Mean Data-Collecting Machine
      1. A Very Brief Refactor Interlude
      2. Redis: A Data Structures Server
      3. Elementary Set Operations
      4. Souping Up the Machine with Basic Friend/Follower Metrics
      5. Calculating Similarity by Computing Common Friends and Followers
      6. Measuring Influence
    3. Constructing Friendship Graphs
      1. Clique Detection and Analysis
      2. The Infochimps “Strong Links” API
      3. Interactive 3D Graph Visualization
    4. Summary
  8. 5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
    1. Pen : Sword :: Tweet : Machine Gun (?!?)
    2. Analyzing Tweets (One Entity at a Time)
      1. Tapping (Tim’s) Tweets
      2. Who Does Tim Retweet Most Often?
      3. What’s Tim’s Influence?
      4. How Many of Tim’s Tweets Contain Hashtags?
    3. Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty)
      1. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?
      2. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?
      3. Which Gets Retweeted More Often: #JustinBieber or #TeaParty?
      4. How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?
    4. Visualizing Tons of Tweets
      1. Visualizing Tweets with Tricked-Out Tag Clouds
      2. Visualizing Community Structures in Twitter Search Results
    5. Closing Remarks
  9. 6. LinkedIn: Clustering Your Professional Network for Fun (and Profit?)
    1. Motivation for Clustering
    2. Clustering Contacts by Job Title
      1. Standardizing and Counting Job Titles
      2. Common Similarity Metrics for Clustering
      3. A Greedy Approach to Clustering
      4. Hierarchical and k-Means Clustering
    3. Fetching Extended Profile Information
    4. Geographically Clustering Your Network
      1. Mapping Your Professional Network with Google Earth
      2. Mapping Your Professional Network with Dorling Cartograms
    5. Closing Remarks
  10. 7. Google+: TF-IDF, Cosine Similarity, and Collocations
    1. Harvesting Google+ Data
    2. Data Hacking with NLTK
    3. Text Mining Fundamentals
      1. A Whiz-Bang Introduction to TF-IDF
      2. Querying Google+ Data with TF-IDF
    4. Finding Similar Documents
      1. The Theory Behind Vector Space Models and Cosine Similarity
      2. Clustering Posts with Cosine Similarity
      3. Visualizing Similarity with Graph Visualizations
    5. Bigram Analysis
      1. How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions
    6. Tapping into Your Gmail
      1. Accessing Gmail with OAuth
      2. Fetching and Parsing Email Messages
    7. Before You Go Off and Try to Build a Search Engine…
    8. Closing Remarks
  11. 8. Blogs et al.: Natural Language Processing (and Beyond)
    1. NLP: A Pareto-Like Introduction
      1. Syntax and Semantics
      2. A Brief Thought Exercise
    2. A Typical NLP Pipeline with NLTK
    3. Sentence Detection in Blogs with NLTK
    4. Summarizing Documents
      1. Analysis of Luhn’s Summarization Algorithm
    5. Entity-Centric Analysis: A Deeper Understanding of the Data
      1. Quality of Analytics
    6. Closing Remarks
  12. 9. Facebook: The All-in-One Wonder
    1. Tapping into Your Social Network Data
      1. From Zero to Access Token in Under 10 Minutes
      2. Facebook’s Query APIs
    2. Visualizing Facebook Data
      1. Visualizing Your Entire Social Network
      2. Visualizing Mutual Friendships Within Groups
      3. Where Have My Friends All Gone? (A Data-Driven Game)
      4. Visualizing Wall Data As a (Rotating) Tag Cloud
    3. Closing Remarks
  13. 10. The Semantic Web: A Cocktail Discussion
    1. An Evolutionary Revolution?
    2. Man Cannot Live on Facts Alone
      1. Open-World Versus Closed-World Assumptions
      2. Inferencing About an Open World with FuXi
    3. Hope
  14. Index
  15. About the Author
  16. Colophon
  17. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Collecting and Manipulating Twitter Data

In the extremely unlikely event that you don’t know much about Twitter yet, it’s a real-time, highly social microblogging service that allows you to post short messages of 140 characters or less; these messages are called tweets. Unlike social networks like Facebook and LinkedIn, where a connection is bidirectional, Twitter has an asymmetric network infrastructure of “friends” and “followers.” Assuming you have a Twitter account, your friends are the accounts that you are following and your followers are the accounts that are following you. While you can choose to follow all of the users who are following you, this generally doesn’t happen because you only want your Home Timeline[6] to include tweets from accounts whose content you find interesting. Twitter is an important phenomenon from the standpoint of its incredibly high number of users, as well as its use as a marketing device and emerging use as a transport layer for third-party messaging services. It offers an extensive collection of APIs, and although you can use a lot of the APIs without registering, it’s much more interesting to build up and mine your own network. Take a moment to review Twitter’s liberal terms of service, API documentation, and API rules, which allow you to do just about anything you could reasonably expect to do with Twitter data before doing any heavy-duty development. The rest of this book assumes that you have a Twitter account and enough friends/followers that you have data to mine.


The official Twitter account for this book is @SocialWebMining .

Tinkering with Twitter’s API

A minimal wrapper around Twitter’s web API is available through a package called twitter ( can be installed with easy_install per the norm:

$  easy_install twitter
Searching for twitter

...truncated output...

Finished processing dependencies for twitter

The package also includes a handy command-line utility and IRC bot, so after installing the module you should be able to simply type twitter in a shell to get a usage screen about how to use the command-line utility. However, we’ll focus on working within the interactive Python interpreter. We’ll work though some examples, but note that you can always skim the documentation by running pydoc from the terminal. *nix users can simply type pydoc twitter.Twitter to view the documentation on the Twitter class, while Windows users need to type python -mpydoc twitter.Twitter. If you find yourself reviewing the documentation for certain modules often, you can elect to pass the -w option to pydoc and write out an HTML page that you can save and bookmark in your browser. It’s also worth knowing that running pydoc on a module or class brings up the inline documentation in the same way that running the help() command in the interpreter would. Try typing help(twitter.Twitter) in the interpreter to see for yourself.

Without further ado, let’s find out what people are talking about by inspecting the trends available to us through Twitter’s trends API. Let’s fire up the interpreter and initiate a search. Try following along with Example 1-3, and use the help() function as needed to try to answer as many of your own questions as possible before proceeding.

Example 1-3. Retrieving Twitter trends

>>> import twitter
>>> twitter_api = twitter.Twitter(domain="", api_version='1')
>>> WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire world
>>> world_trends = twitter_api.trends._(WORLD_WOE_ID) # get back a callable
>>> [ trend['name'] for trend in world_trends()[0]['trends'] ] # iterate through the trends
[u'#ZodiacFacts', u'#nowplaying', u'#ItsOverWhen', u'#Christoferdrew',
u'Justin Bieber', u'#WhatwouldItBeLike', u'#Sagittarius', u'SNL', u'#SurveySays',

Since you’re probably wondering, the pattern for using the twitter module is simple and predictable: instantiate the Twitter class with a base URL and then invoke methods on the object that correspond to URL contexts. For example, twitter_api._trends(WORLD_WOE_ID) initiates an HTTP call to GET, which you could type into your web browser to get the same set of results. As further context for the previous interpreter session, this chapter was originally drafted on a Saturday night, so it’s not a coincidence that the trend SNL (Saturday Night Live, a popular comedy show that airs in the United States) appears in the list. Now might be a good time to go ahead and bookmark the official Twitter API documentation since you’ll be referring to it quite frequently.

Given that SNL is trending, the next logical step might be to grab some search results about it by using the search API to search for tweets containing that text and then print them out in a readable way as a JSON structure. Example 1-4 illustrates.

Example 1-4. Paging through Twitter search results

>>> twitter_search = twitter.Twitter(domain="")
>>> search_results = []
>>> for page in range(1,6):
...     search_results.append("SNL", rpp=100, page=page))

The code fetches and stores five consecutive batches (pages) of results for a query (q) of SNL, with 100 results per page (rpp). It’s again instructive to observe that the equivalent REST query[7] that we execute in the loop is of the form The trivial mapping between the REST API and the twitter module makes it very simple to write Python code that interacts with Twitter services. After executing the search, the search_results list contains five objects, each of which is a batch of 100 results. You can print out the results in a readable way for inspection by using the json package that comes built-in as of Python Version 2.6, as shown in Example 1-5.

Example 1-5. Pretty-printing Twitter data as JSON

>>> import json
>>> print json.dumps(search_results, sort_keys=True, indent=1)
  "completed_in": 0.088122000000000006, 
  "max_id": 11966285265, 
  "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL", 
  "page": 1, 
  "query": "SNL", 
  "refresh_url": "?since_id=11966285265&q=SNL", 
  "results": [
    "created_at": "Sun, 11 Apr 2010 01:34:52 +0000", 
    "from_user": "bieber_luv2", 
    "from_user_id": 106998169, 
    "geo": null, 
    "id": 11966285265, 
    "iso_language_code": "en", 
    "metadata": {
     "result_type": "recent"
    "profile_image_url": "", 
    "source": "<a href="">web</a>", 
    "text": " ...truncated... im nt gonna go to sleep happy unless i see @justin...", 
    "to_user_id": null
            ... output truncated - 99 more tweets ...

  "results_per_page": 100, 
  "since_id": 0

    ... output truncated - 4 more pages ...


As of Nov 7, 2011, the from_user_id field in each search result does correspond to the tweet author’s actual Twitter id, which was previously not the case. See Twitter API Issue #214 for details on the evolution and resolution of this issue.

We’ll wait until later in the book to pick apart many of the details in this query (see Chapter 5); the important observation at the moment is that the tweets are keyed by results in the response. We can distill the text of the 500 tweets into a list with the following approach. Example 1-6 illustrates a double list comprehension that’s indented to illustrate the intuition behind it being nothing more than a nested loop.

Example 1-6. A simple list comprehension in Python

>>> tweets = [ r['text'] \
...     for result in search_results \
...         for r in result['results'] ]

List comprehensions are used frequently throughout this book. Although they can look quite confusing if written on a single line, printing them out as nested loops clarifies the meaning. The result of tweets in this particular case is equivalent to defining an empty list called tweets and invoking tweets.append(r['text']) in the same kind of nested loop as presented here. See the “Data Structures” section in the official Python tutorial for more details. List comprehensions are particularly powerful because they usually yield substantial performance gains over nested lists and provide an intuitive (once you’re familiar with them) yet terse syntax.

Frequency Analysis and Lexical Diversity

One of the most intuitive measurements that can be applied to unstructured text is a metric called lexical diversity. Put simply, this is an expression of the number of unique tokens in the text divided by the total number of tokens in the text, which are elementary yet important metrics in and of themselves. It could be computed as shown in Example 1-7.

Example 1-7. Calculating lexical diversity for tweets

>>> words = []
>>> for t in tweets:
...     words += [ w for w in t.split() ]
>>> len(words) # total words
>>> len(set(words)) # unique words
>>> 1.0*len(set(words))/len(words) # lexical diversity
>>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet


Prior to Python 3.0, the division operator applies the floor function and returns an integer value (unless one of the operands is a floating-point value). Multiply either the numerator or the denominator by 1.0 to avoid truncation errors.

One way to interpret a lexical diversity of around 0.23 would be to say that about one out of every four words in the aggregated tweets is unique. Given that the average number of words in each tweet is around 14, that translates to just over 3 unique words per tweet. Without introducing any additional information, that could be interpreted as meaning that each tweet carries about 20 percent unique information. What would be interesting to know at this point is how “noisy” the tweets are with uncommon abbreviations users may have employed to stay within the 140 characters, as well as what the most frequent and infrequent terms used in the tweets are. A distribution of the words and their frequencies would be helpful. Although these are not difficult to compute, we’d be better off installing a tool that offers a built-in frequency distribution and many other tools for text analysis.

The Natural Language Toolkit (NLTK) is a popular module we’ll use throughout this book: it delivers a vast amount of tools for various kinds of text analytics, including the calculation of common metrics, information extraction, and natural language processing (NLP). Although NLTK isn’t necessarily state-of-the-art as compared to ongoing efforts in the commercial space and academia, it nonetheless provides a solid and broad foundation—especially if this is your first experience trying to process natural language. If your project is sufficiently sophisticated that the quality or efficiency that NLTK provides isn’t adequate for your needs, you have approximately three options, depending on the amount of time and money you are willing to put in: scour the open source space for a suitable alternative by running comparative experiments and benchmarks, churn through whitepapers and prototype your own toolkit, or license a commercial product. None of these options is cheap (assuming you believe that time is money) or easy.

NLTK can be installed per the norm with easy_install, but you’ll need to restart the interpreter to take advantage of it. You can use the cPickle module to save (“pickle”) your data before exiting your current working session, as shown in Example 1-8.

Example 1-8. Pickling your data

>>> f = open("myData.pickle", "wb")
>>> import cPickle
>>> cPickle.dump(words, f)
>>> f.close() 
$  easy_install nltk
Searching for nltk

...truncated output...

Finished processing dependencies for nltk


If you encounter an “ImportError: No module named yaml” problem when you try to import nltk, execute an easy_install pyYaml, which should clear it up.

After installing NLTK, you might want to take a moment to visit its official website, where you can review its documentation. This includes the full text of Steven Bird, Ewan Klein, and Edward Loper’s Natural Language Processing with Python (O’Reilly), NLTK’s authoritative reference.

What are people talking about right now?

Among the most compelling reasons for mining Twitter data is to try to answer the question of what people are talking about right now. One of the simplest techniques you could apply to answer this question is basic frequency analysis. NLTK simplifies this task by providing an API for frequency analysis, so let’s save ourselves some work and let NLTK take care of those details. Example 1-9 demonstrates the findings from creating a frequency distribution and takes a look at the 50 most frequent and least frequent terms.

Example 1-9. Using NLTK to perform basic frequency analysis

>>> import nltk
>>> import cPickle
>>> words = cPickle.load(open("myData.pickle"))
>>> freq_dist = nltk.FreqDist(words)
>>> freq_dist.keys()[:50] # 50 most frequent tokens
[u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', 
u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', 
u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', 
u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', 
u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', 
u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']

>>> freq_dist.keys()[-50:] # 50 least frequent tokens
[u'what?!', u'whens', u'where', u'while', u'white', u'whoever', u'whoooo!!!!', 
u'whose', u'wiating', u'wii', u'wiig', u'win...', u'wink.', u'wknd.', u'wohh', u'won',
 u'wonder', u'wondering', u'wootwoot!', u'worked', u'worth', u'xo.', u'xx', u'ya', 
u'ya<3miranda', u'yay', u'yay!', u'ya\u2665', u'yea', u'yea.', u'yeaa', u'yeah!', 
u'yeah.', u'yeahhh.', u'yes,', u'yes;)', u'yess', u'yess,', u'you!!!!!', 
u"you'll", u'you+snl=', u'you,', u'youll', u'youtube??', u'youu<3', 
u'youuuuu', u'yum', u'yumyum', u'~', u'\xac\xac']


Python 2.7 added a collections.Counter ( class that facilitates counting operations. You might find it useful if you’re in a situation where you can’t easily install NLTK, or if you just want to experiment with the latest and greatest classes from Python’s standard library.

A very quick skim of the results from Example 1-9 shows that a lot more useful information is carried in the frequent tokens than the infrequent tokens. Although some work would need to be done to get a machine to recognize as much, the frequent tokens refer to entities such as people, times, and activities, while the infrequent terms amount to mostly noise from which no meaningful conclusion could be drawn.

The first thing you might have noticed about the most frequent tokens is that “snl” is at the top of the list. Given that it is the basis of the original search query, this isn’t surprising at all. Where it gets more interesting is when you skim the remaining tokens: there is apparently a lot of chatter about a fellow named Justin Bieber, as evidenced by the tokens @justinbieber, justin, and bieber. Anyone familiar with SNL would also know that the occurrences of the tokens “tina” and “fey” are no coincidence, given Tina Fey’s longstanding affiliation with the show. Hopefully, it’s not too difficult (as a human) to skim the tokens and form the conjecture that Justin Bieber is a popular guy, and that a lot of folks were very excited that he was going to be on the show on the Saturday evening the search query was executed.

At this point, you might be thinking, “So what? I could skim a few tweets and deduce as much.” While that may be true, would you want to do it 24/7, or pay someone to do it for you around the clock? And what if you were working in a different domain that wasn’t as amenable to skimming random samples of short message blurbs? The point is that frequency analysis is a very simple, yet very powerful tool that shouldn’t be overlooked just because it’s so obvious. On the contrary, it should be tried out first for precisely the reason that it’s so obvious and simple. Thus, one preliminary takeaway here is that the application of a very simple technique can get you quite a long way toward answering the question, “What are people talking about right now?”

As a final observation, the presence of “rt” is also a very important clue as to the nature of the conversations going on. The token RT is a special symbol that is often prepended to a message to indicate that you are retweeting it on behalf of someone else. Given the high frequency of this token, it’s reasonable to infer that there were a large amount of duplicate or near-duplicate tweets involving the subject matter at hand. In fact, this observation is the basis of our next analysis.


The token RT can be prepended to a message to indicate that it is being relayed, or “retweeted” in Twitter parlance. For example, a tweet of “RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?” would indicate that the sender is retweeting information gained via the user @SocialWebMining . An equivalent form of the retweet would be “Justin Bieber is on SNL 2nite. w00t?!? Ummm…(via @SocialWebMining)”.

Extracting relationships from the tweets

Because the social web is first and foremost about the linkages between people in the real world, one highly convenient format for storing social web data is a graph. Let’s use NetworkX to build out a graph connecting Twitterers who have retweeted information. We’ll include directionality in the graph to indicate the direction that information is flowing, so it’s more precisely called a digraph . Although the Twitter APIs do offer some capabilities for determining and analyzing statuses that have been retweeted, these APIs are not a great fit for our current use case because we’d have to make a lot of API calls back and forth to the server, which would be a waste of the API calls included in our quota.


At the time this book was written, Twitter imposes a rate limit of 350 API calls per hour for authenticated requests; anonymous requests are limited to 150 per hour. You can read more about the specifics at In Chapters 4 and 5, we’ll discuss techniques for making the most of the rate limiting, as well as some other creative options for collecting data.

Besides, we can use the clues in the tweets themselves to reliably extract retweet information with a simple regular expression. By convention, Twitter usernames begin with an @ symbol and can only include letters, numbers, and underscores. Thus, given the conventions for retweeting, we only have to search for the following patterns:

  • RT followed by a username

  • via followed by a username

Although Chapter 5 introduces a module specifically designed to parse entities out of tweets, Example 1-10 demonstrates that you can use the re module to compile[8] a pattern and extract the originator of a tweet in a lightweight fashion, without any special libraries.

Example 1-10. Using regular expressions to find retweets

>>> import re
>>> rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
>>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?",
...     "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"]
>>> for t in example_tweets:
...     rt_patterns.findall(t)
[('RT', ' @SocialWebMining')]
[('via', ' @SocialWebMining')]

In case it’s not obvious, the call to findall returns a list of tuples in which each tuple contains either the matching text or an empty string for each group in the pattern; note that the regex does leave a leading space on the extracted entities, but that’s easily fixed with a call to strip(), as demonstrated in Example 1-11. Since neither of the example tweets contains both of the groups enclosed in the parenthetical expressions, one string is empty in each of the tuples.


Regular expressions are a basic programming concept whose explanation is outside the scope of this book. The re module documentation is a good place to start getting up to speed, and you can always consult Friedl’s classic Mastering Regular Expressions (O’Reilly) if you want to learn more than you’ll probably ever need to know about them.

Given that the tweet data structure as returned by the API provides the username of the person tweeting and the newly found ability to extract the originator of a retweet, it’s a simple matter to load this information into a NetworkX graph. Let’s create a graph in which nodes represent usernames and a directed edge between two nodes signifies that there is a retweet relationship between the nodes. The edge itself will carry a payload of the tweet ID and tweet text itself.

Example 1-11 demonstrates the process of generating such a graph. The basic steps involved are generalizing a routine for extracting usernames in retweets, flattening out the pages of tweets into a flat list for easier processing in a loop, and finally, iterating over the tweets and adding edges to a graph. Although we’ll generate an image of the graph later, it’s worthwhile to note that you can gain a lot of insight by analyzing the characteristics of graphs without necessarily visualizing them.

Example 1-11. Building and analyzing a graph describing who retweeted whom

>>> import networkx as nx
>>> import re
>>> g = nx.DiGraph()
>>> all_tweets = [ tweet 
...                for page in search_results 
...                    for tweet in page["results"] ]
>>> def get_rt_sources(tweet):
...     rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
...     return [ source.strip() 
...              for tuple in rt_patterns.findall(tweet) 
...                  for source in tuple 
...                      if source not in ("RT", "via") ]
>>> for tweet in all_tweets:
...     rt_sources = get_rt_sources(tweet["text"])
...     if not rt_sources: continue
...     for rt_source in rt_sources:
...         g.add_edge(rt_source, tweet["from_user"], {"tweet_id" : tweet["id"]})
>>> g.number_of_nodes()
>>> g.number_of_edges()
>>> g.edges(data=True)[0]
(u'@ericastolte', u'bonitasworld', {'tweet_id': 11965974697L})
>>> len(nx.connected_components(g.to_undirected()))
>>> sorted(
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 9, 37] 

The built-in operations that NetworkX provides are a useful starting point to make sense of the data, but it’s important to keep in mind that we’re only looking at a very small slice of the overall conversation happening on Twitter about SNL—500 tweets out of potentially tens of thousands (or more). For example, the number of nodes in the graph tells us that out of 500 tweets, there were 160 users involved in retweet relationships with one another, with 125 edges connecting those nodes. The ratio of 160/125 (approximately 1.28) is an important clue that tells us that the average degree of a node is approximately one—meaning that although some nodes are connected to more than one other node, the average is approximately one connection per node.

The call to connected_components shows us that the graph consists of 37 subgraphs and is not fully connected. The output of degree might seem a bit cryptic at first, but it actually confirms insight we’ve already gleaned: think of it as a way to get the gist of how well connected the nodes in the graph are without having to render an actual graph. In this case, most of the values are 1, meaning all of those nodes have a degree of 1 and are connected to only one other node in the graph. A few values are between 2 and 9, indicating that those nodes are connected to anywhere between 2 and 9 other nodes. The extreme outlier is the node with a degree of 37. The gist of the graph is that it’s mostly composed of disjoint nodes, but there is one very highly connected node. Figure 1-1 illustrates a distribution of degree as a column chart. The trendline shows that the distribution closely follows a Power Law and has a “heavy” or “long” tail. Although the characteristics of distributions with long tails are by no means treated with rigor in this book, you’ll find that lots of distributions we’ll encounter exhibit this property, and you’re highly encouraged to take the initiative to dig deeper if you feel the urge. A good starting point is Zipf’s law.

A distribution illustrating the degree of each node in the graph, which reveals insight into the graph’s connectedness

Figure 1-1. A distribution illustrating the degree of each node in the graph, which reveals insight into the graph’s connectedness

We’ll spend a lot more time in this book using automatable heuristics to make sense of the data; this chapter is intended simply as an introduction to rattle your brain and get you thinking about ways that you could exploit data with the low-hanging fruit that’s available to you. Before we wrap up this chapter, however, let’s visualize the graph just to be sure that our intuition is leading us in the right direction.

Visualizing Tweet Graphs

Graphviz is a staple in the visualization community. This section introduces one possible approach for visualizing graphs of tweet data: exporting them to the DOT language, a simple text-based format that Graphviz consumes. Graphviz binaries for all platforms can be downloaded from its official website, and the installation is straightforward regardless of platform. Once Graphviz is installed, *nix users should be able to easy_install pygraphviz per the norm to satisfy the PyGraphviz dependency NetworkX requires to emit DOT. Windows users will most likely experience difficulties installing PyGraphviz,[9] but this turns out to be of little consequence since it’s trivial to tailor a few lines of code to generate the DOT language output that we need in this section.

Example 1-12 illustrates an approach that works for both platforms.

Example 1-12. Generating DOT language output is easy regardless of platform

OUT = ""

    nx.drawing.write_dot(g, OUT)
except ImportError, e:

    # Help for Windows users:
    # Not a general-purpose method, but representative of
    # the same output write_dot would provide for this graph
    # if installed and easy to implement

    dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) \
        for n1, n2 in g.edges()]
    f = open(OUT, 'w')
    f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))

The DOT output that is generated is of the form shown in Example 1-13.

Example 1-13. Example DOT language output

strict digraph {
"@ericastolte" -> "bonitasworld" [tweet_id=11965974697];
"@mpcoelho" -> "Lil_Amaral" [tweet_id=11965954427];
"@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062];
"@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327];

With DOT language output on hand, the next step is to convert it into an image. Graphviz itself provides a variety of layout algorithms to visualize the exported graph; circo, a tool used to render graphs in a circular-style layout, should work well given that the data suggested that the graph would exhibit the shape of an ego graph with a “hub and spoke”-style topology, with one central node being highly connected to many nodes having a degree of 1. On a *nix platform, the following command converts the file exported from NetworkX into an file that you can open in an image viewer (the result of the operation is displayed in Figure 1-2):

$ circo -Tpng -Osnl_search_results
Our search results rendered in a circular layout with Graphviz

Figure 1-2. Our search results rendered in a circular layout with Graphviz

Windows users can use the GVedit application to render the file as shown in Figure 1-3. You can read more about the various Graphviz options in the online documentation. Visual inspection of the entire graphic file confirms that the characteristics of the graph align with our previous analysis, and we can visually confirm that the node with the highest degree is @justinbieber, the subject of so much discussion (and, in case you missed that episode of SNL, the guest host of the evening). Keep in mind that if we had harvested a lot more tweets, it is very likely that we would have seen many more interconnected subgraphs than are evidenced in the sampling of 500 tweets that we have been analyzing. Further analysis of the graph is left as a voluntary exercise for the reader, as the primary objective of this chapter was to get your development environment squared away and whet your appetite for more interesting topics.

Graphviz appears elsewhere in this book, and if you consider yourself to be a data scientist (or are aspiring to be one), it is a tool that you’ll want to master. That said, we’ll also look at many other useful approaches to visualizing graphs. In the chapters to come, we’ll cover additional outlets of social web data and techniques for analysis.

Synthesis: Visualizing Retweets with Protovis

A turn-key example script that synthesizes much of the content from this chapter and adds a visualization is how we’ll wrap up this chapter. In addition to spitting some useful information out to the console, it accepts a search term as a command line parameter, fetches, parses, and pops up your web browser to visualize the data as an interactive HTML5-based graph. It is available through the official code repository for this book at You are highly encouraged to try it out. We’ll revisit Protovis, the underlying visualization toolkit for this example, in several chapters later in the book. Figure 1-4 illustrates Protovis output from this script. The boilerplate in the sample script is just the beginning—much more can be done!

Windows users can use GVedit instead of interacting with Graphviz at the command prompt

Figure 1-3. Windows users can use GVedit instead of interacting with Graphviz at the command prompt

[7] If you’re not familiar with REST, see the sidebar RESTful Web Services in Chapter 7 for a brief explanation.

[8] In the present context, compiling a regular expression means transforming it into bytecode so that it can be executed by a matching engine written in C.

[9] See NetworkX Ticket #117, which reveals that this has been a long-standing issue that somehow has not garnered the support to be overcome even after many years of frustration. The underlying issue has to do with the need to compile C code during the easy_install process. The ability to work around this issue fairly easily by generating DOT language output may be partly responsible for why it has remained unresolved for so long.

The best content for your career. Discover unlimited learning on demand for around $1/day.