You are previewing Mining the Social Web.

Mining the Social Web

Cover of Mining the Social Web by Matthew A. Russell Published by O'Reilly Media, Inc.
  1. Mining the Social Web
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Content Updates
      1. February 22, 2012
    2. To Read This Book?
    3. Or Not to Read This Book?
    4. Tools and Prerequisites
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  4. 1. Introduction: Hacking on Twitter Data
    1. Installing Python Development Tools
    2. Collecting and Manipulating Twitter Data
      1. Tinkering with Twitter’s API
      2. Frequency Analysis and Lexical Diversity
      3. Visualizing Tweet Graphs
      4. Synthesis: Visualizing Retweets with Protovis
    3. Closing Remarks
  5. 2. Microformats: Semantic Markup and Common Sense Collide
    1. XFN and Friends
    2. Exploring Social Connections with XFN
      1. A Breadth-First Crawl of XFN Data
    3. Geocoordinates: A Common Thread for Just About Anything
      1. Wikipedia Articles + Google Maps = Road Trip?
    4. Slicing and Dicing Recipes (for the Health of It)
    5. Collecting Restaurant Reviews
    6. Summary
  6. 3. Mailboxes: Oldies but Goodies
    1. mbox: The Quick and Dirty on Unix Mailboxes
    2. mbox + CouchDB = Relaxed Email Analysis
      1. Bulk Loading Documents into CouchDB
      2. Sensible Sorting
      3. Map/Reduce-Inspired Frequency Analysis
      4. Sorting Documents by Value
      5. couchdb-lucene: Full-Text Indexing and More
    3. Threading Together Conversations
      1. Look Who’s Talking
    4. Visualizing Mail “Events” with SIMILE Timeline
    5. Analyzing Your Own Mail Data
      1. The Graph Your (Gmail) Inbox Chrome Extension
    6. Closing Remarks
  7. 4. Twitter: Friends, Followers, and Setwise Operations
    1. RESTful and OAuth-Cladded APIs
      1. No, You Can’t Have My Password
    2. A Lean, Mean Data-Collecting Machine
      1. A Very Brief Refactor Interlude
      2. Redis: A Data Structures Server
      3. Elementary Set Operations
      4. Souping Up the Machine with Basic Friend/Follower Metrics
      5. Calculating Similarity by Computing Common Friends and Followers
      6. Measuring Influence
    3. Constructing Friendship Graphs
      1. Clique Detection and Analysis
      2. The Infochimps “Strong Links” API
      3. Interactive 3D Graph Visualization
    4. Summary
  8. 5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
    1. Pen : Sword :: Tweet : Machine Gun (?!?)
    2. Analyzing Tweets (One Entity at a Time)
      1. Tapping (Tim’s) Tweets
      2. Who Does Tim Retweet Most Often?
      3. What’s Tim’s Influence?
      4. How Many of Tim’s Tweets Contain Hashtags?
    3. Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty)
      1. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?
      2. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?
      3. Which Gets Retweeted More Often: #JustinBieber or #TeaParty?
      4. How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?
    4. Visualizing Tons of Tweets
      1. Visualizing Tweets with Tricked-Out Tag Clouds
      2. Visualizing Community Structures in Twitter Search Results
    5. Closing Remarks
  9. 6. LinkedIn: Clustering Your Professional Network for Fun (and Profit?)
    1. Motivation for Clustering
    2. Clustering Contacts by Job Title
      1. Standardizing and Counting Job Titles
      2. Common Similarity Metrics for Clustering
      3. A Greedy Approach to Clustering
      4. Hierarchical and k-Means Clustering
    3. Fetching Extended Profile Information
    4. Geographically Clustering Your Network
      1. Mapping Your Professional Network with Google Earth
      2. Mapping Your Professional Network with Dorling Cartograms
    5. Closing Remarks
  10. 7. Google+: TF-IDF, Cosine Similarity, and Collocations
    1. Harvesting Google+ Data
    2. Data Hacking with NLTK
    3. Text Mining Fundamentals
      1. A Whiz-Bang Introduction to TF-IDF
      2. Querying Google+ Data with TF-IDF
    4. Finding Similar Documents
      1. The Theory Behind Vector Space Models and Cosine Similarity
      2. Clustering Posts with Cosine Similarity
      3. Visualizing Similarity with Graph Visualizations
    5. Bigram Analysis
      1. How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions
    6. Tapping into Your Gmail
      1. Accessing Gmail with OAuth
      2. Fetching and Parsing Email Messages
    7. Before You Go Off and Try to Build a Search Engine…
    8. Closing Remarks
  11. 8. Blogs et al.: Natural Language Processing (and Beyond)
    1. NLP: A Pareto-Like Introduction
      1. Syntax and Semantics
      2. A Brief Thought Exercise
    2. A Typical NLP Pipeline with NLTK
    3. Sentence Detection in Blogs with NLTK
    4. Summarizing Documents
      1. Analysis of Luhn’s Summarization Algorithm
    5. Entity-Centric Analysis: A Deeper Understanding of the Data
      1. Quality of Analytics
    6. Closing Remarks
  12. 9. Facebook: The All-in-One Wonder
    1. Tapping into Your Social Network Data
      1. From Zero to Access Token in Under 10 Minutes
      2. Facebook’s Query APIs
    2. Visualizing Facebook Data
      1. Visualizing Your Entire Social Network
      2. Visualizing Mutual Friendships Within Groups
      3. Where Have My Friends All Gone? (A Data-Driven Game)
      4. Visualizing Wall Data As a (Rotating) Tag Cloud
    3. Closing Remarks
  13. 10. The Semantic Web: A Cocktail Discussion
    1. An Evolutionary Revolution?
    2. Man Cannot Live on Facts Alone
      1. Open-World Versus Closed-World Assumptions
      2. Inferencing About an Open World with FuXi
    3. Hope
  14. Index
  15. About the Author
  16. Colophon
  17. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

A Lean, Mean Data-Collecting Machine

In principle, fetching Twitter data is dirt simple: make a request, store the response, and repeat as needed. But all sorts of real-world stuff gets in the way, such as network I/O, the infamous fail whale,[25] and those pesky API rate limits. Fortunately, it’s not too difficult to handle such issues, so long as you do a bit of forward planning and anticipate the things that could (and will) go wrong.

When executing a long-running program that’s eating away at your rate limit, writing robust code is especially important; you want to handle any exceptional conditions that could occur, do your best to remedy the situation, and—in the event that your best just isn’t good enough—save state and leave an indication of how to pick things back up where they left off. In other words, when you write data-harvesting code for a platform like Twitter, you must assume that it will throw curve balls at you. There will be atypical conditions you’ll have to handle, and they’re often more the norm than the exception.

The code we’ll develop is semi-rugged in that it deals with the most common things that can go wrong and is patterned so that you can easily extend it to handle new circumstances if they arise. That said, there are two specific HTTP errors you are highly likely to encounter when harvesting even modest amounts of Twitter data: a 401 Error (Not Authorized) and a 503 Error (Over Capacity). The former occurs when you attempt to access data that a user has protected, while the latter is basically unpredictable.

Whenever Twitter returns an HTTP error, the twitter module throws a TwitterHTTP Error exception, which can be handled like any other Python exception, with a try/except block. Example 4-2 illustrates a minimal code block that harvests some friend IDs and handles some of the more common exceptional conditions.

Note

You’ll need to create a Twitter app in order to get a consumer key and secret that can be used with the Twitter examples in this book. It’s painless and only takes a moment.

Example 4-2. Using OAuth to authenticate and grab some friend data (friends_followers__get_friends.py)

# -*- coding: utf-8 -*-

import sys
import time
import cPickle
import twitter
from twitter.oauth_dance import oauth_dance

# Go to http://twitter.com/apps/new to create an app and get these items

consumer_key = ''
consumer_secret = ''

SCREEN_NAME = sys.argv[1]
friends_limit = 10000

(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
        consumer_key, consumer_secret)
t = twitter.Twitter(domain='api.twitter.com', api_version='1',
                    auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
                    consumer_key, consumer_secret))

ids = []
wait_period = 2  # secs
cursor = -1

while cursor != 0:
    if wait_period > 3600:  # 1 hour
        print 'Too many retries. Saving partial data to disk and exiting'
        f = file('%s.friend_ids' % str(cursor), 'wb')
        cPickle.dump(ids, f)
        f.close()
        exit()

    try:
        response = t.friends.ids(screen_name=SCREEN_NAME, cursor=cursor)
        ids.extend(response['ids'])
        wait_period = 2
    except twitter.api.TwitterHTTPError, e:
        if e.e.code == 401:
            print 'Encountered 401 Error (Not Authorized)'
            print 'User %s is protecting their tweets' % (SCREEN_NAME, )
        elif e.e.code in (502, 503):
            print 'Encountered %i Error. Trying again in %i seconds' % (e.e.code,
                    wait_period)
            time.sleep(wait_period)
            wait_period *= 1.5
            continue
        elif t.account.rate_limit_status()['remaining_hits'] == 0:
            status = t.account.rate_limit_status()
            now = time.time()  # UTC
            when_rate_limit_resets = status['reset_time_in_seconds']  # UTC
            sleep_time = when_rate_limit_resets - now
            print 'Rate limit reached. Trying again in %i seconds' % (sleep_time,
                    )
            time.sleep(sleep_time)
            continue

    cursor = response['next_cursor']
    print 'Fetched %i ids for %s' % (len(ids), SCREEN_NAME)
    if len(ids) >= friends_limit:
        break

# do something interesting with the IDs

print ids

The twitter.oauth module provides read_token_file and write_token_file convenience functions that can be used to store and retrieve your OAuth token and OAuth token secret, so you don’t have to manually enter in a PIN to authenticate each time.

Note

In OAuth 2.0 parlance, “client” describes the same role as a “consumer” in OAuth 1.0, thus the use of the variable names consumer_key and consumer_secret in the preceding listing.

There are several noteworthy items about the listing:

  • You can obtain your own consumer_key and consumer_secret by registering an application with Twitter at http://dev.twitter.com/apps/new. These two items, along with the credentials returned through the “OAuth dance,” are what enable you to provide an application with access to your account data (your friends list, in this particular example).

  • The online documentation for Twitter’s social graph APIs states that requests for friend/follower data will return up to 5,000 IDs per call. In the event that there are more than 5,000 IDs to be returned, a cursor value that’s not equal to zero is returned that can be used to navigate forward to the next batch. This particular example “stops short” at a maximum of 10,000 ID values, but friends_limit could be an arbitrarily larger number.

  • Given that the /friends/ids resource returns up to 5,000 IDs at a time, regular user accounts could retrieve up to 1,750,000 IDs before rate limiting would kick in based on a 350 requests/hour metric. While it might be an anomaly for a user to have that many friends on Twitter, it’s not at all uncommon for popular users to have many times that many followers.

  • It’s not clear from any official documentation or the example code itself, but ID values in the results seem to be in reverse chronological order, so the first value will be the person you most recently followed, and the last value will be the first person you followed. Requests for followers via t.followers.ids appear to return results in the same order.

At this point, you’ve only been introduced to a few Twitter APIs. These are sufficiently powerful to answer a number of interesting questions about your account or any other nonprotected account, but there are numerous other APIs out there. We’ll look at some more of them shortly, but first, let’s segue into a brief interlude to refactor Example 4-2.

A Very Brief Refactor Interlude

Given that virtually all interesting code listings involving Twitter data will repeatedly involve performing the OAuth dance and making robust requests that can stand up to the litany of things that you have to assume might go wrong, it’s very worthwhile to establish a pattern for performing these tasks. The approach that we’ll take is to isolate the OAuth logic for login() and makeTwitterRequest functions so that Example 4-2 looks like the following refactored version of Example 4-3:

Example 4-3. Example 4-2 refactored to use two common utilities for OAuth and making API requests (friends_followers__get_friends_refactored.py)

# -*- coding: utf-8 -*-

import sys
import time
import cPickle
import twitter
from twitter__login import login
from twitter__util import makeTwitterRequest 

friends_limit = 10000

# You may need to setup your OAuth settings in twitter__login.py
t = login()

def getFriendIds(screen_name=None, user_id=None, friends_limit=10000):
    assert screen_name is not None or user_id is not None

    ids = []
    cursor = -1
    while cursor != 0:
        params = dict(cursor=cursor)
        if screen_name is not None:
            params['screen_name'] = screen_name
        else:
            params['user_id'] = user_id

        response = makeTwitterRequest(t, t.friends.ids, **params)

        ids.extend(response['ids'])
        cursor = response['next_cursor']
        print >> sys.stderr, \
            'Fetched %i ids for %s' % (len(ids), screen_name or user_id)
        if len(ids) >= friends_limit:
            break

    return ids

if __name__ == '__main__':
    ids = getFriendIds(sys.argv[1], friends_limit=10000)

    # do something interesting with the ids
    print ids

From here on out, we’ll continue to use twitter__login and twitter__util to keep the examples as crisp and simple as possible. It’s worthwhile to take a moment and peruse the source for these modules online before reading further. They’ll appear again and again, and twitter__util will soon come to have a number of commonly used convenience functions in it.

The next section introduces Redis, a powerful data structures server that has quickly gained a fierce following due to its performance and simplicity.

Redis: A Data Structures Server

As we’ve already observed, planning ahead is important when you want to execute a potentially long-running program to scarf down data from the Web, because lots of things can go wrong. But what do you do with all of that data once you get it? You may initially be tempted to just store it to disk. In the situation we’ve just been looking at, that might result in a directory structure similar to the following:

./
screen_name1/
    friend_ids.json
    follower_ids.json
    user_info.json
screen_name2/
    ...
...

This looks pretty reasonable until you harvest all of the friends/followers for a very popular user—then, depending on your platform, you may be faced with a directory containing millions of subdirectories that’s relatively unusable because you can’t browse it very easily (if at all) in a terminal. Saving all this info to disk might also require that you maintain a registry of some sort that keeps track of all screen names, because the time required to generate a directory listing (in the event that you need one) for millions of files might not yield a desirable performance profile. If the app that uses the data then becomes threaded, you may end up with multiple writers needing to access the same file at the same time, so you’ll have to start dealing with file locking and such things. That’s probably not a place you want to go. All we really need in this case is a system that makes it trivially easy to store basic key/value pairs and a simple key encoding scheme—something like a disk-backed dictionary would be a good start. This next snippet demonstrates the construction of a key by concatenating a user ID, delimiter, and data structure name:

s = {}
s["screen_name1$friend_ids"] = [1,2,3, ...]
s["screen_name1$friend_ids"] # returns [1,2,3, ...]

But wouldn’t it be cool if the map could automatically compute set operations so that we could just tell it to do something like:

s.intersection("screen_name1$friend_ids", "screen_name1$follower_ids")

to automatically compute “mutual friends” for a Twitterer (i.e., to figure out which of their friends are following them back)? Well, there’s an open source project called Redis that provides exactly that kind of capability. Redis is trivial to install, blazingly fast (written in C), scales well, is actively maintained, and has a great Python client with accompanying documentation available. Taking Redis for a test drive is as simple as installing it and starting up the server. (Windows users can save themselves some headaches by grabbing a binary that’s maintained by servicestack.net.) Then, just run easy_install redis to obtain a nice Python client that provides trivial access to everything it has to offer. For example, the previous snippet translates to the following Redis code:

import redis

r = redis.Redis(host='localhost', port=6379, db=0) # Default params
[ r.sadd("screen_name1$friend_ids", i) for i in [1, 2, 3, ...] ]
r.smembers("screen_name1$friend_ids") # Returns [1, 2, 3, ...]

Note that while sadd and smembers are set-based operations, Redis includes operations specific to various other types of data structures, such as sets, lists, and hashes. The set operations turn out to be of particular interest because they provide the answers to many of the questions posed at the beginning of this chapter. It’s worthwhile to take a moment to review the documentation for the Redis Python client to get a better appreciation of all it can do. Recall that you can simply execute a command like pydoc redis.Redis to quickly browse documentation from a terminal.

Note

See “Redis: under the hood” for an awesome technical deep dive into how Redis works internally.

Elementary Set Operations

The most common set operations you’ll likely encounter are the union, intersection, and difference operations. Recall that the difference between a set and a list is that a set is unordered and contains only unique members, while a list is ordered and may contain duplicate members. As of Version 2.6, Python provides built-in support for sets via the set data structure. Table 4-1 illustrates some examples of common set operations for a trivially small universe of discourse involving friends and followers:

Friends = {Abe, Bob}, Followers = {Bob, Carol}

Table 4-1. Sample set operations for Friends and Followers

OperationResultComment
Friends ∪ FollowersAbe, Bob, CarolSomeone’s overall network
Friends ∩ FollowersBobSomeone’s mutual friends
Friends – FollowersAbePeople a person is following, but who are not following that person back
Followers – FriendsCarolPeople who are following someone but are not being followed back

As previously mentioned, Redis provides native operations for computing common set operations. A few of the most relevant ones for the upcoming work at hand include:

smembers

Returns all of the members of a set

scard

Returns the cardinality of a set (the number of members in the set)

sinter

Computes the intersection for a list of sets

sdiff

Computes the difference for a list of sets

mget

Returns a list of string values for a list of keys

mset

Stores a list of string values against a list of keys

sadd

Adds an item to a set (and creates the set if it doesn’t already exist)

keys

Returns a list of keys matching a regex-like pattern

Skimming the pydoc for Python’s built-in set data type should convince you of the close mapping between it and the Redis APIs.

Souping Up the Machine with Basic Friend/Follower Metrics

Redis should serve you well on your quest to efficiently process and analyze vast amounts of Twitter data for certain kinds of queries. Adapting Example 4-2 with some additional logic to house data in Redis requires only a simple change, and Example 4-4 is an update that computes some basic friend/follower statistics. Native functions in Redis are used to compute the set operations.

Example 4-4. Harvesting, storing, and computing statistics about friends and followers (friends_followers__friend_follower_symmetry.py)

# -*- coding: utf-8 -*-

import sys
import locale
import time
import functools
import twitter
import redis
from twitter__login import login

# A template-like function for maximizing code reuse,
# which is essentially a wrapper around makeTwitterRequest
# with some additional logic in place for interfacing with 
# Redis
from twitter__util import _getFriendsOrFollowersUsingFunc

# Creates a consistent key value for a user given a screen name
from twitter__util import getRedisIdByScreenName

SCREEN_NAME = sys.argv[1]

MAXINT = sys.maxint

# For nice number formatting
locale.setlocale(locale.LC_ALL, '')  

# You may need to setup your OAuth settings in twitter__login.py

t = login()

# Connect using default settings for localhost
r = redis.Redis()  

# Some wrappers around _getFriendsOrFollowersUsingFunc 
# that bind the first two arguments

getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, 
                               t.friends.ids, 'friend_ids', t, r)

getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
                                 t.followers.ids, 'follower_ids', t, r)

screen_name = SCREEN_NAME

# get the data

print >> sys.stderr, 'Getting friends for %s...' % (screen_name, )
getFriends(screen_name, limit=MAXINT)

print >> sys.stderr, 'Getting followers for %s...' % (screen_name, )
getFollowers(screen_name, limit=MAXINT)

# use redis to compute the numbers

n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids'))

n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))

n_friends_diff_followers = r.sdiffstore('temp',
                                        [getRedisIdByScreenName(screen_name,
                                        'friend_ids'),
                                        getRedisIdByScreenName(screen_name,
                                        'follower_ids')])
r.delete('temp')

n_followers_diff_friends = r.sdiffstore('temp',
                                        [getRedisIdByScreenName(screen_name,
                                        'follower_ids'),
                                        getRedisIdByScreenName(screen_name,
                                        'friend_ids')])
r.delete('temp')

n_friends_inter_followers = r.sinterstore('temp',
        [getRedisIdByScreenName(screen_name, 'follower_ids'),
        getRedisIdByScreenName(screen_name, 'friend_ids')])
r.delete('temp')

print '%s is following %s' % (screen_name, locale.format('%d', n_friends, True))
print '%s is being followed by %s' % (screen_name, locale.format('%d',
                                      n_followers, True))
print '%s of %s are not following %s back' % (locale.format('%d',
        n_friends_diff_followers, True), locale.format('%d', n_friends, True),
        screen_name)print '%s of %s are not being followed back by %s' % (locale.format('%d',
        n_followers_diff_friends, True), locale.format('%d', n_followers, True),
        screen_name)
print '%s has %s mutual friends' \
    % (screen_name, locale.format('%d', n_friends_inter_followers, True))

Aside from the use of functools.partial (http://docs.python.org/library/functools.html) to create getFriends and getFollowers from a common piece of parameter-bound code, Example 4-4 should be pretty straightforward. There’s one other very subtle thing to notice: there isn’t a call to r.save in Example 4-4, which means that the settings in redis.conf dictate when data is persisted to disk. By default, Redis stores data in memory and asynchronously snapshots data to disk according to a schedule that’s dictated by whether or not a number of changes have occurred within a specified time interval. The risk with asynchronous writes is that you might lose data if certain unexpected conditions, such as a system crash or power outage, were to occur. Redis provides an “append only” option that you can enable in redis.conf to hedge against this possibility.

Note

It is highly recommended that you enable the appendonly option in redis.conf to protect against data loss; see the “Append Only File HOWTO” for helpful details.

Consider the following output, relating to Tim O’Reilly’s network of followers. Keeping in mind that there’s a rate limit of 350 OAuth requests per hour, you could expect this code to take a little less than an hour to run, because approximately 300 API calls would need to be made to collect all the follower ID values:

timoreilly is following 663
timoreilly is being followed by 1,423,704
131 of 633 are not following timoreilly back
1,423,172 of 1,423,704 are not being followed back by timoreilly
timoreilly has 532 mutual friends

Note that while you could choose to settle for harvesting a smaller number of followers to avoid the rate limit‒imposed wait, the API documentation does not state that taking the first N pages’ worth of data would yield a truly random sample, and it appears that data is returned in reverse chronological order—so, you may not be able to extrapolate in a predictable way whether your logic depends on it. For example, if the first 10,000 followers returned just so happened to contain the 532 mutual friends, extrapolation from those points would result in a skewed analysis because these results are not at all representative of the larger population. For a very popular Twitterer such as Britney Spears, with well over 5,000,000 followers, somewhere in the neighborhood of 1,000 API calls would be required to fetch all of the followers over approximately a four-hour period. In general, the wait is probably worth it for this kind of data, and you could use the Twitter-streaming APIs to keep your data up-to-date so that you never have to go through the entire ordeal again.

Warning

One common source of error for some kinds of analysis is to forget about the overall size of a population relative to your sample. For example, randomly sampling 10,000 of Tim O’Reilly’s friends and followers would actually give you the full population of his friends, yet only a tiny fraction of his followers. Depending on the sophistication of your analysis, the sample size relative to the overall size of a population can make a difference in determining whether the outcome of an experiment is statistically significant, and the level of confidence you can have about it.

Given even these basic friend/follower stats, a couple of questions that lead us toward other interesting analyses naturally follow. For example, who are the 131 people who are not following Tim O’Reilly back? Given the various possibilities that could be considered about friends and followers, the “Who isn’t following me back?” question is one of the more interesting ones and arguably can provide a lot of insight about a person’s interests. So, how can we answer this question?

Staring at a list of user IDs isn’t very riveting, so resolving those user IDs to actual user objects is the first obvious step. Example 4-5 extends Example 4-4 by encapsulating common error-handling code into reusable form. It also provides a function that demonstrates how to resolve those ID values to screen names using the /users/lookup API, which accepts a list of up to 100 user IDs or screen names and returns the same basic user information that you saw earlier with /users/show.

Example 4-5. Resolving basic user information such as screen names from IDs (friends_followers__get_user_info.py)

# -*- coding: utf-8 -*-

import sys
import json
import redis
from twitter__login import login

# A makeTwitterRequest call through to the /users/lookup 
# resource, which accepts a comma separated list of up 
# to 100 screen names. Details are fairly uninteresting. 
# See also http://dev.twitter.com/doc/get/users/lookup
from twitter__util import getUserInfo

if __name__ == "__main__":
    screen_names = sys.argv[1:]

    t = login()
    r = redis.Redis()

    print json.dumps(
            getUserInfo(t, r, screen_names=screen_names),
            indent=4
          )

Although not reproduced in its entirety, the getUserInfo function that’s imported from twitter__util is essentially just a makeTwitterRequest to the /users/lookup resource using a list of screen names. The following snippet demonstrates:

def getUserInfo(t, r, screen_names):
    info = []
    response = makeTwitterRequest(t, 
                                  t.users.lookup,
                                  screen_name=','.join(screen_names)
                                 )

    for user_info in response:
        r.set(getRedisIdByScreenName(user_info['screen_name'], 'info.json'),
              json.dumps(user_info))
        r.set(getRedisIdByUserId(user_info['id'], 'info.json'), 
              json.dumps(user_info))

        info.extend(response)

    return info

It’s worthwhile to note that getUserInfo stores the same user information under two different keys: the user ID and the screen name. Storing both of these keys allows us to easily look up a screen name given a user ID value and a user ID value given a screen name. Translating a user ID value to a screen name is a particularly useful operation since the social graph APIs for getting friends and followers return only ID values, which have no intuitive value until they are resolved against screen names and other basic user information. While there is redundant storage involved in this scheme, compared to other approaches, the convenience is arguably worth it. Feel free to take a leaner approach if storage is a concern.

An example user information object for Tim O’Reilly follows in Example 4-6, illustrating the kind of information available about Twitterers. The sky is the limit with what you can do with data that’s this rich. We won’t mine the user descriptions and tweets of the folks who aren’t following Tim back and put them in print, but you should have enough to work with should you wish to conduct that kind of analysis.

Example 4-6. Example user object represented as JSON data for Tim O’Reilly

{
    "id": 2384071,
    "verified": true,
    "profile_sidebar_fill_color": "e0ff92",
    "profile_text_color": "000000",
    "followers_count": 1423326,
    "protected": false,
    "location": "Sebastopol, CA",
    "profile_background_color": "9ae4e8",
    "status": {
        "favorited": false,
        "contributors": null,
        "truncated": false,
        "text": "AWESOME!! RT @adafruit: a little girl asks after seeing adafruit ...",
        "created_at": "Sun May 30 00:56:33 +0000 2010",
        "coordinates": null,
        "source": "<a href=\"http://www.seesmic.com/\" rel=\"nofollow\">Seesmic</a>",
        "in_reply_to_status_id": null,
        "in_reply_to_screen_name": null,
        "in_reply_to_user_id": null,
        "place": null,
        "geo": null,
        "id": 15008936780
    },
    "utc_offset": -28800,
    "statuses_count": 11220,
    "description": "Founder and CEO, O'Reilly Media. Watching the alpha geeks...",
    "friends_count": 662,
    "profile_link_color": "0000ff",
    "profile_image_url": "http://a1.twimg.com/profile_images/941827802/IMG_...jpg",
    "notifications": false,
    "geo_enabled": true,
    "profile_background_image_url": "http://a1.twimg.com/profile_background_...gif",
    "name": "Tim O'Reilly",
    "lang": "en",
    "profile_background_tile": false,
    "favourites_count": 10,
    "screen_name": "timoreilly",
    "url": "http://radar.oreilly.com",
    "created_at": "Tue Mar 27 01:14:05 +0000 2007",
    "contributors_enabled": false,
    "time_zone": "Pacific Time (US & Canada)",
    "profile_sidebar_border_color": "87bc44",
    "following": false
}

The refactored logic for handling HTTP errors and obtaining user information in batches is provided in the following sections. Note that the handleTwitterHTTPError function intentionally doesn’t include error handling for every conceivable error case, because the action you may want to take will vary from situation to situation. For example, in the event of a urllib2.URLError (operation timed out) that is triggered because someone unplugged your network cable, you want to prompt the user for a specific course of action.

Example 4-5 brings to light some good news and some not-so-good news. The good news is that resolving the user IDs to user objects containing a byline, location information, the latest tweet, etc. is a treasure trove of information. The not-so-good news is that it’s quite expensive to do this in terms of rate limiting, given that you can only get data back in batches of 100. For Tim O’Reilly’s friends, that’s only seven API calls. For his followers, however, it’s over 14,000, which would take nearly two days to collect, given a rate limit of 350 calls per hour (and no glitches in harvesting).

However, given a full collection of anyone’s friends and followers ID values, you can randomly sample and calculate measures of statistical significance to your heart’s content. Redis provides the srandmember function that fits the bill perfectly. You pass it the name of a set, such as timoreilly$follower_ids, and it returns a random member of that set.

Calculating Similarity by Computing Common Friends and Followers

Another piece of low-hanging fruit that we can go after is computing the friends and followers that two or more Twitterers have in common. Within a given universe, these folks might be interesting for a couple of reasons. One reason is that they’re the “common thread” connecting various disparate networks; you might interpret this to be a type of similarity metric. For example, if two users were both following a large number of the same people, you might conclude that those two users had very similar interests. From there, you might start to analyze the information embedded in the tweets of the common friends to gain more insight into what those people have in common, if anything, or make other conclusions. It turns out that computing common friends and followers is just a set operation away.

Example 4-7 illustrates the use of Redis’s sinterstore function, which stores the result of a set intersection, and introduces locale.format for pretty-printing so that the output is easier to read.

Example 4-7. Finding common friends/followers for multiple Twitterers, with output that’s easier on the eyes (friends_followers__friends_followers_in_common.py)

# -*- coding: utf-8 -*-

import sys
import redis

from twitter__util import getRedisIdByScreenName

# A pretty-print function for numbers
from twitter__util import pp

r = redis.Redis()

def friendsFollowersInCommon(screen_names):
    r.sinterstore('temp$friends_in_common', 
                  [getRedisIdByScreenName(screen_name, 'friend_ids') 
                      for screen_name in screen_names]
                 )

    r.sinterstore('temp$followers_in_common',
                  [getRedisIdByScreenName(screen_name, 'follower_ids')
                      for screen_name in screen_names]
                 )

    print 'Friends in common for %s: %s' % (', '.join(screen_names),
            pp(r.scard('temp$friends_in_common')))

    print 'Followers in common for %s: %s' % (', '.join(screen_names),
            pp(r.scard('temp$followers_in_common')))

    # Clean up scratch workspace

    r.delete('temp$friends_in_common')
    r.delete('temp$followers_in_common')

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print >> sys.stderr, "Please supply at least two screen names."
        sys.exit(1)

    # Note:
    # The assumption is that the screen names you are 
    # supplying have already been added to Redis.
    # See friends_followers__get_friends__refactored.py

    friendsFollowersInCommon(sys.argv[1:])

Note that although the values in the working sets are ID values, you could easily use Redis’ randomkey function to sample friends and followers, and use the getUserInfo function from Example 4-5 to resolve useful information such as screen names, most recent tweets, locations, etc.

Measuring Influence

When someone shares information via a service such as Twitter, it’s only natural to wonder how far the information penetrates into the overall network by means of being retweeted. It should be fair to assume that the more followers a person has, the greater the potential is for that person’s tweets to be retweeted. Users who have a relatively high overall percentage of their originally authored tweets retweeted can be said to be more influential than users who are retweeted infrequently. Users who have a relatively high percentage of their tweets retweeted, even if they are not originally authored, might be said to be mavens—people who are exceptionally well connected and like to share information.[26] One trivial way to measure the relative influence of two or more users is to simply compare their number of followers, since every follower will have a direct view of their tweets. We already know from Example 4-6 that we can get the number of followers (and friends) for a user via the /users/lookup and /users/show APIs. Extracting that information from these APIs is trivial enough:

for screen_name in screen_names:
    _json = json.loads(r.get(getRedisIdByScreenName(screen_name, "info.json")))
    n_friends, n_followers = _json['friends_count'], _json['followers_count']

Counting numbers of followers is interesting, but there’s so much more that can be done. For example, a given user may not have the popularity of an information maven like Tim O’Reilly, but if you have him as a follower and he retweets you, you’ve suddenly tapped into a vast network of people who might just start to follow you once they’ve determined that you’re also interesting. Thus, a much better approach that you might take in calculating users’ potential influence is to not only compare their numbers of followers, but to spider out into the network a couple of levels. In fact, we can use the very same breadth-first approach that was introduced in Example 2-4.

Example 4-8 illustrates a generalized crawl function that accepts a list of screen names, a crawl depth, and parameters that control how many friends and followers to retrieve. The friends_limit and followers_limit parameters control how many items to fetch from the social graph APIs (in batches of 5,000), while friends_sample and followers_sample control how many user objects to retrieve (in batches of 100). An updated function for getUserInfo is also included to reflect the pass-through of the sampling parameters.

Example 4-8. Crawling friends/followers connections (friends_followers__crawl.py)

# -*- coding: utf-8 -*-

import sys
import redis
import functools
from twitter__login import login
from twitter__util import getUserInfo
from twitter__util import _getFriendsOrFollowersUsingFunc

SCREEN_NAME = sys.argv[1]

t = login()
r = redis.Redis()

# Some wrappers around _getFriendsOrFollowersUsingFunc that 
# create convenience functions

getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, 
                               t.friends.ids, 'friend_ids', t, r)
getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
                                 t.followers.ids, 'follower_ids', t, r)

def crawl(
    screen_names,
    friends_limit=10000,
    followers_limit=10000,
    depth=1,
    friends_sample=0.2, #XXX
    followers_sample=0.0,
    ):

    getUserInfo(t, r, screen_names=screen_names)
    for screen_name in screen_names:
        friend_ids = getFriends(screen_name, limit=friends_limit)
        follower_ids = getFollowers(screen_name, limit=followers_limit)

        friends_info = getUserInfo(t, r, user_ids=friend_ids, 
                                   sample=friends_sample)

        followers_info = getUserInfo(t, r, user_ids=follower_ids,
                                     sample=followers_sample)

        next_queue = [u['screen_name'] for u in friends_info + followers_info]

        d = 1
        while d < depth:
            d += 1
            (queue, next_queue) = (next_queue, [])
            for _screen_name in queue:
                friend_ids = getFriends(_screen_name, limit=friends_limit)
                follower_ids = getFollowers(_screen_name, limit=followers_limit)

                next_queue.extend(friend_ids + follower_ids)

                # Note that this function takes a kw between 0.0 and 1.0 called
                # sample that allows you to crawl only a random sample of nodes
                # at any given level of the graph

                getUserInfo(user_ids=next_queue)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Please supply at least one screen name."
    else:
        crawl([SCREEN_NAME])

        # The data is now in the system. Do something interesting. For example, 
        # find someone's most popular followers as an indiactor of potential influence.
        # See friends_followers__calculate_avg_influence_of_followers.py

Assuming you’ve run crawl with high enough numbers for friends_limit and followers_limit to get all of a users’ friend IDs and follower IDs, all that remains is to take a large enough random sample and calculate interesting metrics, such as the average number of followers one level out. It could also be fun to look at his top N followers to get an idea of who he might be influencing. Example 4-9 demonstrates one possible approach that pulls the data out of Redis and calculates Tim O’Reilly’s most popular followers.

Example 4-9. Calculating a Twitterer’s most popular followers (friends_followers__calculate_avg_influence_of_followers.py)

# -*- coding: utf-8 -*-

import sys
import json
import locale
import redis
from prettytable import PrettyTable

# Pretty printing numbers
from twitter__util import pp 

# These functions create consistent keys from 
# screen names and user id values
from twitter__util import getRedisIdByScreenName 
from twitter__util import getRedisIdByUserId

SCREEN_NAME = sys.argv[1]

locale.setlocale(locale.LC_ALL, '')

def calculate():
    r = redis.Redis()  # Default connection settings on localhost

    follower_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME,
                        'follower_ids')))

    followers = r.mget([getRedisIdByUserId(follower_id, 'info.json')
                       for follower_id in follower_ids])
    followers = [json.loads(f) for f in followers if f is not None]

    freqs = {}
    for f in followers:
        cnt = f['followers_count']
        if not freqs.has_key(cnt):
            freqs[cnt] = []

        freqs[cnt].append({'screen_name': f['screen_name'], 'user_id': f['id']})

    # It could take a few minutes to calculate freqs, so store a snapshot for later use

    r.set(getRedisIdByScreenName(SCREEN_NAME, 'follower_freqs'),
          json.dumps(freqs))

    keys = freqs.keys()
    keys.sort()

    print 'The top 10 followers from the sample:'

    fields = ['Date', 'Count']
    pt = PrettyTable(fields=fields)
    [pt.set_field_align(f, 'l') for f in fields]

    for (user, freq) in reversed([(user['screen_name'], k) for k in keys[-10:]
                                    for user in freqs[k]]):
        pt.add_row([user, pp(freq)])

    pt.printt()

    all_freqs = [k for k in keys for user in freqs[k]]
    avg = reduce(lambda x, y: x + y, all_freqs) / len(all_freqs)

    print "\nThe average number of followers for %s's followers: %s" \
        % (SCREEN_NAME, pp(avg))

# psyco can only compile functions, so wrap code in a function

try:
    import psyco
    psyco.bind(calculate)
except ImportError, e:
    pass  # psyco not installed

calculate()

Note

In many common number-crunching situations, the psyco module can dynamically compile code and produce dramatic speed improvements. It’s totally optional but definitely worth a hard look if you’re performing calculations that take more than a few seconds.

Output follows for a sample size of about 150,000 (approximately 10%) of Tim O’Reilly’s followers. For statistical analysis, this high of a sample size relative to the population ensures a tiny margin of error and a very high confidence level.[27] That is, the results can be considered very representative, though not quite the same thing as the absolute truth about the population:

The top 10 followers from the sample:
aplusk 4,993,072
BarackObama 4,114,901
mashable 2,014,615
MarthaStewart 1,932,321
Schwarzenegger 1,705,177
zappos 1,689,289
Veronica 1,612,827
jack 1,592,004
stephenfry 1,531,813
davos 1,522,621

The average number of followers for timoreilly's followers: 445

Interestingly, a few familiar names show up on the list, including some of the most popular Twitterers of all time: Ashton Kutcher (@aplusk), Barack Obama, Martha Stewart, and Arnold Schwarzenegger, among others. Removing these top 10 followers and recalculating lowers the average number of followers of Tim’s followers to approximately 284. Removing any follower with less than 10 followers of her own, however, dramatically increases the number to more than 1,000! Noting that there are tens of thousands of followers in this range and briefly perusing their profiles, however, does bring some reality into the situation: many of these users are spam accounts, users who are protecting their tweets, etc. Culling out the top 10 followers and all followers having fewer than 10 followers of their own might be a reasonable metric to work with; doing both of these things results in a number around 800, which is still quite high. There must be something to be said for the idea of getting retweeted by a popular Twitterer who has lots of connections to other popular Twitterers.



[25] Whenever Twitter goes over capacity, an HTTP 503 error is issued. In a browser, the error page displays an image of the now infamous “fail whale.” See http://twitter.com/503.

[26] See The Tipping Point by Malcolm Gladwell (Back Bay Books) for a great discourse on mavens.

[27] It’s about a 0.14 margin of error for a 99% confidence level.

The best content for your career. Discover unlimited learning on demand for around $1/day.