You are previewing Mining the Social Web.

Mining the Social Web

Cover of Mining the Social Web by Matthew A. Russell Published by O'Reilly Media, Inc.
  1. Mining the Social Web
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Content Updates
      1. February 22, 2012
    2. To Read This Book?
    3. Or Not to Read This Book?
    4. Tools and Prerequisites
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  4. 1. Introduction: Hacking on Twitter Data
    1. Installing Python Development Tools
    2. Collecting and Manipulating Twitter Data
      1. Tinkering with Twitter’s API
      2. Frequency Analysis and Lexical Diversity
      3. Visualizing Tweet Graphs
      4. Synthesis: Visualizing Retweets with Protovis
    3. Closing Remarks
  5. 2. Microformats: Semantic Markup and Common Sense Collide
    1. XFN and Friends
    2. Exploring Social Connections with XFN
      1. A Breadth-First Crawl of XFN Data
    3. Geocoordinates: A Common Thread for Just About Anything
      1. Wikipedia Articles + Google Maps = Road Trip?
    4. Slicing and Dicing Recipes (for the Health of It)
    5. Collecting Restaurant Reviews
    6. Summary
  6. 3. Mailboxes: Oldies but Goodies
    1. mbox: The Quick and Dirty on Unix Mailboxes
    2. mbox + CouchDB = Relaxed Email Analysis
      1. Bulk Loading Documents into CouchDB
      2. Sensible Sorting
      3. Map/Reduce-Inspired Frequency Analysis
      4. Sorting Documents by Value
      5. couchdb-lucene: Full-Text Indexing and More
    3. Threading Together Conversations
      1. Look Who’s Talking
    4. Visualizing Mail “Events” with SIMILE Timeline
    5. Analyzing Your Own Mail Data
      1. The Graph Your (Gmail) Inbox Chrome Extension
    6. Closing Remarks
  7. 4. Twitter: Friends, Followers, and Setwise Operations
    1. RESTful and OAuth-Cladded APIs
      1. No, You Can’t Have My Password
    2. A Lean, Mean Data-Collecting Machine
      1. A Very Brief Refactor Interlude
      2. Redis: A Data Structures Server
      3. Elementary Set Operations
      4. Souping Up the Machine with Basic Friend/Follower Metrics
      5. Calculating Similarity by Computing Common Friends and Followers
      6. Measuring Influence
    3. Constructing Friendship Graphs
      1. Clique Detection and Analysis
      2. The Infochimps “Strong Links” API
      3. Interactive 3D Graph Visualization
    4. Summary
  8. 5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
    1. Pen : Sword :: Tweet : Machine Gun (?!?)
    2. Analyzing Tweets (One Entity at a Time)
      1. Tapping (Tim’s) Tweets
      2. Who Does Tim Retweet Most Often?
      3. What’s Tim’s Influence?
      4. How Many of Tim’s Tweets Contain Hashtags?
    3. Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty)
      1. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?
      2. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?
      3. Which Gets Retweeted More Often: #JustinBieber or #TeaParty?
      4. How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?
    4. Visualizing Tons of Tweets
      1. Visualizing Tweets with Tricked-Out Tag Clouds
      2. Visualizing Community Structures in Twitter Search Results
    5. Closing Remarks
  9. 6. LinkedIn: Clustering Your Professional Network for Fun (and Profit?)
    1. Motivation for Clustering
    2. Clustering Contacts by Job Title
      1. Standardizing and Counting Job Titles
      2. Common Similarity Metrics for Clustering
      3. A Greedy Approach to Clustering
      4. Hierarchical and k-Means Clustering
    3. Fetching Extended Profile Information
    4. Geographically Clustering Your Network
      1. Mapping Your Professional Network with Google Earth
      2. Mapping Your Professional Network with Dorling Cartograms
    5. Closing Remarks
  10. 7. Google+: TF-IDF, Cosine Similarity, and Collocations
    1. Harvesting Google+ Data
    2. Data Hacking with NLTK
    3. Text Mining Fundamentals
      1. A Whiz-Bang Introduction to TF-IDF
      2. Querying Google+ Data with TF-IDF
    4. Finding Similar Documents
      1. The Theory Behind Vector Space Models and Cosine Similarity
      2. Clustering Posts with Cosine Similarity
      3. Visualizing Similarity with Graph Visualizations
    5. Bigram Analysis
      1. How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions
    6. Tapping into Your Gmail
      1. Accessing Gmail with OAuth
      2. Fetching and Parsing Email Messages
    7. Before You Go Off and Try to Build a Search Engine…
    8. Closing Remarks
  11. 8. Blogs et al.: Natural Language Processing (and Beyond)
    1. NLP: A Pareto-Like Introduction
      1. Syntax and Semantics
      2. A Brief Thought Exercise
    2. A Typical NLP Pipeline with NLTK
    3. Sentence Detection in Blogs with NLTK
    4. Summarizing Documents
      1. Analysis of Luhn’s Summarization Algorithm
    5. Entity-Centric Analysis: A Deeper Understanding of the Data
      1. Quality of Analytics
    6. Closing Remarks
  12. 9. Facebook: The All-in-One Wonder
    1. Tapping into Your Social Network Data
      1. From Zero to Access Token in Under 10 Minutes
      2. Facebook’s Query APIs
    2. Visualizing Facebook Data
      1. Visualizing Your Entire Social Network
      2. Visualizing Mutual Friendships Within Groups
      3. Where Have My Friends All Gone? (A Data-Driven Game)
      4. Visualizing Wall Data As a (Rotating) Tag Cloud
    3. Closing Remarks
  13. 10. The Semantic Web: A Cocktail Discussion
    1. An Evolutionary Revolution?
    2. Man Cannot Live on Facts Alone
      1. Open-World Versus Closed-World Assumptions
      2. Inferencing About an Open World with FuXi
    3. Hope
  14. Index
  15. About the Author
  16. Colophon
  17. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Chapter 4. Twitter: Friends, Followers, and Setwise Operations

Unless you’ve been cryogenically frozen for the past couple of years, you’ve no doubt heard of Twitter—a microblogging service that can be used to broadcast short (maximum 140 characters) status updates. Whether you love it, hate it, or are indifferent, it’s undeniable that Twitter has reshaped the way people communicate on the Web. This chapter makes a modest attempt to introduce some rudimentary analytic functions that you can implement by taking advantage of the Twitter APIs to answer a number of interesting questions, such as:

  • How many friends/followers do I have?

  • Who am I following that is not following me back?

  • Who is following me that I am not following back?

  • Who are the friendliest and least friendly people in my network?

  • Who are my “mutual friends” (people I’m following that are also following me)?

  • Given all of my followers and all of their followers, what is my potential influence if I get retweeted?

Warning

Twitter’s API is constantly evolving. It is highly recommended that you follow the Twitter API account, @TwitterAPI, and check any differences between the text and actual behavior you are seeing against the official docs.

This chapter analyzes relationships among Twitterers, while the next chapter hones in on the actual content of tweets. The code we’ll develop for this chapter is relatively robust in that it takes into consideration common issues such as the infamous Twitter rate limits,[23] network I/O errors, potentially managing large volumes of data, etc. The final result is a fairly powerful command-line utility that you should be able to adapt easily for your own custom uses (http://github.com/ptwobrussell/Mining-the-Social-Web/blob/master/python_code/TwitterSocialGraphUtility.py).

Note

Having the tools on hand to harvest and mine your own tweets is essential. However, be advised that initiatives to archive historical Twitter data in the U.S. Library of Congress may soon render the inconveniences and headaches associated with harvesting and API rate-limiting non-issues for many forms of analysis. Firms such as Infochimps are also emerging and providing a medium for acquiring various kinds of Twitter data (among other things). A query for Twitter data at Infochimps turns up everything from archives for #worldcup tweets to analyses of how smileys are used.

RESTful and OAuth-Cladded APIs

The year 2010 will be remembered by some as the transition period in which Twitter started to become “all grown up.” Basic HTTP authentication got replaced with OAuth[24] (more on this shortly), documentation improved, and API statuses became transparent, among other things. Twitter search APIs that were introduced with Twitter’s acquisition of Summize collapsed into the “traditional” REST API, while the streaming APIs gained increasing use for production situations. If Twitter were on the wine menu, you might pick up the 2010 bottle and say, “it was a good year—a very good year.” All that said, there’s a lot of useful information tucked away online, and this chapter aims not to reproduce any more of it than is absolutely necessary.

Most of the development in this chapter revolves around the social graph APIs for getting the friends and followers of a user, the API for getting extended user information (name, location, last tweet, etc.) for a list of users, and the API for getting tweet data. An entire book of its own (literally) could be written to explore additional possibilities, but once you’ve learned the ropes, your imagination will have no problems taking over. Plus, certain exercises always have to be left for the “interested reader,” right?

The Python client we’ll use for Twitter is quite simply named twitter (the same one we’ve already seen at work in Chapter 1). It provides a minimal wrapper around Twitter’s RESTful web services. There’s very little documentation about this module because you simply construct requests in the same manner that the URL is pieced together in Twitter’s online documentation. For example, a request from the terminal that retrieves Tim O’Reilly’s user info simply involves dispatching a request to the /users/show resource as a curl command, as follows:

$ curl 'http://api.twitter.com/1/users/show.json?screen_name=timoreilly'

Note

curl is a handy tool that can be used to transfer data to/from a server using a variety of protocols, and it is especially useful for making HTTP requests from a terminal. It comes standard and is usually in the PATH on most *nix systems, but Windows users may need to download and configure it.

There are a couple of subtleties about this request. First, Twitter has a versioned API, so the appearance of /1 as the URL context denotes that Version 1 of the API is in use. Next, a user_id could have been passed in instead of a screen_name had one been available. The mapping of the curl command to the equivalent Python script in Example 4-1 should be obvious.

Example 4-1. Fetching extended information about a Twitter user

# -*- coding: utf-8 -*-

import twitter
import json

screen_name = 'timoreilly'
t = twitter.Twitter(domain='api.twitter.com', api_version='1')
response = t.users.show(screen_name=screen_name)
print json.dumps(response, sort_keys=True, indent=4)

In case you haven’t reviewed Twitter’s online docs yet, it’s probably worthwhile to explicitly mention that the /users/show API call does not require authentication, and it has some specific peculiarities depending on whether a user has “protected” his tweets in the privacy settings. The /users/lookup API call is very similar to /users/show except that it requires authentication and allows you to pass in a comma-separated list of screen_name or user_id values so that you can perform batch lookups. To obtain authorization to use Twitter’s API, you’ll need to learn about OAuth, which is the topic of the next section.

No, You Can’t Have My Password

OAuth stands for “open authorization.” In an effort to be as forward-looking as possible, this section provides a very cursory overview of OAuth 2.0, an emerging authorization scheme that Facebook has already implemented from a clean slate (see Chapter 9) and that Twitter plans to support “soon”. It will eventually become the new industry standard. As this book was written, Twitter and many other web services support OAuth 1.0a, as defined by RFC 5849. However, the landscape is expected to shift, with OAuth 2.0 streamlining work for developers and promoting better user experiences. While the terminology and details of this section are specific to OAuth 2.0, the basic workflow involved in OAuth 1.0a is very similar. Both schemes are “three-legged” in that they involve an exchange of information (often called a “dance”) among a client application that needs access to a protected resource, a resource owner such as a social network, and an end user who needs to authorize the client application to access the protected resource (without giving it a username/password combination).

Note

It is purely coincidence that Twitter’s current (and only) API Version is 1.0 and that Twitter currently supports OAuth 1.0a. It is likely that Version 1 of Twitter’s API will eventually also support OAuth 2.0 when it is ready.

So, in a nutshell, OAuth provides a way for you to authorize an application to access data you have stored away in another application without having to share your username and password. The IETF OAuth 2.0 Protocol spec isn’t nearly as scary as it might sound, and you should take a little time to peruse it because OAuth is popping up everywhere—especially in the social network landscape. Here’s the gist of the major steps involved:

  • You (the end user) want to authorize an application of some sort (the client) to access some of your data (a scope) that’s managed by a web service (the resource owner).

    • You’re smart and you know better than to give the app your credentials directly.

  • Instead of asking for your password, the client redirects you to the resource owner, and you authorize a scope for the client directly with the resource owner.

    • The client identifies itself with a unique client identifier and a way to contact it once authorization has taken place by the end user.

  • Assuming the end user authorizes the client, the client is notified and given an authorization code confirming that the end user has authorized it to access a scope.

    • But there’s a small problem if we stop here: given that the client has identified itself with an identifier that is necessarily not a secret, a malicious client could have fraudulently identified itself and masqueraded as being created by a trusted publisher, effectively deceiving the end user to authorize it.

  • The client presents the authorization code it just received along with its client identifier and corresponding client secret to the resource owner and gets back an access token. The combination of client identifier, client secret, and authorization code ensures that the resource owner can positively identify the client and its authorization.

    • The access token may optionally be short-lived and need to be refreshed by the client.

  • The client uses the access token to make requests on behalf of the end user until the access token is revoked or expires.

Section 1.4.1 of the spec provides more details and the basis for Figure 4-1.

The web server flow for OAuth 2.0. Start with the web client and follow the arrows in lexographic order so that you end back at the web client after completing steps A–E.

Figure 4-1. The web server flow for OAuth 2.0. Start with the web client and follow the arrows in lexographic order so that you end back at the web client after completing steps A–E.

Again, OAuth 2.0 is a relatively new beast, and as of late 2010, various details are still being hammered out before the spec is officially published as an RFC. Your best bet for really understanding it is to sweat through reading the spec, sketching out some flows, asking yourself questions, role playing as the malicious user, etc. To get a feel for some of the politics associated with OAuth, review the Ars Technica article by Ryan Paul, “Twitter: A Case Study on How to Do OAuth Wrong”. Also check out Eran Hammer-Lahav’s (the first author on the OAuth 2.0 Protocol working draft) response, along with the fairly detailed rebuttal by Ben Adida, a cryptography expert.

Note

As a late-breaking convenience for situations that involve what Twitter calls a “single-user use case”, Twitter also offers a streamlined fast track for getting the credentials you need to make requests without having to implement the entire OAuth flow. Read more about it at http://dev.twitter.com/pages/oauth_single_token. This chapter and the following chapter, however, are written with the assumption that you’ll want to implement the standard OAuth flow.



[23] At the time of this writing, Twitter limits OAuth requests to 350 per hour and anonymous requests to 150 per hour. Twitter believes that these limits should be sufficient for most every client application, and that if they’re not, you’re probably building your app wrong.

[24] As of December 2010, Twitter implements OAuth 1.0a, but you should expect to see support for OAuth 2.0 sometime in 2011.

The best content for your career. Discover unlimited learning on demand for around $1/day.