You are previewing Mining the Social Web.

Mining the Social Web

Cover of Mining the Social Web by Matthew A. Russell Published by O'Reilly Media, Inc.
  1. Mining the Social Web
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Content Updates
      1. February 22, 2012
    2. To Read This Book?
    3. Or Not to Read This Book?
    4. Tools and Prerequisites
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  4. 1. Introduction: Hacking on Twitter Data
    1. Installing Python Development Tools
    2. Collecting and Manipulating Twitter Data
      1. Tinkering with Twitter’s API
      2. Frequency Analysis and Lexical Diversity
      3. Visualizing Tweet Graphs
      4. Synthesis: Visualizing Retweets with Protovis
    3. Closing Remarks
  5. 2. Microformats: Semantic Markup and Common Sense Collide
    1. XFN and Friends
    2. Exploring Social Connections with XFN
      1. A Breadth-First Crawl of XFN Data
    3. Geocoordinates: A Common Thread for Just About Anything
      1. Wikipedia Articles + Google Maps = Road Trip?
    4. Slicing and Dicing Recipes (for the Health of It)
    5. Collecting Restaurant Reviews
    6. Summary
  6. 3. Mailboxes: Oldies but Goodies
    1. mbox: The Quick and Dirty on Unix Mailboxes
    2. mbox + CouchDB = Relaxed Email Analysis
      1. Bulk Loading Documents into CouchDB
      2. Sensible Sorting
      3. Map/Reduce-Inspired Frequency Analysis
      4. Sorting Documents by Value
      5. couchdb-lucene: Full-Text Indexing and More
    3. Threading Together Conversations
      1. Look Who’s Talking
    4. Visualizing Mail “Events” with SIMILE Timeline
    5. Analyzing Your Own Mail Data
      1. The Graph Your (Gmail) Inbox Chrome Extension
    6. Closing Remarks
  7. 4. Twitter: Friends, Followers, and Setwise Operations
    1. RESTful and OAuth-Cladded APIs
      1. No, You Can’t Have My Password
    2. A Lean, Mean Data-Collecting Machine
      1. A Very Brief Refactor Interlude
      2. Redis: A Data Structures Server
      3. Elementary Set Operations
      4. Souping Up the Machine with Basic Friend/Follower Metrics
      5. Calculating Similarity by Computing Common Friends and Followers
      6. Measuring Influence
    3. Constructing Friendship Graphs
      1. Clique Detection and Analysis
      2. The Infochimps “Strong Links” API
      3. Interactive 3D Graph Visualization
    4. Summary
  8. 5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
    1. Pen : Sword :: Tweet : Machine Gun (?!?)
    2. Analyzing Tweets (One Entity at a Time)
      1. Tapping (Tim’s) Tweets
      2. Who Does Tim Retweet Most Often?
      3. What’s Tim’s Influence?
      4. How Many of Tim’s Tweets Contain Hashtags?
    3. Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty)
      1. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?
      2. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?
      3. Which Gets Retweeted More Often: #JustinBieber or #TeaParty?
      4. How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?
    4. Visualizing Tons of Tweets
      1. Visualizing Tweets with Tricked-Out Tag Clouds
      2. Visualizing Community Structures in Twitter Search Results
    5. Closing Remarks
  9. 6. LinkedIn: Clustering Your Professional Network for Fun (and Profit?)
    1. Motivation for Clustering
    2. Clustering Contacts by Job Title
      1. Standardizing and Counting Job Titles
      2. Common Similarity Metrics for Clustering
      3. A Greedy Approach to Clustering
      4. Hierarchical and k-Means Clustering
    3. Fetching Extended Profile Information
    4. Geographically Clustering Your Network
      1. Mapping Your Professional Network with Google Earth
      2. Mapping Your Professional Network with Dorling Cartograms
    5. Closing Remarks
  10. 7. Google+: TF-IDF, Cosine Similarity, and Collocations
    1. Harvesting Google+ Data
    2. Data Hacking with NLTK
    3. Text Mining Fundamentals
      1. A Whiz-Bang Introduction to TF-IDF
      2. Querying Google+ Data with TF-IDF
    4. Finding Similar Documents
      1. The Theory Behind Vector Space Models and Cosine Similarity
      2. Clustering Posts with Cosine Similarity
      3. Visualizing Similarity with Graph Visualizations
    5. Bigram Analysis
      1. How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions
    6. Tapping into Your Gmail
      1. Accessing Gmail with OAuth
      2. Fetching and Parsing Email Messages
    7. Before You Go Off and Try to Build a Search Engine…
    8. Closing Remarks
  11. 8. Blogs et al.: Natural Language Processing (and Beyond)
    1. NLP: A Pareto-Like Introduction
      1. Syntax and Semantics
      2. A Brief Thought Exercise
    2. A Typical NLP Pipeline with NLTK
    3. Sentence Detection in Blogs with NLTK
    4. Summarizing Documents
      1. Analysis of Luhn’s Summarization Algorithm
    5. Entity-Centric Analysis: A Deeper Understanding of the Data
      1. Quality of Analytics
    6. Closing Remarks
  12. 9. Facebook: The All-in-One Wonder
    1. Tapping into Your Social Network Data
      1. From Zero to Access Token in Under 10 Minutes
      2. Facebook’s Query APIs
    2. Visualizing Facebook Data
      1. Visualizing Your Entire Social Network
      2. Visualizing Mutual Friendships Within Groups
      3. Where Have My Friends All Gone? (A Data-Driven Game)
      4. Visualizing Wall Data As a (Rotating) Tag Cloud
    3. Closing Remarks
  13. 10. The Semantic Web: A Cocktail Discussion
    1. An Evolutionary Revolution?
    2. Man Cannot Live on Facts Alone
      1. Open-World Versus Closed-World Assumptions
      2. Inferencing About an Open World with FuXi
    3. Hope
  14. Index
  15. About the Author
  16. Colophon
  17. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Chapter 1. Introduction: Hacking on Twitter Data

Although we could get started with an extended discussion of specific social networking APIs, schemaless design, or many other things, let’s instead dive right into some introductory examples that illustrate how simple it can be to collect and analyze some social web data. This chapter is a drive-by tutorial that aims to motivate you and get you thinking about some of the issues that the rest of the book revisits in greater detail. We’ll start off by getting our development environment ready and then quickly move on to collecting and analyzing some Twitter data.

Installing Python Development Tools

The example code in this book is written in Python, so if you already have a recent version of Python and easy_install on your system, you obviously know your way around and should probably skip the remainder of this section. If you don’t already have Python installed, the bad news is that you’re probably not already a Python hacker. But don’t worry, because you will be soon; Python has a way of doing that to people because it is easy to pick up and learn as you go along. Users of all platforms can find instructions for downloading and installing Python at http://www.python.org/download/, but it is highly recommended that Windows users install ActivePython, which automatically adds Python to your path at the Windows Command Prompt (henceforth referred to as a “terminal”) and comes with easy_install, which we’ll discuss in just a moment. The examples in this book were authored in and tested against the latest Python 2.7 branch, but they should also work fine with other relatively up-to-date versions of Python. At the time this book was written, Python Version 2 is still the status quo in the Python community, and it is recommended that you stick with it unless you are confident that all of the dependencies you’ll need have been ported to Version 3, and you are willing to debug any idiosyncrasies involved in the switch.

Once Python is installed, you should be able to type python in a terminal to spawn an interpreter. Try following along with Example 1-1.

Example 1-1. Your very first Python interpreter session

>>> print "Hello World"
Hello World
>>> #this is a comment
...
>>> for i in range(0,10): # a loop
...     print i, # the comma suppresses line breaks
... 
0 1 2 3 4 5 6 7 8 9
>>> numbers = [ i for i in range(0,10) ] # a list comprehension
>>> print numbers
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> if 10 in numbers: # conditional logic
...     print True
... else:
...     print False
... 
False

One other tool you’ll want to have on hand is easy_install,[5] which is similar to a package manager on Linux systems; it allows you to effortlessly install Python packages instead of downloading, building, and installing them from source. You can download the latest version of easy_install from http://pypi.python.org/pypi/setuptools, where there are specific instructions for each platform. Generally speaking, *nix users will want to sudo easy_install so that modules are written to Python’s global installation directories. It is assumed that Windows users have taken the advice to use ActivePython, which automatically includes easy_install as part of its installation.

Note

Windows users might also benefit from reviewing the blog post “Installing easy_install…could be easier”, which discusses some common problems related to compiling C code that you may encounter when running easy_install.

Once you have properly configured easy_install, you should be able to run the following command to install NetworkX—a package we’ll use throughout the book for building and analyzing graphs—and observe similar output:

$  easy_install networkx
Searching for networkx

...truncated output...

Finished processing dependencies for networkx

With NetworkX installed, you might think that you could just import it from the interpreter and get right to work, but occasionally some packages might surprise you. For example, suppose this were to happen:

>>> import networkx
Traceback (most recent call last):

... truncated output ...

ImportError: No module named numpy

Whenever an ImportError happens, it means there’s a missing package. In this illustration, the module we installed, networkx, has an unsatisfied dependency called numpy , a highly optimized collection of tools for scientific computing. Usually, another invocation of easy_install fixes the problem, and this situation is no different. Just close your interpreter and install the dependency by typing easy_install numpy in the terminal:

$  easy_install numpy
Searching for numpy

...truncated output...

Finished processing dependencies for numpy

Now that numpy is installed, you should be able to open up a new interpreter, import networkx, and use it to build up graphs. Example 1-2 demonstrates.

Example 1-2. Using NetworkX to create a graph of nodes and edges

>>> import networkx
>>> g=networkx.Graph()
>>> g.add_edge(1,2)
>>> g.add_node("spam")
>>> print g.nodes()
[1, 2, 'spam']
>>> print g.edges()
[(1, 2)]

At this point, you have some of your core Python development tools installed and are ready to move on to some more interesting tasks. If most of the content in this section has been a learning experience for you, it would be worthwhile to review the official Python tutorial online before proceeding further.



[5] Although the examples in this book use the well-known easy_install, the Python community has slowly been gravitating toward pip , another build tool you should be aware of and that generally “just works” with any package that can be easy_install’d. If you have git tooling already installed, pip is also handy for installing directly from GitHub repositories for packages that aren't available through PyPi as illustrated in Exploring the Graph API one connection at a time.

The best content for your career. Discover unlimited learning on demand for around $1/day.