Analyzing Tweets (One Entity at a Time)

CouchDB makes a great storage medium for collecting tweets because, just like the email messages we looked at in Chapter 3, they are conveniently represented as JSON-based documents and lend themselves to map/reduce analysis with very little effort. Our next example script harvests tweets from time lines, is relatively robust, and should be easy to understand because all of the modules and much of the code has already been introduced in earlier chapters. One subtle consideration in reviewing it is that it uses a simple map/reduce job to compute the maximum ID value for a tweet and passes this in as a query constraint so as to avoid pulling duplicate data from Twitter’s API. See the information associated with the since_id parameter of the time line APIs for more details.

It may also be informative to note that the maximum number of most recent tweets available from the user time line is around 3,200, while the home time line[30] returns around 800 statuses; thus, it’s not very expensive (in terms of counting toward your rate limit) to pull all of the data that’s available. Perhaps not so intuitive when first interacting with the time line APIs is the fact that requests for data on the public time line only return 20 tweets, and those tweets are updated only every 60 seconds. To collect larger amounts of data you need to use the streaming API.

For example, if you wanted to learn a little more about Tim O’Reilly, “Silicon Valley’s ...

Get Mining the Social Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.