O'Reilly logo

Mining the Social Web by Matthew A. Russell

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Threading Together Conversations

As a first attempt at threading conversations, you might start with some basic string heuristics on the Subject header of the message and eventually get to the point where you’re inspecting senders, recipients, and date stamps in an attempt to piece things together. Fortunately, mail servers are slightly more sophisticated than you might think and, as you know from mbox: The Quick and Dirty on Unix Mailboxes, there are Message-ID, In-Reply-To, and References headers that can be used to extract conversations from messages in a mailbox. A message threading algorithm commonly known as “jwz threading,”[20] takes all of this into account and provides a reasonable approach to parsing out message threads. All of the specifics for the algorithm can be found online at http://www.jwz.org/doc/threading.html. The implementation we’ll be using is a fairly straightforward modification[21] of the one found in the Mail Trends project, which provides some other useful out-of-the-box tools. Given that no checkins for the project hosted on Google Code have occurred since early 2008, it’s unclear whether Mail Trends is being actively maintained anywhere, but the project nonetheless provides a useful starting point for mail analysis, as evidenced by the salvaging of jwz threading.

Let’s go ahead and take a look at the overall workflow in Example 3-14, and then we’ll dive into a few more of the details.

Example 3-14. Creating discussion threads from mbox data via “jwz threading” ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required