Full Text Comparison

An alternative approach is to compare messages using their entire content, taking into account the insertion and deletion of words and changes in spelling and punctuation. This lets you use all the information content of the text, rather than a single word or phrase, and it allows you to avoid having to define a specific pattern that may not work as well as you had hoped.

Text comparison in this general sense is not a simple problem. Simple tools such as grep or diff are not up to the task. Tools based on dynamic programming, which I discuss briefly in Chapter 8 in the context of uncovering plagiarism, are too computationally expensive to be used here. Fortunately, there are a variety of open source text search tools available that can be used. Most of these operate by indexing the significant words in each document and then efficiently comparing those indexes. This approach, in its basic form, treats each word separately, whereas a lot of information is contained in how words are arranged in sentences. In the case of email searches, this is not such an important factor. Some of the leading tools in this area include WebGlimpse (http://webglimpse.net/), Swish-e (http://swish-e.org/) and Lucene (http://lucene.apache.org/). Efficient text comparison is a major component of Internet search engines, and, not surprisingly, these open source tools tend to focus on that application.

Rather than show how one of these tools can be adapted for email searching, I have chosen ...

Get Internet Forensics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.