Word Analysis

Bayes Rule enables calculation of the probability that a message is spam, given an observed probability that various words indicated spam (or non-spam) in the past. One of the drawbacks of non-Bayesian filtering is the lack of a “big picture” about the message (for example, looking only for certain keywords, addresses, or other patterns). Initial Bayesian spam filters chose only 30 words to examine [1, 2]. Newer filters [4] look much more deeply.

One author [5] carefully determined word stems (such as reducing “mails” and “mailing” to “mail”). Graham [1, 2] was careful to generalize his analyses to include headers (which is intuitive because certain sources of email issue only spam).

Bill Yerazunis, author of the spam-filtering ...

Get Slamming Spam: A Guide for System Administrators now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.