The data

We can find the spam dataset from the following link:

http://spamassassin.apache.org/

In the following screenshot, we can see the easy ham (not spam) folder with 2551 files:

The data

The spam text looks like the following screenshot, which may include HTML tags and plain text. In this case, we are only interested in the subject line, so we need to write a code to obtain the subject from all the files.

The data

This example will show you how to preprocess the SpamAssassin data using Python in order to collect all the subject lines from the e-mails.

First, we ...

Get Practical Data Analysis - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.