Clustering text

The clustering of text has many applications. It deals with grouping similar documents based on the words present in the text. One of the most common examples would be the clustering of news articles into similar groups. We will discuss how to implement the clustering of text using Mahout.

The dataset

We will be using Reuters dataset for the clustering example. This dataset has a repository of e-mails. We will download the dataset and then extract it using tar to the reuters-sgm folder. Move to the directory data/chapter10 and execute the following commands:

export MAHOUT_LOCAL=TRUE
curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o reuters21578.tar.gz

mkdir -p reuters-sgm

tar xzf reuters21578.tar.gz -C reuters-sgm ...

Get Learning Apache Mahout now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.