O'Reilly logo

Learning Apache Mahout by Chandramani Tiwary

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Clustering text

The clustering of text has many applications. It deals with grouping similar documents based on the words present in the text. One of the most common examples would be the clustering of news articles into similar groups. We will discuss how to implement the clustering of text using Mahout.

The dataset

We will be using Reuters dataset for the clustering example. This dataset has a repository of e-mails. We will download the dataset and then extract it using tar to the reuters-sgm folder. Move to the directory data/chapter10 and execute the following commands:

export MAHOUT_LOCAL=TRUE
curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o reuters21578.tar.gz

mkdir -p reuters-sgm

tar xzf reuters21578.tar.gz -C reuters-sgm ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required