The clustering of text has many applications. It deals with grouping similar documents based on the words present in the text. One of the most common examples would be the clustering of news articles into similar groups. We will discuss how to implement the clustering of text using Mahout.
We will be using
Reuters dataset for the clustering example. This dataset has a repository of e-mails. We will download the dataset and then extract it using
tar to the
reuters-sgm folder. Move to the directory
data/chapter10 and execute the following commands:
export MAHOUT_LOCAL=TRUE curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o reuters21578.tar.gz mkdir -p reuters-sgm tar xzf reuters21578.tar.gz -C reuters-sgm ...