Clustering text data using K-Means

In this recipe, we are going to take a look at how to use Mahout to cluster text data using Mahout's implementation of the K-Means algorithm. K-Means is very popular clustering algorithm; you can read more about it at https://en.wikipedia.org/wiki/K-means_clustering.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

In this recipe, we are going to use Mahout's K Means algorithm to cluster the text data that is available. To do this, we first need to get some text data and copy it to HDFS:

hadoop fs –mkdir /kmeans
hadoop fs –put mydata.txt /kmeans/input

In order to execute the K-Means job on the given data, we first ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop: Data Processing and Modelling by Garry Turkington, Tanmay Deshpande, Sandeep Karanth

Clustering text data using K-Means

Getting ready

How to do it...

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly