Clustering text data using K-Means
In this recipe, we are going to take a look at how to use Mahout to cluster text data using Mahout's implementation of the K-Means algorithm. K-Means is very popular clustering algorithm; you can read more about it at https://en.wikipedia.org/wiki/K-means_clustering.
Getting ready
To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.
How to do it...
In this recipe, we are going to use Mahout's K Means algorithm to cluster the text data that is available. To do this, we first need to get some text data and copy it to HDFS:
hadoop fs –mkdir /kmeans hadoop fs –put mydata.txt /kmeans/input
In order to execute the K-Means job on the given data, we first ...
Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.