Grouping similar text documents with k-means clustering methods

Computer programs face limitations in interpreting the meaning of given sentences, and therefore do not know how to group documents based on their similarities. However, if we can convert sentences into a mathematical matrix (document term matrix), a program can compute the distance between each document and group similar ones together.

In this recipe, we demonstrate how to compute the distance between text documents and how we can cluster similar text documents with the k-means method.

Getting ready

In this recipe, we use news titles as clustering input. You can find the data on the author's GitHub page at https://github.com/ywchiu/rcookbook/raw/master/chapter12/news.RData.

How to do ...

Get R for Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.