The segmentation of documents

To identify the different groups of cleaned terms, based on the frequency and association of the terms in the documents of the corpus, one might directly use our tdm matrix to run, for example, the classic hierarchical cluster algorithm.

On the other hand, if you would rather like to cluster the R packages based on their description, we should compute a new matrix with DocumentTermMatrix, instead of the previously used TermDocumentMatrix. Then, calling the clustering algorithm on this matrix would result in the segmentation of the packages.

For more details on the available methods, algorithms, and guidance on choosing the appropriate functions for clustering, please see Chapter 10, Classification and Clustering. For ...

Get Mastering Data Analysis with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Data Analysis with R by Gergely Daroczi

The segmentation of documents

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly