8

SIMILARITY, CATEGORIZATION AND CLUSTERING

Automatic text categorization, known as ATC, has become one of the most significant commercial uses of content analysis techniques. At its core, automatic text categorization allows the computer to organize documents based on their content, offering a filtering capability orders of magnitude more sophisticated than simple keyword searching. The vector space model underlying ATC can be used for other types of similarity analysis, from comparing pairs of documents, to the automated grouping of entire archives. Clustering is an especially powerful technique that allows the machine to determine the set of categories that work best for a particular text collection, rather than placing them into predefined ...

Get Data Mining Methods for the Content Analyst now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.