BBC dataset

In 2006, Greene and Cunningham collected the BBC dataset to study a particular document—Clustering challenge using support vector machines. The dataset consists of 2,225 documents from the BBC News website from 2004 to 2005, corresponding to the stories collected from five topical areas: business, entertainment, politics, sport, and technology. The dataset can be seen at the following website: http://mlg.ucd.ie/datasets/bbc.html.

We can download the raw text files under the Dataset: BBC section. You will also notice that the website contains an already processed dataset, but, for this example, we want to process the dataset by ourselves. The ZIP contains five folders, one per topic. The actual documents are placed in the corresponding ...

Get Machine Learning in Java - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.