In previous chapters, you've looked at different ways of dealing with word counts for textual data. For purposes of comparison, it's useful to try these first and see what sort of results you get, then compare them with the results of feature extraction. If you have the code that you wrote for those chapters, you can import those modules and try them here on your feeds. If not, don't worry—this section illustrates how these methods work on the sample data.
Bayesian classification is, as you've seen, a supervised learning method. If you were to try to use the classifier built in Chapter 6, you would first be required to classify several examples of stories to train the classifier. The classifier would then be able to put later stories into your predefined categories. Besides the obvious downside of having to do the initial training, this approach also suffers from the limitation that the developer has to decide what all the different categories are. All the classifiers you've seen so far, such as decision trees and support-vector machines, will have this same limitation when applied to a dataset of this kind.
If you'd like to try the Bayesian classifier on this dataset,
you'll need to place the module you built in Chapter 6 in your working directory. You can use
articlewords dictionary as is
for the feature set of each article.
Try this in your Python session:
return [wordvec[w] for w in range(len(x)) if ...