O'Reilly logo

Practical Data Analysis - Second Edition by Dr. Sampath Kumar, Hector Cuesta

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The algorithm

We use the function list_words() to get a list of unique words with more than three characters in lower case:

def list_words(text): 
    words = [] 
    words_tmp = text.lower().split() 
    for w in words_tmp: 
        if w not in words and len(w) > 3: 
            words.append(w) 
    return words 

Tip

For a more advanced term-document matrix, we can use the Python textmining package from:

https://pypi.python.org/pypi/textmining/1.0

The training() function creates variables to store the data needed for the classification. The c_words variable is a dictionary with the unique words and its number of occurrences in the text (frequency) by category. The c_categories variable stores a dictionary of each category and its number of texts. Finally, c_text and c_total_words store the ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required