- Initialize a new Python file by importing the following file:
import numpy as np from nltk.corpus import brown from chunking import splitter
- Define the main function and read the input data from Brown corpus:
if __name__=='__main__': content = ' '.join(brown.words()[:10000])
- Split the text content into chunks:
num_of_words = 2000 num_chunks = [] count = 0 texts_chunk = splitter(content, num_of_words)
- Build a vocabulary based on these text chunks:
for text in texts_chunk: num_chunk = {'index': count, 'text': text} num_chunks.append(num_chunk) count += 1
- Extract a document word matrix, which effectively counts the amount of incidences of each word in the document:
from sklearn.feature_extraction.text import ...