CountVectorizer

CountVectorizer and CountVectorizerModel works on count of words(tokens). It uses words in text documents to build vectors containing count of tokens. It has provisions of using dictionary of words to identify tokens that can be taken as input to algorithms. If dictionary is not available CountVectorizer uses its own estimator to build the vocabulary. Based on that vocabulary it generates CountVectorizerModel, a sparse representations of training documents. This model acts as input to NLP algorithms like LDA. 

CountVectorizer counts the word frequencies for the document, whereas TF-IDF gives us the importance of the word with regards to the whole corpus. CountVectorizer is one of the tools used to convert the text to a vector ...

Get Artificial Intelligence for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.