Tokenizing text data

When we deal with text, we need to break it down into smaller pieces for analysis. This is where tokenization comes into the picture. It is the process of dividing the input text into a set of pieces like words or sentences. These pieces are called tokens. Depending on what we want to do, we can define our own methods to divide the text into many tokens. Let's take a look at how to tokenize the input text using NLTK.

Create a new Python file and import the following packages:

from nltk.tokenize import sent_tokenize, \ 
        word_tokenize, WordPunctTokenizer

Define some input text that will be used for tokenization:

# Define input text input_text = "Do you know how tokenization works? It's actually quite interesting! Let's analyze a ...

Get Artificial Intelligence with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Artificial Intelligence with Python by Prateek Joshi

Tokenizing text data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly