Chapter 6. Handling Text

6.0 Introduction

Unstructured text data, like the contents of a book or a tweet, is both one of the most interesting sources of features and one of the most complex to handle. In this chapter, we will cover strategies for transforming text into information-rich features. This is not to say that the recipes covered here are comprehensive. There exist entire academic disciplines focused on handling this and similar types of data, and the contents of all their techniques would fill a small library. Despite this, there are some commonly used techniques, and a knowledge of these will add valuable tools to our preprocessing toolbox.

6.1 Cleaning Text

Problem

You have some unstructured text data and want to complete some basic cleaning.

Solution

Most basic text cleaning operations should only replace Python’s core string operations, in particular strip, replace, and split:

# Create text
text_data = ["   Interrobang. By Aishwarya Henriette     ",
             "Parking And Going. By Karl Gautier",
             "    Today Is The night. By Jarek Prakash   "]

# Strip whitespaces
strip_whitespace = [string.strip() for string in text_data]

# Show text
strip_whitespace
['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']
# Remove periods
remove_periods = [string.replace(".", "") for string in strip_whitespace]

# Show text
remove_periods
['Interrobang By Aishwarya Henriette', 'Parking And Going By Karl Gautier', 'Today Is The night By ...

Get Machine Learning with Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.