Chapter 8. Cleaning Your Dirty Data

So far in this book, you’ve ignored the problem of badly formatted data by using generally well-formatted data sources, dropping data entirely if it deviated from what you were expecting. But often, in web scraping, you can’t be too picky about where you get your data from, or what it looks like.

Because of errant punctuation, inconsistent capitalization, line breaks, and misspellings, dirty data can be a big problem on the web. This chapter covers a few tools and techniques to help you prevent the problem at the source by changing the way you write code, and clean the data after it’s in the database.

Cleaning in Code

Just as you write code to handle overt exceptions, you should practice defensive coding to handle the unexpected.

In linguistics, an n-gram is a sequence of n words used in text or speech. When doing natural language analysis, it can often be handy to break up a piece of text by looking for commonly used n-grams, or recurring sets of words that are often used together.

This section focuses on obtaining properly formatted n-grams rather than using them to do any analysis. Later, in Chapter 9, you can see 2-grams and 3-grams in action to do text summarization and analysis.

The following returns a list of 2-grams found in the Wikipedia article on the Python programming language:

from urllib.request import urlopen
from bs4 import BeautifulSoup

def getNgrams(content, n):
  content = content.split(' ')
  output = []
  for i in range ...

Get Web Scraping with Python, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.