18

image Sorting Out Address Data

Cleaning address data is challenging, because the address information is divided across multiple lines. This is a real problem, because the delimiters may be tabs, commas, or line breaks, and the number of lines is variable and so too is their positional order.

Automating the whole process is not easy because of the widely variable formatting. Parts of it may yield to some scripting. Unless the dataset is extraordinarily huge, you might find that loading the address data into Excel and manipulating it there makes for a handy “workbench.” We might start with something like this:

Figure 18a Raw address data

First, apply ...

Get Developing Quality Metadata now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.