Chapter 10. Acquisition

Most of the interesting public data sources are poorly structured, full of noise, and hard to access. I probably spend more time turning messy source data into something usable than I do on the rest of the data analysis processes combined, so I’m very thankful that there are multiple tools emerging to help.

Google Refine is an update to the Freebase Gridworks tool for cleaning up large, messy spreadsheets. It has been designed to make it easy to correct the most common errors you’ll encounter in human-created datasets. For example, it’s easy to spot and correct common problems like typos or inconsistencies in text values and to change cells from one format to another. There’s also rich support for linking data by calling APIs with the data contained in existing rows to augment the spreadsheet with information from external sources.

Refine doesn’t let you do anything you can’t with other tools, but its power comes from how well it supports a typical extract and transform workflow. It feels like a good step up in abstraction, packaging processes that would typically take multiple steps in a scripting language or spreadsheet package into single operations with sensible defaults.

Needlebase provides a point-and-click interface for extracting structured information from web pages. As a user, you select elements on an example page that contain the data you’re interested in, and the tool then uses the patterns you’ve defined to pull out information ...

Get Big Data Glossary now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.