Chapter 8. Data Cleanup: Standardizing and Scripting

You’ve learned how to match, parse, and find duplicates in your data, and you’ve started exploring the wonderful world of data cleanup. As you grow to understand your datasets and the questions you’d like to answer with them, you’ll want to think about standardizing your data as well as automating your cleanup.

In this chapter, we’ll explore how and when to standardize your data and when to test and script your data cleanup. If you are managing regular updates or additions to the dataset, you’ll want to make the cleanup process as efficient and clear as possible so you can spend more time analyzing and reporting. We’ll begin by standardizing and normalizing your dataset and determining what to do if your dataset is not normalized.

Normalizing and Standardizing Your Data

Depending on your data and the type of research you are conducting, standardizing and normalizing your dataset might mean calculating new values using the values you currently have, or it might mean applying standardizations or normalizations across a particular column or value.

Normalization, from a statistical view, often has to do with calculating new values from a dataset to standardize the data on a particular scale. For example, you might need to normalize scores for a test to scale so you can accurately view the distribution. You might also need to normalize data so you can accurately see percentiles, or percentiles across different groups ...

Get Data Wrangling with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.