SANITY CHECK AND DATA VISUALIZATION

Linda L. Briggs, in her 2010 TDWI article, interviewed Dannett McGilvray, a former data quality expert at Hewlett Packard, and he mentioned that “Data quality is a measurable and ongoing process effort.”10 Before building any data table for business analytics, here are some standard sanity checks that we recommend your team undertake:

  • Missing value percentage: Avoid keeping any variable with more than 50% missing value.
  • Outliers: Extreme value when looking at, for instance, age—140 would definitely be suspicious; we recommend getting rid of extreme value or simply adjusting them.
  • Suspicious definition and unknown value coding.
  • Invalid or erroneous data.
  • Data distribution: Understand how your data are distributed and how it affects your business. Are your customers leaning to be younger more than older? Is their average spending higher or lower?
  • Inconsistent values: A field containing both defined at character and integer data types.

On top of this sanity check list, it is important to visualize the data. Most business analytics companies offer a variety of visualization tools. Visualization would help you quickly spot data inconsistencies. We will cover data visualization in more detail in a separate chapter.

It is also essential to understand the process by which the data are created in a given system. In the end, familiarization with the extraction and transformation processes, as well as with other data hygiene steps, is vital for you to ...

Get Win with Advanced Business Analytics: Creating Business Value from Your Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.