Summary

In this chapter, we build upon what we've learned from Chapter 2, Data Ingest and Egress Patterns, where we have integrated data from multiple source systems and ingested it into Hadoop. The next step is to find clues about the data type by looking at the constituent values. The values are examined to see if they are misrepresented, if their units are misinterpreted, or if the context of units is derived incorrectly. This sleuthing mechanism is discussed in more detail in the data type inference pattern.

In the basic statistical profiling pattern, we examine if the values meet the quality expectations of the use case by collecting statistical information on the numeric values to find answers to the following questions: For a numeric field, ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.