Choosing Pig for validation and cleansing

Implementing the validation and cleansing code in Pig within the Hadoop environment, reduces the time-quality trade-off and the requirement to move data to external systems to perform cleansing. The high-level overview of implementation is depicted in the following diagram:

Choosing Pig for validation and cleansing

Implementing validation and cleansing in Pig

The following are the advantages of performing data cleansing within the Hadoop environment using Pig:

  • Improved overall performance since validation and cleansing are done in the same environment. There is no need to transfer data to external systems for cleansing.
  • Pig is highly suitable to write ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.