Truncation

Another variant of erasing is truncation, where we make all the input data a uniform size. This is useful when we are pretty sure that information loss is accepted in the further processing of the pipelines.

This can also be an intelligent truncation where we are aware of the data we are dealing with. Let's see this example of email addresses:

Input

Output

What's truncated

alice@localhost.com

alice

@localhost.com

bob@localhost.com

bob

@localhost.com

rob@localhost.com

rob

@localhost.com

 

From the preceding examples, we can see that all the domain portions from the email are truncated as all of them belong to the same domain. This technique saves storage space.

Get Modern Big Data Processing with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.