Compression

A recurring theme that appears in this book is the need to save storage and network data transfer. When dealing with large volumes of data, anything that reduces these two properties gives an efficiency boost both in terms of speed and cost. Compression is one such strategy that can help make a Hadoop-based system efficient.

All compression techniques are a tradeoff between speed and space. The higher the space savings, the slower the compression technique, and vice versa. Each compression technique is also tunable for this tradeoff. For example, the gzip compression tool has options -1 to -9, where -1 optimizes for speed and -9 for space.

The following figure shows the different compression algorithms in the speed-space spectrum. The ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.