Chapter 19. Compression

Mark Jason Dominus

You are probably familiar with Unix compress, gzip, or bzip2 utilities, or the DOS pkzip utility. These programs all make files smaller; such files are compressed. Compressed files take less disk space and less network bandwidth when you move them around. The downside of compressed files is that it they are full of unreadable gibberish; you usually have to run another program to uncompress them before you can use them again. In this article we’ll see how file compression works, and I’ll demonstrate a simple module that includes functions for compressing and uncompressing files.

Morse Code

The idea behind data compression is very simple. In a typical text file, every character takes up the same amount of space: eight bits. The letter “e” is represented by the bits 01100101; the letter “Z” is represented by the bits 010110010. But in a text file, “e” occurs much more frequently than “Z”—about 75 times as frequently. If you could give the common symbols short codes, you’d save space.

This isn’t a new idea. It was exploited by Samuel Morse in the Morse code, a very early digital data transmission protocol. Morse code was designed to send text files over telegraph wires. A telegraph is very simple; it has a switch at one end, and when you close the switch, an electric current travels through a wire to the other end, where there is a relay that makes a click. By tapping the switch at one end, you make the relay at the other end click. Letters and ...

Get Computer Science & Perl Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.