ITPRNN homepage | ITILA homepage

# COMPRESSION ALGORITHMS

Some solutions to exercise 5.29 (page 103), Information Theory, Inference, and Learning Algorithms. [As usual I recommend that you not look at these solutions until you have thought hard about your own.] When making your own solution, you may find it useful to have an implementation of the Huffman algorithm. You are welcome to use my perl program huffman.p or python program huffman10.py.

## Complete solutions

Here are example compression programs, mainly written in C, all designed to compress a sparse file containing N=10000 bits of which roughly 0.01 are 1s.

You can download all the programs and a Makefile in this tar file.

 Some notes on an approach using runlength codes postscript | pdf

### Huffman using run lengths

In my runlength/Huffman encoder the maximum runlength is 69 for the reasons explained in this document (postscript) | (pdf).

I assume the length of the file is known to the decoder; this allows the compressed file to be about 6 bits shorter than if I ensured that the file is self-delimiting in some way, for example, using the EOF character.

My encoder is a C program, RLencode.c; the decoder is a perl program, RLdecode.p because I didn't want to figure out how to write the decoder in C. I am pretty sure that these programs will work on any files.

The usage is

``` RLencode   < filep.01.10000 > file.RLZ
RLdecode.p < file.RLZ > decoded
```
The encoder gets the file into 831 bits.

If the source filelength is changed from 10000, please add the Nmax=blah argument to inform the decoder of the correct filelength. Thus: RLdecode.p Nmax=10000 < file.RLZ > decoded

### Arithmetic coding

It's hard to make an arithmetic code that works perfectly. ACencode and ACdecode work on all test files I have tried, but I am still not certain they will always always work! (Indeed, I reckon there's a probability of 1/million or so per megabyte of compressed output that this algorithm will get into trouble. For a better-written arithmetic coding algorithm, please see Radford Neal's arithmetic coder in C, or my 'compressors in python' page.)

Arithmetic coding is nice because it lends itself to adaptive models, corresponding for example to the belief that the bias of the bent coin is not known, and should be inferred as we go; or to the belief that the bias of the bent coin might change with time.

The two programs have two choices of compilation options, corresponding to two possible adaptive models. One model (well suited to the competition problem) asserts that the bias of the coin is known to be accurately very close to 0.01; the other asserts that the bias is unknown and could be anything in the ballpark 0.01-0.99. This choice is determined in the file ACdefns.h, which is included at compilation time by both programs.

The programs are used thus:

``` make ACdecode
make ACencode
ACencode < filep.01.10000  > file.ACZ
ACdecode < file.ACZ  > file.decoded
```
This encoder gets the sparse file into 829 bits. The decoder makes use of the known source file length, N=10000.

The results achieved by arithmetic coding are especially impressive for even larger files. For example, I made a million-bit source file using randNchooseM.p N=1000000 M=10000 > million.dat and compared ACencode with the runlength encoder RLencode. The compressed file lengths were

```   80797 file.ACZ
81025 file.RLZ
```

### Golomb code

The Golomb code is a very simple code that is both a runlength code and an approximate arithmetic coder. The encoder has just two adjustable parameters, one bit (here set to 0) which identifies the more probable symbol, and an integer m (here set to 6 or 7) whose value defines the implicit probability of the less probable symbol, via

p1 ~= ln(2) 2-m.
We define M = 2m.

To encode a file, the Golomb encoder outputs a 1 every time the stream contains M consecutive 0s. Whenever it encounters (for some r between 0 and M-1) a string of r consecutive 0s followed by a 1, it outputs a 0 followed by the integer r encoded as an m-bit binary number.

This encoder may be viewed as a special case of the runlength-based Huffman code with a maximum runlength to be a power of 2, and all runs assigned equal implicit probability.

One may also view it as an approximate arithmetic coder; adaptation may be performed by adjusting m. The Golomb coder was the starting point for the Z-coder, an excellent compression algorithm used inside djvu.

The programs are used thus:

``` make GolombEncode
make GolombDecode
GolombEncode < filep.01.10000  > file.GZ
GolombDecode < file.GZ  > file.decoded
```
This encoder gets the sparse file into 870 bits (when m=7) and 838 bits (when m=6). That's very close to optimal, isn't it! The decoder does not make use of a known source file length; when it hits an EOF symbol, it stops decoding and terminates the file correctly.

 Code C_alpha n traditional binary length c_alpha(n) 1 1 1 1 2 10 2 010 3 11 2 011 4 100 3 00100 5 101 3 00101 6 110 3 00110 . . . . 45 101101 6 00000101101

### Run length code with cheap encoder for integers

We can make a really small compression algorithm that is reasonably well suited to sparse files by simply counting the run lengths of 1s and 0s, then coding those integers into binary with a simple uniquely decodeable code.

The program IRL.c encodes the runlengths using the code C_alpha, described in Chapter 7 of Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers).

The same program can function as an encoder or a decoder; add the command line argument "-1" to select the decoder.

``` encoding usage: IRL < filep.01.10000  > file.IRLZ
decoding usage: IRL -1 <  file.IRLZ > decoded
```
This program does not use an explicit probabilistic model; instead, it uses an implicit probabilistic model defined by the chosen codelengths for integers. For example, according to C_alpha, the implicit probabilities of 2 and 3 are both 1/8, and the probabilities of 4, 5, 6, and 7 are all 1/32.

This program gets the sparse file into 1286 bits.

## Further ideas for other solutions

### Position code plus `bits back'

Here's a fun idea. To encode a file of length N=10,000, of which roughly 100 of the bits are 1s, we could encode the position of each of the 1s in the file. Since each position can be represented by a 14-bit integer, the compressed file length will be roughly 100x14 = 1400 bits.

Now, that's some way off from the expected Shannon information content, 800 bits or so. Why?

Well, an encoding of all the positions has redundancy in it in the sense that the encoder is free to choose the order of the encoded bit-positions. This freedom means that the encoder has the opportunity to encode not only the 100 bit-positions but also an arbitrary choice of one from the 100! (one hundred factorial) possible permutations. In order to make that choice, the encoder could sub-contract to his friend Fred, who also wants to communicate over this channel, the decision about the choice of permutation. Receiving the permuted string of bits, the receiver can then deduce not only the sparse file but also Fred's choice. And Fred can use that choice to convey another file of size log2(100!) bits, which is very roughly 100 (log2 (100/e) ), or 520 bits.

So the net cost of communicating this way is (total transmitted length in bits) - (length in bits of Fred's hidden transmission) ~= 1400 - 520 ~= 880 bits.

Which is (pretty near) the expected Shannon information content!

This idea is called bits-back coding. We encode in an arbitrary way, that requires some extra bits to resolve the arbitrariness; then claim back the cost of those extra bits by selling them to Fred. Now we get to the really fun bit: can you write a compression method such that the encoder himself plays the role of Fred? - i.e., the encoder chooses the permutation of the bit-positions in a way that conveys some of the bit-positions!

### Related concepts

• How should Fred turn his 600-bit file into a permutation of the 100 bits? We need an efficient and practical program. And a decoder that turns a permutation back into bits.
• A nearby concept is this: imagine that Joe wants to communicate an unordered selection of r objects from N objects. How should he encode this selection in bits?

 ITPRNN homepage | ITILA homepage

David MacKay
Last modified: Mon Feb 18 17:35:17 2008