All files pointed to by this page are free software © David MacKay December 2005. License: GPL
This page offers a library of compression algorithms in python.
Several of these packages use the doctest package to offer automated testing. For example, running
python IntegerCodes.pycauses 12 tests to be executed, all of them defined in the documentation of the individual functions. The same tests are also run if you open IntegerCodes.py in emacs and type
C-c C-c
IntegerCodes.py |
This module supplies four functions that implement three
codes for integers.
See Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers) This package uses the doctest module to test that it is functioning correctly. |
Huffman3.py |
The Huffman3 package provides a Huffman algorithm, spitting out an optimal binary symbol code for a given set of probabilities. It also returns two objects that can be used for Encoding and Decoding with the functions encode and decode. The two objects are
~/python/compression/huffman$ python Huffman3.py Reading in frequencies (and optional symbol names) from stdin 50 <<<| 25 <<<| These four lines are the user's input, terminated by Control-D 12 <<<| 13 <<<| #Symbol Count Codeword 1 (50) 1 2 (25) 01 3 (12) 000 4 (13) 001 Alternatively, a slightly fancier usage includes symbol names in each line of stdin: ~/python/compression/huffman$ echo -e " 50 a \n 25 b \n 12 c \n 13 d" > ExampleCounts ~/python/compression/huffman$ python Huffman3.py < ExampleCounts Reading in frequencies (and optional symbol names) from stdin #Symbol Count Codeword a (50) 1 b (25) 01 c (12) 000 d (13) 001 Functions supplied by the package:
Examples illustrating use of this package:
See also the file Example.py for a small python program that uses this package, and RLE.py for a bigger example. |
Example.py |
This is a compression algorithm for compressing files containing the 4 symbols {a,b,c,d}. The assumed probabilities are {0.5,0.25,0.125,0.125}. This example uses the Huffman package to create its Huffman code and to handle encoding and decoding. If you run this package from within emacs with C-cC-c, it runs a test called easytest(). The package can also be used directly from a shell to compress or uncompress data received via stdin or stdout. The default behaviour is compression; to get uncompression, give an additional argument (for example --uncompress). Usage: at shell prompt - compression first, then uncompression $ echo -n "aaaabbcd" > Example.txt $ python Example.py < Example.txt > Example.zip $ python Example.py --uncompress < Example.zip > Example.unc $ diff Example.unc Example.txt |
HuffmanL.py |
The alternate package HuffmanL.py
returns a simpler structure for decoding.
Instead of using the concept of an "internalnode" Class,
the Huffman algorithm returns a binary list of binary lists of binary lists...
For example, the decoder for the code #Symbol Count Codeword 0 (10 ) 00 1 (9 ) 111 2 (8 ) 101 3 (7 ) 100 4 (6 ) 011 5 (5 ) 1101 6 (4 ) 1100 7 (3 ) 0101 8 (2 ) 01001 9 (1 ) 01000can be represented by the simple list of lists: A = [[0, [[[9, 8], 7], 4]], [[3, 2], [[6, 5], 1]]]Decoding then can be done like this (pseudocode): l = list(string) p = A while len(l)>0: c = int(l.pop(0)) p = p[c] if p is a string : output.append( p ) ; p = A |
RLE.py |
This is a RunLength Encoding compression algorithm that uses the Huffman algorithm to define a code for runlengths. The package contains the following functions:
There are also three test functions. If RLE.py is run from a terminal, it invokes compress_it (using stdin) or uncompress_it (using stdin), respectively if there are zero arguments or one argument. |
block.py |
This is a Block compression algorithm that uses the Huffman algorithm. This simple block compressor assumes that the source file is an exact multiple of the block length. The encoding does not itself delimit the size of the file, so the decoder needs to knows where the end of the compressed file is. Thus we must either ensure the decoder knows the original file's length, or we have to use a special end-of-file character. A superior compressor would first encode the source file's length at the head of the compressed file; then the decoder would be able to stop at the right point and correct any truncation errors. I'll fix this in block2.py. The package contains the following functions:
There are also three test functions. If block.py is run from a terminal, it invokes compress_it (using stdin) or uncompress_it (using stdin), respectively if there are zero arguments or one argument. |
ac_encode.py |
Arithmetic coding -- the king of compressors
For another arithmetic coder by Philip, including file input and output functions, see Philip's page |
golomb.py |
'Golomb' Compressor for a sparse file, with many 0s and few 1s.
|
IRL.py |
Run-length compressor, "without explicit probabilistic model", for redundant files. Compresses lengths of runs of 1s or 0s into a uniquely decodeable representation, C_alpha, as described in Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers) Runs of 0s and 1s are treated identically. So this decoder is expected to perform 'quite well' for a wide range of redundant sources. It's not only quite good for files like '000000000000000100000000000000000001010000000' and '1111111111111111011111111111111111111111001111111' but also especially good for files like '11111111111111111111000000000000000000000', which none of the other 'Bent Coin' compressors handle well. usage I: This package can be directly executed as follows: COMPRESS: python IRL.py < BentCoinFile > tmp UNCOMPRESS: python IRL.py --uncompress < tmp > tmp2 If this package is executed by C-c C-c from within emacs, it runs the function "test" which performs a few small encodings and decodings. usage II: see IRLc.py and IRLd.py for examples of programs that use this package.
More detailsEncode one string's runs using the code C_alpha. If selfdelimiting, prepend the number of runs, so the decoder will know when to stop reading. The first run's first character is sent in the clear; all subsequent runs alternate between 0 and 1. This compressor can be used in two ways: the 'smart' way and the 'nonsmart' way. The 'smart' way generates slightly longer compressed files that are fully self-delimiting. So you can concatenate the compressed files together end to end, and the decoder can read along, uncompressing each file separately. This smartness is achieved by prepending the Number Of Runs to the compressed string. The package includes functions encode and decode which produce the compressed file, and optionally prepend an encoded integer to the head of the file, thus making it self-delimiting. decode calls either smartdecode (which carefully reads just the right number of bits from the compressed list in order to recover one file) or simpledecode (which reads until the end of the compressed list is reached, and which thus requires either that the receiver know the file length, or that the file system has a special end of file character. |
position.py |
Position codesIn position.py, the decoder must know the compressed filelength. This position code illustrates a way to read command-line arguments. usage examples: position.py -bits 5 < /home/mackay/compress/BentCoinFile > encoded.pos5 position.py -bits 5 -decode 1 < encoded.pos5 > recovered5 position.py -bits 8 < /home/mackay/compress/BentCoinFile > encoded.pos8 position.py -bits 8 -decode 1 < encoded.pos8 > recovered8 position.py -bits 6 < /home/mackay/compress/BentCoinFile > encoded.pos6 position.py -bits 6 -decode 1 < encoded.pos6 > recovered6 position.py -bits 7 < /home/mackay/compress/BentCoinFile > encoded.pos7 position.py -bits 7 -decode 1 < encoded.pos7 > recovered7 position.py -multipleblocks 0 -bits 14 < /home/mackay/compress/BentCoinFile > encoded.pos position.py -multipleblocks 0 -bits 14 -decode 1 < encoded.pos > recovered In position2.py, the encoded file is self-delimiting. The decoder does not need to know the compressed filelength. |
LZ.py |
Lempel-Ziv compressionLZ.py implements a simple Lempel-Ziv algorithm, as described in MacKay (2003) Chapter 6. page 119. Each chunk conveys a pointer to the dictionary and one bit. |
LZ2.py |
Lempel-Ziv compression (2)LZ2.py implements another version of the Lempel-Ziv algorithm, in which each chunk conveys a pointer to the source file, and a length of match. |