LZJB
tags: compression
LZJB is a lossless compression algorithm created by Jeff Bonwick. Outside of the original source code, I only know of one description of the algorithm, in a short piece on Wikibooks.
From the description and a hundred or so sample input/output pairs, here is a clean description (with code!) of the decompressor.1
Wikibooks says:
The LZJB decoder fetches a "literal/copy" bit from the compressed file.
This is only mostly true. The first byte of an LZJB stream is a control byte. Each bit in that byte is a "literal/copy" byte as described. Once all those bits have been handled (each one advancing the input stream by 1 or 2 bytes), another control byte is read and the process repeats.
A 0 bit means: read 1 byte from the input and append it to the output.
A 1 bit means: read 2 bytes from the input, the high 6 bits of the first byte
are length, and the remaining 10 bits are distance. The 2 bits left from the
length byte are the high bits of distance. Because it takes 2 bytes to encode
a copy instruction, LZJB does not encode repeated sequences of length 1 or 2.
Instead, length is incremented by 3, allowing sequences of length [3,66) to
be expressed.
Then:
- Call the current end of the output
end. - Seek
distancebytes back from the current end of the output. - Copy
lengthbytes starting from that position to the end of the output. - If
lengthis greater thandistance(as happens commonly), the bytes fromdistancetoendare repeatedly appended untillengthis smaller thandistance, then a final partial copy is performed.
def decompress(bs, size): bs = bytearray(bs) out = bytearray() blen = len(bs) pos = 0 while pos < blen and len(out) < size: control = bs[pos] pos += 1 for i in xrange(8): b = control & (1 << i) > 0 if not pos < blen: break if not b: out.append(bs[pos]) pos += 1 else: length = (bs[pos] >> 2) + 3 distance = (bs[pos] & 0b11) << 8 | bs[pos+1] pos += 2 backref = out[-distance:] lookup = backref * (length / distance) + backref[:(length % distance)] out += lookup return out[:size]
-
Implementing a compressor is left as an exercise to the interested reader. ↩