LZJB

tags: compression

LZJB is a lossless compression algorithm created by Jeff Bonwick. Outside of the original source code, I only know of one description of the algorithm, in a short piece on Wikibooks.

From the description and a hundred or so sample input/output pairs, here is a clean description (with code!) of the decompressor.¹

Wikibooks says:

The LZJB decoder fetches a "literal/copy" bit from the compressed file.

This is only mostly true. The first byte of an LZJB stream is a control byte. Each bit in that byte is a "literal/copy" byte as described. Once all those bits have been handled (each one advancing the input stream by 1 or 2 bytes), another control byte is read and the process repeats.

A 0 bit means: read 1 byte from the input and append it to the output.

A 1 bit means: read 2 bytes from the input, the high 6 bits of the first byte are length, and the remaining 10 bits are distance. The 2 bits left from the length byte are the high bits of distance. Because it takes 2 bytes to encode a copy instruction, LZJB does not encode repeated sequences of length 1 or 2. Instead, length is incremented by 3, allowing sequences of length [3,66) to be expressed. Then:

Call the current end of the output end.
Seek distance bytes back from the current end of the output.
Copy length bytes starting from that position to the end of the output.
If length is greater than distance (as happens commonly), the bytes from distance to end are repeatedly appended until length is smaller than distance, then a final partial copy is performed.

def decompress(bs, size):
    bs = bytearray(bs)
    out = bytearray()
    blen = len(bs)
    pos = 0
    while pos < blen and len(out) < size:
        control = bs[pos]
        pos += 1
        for i in xrange(8):
            b = control & (1 << i) > 0
            if not pos < blen:
                break
            if not b:
                out.append(bs[pos])
                pos += 1
            else:
                length = (bs[pos] >> 2) + 3 
                distance = (bs[pos] & 0b11) << 8 | bs[pos+1]
                pos += 2
                backref = out[-distance:]
                lookup = backref * (length / distance) + backref[:(length % distance)]
                out += lookup
    return out[:size]

Implementing a compressor is left as an exercise to the interested reader. ↩