blog

LZJB

tags: compression

LZJB is a lossless compression algorithm created by Jeff Bonwick. Outside of the original source code, I only know of one description of the algorithm, in a short piece on Wikibooks.

From the description and a hundred or so sample input/output pairs, here is a clean description (with code!) of the decompressor.1

Wikibooks says:

The LZJB decoder fetches a "literal/copy" bit from the compressed file.

This is only mostly true. The first byte of an LZJB stream is a control byte. Each bit in that byte is a "literal/copy" byte as described. Once all those bits have been handled (each one advancing the input stream by 1 or 2 bytes), another control byte is read and the process repeats.

A 0 bit means: read 1 byte from the input and append it to the output.

A 1 bit means: read 2 bytes from the input, the high 6 bits of the first byte are length, and the remaining 10 bits are distance. The 2 bits left from the length byte are the high bits of distance. Because it takes 2 bytes to encode a copy instruction, LZJB does not encode repeated sequences of length 1 or 2. Instead, length is incremented by 3, allowing sequences of length [3,66) to be expressed. Then:

def decompress(bs, size):
    bs = bytearray(bs)
    out = bytearray()
    blen = len(bs)
    pos = 0
    while pos < blen and len(out) < size:
        control = bs[pos]
        pos += 1
        for i in xrange(8):
            b = control & (1 << i) > 0
            if not pos < blen:
                break
            if not b:
                out.append(bs[pos])
                pos += 1
            else:
                length = (bs[pos] >> 2) + 3 
                distance = (bs[pos] & 0b11) << 8 | bs[pos+1]
                pos += 2
                backref = out[-distance:]
                lookup = backref * (length / distance) + backref[:(length % distance)]
                out += lookup
    return out[:size]

  1. Implementing a compressor is left as an exercise to the interested reader.