Disclaimer: this is not final. If you find mistakes, please report them.

Part VIII: Missing bits, streams, and cache, fixing bugs

This post is part of a series on entropy coding and writing our own Zstandard library in Rust.

Part I · II · III · IV · V · VI · VII · VIII · IX · X

Welcome back to the series. In the last episode, recovered compressed literals from a Zstandard archive, building an efficient canonical Huffman decoder.

We tested our decoder on a well-formed stream and saw that it worked well, but then we tested again on an actual file and it gave something incorrect. Today we fix this, and pick a couple other low hanging fruits.

A point of situation

A block is correctly decoded when we have

  • Recovered (and if needed, decompressed) the literals stream
  • Recovered (and decompressed) the sequence stream
  • Executed the sequence on our literals

Here are the things that we don’t yet support:

  1. Splitting blocks exactly at the right positions
  2. Literals that are compressed in Treeless mode
  3. Literals that are compressed in Compressed mode but use an uncompressed Huffman distribution
  4. Sequences that are compressed in either Rle, Fse, or Repeat mode
  5. Special “repeat codes” in sequence execution
  6. Multiple streams

(There are other nice things we can do, and some optimisations, but if we handled all of the above we would have a fully-functioning Zstandard decoder.)

Amongst the above points, we’ll address 1, 4, 5, 6 today.

Addressing (2) requires a way to reuse a table across blocks, we’ll think about how to do that at a later time, and addressing (3) is not a priority as I struggled to find any compressed file that has this, I’m not even sure it’s actually ever used.

Splitting blocks

Our current code for block splitting was written in Part III. We read the block’s header and recover

  • An initial offset
  • The decompressed size for the literals stream
  • The compressed size for the literals stream
  • The number of streams (1 or 4)
  • The block header length

At the time, we only concerned ourselves with Raw or Rle blocks, so we discarded the “compressed” size information.

When the block is Compressed, this information matters, for the reason highlighted at the end of Part VII: if we don’t stop it, Huffman decoding will never stop, consuming all our data and souls. So the only thing we need to do is trim the literals stream:

// File: zrs/zarc/src/decompress.rs

// ...

fn process_compressed_block(input: Vec<u8>) -> Result<Vec<u8>, CzarError> {

    // ..

    // Determine header format
    let (
        s_len,     // Initial offset
        rsize_len, // Decompressed size
        csize_len, // Compressed size
        nstreams,  // Number of streams
        hlen,      // Header length
    ) = match literal_type {
        LiteralType::Raw | LiteralType::Rle => match lt_form {
            0 | 2 => (1, 5, 0, 1, 1),
            1 => (2, 12, 0, 1, 2),
            3 => (2, 20, 0, 1, 3),
            _ => panic!(), // Impossible
        },
        LiteralType::Compressed | LiteralType::Treeless => {
            match lt_form {
                0 => (2, 10, 10, 1, 3),
                1 => (2, 10, 10, 4, 3),
                2 => (2, 14, 14, 4, 4),
                3 => (2, 18, 18, 4, 5),
                _ => panic!(), // Impossible
            }
        }
    };

    // ..

    // Rest of the input to be processed
    let rest = &input[hlen..];

    // New! Trim literals stream
    let (literals_stream, rest) = match literal_type {
        LiteralType::Raw | LiteralType::Rle =>
            (&rest[..rsize as usize], &rest[rsize as usize..]),
        LiteralType::Compressed | LiteralType::Treeless => {
            (&rest[..csize as usize], &rest[csize as usize..])
        }
    };


    // Extract literals
    let (literals, skip) = get_literals(literals_stream, rsize, nstreams, literal_type);

    // ...
}

Hey, why don’t we use here the split_at method supported by Vec? It’s a perfect fit! Try to do it as an exercise.

Supporting more sequence modes

Handling RLE mode

RLE mode for sequence items is relatively simple, and we can handle it by constructing a tANS/FSE table with a single entry, then decoding as usual:

  • symbol is the one byte content read from the input
  • num_bits and baseline are zero
  • the table’s “accuracy” is also zero
// File: zrs/zarc/src/fse.rs

// ...

impl DTable {
    pub fn from_rle(symbol: u8) -> Self {
        Self {
            table: vec![State {
                symbol,
                nb_bits: 0,
                baseline: 0,
            }],
        }
    }
    // ...
}

Handling FSE mode

If either of the sequence item is in Fse mode, then a tANS/FSE distribution is provided immediately. We already wrote a function fse::load_fse_weights to extract this distribution so we should be good.

Putting in all together

// File: zrs/zarc/src/decompress.rs

// ...

fn get_sequences(input: &[u8]) -> Result<Vec<LzSequence>, CzarError> {
    // ...

    // New: we match on the mode
    let (dtable1, accuracy1) = match literals_mode {
        SeqInfo::Predefined => (
            // Predefined mode, this was already there!
            fse::DTable::from_weights(
                &fse::DEFAULT_LITLEN_WEIGHT,  //
                fse::DEFAULT_LITLEN_ACCURACY, //
            ),
            fse::DEFAULT_LITLEN_ACCURACY,
        ),
        SeqInfo::Rle => {
            // New! RLE mode
            // Read one symbol
            let symbol = sequences[0];
            sequences = &sequences[1..];
            (fse::DTable::from_rle(symbol), 0)
        }
        SeqInfo::Fse => {
            // New! FSE mode
            // Read the table from file
            let bitstream = sequences.view_bits::<Lsb0>().to_bitvec();
            let (fse_weights, accuracy, skip) = fse::load_fse_weights(&bitstream);
            sequences = &sequences[skip..];
            (fse::DTable::from_weights(&fse_weights, accuracy), accuracy)
        }
        _ => unimplemented!("Literals length distribution: {:?}", literals_mode),
    };

    // ...

    // New! We use the table and accuracy obtained above
    let mut dec_litlen = fse::Decoder::use_table(dtable1, accuracy1);

    // ...

}

And we do the same with offsets and match length. That’s a lot of repeated code. And we could factor it. We should.

Ooookay let’s just create a load_seq_table function that contains the code above, and returns a DTable together with accuracy and how many bytes were read. Then we call this function three times, providing it with default parameters, in the extract_sequences function.

Special repeat codes: caching offsets

When we read offsets, values from 1 to 3 correspond to special repeat codes (a value of 0 is not posible).

The theory

The codes are Special1, Special2, Special3, and they correspond to changing values, which are updated as we process the file every time we handle a ̀ Compressed` block.

This makes it essentially impossible to process compressed blocks in parallel. I’m assuming that these codes make compression rates better somehow. Otherwise it’s hard to explain the added complexity and lost opportunity.

Basically, these special codes allow reusing a recently-used offset code. When that code is long, using a special code instead shaves a few bits. When no special code arises, i.e., only “standard” offset codes this is exactly what happens:

  • Initial value for special codes is (1, 4, 8)
  • Reading a “standard” code 123 updates values to (123, 1, 4)
  • Reading a “standard” code 456 updates values to (456, 123, 1)
  • Reading a “standard” code 789 updates values to (789, 456, 123)
  • Etc.

“Interesting” things happen when we use one of the special codes. Assume for now that literal_len > 0, and that we have the values above for special codes: (789, 456, 123):

  • Reading the special code 1, we output its value 789. Values are (789, 456, 123)
  • Reading the special code 2, we output its value 456. We then swap values for code 1 and code 2: (456, 789, 123).
  • Reading the special code 3, we output its value 123. We then rotate values to: (123, 456, 789)

The swapping/rotating for codes 2 and 3 are so that the order always reflects most recently used values.

The logic is different if we have literal_len = 0. Starting from our values (123, 456, 789),

  • Reading the special code 1, we output the second value 456, and swap: (456, 123, 789)
  • Reading the special code 2, we output the third value 789, and rotate: (789, 456, 123)
  • Reading the special code 3, we output the first value minus one 788 as a new value. Values are (788, 789, 456)

Here’s a Python demo that processes the example from the spec

# File: special_codes.py

def process_offset_code(offset, lit_len, state):
    if offset not in ["special1", "special2", "special3"]:
        # Standard code
        state = [offset] + state[:2]
        return offset, state

    # Otherwise, special code
    if lit_len != 0:
        # Normal case
        if offset == "special2":
            state[0], state[1] = state[1], state[0]
        elif offset == "special3":
            state = [state[2], state[0], state[1]]
    else:
        # Special case
        if offset == "special1":
            state[0], state[1] = state[1], state[0]
        elif offset == "special2":
            state = [state[2], state[0], state[1]]
        else:
            state = [state[0] - 1, state[0], state[1]]

    return state[0], state

offsets = [1111, "special1", 2222, 1111, 3333,
           "special2", "special3", "special3", "special1"]
litlens = [11, 22, 22, 111, 33, 22, 33, 0, 0]

state = [1, 4, 8]
for (offset, litlen) in zip(offsets, litlens):
    output, state = process_offset_code(offset, litlen, state)
    print("output = %s state = %s" % (output, state))

This works like a charm!

Except for the last step. We get a final state of [2222, 2221, 1111], but the spec says we should get [2222, 2221, 3333]. I’ve banged my head on that for some time. Here’s my diagnosis: it seems that this is yet another case of the spec is wrong. It’s becoming a meme at this point.

The practice

In practice, we read offset codes before we read literal lengths. We’ll start by updating our read_offset_code functions:

// File: zrs/zarc/src/fse.rs

// ...

// New!
#[derive(Debug, Clone, Copy)]
pub enum OffsetCode {
    Normal(u32),
    Special1,
    Special2,
    Special3,
}

// New return type
pub fn read_offset_code(code: u8, bitstream: &mut BitVec<LocalBits, u8>) -> OffsetCode {
    let remaining = bitstream.split_off(bitstream.len() - code as usize);

    let value = bits_to_integer(&remaining);
    let offset_value = (1 << code) + value;
    if offset_value > 3 {
        OffsetCode::Normal((offset_value - 3) as u32)
    } else {
        match offset_value {
            1 => OffsetCode::Special1,
            2 => OffsetCode::Special2,
            3 => OffsetCode::Special3,
            _ => panic!(), // Impossible
        }
    }
}

Then we’ll add a few lines to our decompressor

// File: zrs/zarc/src/decompress.rs

// ...

fn get_sequences(input: &[u8]) -> Result<Vec<LzSequence>, CzarError> {
    // ...

    // New (and temporary!)
    let mut offset_cache = [1, 4, 8];

    for i in 0..nb_sequences {

        let offset_code = fse::read_offset_code(code_offset, &mut bitstream);

        let match_len = fse::read_code(code_matchlen, &fse::MATCHLEN_CODE, &mut bitstream);

        let literal_len = fse::read_code(code_litlen, &fse::LITLEN_CODE, &mut bitstream);

        // New!
        let offset = fse::process_offset_code(offset_code, literal_len > 0, &mut offset_cache);

        // ...
        }
    // ...
}

We added an offset_cache variable which is currently internal to get_sequences. This is fine for testing, but if we plan to deal with more than one block (which we do), we’ll have to bring that one level higher. Problem for future me. This is just figuring things out for now.

The process_offset_code function we haven’t written yet, but it should do pretty much what the Python code above was doing.

// File: zrs/zarc/src/fse.rs

// ...

// New!
pub fn process_offset_code(offset_code: OffsetCode, standard: bool, cache: &mut [u32; 3]) -> u32 {
    if let OffsetCode::Normal(offset) = offset_code {
        // Non-special code
        cache[2] = cache[1]; cache[1] = cache[0]; cache[0] = offset;
        return offset;
    }

    if standard {
        // Normal case
        match offset_code {
            OffsetCode::Special1 => {}
            OffsetCode::Special2 => {
                let tmp = cache[0]; cache[0] = cache[1]; cache[1] = tmp;
            }
            OffsetCode::Special3 => {
                let tmp = cache[2];  cache[2] = cache[1];
                cache[1] = cache[0]; cache[0] = tmp;
            }
            _ => panic!(), // Impossible
        }
    } else {
        match offset_code {
            OffsetCode::Special1 => {
                let tmp = cache[0]; cache[0] = cache[1]; cache[1] = tmp;
            }
            OffsetCode::Special2 => {
                let tmp = cache[2];  cache[2] = cache[1];
                cache[1] = cache[0]; cache[0] = tmp;
            }
            OffsetCode::Special3 => {
                cache[2] = cache[1]; cache[1] = cache[0];
                cache[0] -= 1;
            }
            _ => panic!(), // Impossible
        }
    }

    cache[0]
}

Multiple streams

Zstandard allows a compressed file to split Compressed or Treeless literals into 4 streams, which are decoded using the same table, independently. Decoded streams are concatenated, in order.

When 1 stream is used, it immediately follows the Huffman weight header. But when 4 streams are used, the Huffman header is followed by a “jump table” made of 3 u16 values, describing the length of the three first streams (the fourth stream’s size is whatever remains).

At this stage it may be time to refactor a little bit our code to avoid getting too messy: we’ll take the tANS/FSE literal handling out of our get_literals function, and put all the information about how literals are represented into a struct LiteralHeaderFormat:

// File: zrs/zarc/src/decompress.rs

// ...

fn decompress_fse_literals(
    stream: &[u8],
    format: LiteralHeaderFormat,
) -> Result<Vec<u8>, CzarError> {
    // Read Huffman table if present
    let (header_size, htable) = if format.literal_type == LiteralType::Compressed {
        extract_huffman_table(stream)
    } else {
        // TODO: reuse Huffman table
        unimplemented!("Treeless Huffman");
    };
    let (_, stream) = stream.split_at(header_size);

    // Decompress stream
    if !format.multiple_streams {
        // Single Huffman stream
        Ok(htable.decompress(stream))
    } else {
        // 4 Huffman streams
        // Read jump table
        let (jump_table, rest) = stream.split_at(6);
        let [size1, size2, size3] = read_jump_table(jump_table);

        // Decompose streams
        let (stream1, rest) = rest.split_at(size1);
        let (stream2, rest) = rest.split_at(size2);
        let (stream3, stream4) = rest.split_at(size3);

        // Decompress streams
        let literals1 = htable.decompress(stream1);
        let literals2 = htable.decompress(stream2);
        let literals3 = htable.decompress(stream3);
        let literals4 = htable.decompress(stream4);

        // Combine results
        let combined = [literals1, literals2, literals3, literals4].concat();

        Ok(combined)
    }
}

// ...

In the process of checking this code, I found a small but decisive bug in the Huffman decompression procedure: we forgot to output the last symbol. That’s the reason we were missing \n during our earlier tests, but I didn’t bother looking into it, until now. This is easy to fix:

// File: zrs/zarc/src/huffman.rs

// ...

    pub fn decompress(&self, input: &[u8]) -> Vec<u8> {
        // ..
            if (entry.nb_bits as usize) < bitstream.len() {
                // Output symbol
                result.push(entry.symbol);
                // Consume nb_bits
                bitstream = &bitstream[entry.nb_bits as usize..];
            } else {

                // New! We forgot to do this
                result.push(entry.symbol);
                break;
            }
            // ..
    }

Tracking a subtle bug

Now that we have support for a decent proportion of cases, we probably want to check that all is well, and perhaps spot a few bugs lurking in the shadows. The good thing is that we have already implemented the checksum back in Part II so we can determine whether our decompressed file exactly matches the expectation.

We also need a bunch of test times, just in case.

I took files from the Canterbury Corpus, which should provide a good variety of cases. I added to that my own selection of works: Shakespeare’s Sonnet 42 and the opening paragraph of Beowulf.

Here are the results as of now:

  • The following files are decompressed exactly:
    • Shakespeare’s Sonnet 42
    • xargs.1
    • asyoulik.txt
    • grammar.lsp
    • fields.c
  • The following files are “almost” decompressed exactly
    • cp.html
    • Beowulf’s opening paragraph
  • The following files cause an error during sequence execution:
    • kennedy.xls
    • plrabn12.txt
    • lcet10.txt
    • ptt5
    • pic
    • alice29.txt

What I mean by “almost exact” above is the following: we seem to have everything in place except for a few characters. In cp.html the only error is on character 0xfc, which is replaced by 0xfa. In Beowulf, it seems that the only issue is that we incorrectly output character 0xc2 instead of 0xc3


Fig. 1: Error in the first word of Beowulf.


Note that because the text is UTF-8 encoded, a single error may cause several characters to appear differently. Because of character 0xc3 being replaced by 0xc2 we get ¦ instead of æ, ¾ instead of þ, and ° instead of ð. Capital thorn Þ is also replaced by a non-printable control character “privacy message”. So there’s really one error but it affects multiple letters.

Clearly, the only possible cause for an error of this kind is during the Huffman table generation. Let’s create a third file that has issues, just to have some data. A French pangram did the trick.

  • Expected: 0xfc = 252, 0xc3 = 195, 0xc5 = 197
  • Obtained: 0xfa = 250, 0xc2 = 194, 0xc4 = 196

Printing the Huffman weights for each file, it turns out that the last entry is respectively 250, 194, and 196. Since the last entry is not stored, but appended by us after reading Huffman weights, it must be that we are missing some.

Fixing the bug

So it’s an issue about where we stop reading the Huffman weights. And it’s a subtle issue. See, we are not supposed to stop at the end of the bitstream when decoding these weights (as we do when decoding other tANS/FSE coded data elsewhere). If we hit the end early, meaning that we need to read additional bits to determine the next state, it is assumed that the extra bits are zero.

Why does Zstandard do this instead of just, you know, storing the few extra zeros directly in the file? Probably because every bit counts? I mean it’s not going to change much, and it does add some headaches to decoding. Heh.

So the logic of what we’ve been doing so far was this:

  • Try to update state
  • If it fails, add extra zeros, update state, and stop decoding

But that’s not what we’re supposed to do. For once, the spec did say something. Here’s the relevant paragraph:

“If updating state after decoding a symbol would require more bits than remain in the stream, it is assumed that extra bits are 0. Then, symbols for each of the final states are decoded and the process is complete.”

That means two things:

  • The final states, plural — remember that we have two decoders running one after the other. When reading the end we only flushed the current decoder. What we’re supposed to do is flush the current decoder and the other one.
  • We should not consider the final state as valid in case of overflow

Another mistake that we did was to stop decoding when we reached the end of a stream. Indeed, tANS/FSE can read in 0 bits and still output something. As long as the state update doesn’t require reading bits we can and should proceed.

We just need to change our decoding logic when reading Huffman weights, to capture the overflow situation and do the flushing right:

// File: zrs/zarc/src/decompress.rs

fn extract_huffman_table(input: &[u8]) -> (usize, huffman::HuffmanCodec) {
    // ...

    // Interleaved decoders
    let mut decoder1 = fse::Decoder::use_table(dtable.clone(), accuracy);
    let mut decoder2 = fse::Decoder::use_table(dtable, accuracy);

    // New! Initial states
    let mut symbol1 = decoder1.load_initial_state(&mut bitstream);
    let mut symbol2 = decoder2.load_initial_state(&mut bitstream);

    loop {
        let next_symbol1 = decoder1.decoder_step(&mut bitstream);

        // New! Handle overflow
        match next_symbol1 {
            Some(weight) => {
                huffman_weights.push(symbol1);
                symbol1 = weight;
            }
            _ => {
                // New! Flush both states
                huffman_weights.push(symbol1);
                huffman_weights.push(symbol2);
                break;
            }
        }

        let next_symbol2 = decoder2.decoder_step(&mut bitstream);

        // New! Handle overflow
        match next_symbol2 {
            Some(weight) => {
                huffman_weights.push(symbol2);
                symbol2 = weight;
            }
            _ => {
                // New! Flush both states
                huffman_weights.push(symbol2);
                huffman_weights.push(symbol1);
                break;
            }
        }

        // New! Don't stop when stream is empty
        if huffman_weights.len() > 255 {
            break;
        }
    }
     // ...
}

And just with that, we solved the strange Unicode errors that popped up in two of our test files. Magic!

Cliffhanger: Sequence errors, final nail in the coffin

The only issues that seem to remain when decompressing our test files are the errors during sequence execution. Specifically, what happens is an attempt to subtract with overflow when trying to perform a copy-paste operation: we attempt to copy too far back.

Handling this would require some refactoring to allow one block to access the contents of another. Since we need to pass other information as well (offset caches, etc.) maybe it’s worth now putting the efforts to do this cleanly. And while we’re at it, perhaps we’ll tidy things up a bit.

Almost there! If all goes according to keikaku we’ll have a fully featured decompressor next post! Stay tuned!

Did you like this? Do you want to support this kind of stuff? Please share around. You can also buy me a coffee.