Disclaimer: this is not final. If you find mistakes, please report them.

Part II: Packing, unpacking, checksum

This post is part of a series on entropy coding and writing our own Zstandard library in Rust.

Part I · II · III · IV · V · VI · VII · VIII · IX · X

Welcome back to the series. In the last episode we created as part of this project the czar library, responsible for handling Zstandard .zst files, or rather “frames” as they are called in the specification.

It is not yet a fully-featured zstd equivalent, far from it. Today, we’ll add some of the missing features:

Compute the checksum and check it
Create an archive from user-provided data
Extract the contents of this archive

Our library also lacked some user-friendliness and maybe we should start with this first.

Better error messages

The way we handle errors is we return a CzarError:

// In zrs/czar/src/lib.rs

#[derive(Debug)]
pub enum CzarError {
    InvalidMagic,
    IoError(std::io::Error),
}

That’s only two errors. But there are many places where errors can occur in reading a user-provided archive, and if we want to help ourselves when debugging (as well as future library users) we need to be slightly more specific about what exactly went wrong.

We’ll do that in two steps: first, we’ll try to reduce the number of different errors raised by a function to a minimum, except for “high-level” functions (such as unpack). Then, in such “high-level” functions, we’ll raise an error explaining the failure and its cause.

Let’s start with get_descriptor. First we revert its signature back to using std::io::Error, which is the only error it can raise:

fn get_descriptor<T: Read>(input: &mut T) -> Result<FrameDescriptor, std::io::Error>

Then introduce a dedicated error, raised during unpack:

// In zrs/czar/src/lib.rs

#[derive(Debug)]
pub enum CzarError {
    InvalidMagic,
    IoError(std::io::Error),

    // New!
    InvalidDescriptor(std::io:Error),
}

// ...

pub fn unpack<T: Read>(
    input: &mut T,
) -> Result<(FrameDescriptor, Vec<Block>, Option<u32>), CzarError> {
    // ...

    // Old
    let frame_descriptor = get_descriptor(input)?;

    // New!
    let frame_descriptor = match get_descriptor(input) {
        Ok(desc) => desc,
        Err(ioerr) => return Err(CzarError::InvalidDescriptor(ioerr)),
    };

    // ...
}

We’ve done almost nothing there, just wrapping the error, but that way, if an IO error causes an descriptor to be invalid, we’ll known that the descriptor was invalid and that it was caused by an IO error. Previously we only knew some IO error occured. So that’s better now.

Let’s do the same thing with get_block: change its error type back to std::io::Error then

// In zrs/czar/src/lib.rs

#[derive(Debug)]
pub enum CzarError {
    InvalidMagic,
    IoError(std::io::Error),
    InvalidDescriptor(std::io:Error),

    // New!
    IncompleteBlock(std::io::Error),
}

// ...

pub fn unpack<T: Read>(
    input: &mut T,
) -> Result<(FrameDescriptor, Vec<Block>, Option<u32>), CzarError> {

    // ...

    // Old
    let block = get_block(input)?;

    // New
    let block = match get_block(input) {
        Ok(block) => block,
        Err(ioerr) => return Err(CzarError::IncompleteBlock(ioerr)),
    };

    // ...
}

Let as exercise to the reader: add a IncompleteChecksum error which is raised when failing to get a checksum.

We now have a better idea of what fails during unpacking.

Archive creation

Last time, we manually used our data structures and the pack function to assemble a Zstandard archive. We could then successfully run zstd to decompress it. It is time to start automating this process.

We create a new crate in the workspace, respondible for Zstandard archive compression — or in short, zarc:

# In zrs/Cargo.toml
[workspace]
members = [
  "czar",
  "zarc",   # New!
]

$ cargo new --lib zarc
Created library `zarc` package

# In zrs/zarc/Cargo.toml

[package]
edition = "2021"
name = "zarc"
version = "0.1.0"

[dependencies]
czar = {path = "../czar"}

As indicated in the TOML file, zarc will rely on czar. We’ll export a function that takes some input, compresses it, and writes the result in some output:

// In zrs/zarc/src/lib.rs

use std::io::{Read, Write};

pub fn compress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {
    // TODO
}

Since we already wrote the pack function,all we need to figure out is :

What frame descriptor to use
How to chop up the input into blocks
How to compute the checksum (which is optional)

Let’s start easy: we’ll only consider Raw blocks in a first time.

Recall that there are three types of block: Raw (which are not compressed), Rle (which are RLE compressed), and ̀Compressed (which are compressed with ANS).

So what we do now is simply cut the input into slices:

// In zrs/zarc/src/lib.rs

use czar::{Block, BlockType};

use std::{
    io::{Read, Write},
};

pub fn compress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {
    let mut blocks = vec![];
    let mut buffer = vec![];
    input.read_to_end(&mut buffer)?;

    let in_size = buffer.len();
    let max_block_size = 1 << 20; // Arbitrary

    for i in (0..in_size).step_by(max_block_size) {
        let last_block = in_size - i < max_block_size;
        let block_size = if last_block {
            in_size - i
        } else {
            max_block_size
        } as u32;

        // Temporary: only Raw
        let block_type = BlockType::Raw;
        let contents = (&buffer[i..block_size as usize]).to_vec();

        let block = Block {
            last_block,
            block_type,
            block_size,
            contents,
        };

        blocks.push(block);
    }

    // TODO: compute checksum
    // TODO: determine frame descriptor
    // TODO: pack

    Ok(())
}

We’ve got a small problem here: BlockType isn’t public, and Block’s fields are private too. Let’s change that by using the pub keyword there. On to the frame descriptor!

For now, let’s keep it simple and pub all its fields too:

// In zrs/czar/src/lib.rs

// ...

#[derive(Debug)]
pub struct FrameDescriptor {
    // Mandatory fields
    pub size_flag: u8,             // Whether decompressed size is included
    pub single_segment_flag: bool, // Whether decompressed data is contiguous
    pub checksum_flag: bool,       // Whether we add a checksum at the end
    pub dict_id_flag: u8,          // Size of dictionary ID (if provided)

    // Optional fields
    pub win_desc: Option<WindowDescriptor>, // Helps allocate memory
    pub dict_id: Option<u32>,               // Dictionary ID
    pub content_size: Option<u64>,          // Size of decompressed data
}

// ...

// In zrs/zarc/src/lib.rs

pub fn compress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {
    // ...

    let (size_flag, content_size) = get_size_descriptor(in_size);

    // Temporary
    let frame_descriptor = FrameDescriptor {
        size_flag,
        single_segment_flag: true,
        checksum_flag: false,
        dict_id_flag: 0,
        win_desc: None,
        dict_id: None,
        content_size: Some(content_size),
    };

    // TODO: compute checksum

    czar::pack(output, frame_descriptor, &blocks, None)?;

    Ok(())
}

This ad hoc frame descriptor is just there to make this archive works, it just sets the necessary minimum. The get_size_descriptor function determines what flag to use:

// In zrs/zarc/src/lib.rs

fn get_size_descriptor(in_size: usize) -> (u8, u64) {
    const _2_32: usize = (1 << 32) - 1;
    match in_size {
        0..=255 => (0, in_size as u64),
        256..=65535 => (1, (in_size - 256) as u64),
        65536..=_2_32 => (2, in_size as u64),
        _ => (3, in_size as u64),
    }
}

Let’s try this:

// In zrs/zarc/src/lib.rs

#[cfg(test)]
mod tests {
    use crate::*;
    #[test]
    fn test_raw() -> Result<(), czar::CzarError> {
        use std::fs::File;

        let mut message = "Hello world of Zstandard!\n".as_bytes();
        let mut f = File::create("../assets/archive.txt.zst")?;

        compress(&mut message, &mut f)?;

        Ok(())
    }
}

$ cargo test -p zarc
$ cd ../assets
$ zstd -d archive.txt.zst
archive.txt.zst     : 26 bytes
$ car archive.txt
Hello world of Zstandard!

Success!

No? Okay we’re not actually compressing anything here. But we do have well-formed archives.

The checksum

Oh right, there’s still one small thing we haven’t discussed. The checksum: after all blocks have been pushed, we can optionally add a u32 value acting as a checksum for the decompressed data. This way, if we somehow messed up decompression, we may be able to detect it.

The content checksum is the result of the xxh64 hash function on the decompressed data with a seed of zero, of which the 4 lowest bytes are are stored in little-endian format.

It turns out a twox-hash crate exists with a Rust implementation of this particular hash function, so let’s use it!

# In zrs/zarc
$ cargo add twox-hash

// In zrs/zarc/src/lib.rs

// ...

use core::hash::Hasher;

fn compute_checksum(input: &[u8]) -> u32 {
    // Compute xxh64 of input
    let mut hasher = twox_hash::XxHash64::with_seed(0);
    hasher.write(input);
    let hash_value = hasher.finish();

    // Keep only 4 lowest bytes
    let lower_word = (hash_value & 0xffffffff) as u32;

    lower_word
}

Could it be that simple?

// In zrs/zarc/src/lib.rs

// ...

pub fn compress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {

    // ..

    let checksum = compute_checksum(&buffer);   // New!

    for i in (0..in_size).step_by(max_block_size) {
        // ...
    }

    let (size_flag, content_size) = get_size_descriptor(in_size);

    // Temporary
    let frame_descriptor = FrameDescriptor {
        size_flag,
        single_segment_flag: true,
        checksum_flag: true,                    // New!
        dict_id_flag: 0,
        win_desc: None,
        dict_id: None,
        content_size: Some(content_size),
    };

    czar::pack(
        output,
        frame_descriptor,
        &blocks,
        Some(checksum),                         // New!
    )?;

    Ok(())
}

$ cargo test -p zarc
$ cd ../assets
$ zstd -d archive.txt.zst
archive.txt.zst     : 26 bytes
# No error

It is that simple! Who knew. Easy!

Unless… okay there’s one tiny little detail, so let me highlight it now. We’re putting all our input at once into the compute_checksum function. When we have only a couple bytes as in our example, it doesn’t matter. But when we’ll be digesting terabytes of data, we won’t be able to have all this in memory at once. The “right” way to do things would be to proceed block-per-block, updating the hash along.

The reason we don’t do that right now is that, well, we are already storing all our input in memory (in buffer) anyway. At some point we’ll need to address that, only read as much as we can chew, block per block, and compute the hash along. But not today.

Today we celebrate.

Decompression

Let’s keep up the good work: time to decompress!

The unpack function we’ve updated earlier is enough to handle the kind of files our compress produces. Let’s do some refactoring first:

$ cd zarc/src
$ mv lib.rs compress.rs

// In zrs/zarc/src/lib.rs

pub mod compress;
pub mod decompress;

use core::hash::Hasher;

// Moved from compress.rs
fn compute_checksum(input: &[u8]) -> u32 {
    //  ...
}

// Moved from compress.rs
#[cfg(test)]
mod tests {
    // ...
}

// In zrs/zarc/src/compress.rs

use crate::compute_checksum; // Now in lib.rs

// ...

// In zrs/zarc/src/decompress.rs

use czar::{Block, BlockType, FrameDescriptor};
use std::io::{Read, Write};
use crate::compute_checksum;

pub fn decompress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {
    let (frame_descriptor, blocks, checksum) = czar::unpack(input)?;
    // TODO
}

What do we do now? We’ll start simple: ignoring the frame descriptor, we’ll proceed through blocks. If the blocks are Raw we merely copy their contents to the output:

// In zrs/zarc/src/decompress.rs

pub fn decompress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {
    let (_frame_descriptor, blocks, checksum) = czar::unpack(input)?;

    let mut result = vec![];
    for block in blocks {
        match block.block_type {
            BlockType::Raw => result.extend(block.contents),
            _ => unimplemented!(),
        }
    }

    // TODO: checksum

    output.write_all(&result)?;
    output.flush()?;

    Ok(())
}

Ok let’s try this:

// In zrs/zarc/src/lib.rs
// ...

#[cfg(test)]
mod tests {
    // ...

    #[test]
    fn test_draw() -> Result<(), czar::CzarError> {
        use std::fs::File;

        let mut f = File::open("../assets/archive.txt.zst")?;
        let mut g = File::create("../assets/archive.txt")?;

        decompress::decompress(&mut f, &mut g)?;

        Ok(())
    }
}

$ cargo test -p zarc
$ cd ../assets
$ cat archive2.txt
Hello world of Zstandard!
# No error occured

As an exercise to the reader, perform the checksum verification (when the checksum is provided), raising an error when a mismatch happens.

Our decompressor has the same limitation as our compressor, namely we also put all of the input (and the output, which may be orders of magnitude larger!) in memory. This is bad practice. This is dangerous. And this is not necessary. Here too we’ll want to only have one block in memory at a time. We’ll do this. I promise. We’ll do this. At some later point.

Before that, there’s one sweet low-hanging fruit we can’t resist picking. Decompressing Rle blocks just means repeating the same symbol. It is so easy even I can do it, so let’s do it (and cause errors for bad blocks too)

// In zrs/zarc/src/decompress.rs

pub fn decompress<R: Read, W: Write>(input: &mut R, output: &mut W) -> Result<(), czar::CzarError> {

    // ...

    for block in blocks {
        match block.block_type {
        match block.block_type {
            BlockType::Raw => result.extend(block.contents),

            // New!
            BlockType::Rle => result.extend(vec![block.contents[0]; block.block_size as usize]),
            BlockType::Compressed => unimplemented!(),
            BlockType::UnknownBlockType(x) => return Err(CzarError::InvalidBlockType(x)),
        }
    }

    // ...
}

And just like that we cover 2 out of 3 valid cases! Enough for one day. Next time we’ll look into actually compressing data. Go to Part III.

Did you like this? Do you want to support this kind of stuff? Please share around. You can also buy me a coffee.