Disclaimer: this is not final. If you find mistakes, please report them.

I recently had to use data compression in one of my Rust projects. Expectedly, the Rust ecosystem had plenty of high-quality options to choose from, when supporting most formats. But I couldn’t put my hands on a Rust Zstandard library [*].

So I decided to write my own. And I felt I should talk bout it. I fell down the rabbit hole and learnt one things or two about data compression, and what makes Zstandard as impressive as it is. In the process I’ll explain how it works. The rules of the game are simple: I can only rely on the RFC and on the reference implementation (called ztsd) to try and make a working decompressor (and maybe compressor) that can handle at least some files, and we want to understand all the steps.

[*] At the time of writing, the only crate mentioning Zstandard is a binding of the reference C implementation, I felt we could do better.

Part I: Reading frames

This is part of a series of posts:

Part I · II · III · IV · V · VI · VII · VIII · IX · X

Did you like this? Do you want to support this kind of stuff? Please share around. You can also buy me a coffee.

Modern lossless compression

Lossless data compression is a way to reduce the amount of memory necessary to store data, without altering the data. This contrasts with lossy compression, which tolerates some degree of alteration in exchange for even higher compression rates.

There are four key ideas in the field of lossless compression:

Dictionary compression: if a particular sequence appears often in the data stream, we store the sequence in a “dictionary”, assign it an index, and replace the sequence wherever it appears by a reference to the dictionary entry. The reference is shorter than the sequence it replaces, hence we use less data to say the same thing. Two well-known approaches to this are RLE and (variants of) the Lempel-Ziv (LZ) algorithm
Entropy coding: if a symbol (say, a particular character) appears very often in the data stream, we may use less than a full byte to represent it. Three popular approaches to doing this are Huffman coding, arithmetic coding, and asymmetric number systems coding (ASN).
Data interleaving: reordering data may improve the performance of compression, for instance by bunching together repeated symbols. Popular transformations include move-to-front, the Burrows-Wheeler transform, zigzag ordering.
Context modeling: in some contexts, it is possible to “guess” the next symbol going to appear in the data stream with reasonable accuracy. Sometimes the prediction is not quite correct, and we only need to store corrective information (rather than the full symbol).

Ultimately, these ideas are combined to give any particular recipe:

BZIP2 : BWT/MTF + RLE + Huffman
DEFLATE (used in zip, gzip, png): LZ(SS) + Huffman
7zip: LZ(77) + Markov context modeling + Arithmetic coding
Brotli: LZ(77) + Huffman + 2nd order context modeling
Zstandard / LZFSE: LZ(77) + RLE + Huffman + ASN

Note: lossy compression algorithms also use these as part of their pipelines!

In this series we’ll focus on Zstandard (because I said so!), and we’ll explore the design and merits of each of its ingredients. If all goes well, we’ll get a crate handling this data format, so follow along!

This project will be successful if we can get an existing Zstandard archive to correctly decrypt, and if we can create a Zstandard archive containing compressed files that the reference implementation can decrypt.

Archive format

We’ll be building a Rust library, call it zrs, to handle Zstandard archives. If we take a step back, we can decompose our work into

the “archive” part, which has encrypted data together with metadata;
the compression / decompression part.

The compression stage is usually the one requiring the most work, because we may have to decide on the best course of action (i.e., the best way to compress a piece of data) amongst several alternatives.

The archive part is “simply” parsing a file, so that’s where we’ll begin today. To do that, we first create a Rust workspace

$ mkdir zrs
$ cd zrs

# In zrs/Cargo.toml

[workspace]
members = [
  "czar",
]

And within this workspace, we create the czar library which will handle Zstandard archives

$ cargo new --lib czar
Created library `czar` package

All good to go.

Zstandard archives are made of blocks. Some information is put at the beginning of the file to help us read these blocks. It looks like this:

(Magic number) (Header) (Block 1) (Block 2) … (Block N) (Checksum)

Magic number

The “magic number” is a constant:

// In zrs/czar/src/lib.rs

pub const MAGIC : [u8; 4] = [0x28, 0xB5, 0x2F, 0xFD];

Frame header

The frame header tells us all we need to know about this file. It has mandatory and optional fields. Let’s write a struct to capture it:

// In zrs/czar/src/lib.rs

struct FrameDescriptor {
    // Mandatory fields
    size_flag: u8,                  // Whether decompressed size is included
    single_segment_flag: bool,      // Whether decompressed data is contiguous
    checksum_flag: bool,            // Whether we add a checksum at the end
    dict_id_flag: u8,               // Size of dictionary ID (if provided)

    // Optional fields
    win_desc: Option<WindowDescriptor>, // Helps allocate memory
    dict_id: Option<u32>,               // Dictionary ID
    content_size: Option<u64>           // Size of decompressed data
}

struct WindowDescriptor {
    exponent: u8,
    mantissa: u8
}

The “dictionary ID” is something that will intervene during decompression. For now we don’t need to worry too much about it.

Blocks

Blocks contain compressed data. The data is split into blocks rather than compressed as a whole: this improves the performance in several ways. The format of a block is captured as follows:

// In zrs/czar/src/lib.rs

pub struct Block {
    last_block: bool,   // This is the last block in file
    block_type: BlockType, // See below
    block_size: u32,    // Size of block ()
    contents: Vec<u8>   // Contents of this block
}

As appears above, a block can be of different types: not compressed (raw), ANS-compressed, or RLE-compressed. Let’s capture this with an enum:

// In zrs/czar/src/lib.rs

enum BlockType {
    Compressed,             // Compressed with ANS coding
    Raw,                    // Not compressed: plain data
    Rle,                    // Compressed with RLE coding
    UnknownBlockType(u8)    // Anything else
}

Let’s get to it!

In this first post, we won’t try to compress or decompress anything, but we’ll do the necessary to extract blocks from an archive (or to put them there).

We’ll start by making a real archive with zstd:

$ mkdir assets
$ cd assets
$ echo "This is a sample file with some text inside, just to check that zstd does what I think it does." > hello.txt
$ zstd hello.txt

This creates a file hello.txt.zst which we’ll try to read (or at least, for now, extract the blocks).

We’ll be using the Read trait, which allows us to progressively consume the file byte per byte. So we are going to write a function like:

// In zrs/czar/src/lib.rs

use std::io::Read;

pub fn unpack<T: Read>(input: &mut T)  -> Vec<Block>  {
    let magic = get_magic(input);
    let frame_descriptor = get_descriptor(input);

    // ...
}

Here get_magic and get_descriptor would call input.read_exact(&mut buffer) to obtain bytes from the input. But what if we try to read more than there is? An error would be triggered. Let’s allow ourselves to catch it:

// In zrs/czar/src/lib.rs

//                                       👇
pub fn unpack<T: Read>(input: &mut T)  -> Result<Vec<Block>, std::io::Error>  {
    //                          👇
    let magic = get_magic(input)?;

    //                                          👇
    let frame_descriptor = get_descriptor(input)?;

    // ...
}

Now we can proceed:

// In zrs/czar/src/lib.rs

fn get_magic<T: Read>(input: &mut T) -> Result<[u8; 4], std::io::Error> {
    let mut buffer = [0; 4];
    input.read_exact(&mut buffer)?;
    Ok(buffer)
}

The frame descriptor has variable size and requires some more logic to determine whether optional fields are there, but nothing extraordinary. We also need some bit fiddling to extract the mandatory fields.

// In zrs/czar/src/lib.rs

fn get_descriptor<T: Read>(input: &mut T) -> Result<FrameDescriptor, std::io::Error> {
    let mut buffer = [0; 1];
    input.read_exact(&mut buffer)?;
    let byte_data = buffer[0];

    // Mandatory fields
    let dict_id_flag = byte_data & 0x3; // Bits 0-1
    let checksum_flag = (byte_data >> 2) & 1 == 1; // Bit 2
    let single_segment_flag = (byte_data >> 5) & 1 == 1; // Bit 5
    let size_flag = byte_data >> 6; // Bits 6-7

    // Determine which optional fields are present
    let has_window_desc = !single_segment_flag;
    let has_dict_id = dict_id_flag > 0;
    let has_content_size = (size_flag > 0) || single_segment_flag;

    // Optional fields

    // Window descriptor
    let win_desc = if has_window_desc {
        input.read_exact(&mut buffer)?;
        let byte_data = buffer[0];
        let mantissa = byte_data & 0x3;
        let exponent = byte_data >> 2;
        Some(WindowDescriptor { exponent, mantissa })
    } else {
        None
    };

    // Dictionary ID
    let dict_id = if has_dict_id {
        let mut buffer = vec![0; dict_id_flag as _];
        input.read_exact(&mut buffer)?;

        // Padding with extra zeros
        let buffer = [buffer, vec![0; 4 - dict_id_flag]].concat();

        Some(u32::from_be_bytes(buffer))
    } else {
        None
    };

    // Content size
    let content_size = if has_content_size {
        let field_size = 1 << size_flag;

        let mut buffer = vec![0; field_size];
        input.read_exact(&mut buffer)?;

        // Padding with extra zeros
        let buffer = [buffer, vec![0; 8 - field_size]].concat();

        let mut value = u64::from_le_bytes(buffer);

        if size_flag == 2 {
            // Special case
            value += 256;
        }

        Some(value)
    } else {
        None
    };

    Ok(FrameDescriptor {
        size_flag,
        single_segment_flag,
        checksum_flag,
        dict_id_flag,
        win_desc,
        dict_id,
        content_size,
    })
}

Not so bad? But it doesn’t compile:

$ cargo check
# ...
error[E0308]: mismatched types
   --> czar/src/lib.rs:113:33
    |
113 |         Some(u32::from_be_bytes(buffer))
    |                                 ^^^^^^ expected array `[u8; 4]`, found struct `Vec`
    |
    = note: expected array `[u8; 4]`
              found struct `Vec<u8>`

error[E0308]: mismatched types
   --> czar/src/lib.rs:124:44
    |
124 |         let mut value = u64::from_le_bytes(buffer);
    |                                            ^^^^^^ expected array `[u8; 8]`, found struct `Vec`
    |
    = note: expected array `[u8; 8]`
              found struct `Vec<u8>`
# ...

The compiler just needs a little convincing to trust us that there is no problem here.

// In zrs/czar/src/lib.rs

use std::convert::TryInto;

// ...

        //                             👇
        Some(u32::from_be_bytes(buffer.try_into().unwrap()))

        // ...

        //                                       👇
        let mut value = u64::from_le_bytes(buffer.try_into().unwrap());

We can unwrap because we know the conversion will always succeed, so there is no risk of panicking.

Let’s put our efforts so far to the test. Make our frame descriptor debug-able:

// In zrs/czar/src/lib.rs

// 👇
#[derive(Debug)]
struct FrameDescriptor {
    // ...
}

// 👇
#[derive(Debug)]
struct WindowDescriptor {
    // ...
}

so we can print their contents for debugging. Let’s keep it simple:

// In zrs/czar/src/lib.rs

pub fn unpack<T: Read>(input: &mut T) -> Result<Vec<Block>, std::io::Error> {
    let magic = get_magic(input)?;
    let frame_descriptor = get_descriptor(input)?;


    // 👇 Temporary, we'll change that later
    println!("magic = {:?}", magic);
    println!("frame_desc = {:?}", frame_descriptor);
    unimplemented!()
}

Then let’s write a test that runs our hello.txt.zst file through this

// In zrs/czar/src/lib.rs

#[cfg(test)]
mod tests {
    use crate::*;
    #[test]
    fn test_unpack() {
        use std::fs::File;

        let mut f = File::open("../assets/hello.txt.zst").unwrap();
        unpack(&mut f).unwrap();
    }
}

Let’s run this!

$ cargo test

---- tests::test_unpack stdout ----
magic = [40, 181, 47, 253]
frame_desc = FrameDescriptor { size_flag: 0, single_segment_flag: true, checksum_flag: true, dict_id_flag: 0, win_desc: None, dict_id: None, content_size: Some(96) }
thread 'tests::test_unpack' panicked at 'not implemented', czar/src/lib.rs:160:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The panic is completely normal here! We left an unimplemented!() in the body of unpack. So it all goes through and tells us that the decompressed data size is 96 bytes which is exactly the length of our original file. Good!

We also learn that there is a checksum at the end (it doesn’t need to alway be present).

Blocks now

Ok so we need to process blocks now. This will be very similar to the frame descriptor, and even a bit simpler:

// In zrs/czar/src/lib.rs

pub fn get_block<T: Read>(input: &mut T) -> Result<Block, std::io::Error> {
    let mut buffer = vec![0; 3];
    input.read_exact(&mut buffer)?;
    buffer.push(0); // Pad with one zero
    let byte_data = u32::from_le_bytes(buffer.try_into().unwrap());


    let last_block = byte_data & 1 == 1; // Bit 0
    let block_type = match (byte_data >> 1) & 3 {
        // Bits 1-2
        0 => BlockType::Raw,
        1 => BlockType::Rle,
        2 => BlockType::Compressed,
        x => BlockType::UnknownBlockType(x as u8),
    };
    let block_size = byte_data >> 3; // Bits 3-23

    let content_size = match block_type {
        BlockType::Rle => 1,
        _ => block_size,
    };

    // Get block content
    let mut contents = vec![0; content_size as _];
    input.read_exact(&mut contents)?;

    Ok(Block {
        last_block,
        block_type,
        block_size,
        contents,
    })
}

Let’s make out BlockType and Block structs derive the Debug trait, and test this

// In zrs/czar/src/lib.rs

pub fn unpack<T: Read>(input: &mut T) -> Result<Vec<Block>, std::io::Error> {
    let magic = get_magic(input)?;
    let frame_descriptor = get_descriptor(input)?;

    // 👇 Temporary, we'll change that later
    let block = get_block(input)?;

    println!("magic = {:?}", magic);
    println!("frame_desc = {:?}", frame_descriptor);
    // 👇
    println!("block = {:?}", block);

    unimplemented!()
}

We get this:

magic = [40, 181, 47, 253]
frame_desc = FrameDescriptor {
    size_flag: 0,
    single_segment_flag: true,
    checksum_flag: true,
    dict_id_flag: 0,
    win_desc: None,
    dict_id: None,
    content_size: Some(96)
}

block = Block {
    last_block: true,
    block_type: Compressed,
    block_size: 81,
    contents: [194, /* ... */ 10]
}

And no error was caused whatsoever! (except our deliberate unimplemented!). This looks good! Let’s modify unpack to get blocks until we read the last one:

// In zrs/czar/src/lib.rs

pub fn unpack<T: Read>(input: &mut T) -> Result<Vec<Block>, std::io::Error> {
    let magic = get_magic(input)?;
    let frame_descriptor = get_descriptor(input)?;

    let mut blocks = vec![];
    loop {
        // We assume at least one block is present
        let block = get_block(input)?;
        let is_last_block = block.last_block;

        blocks.push(block);
        if is_last_block {
            break;
        }
    }

    // Note: we are currently ignoring the checksum, if any

    Ok(blocks)
}

Tidying up

Are we finished? Well… not quite. There are two things that we haven’t quite handled yet:

We may want to return the frame_descriptor together with blocks, since it contains useful information for decompressing. This is left as an exercise to the reader.
We haven’t checked the “magic” number, which has to match our expected value, and

Nothing too bad, this is easy to do. But what should we do when the check fails?

We should raise an error. But this is not an std::io::Error, it’s something different. Let’s define a struct for our errors:

// In zrs/czar/src/lib.rs

#[derive(Debug)]
pub enum CzarError {
    InvalidMagic,
    IoError(std::io::Error)
}

and replace all other occurences of std::io::Error by CzarError. We just need to tell the compiler how to convert:

// In zrs/czar/src/lib.rs

impl From<std::io::Error> for CzarError {
    fn from(err: std::io::Error) -> Self {
        Self::IoError(err)
    }
}

This is what unpack looks like right now, and it correctly parses our hello.txt.zstfile:

// In zrs/czar/src/lib.rs

pub fn unpack<T: Read>(input: &mut T) -> Result<(FrameDescriptor, Vec<Block>), CzarError> {
    let magic = get_magic(input)?;

    if magic != MAGIC {
        return Err(CzarError::InvalidMagic);
    }

    let frame_descriptor = get_descriptor(input)?;

    let mut blocks = vec![];
    loop {
        // We assume at least one block is present
        let block = get_block(input)?;
        let is_last_block = block.last_block;

        blocks.push(block);
        if is_last_block {
            break;
        }
    }

    // Ignoring checksum if any

    Ok((frame_descriptor, blocks))
}

Exercise for the reader: extract the checksum as well!

Ok we’re done now!

Yes… well no… not quite done yet. But don’t worry, it’s almost nothing. Our ̀czar library can unpack but it cannot pack. Let’s fix that and we’ll be feature-complete, pinky promise.

// In zrs/czar/src/lib.rs

// 👇
use std::io::Write;

// ...

// 👇
pub fn pack<W: Write>(
    output: &mut W,
    frame_descriptor: FrameDescriptor,
    blocks: &[Block],
    checksum: Option<u32>
) -> Result<(), CzarError> {
    // Write magic number
    output.write_all(&MAGIC)?;

    // Write frame descriptor
    write_descriptor(output, frame_descriptor)?;

    for block in blocks {
        write_block(output, block)?;
    }

    if let Some(value) = checksum {
        // Add checksum
        output.write_all(value.to_le_bytes())?;
    }

    output.flush()?;

    Ok(())
}

// 👇
fn write_descriptor<W: Write>(output: &mut W, desc: FrameDescriptor) -> Result<(), CzarError> {
    let to_bit = |b| if b { 1 } else { 0 };

    // Mandatory fields
    let first_byte = desc.dict_id_flag
        + (to_bit(desc.checksum_flag) << 2)
        + (to_bit(desc.single_segment_flag) << 5)
        + (desc.size_flag << 5);

    output.write_all(&[first_byte])?;

    // Window descriptor
    if let Some(value) = desc.win_desc {
        let byte = value.mantissa + value.exponent >> 2;
        output.write_all(&[byte])?;
    }

    // Dictionary ID
    if let Some(value) = desc.dict_id {
        let bytes = value.to_le_bytes();
        // Truncate
        let bytes = &bytes[..desc.dict_id_flag as _];
        output.write_all(bytes)?;
    }

    // Content size
    if let Some(value) = desc.content_size {
        let field_size = 1 << desc.size_flag;
        let bytes = value.to_le_bytes();
        // Truncate
        let bytes = &bytes[..field_size as _];
        output.write_all(bytes)?;
    }

    Ok(())
}

// 👇
fn write_block<W: Write>(output: &mut W, block: &Block) -> Result<(), CzarError> {
    let to_bit = |b| if b { 1 } else { 0 };

    let first_word = to_bit(block.last_block)
        + (match block.block_type {
            BlockType::Raw => 0,
            BlockType::Rle => 1,
            BlockType::Compressed => 2,
            BlockType::UnknownBlockType(x) => x,
        } << 1) as u32
        + (block.block_size << 3);

    // Truncate to 3 bytes
    let bytes = first_word.to_le_bytes();
    let bytes = &bytes[..3];
    output.write_all(bytes)?;

    // Write contents
    output.write_all(&block.contents)?;

    Ok(())
}

There we are! We can now pack and unpack archives! Let’s do a very simple sanity check: we’ll unpack our file, and repack it. We should get the same file.

We can then try to remove the checksum (if present).

// In zrs/czar/src/lib.rs

    // ...

    // 👇
    #[test]
    fn test_unpack_repack() -> Result<(), CzarError> {
        use std::fs::File;

        let mut f = File::open("../assets/hello.txt.zst")?;
        let mut g = File::create("../assets/copycat.txt.zst")?;

        let unpacked = unpack(&mut f)?;

        // Remove checksum, leave rest unchanged
        let (mut frame_descriptor, blocks, _checksum) = unpacked;
        frame_descriptor.checksum_flag = false;

        pack(&mut g, frame_descriptor, &blocks, None)?;

        Ok(())
    }

Let’s try this:

$ cargo test
# No error
$ cd assets
$ diff hello.txt.zst copycat.txt.zst
Binary files hello.txt.zst and copycat.txt.zst differ
$ zstd -d copycat.txt.zst
copycat.txt.zst     : 96 bytes
# Decompressed without error
$ diff copycat.txt hello.txt
# No difference

Now we can build valid Zstandard archives, yay! We can’t really compress yet, though. Or can we? Let’s conclude this post with an example:

// In zrs/czar/src/lib.rs


    // ...

    // 👇
    #[test]
    fn test_rle_pack() -> Result<(), CzarError> {
        use std::fs::File;
        let mut f = File::create("../assets/creation.txt.zst")?;
        let frame_descriptor = FrameDescriptor {
            size_flag: 0,
            single_segment_flag: true,
            checksum_flag: false,
            dict_id_flag: 0,
            win_desc: None,
            dict_id: None,
            content_size: Some(13 + 14 + 15),
        };

        let block1 = Block {
            last_block: false,
            block_type: BlockType::Raw,
            block_size: 13,
            contents: b"Hello world!\n".to_vec(),
        };

        let block2 = Block {
            last_block: false,
            block_type: BlockType::Rle,
            block_size: 14,
            contents: b"=".to_vec(),
        };

        let block3 = Block {
            last_block: true,
            block_type: BlockType::Raw,
            block_size: 15,
            contents: b"Another block!\n".to_vec(),
        };

        pack(&mut f, frame_descriptor, &[block1, block2, block3], None)?;
        Ok(())
    }

This test creates a file that contains three blocks: the Raw blocks are just copied verbatim, whereas the Rle block is expanded to fill block_size. Lo and behold!

$ cargo test
# No error
$ cd assets/
$ zstd -d creation.txt.zst
creation.txt.zst    : 42 bytes
# No error
# A file creation.txt has appeared!
$ cat creation.txt
Hello world!
==============Another block!

We can create perfectly valid Zstandard archives! We’re done! Are we? No?

Oh right, we have to address the inflatable elephant in the room: compression. In the next post we’ll improve a little bit our current czar library, make it nicer and tidier, and start concerning ourselves with preparing actually compressing files. Go to Part II.