Skip to content

Commit

Permalink
succinct
Browse files Browse the repository at this point in the history
  • Loading branch information
frankmcsherry committed Sep 11, 2024
1 parent 9c4d548 commit a6a0c89
Showing 1 changed file with 278 additions and 0 deletions.
278 changes: 278 additions & 0 deletions posts/2024-09-10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
## A Fascinating Diversion into Compression

I've recently been working on columnar data representations in Rust.
The intent is that if you have lists of some complex type `T`, say a `struct` or an `enum` or a list itself (`Vec`), you might have better options than storing the types themselves in a list.
The [`columnar` repository](https://github.com/frankmcsherry/columnar) contains the work, and a [previous post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-25.md) goes in to detail about the process.

What I want to talk about today is the curious case of `Result<S, T>`.

We'll start with a pretty standard intro to how you might represent this in a columnar form, but by the end of the post we'll have developed a way to compress JSON objects.
I was suprised and delighted by the connection, which I stumbled upon more than discovered, and would love to hear more about if you understand where this came from.

### Re-orienting `Result` to columnar form

The `Result<S, T>` type is a [sum type](https://en.wikipedia.org/wiki/Tagged_union), each instance of which can either be `Ok(S)` or `Err(T)`.
That is, it is either one type or the other type, but not both. It is different from a pair `(S, T)` which always has both types.
In addition to either `S` or `T`, it also has to communicate *which* of the two it is (they might be the same, or the same size).

The size of a `Result` is determined by the sizes of `S` and `T`, but it clearly needs to be able to hold either of the two, and a spare bit to indicate which it is.
For example, the size of a `Result<u64, i64>` is 16 bytes, but the size of a `Result<u64, u8>` is also 16 bytes.
These might be larger than you expect because of [alignment](https://doc.rust-lang.org/reference/type-layout.html): types in Rust end up sized at least to an integral multiple of their alignment, which is generally at least the size of the largest member.
If you have a sequence of `Result<u64, u8>` you'll need as 16 bytes times as many elements as you have, even if many of them are the `u8` variant.

An alternate way to represent the same information as a sequence of results, is to [demultiplex](https://en.wikipedia.org/wiki/Multiplexer) the two variants into two sequences, leaving enough information to re-interleave them.
```rust
/// Replacement for `Vec<Result<S, T>>`.
struct ResultColumns<S, T> {
/// Variant, and corresponding offset.
indexes: Vec<Result<usize, usize>>,
/// The `S` variants in order.
s_store: Vec<S>,
/// The `T` variants in order.
t_store: Vec<T>,
}
```
In this example, the backing data are stored in `s_store` and `t_store`, and elements can be retrieved from the information in `indexes`:
```rust
impl<S, T> ResultColumns<S, T> {
/// Not quite `&Result<S, T>` but pretty close.
fn get(&self, index: usize) -> Result<&S, &T> {
match self.indexes[index] {
Ok(index) => &self.s_store[index],
Err(index) => &self.t_store[index],
}
}
}
```

This pattern allows us to store `S` and `T` records separately, but with some overhead.
Each element also requires a `Result<usize, usize>`, which .. is the same size as the hypothetical `Result<u64, u8>` we started with.
So that's not great.

### Succinter data structures

It turns out that `indexes` stores way more information than we strictly need.
We need to be able to see which variant, `Ok` or `Err`, an element is, and figure out where to find it in the corresponding store.

We could alternately use a [bit vector](https://en.wikipedia.org/wiki/Bit_array), a bit like a `Vec<bool>`, to record the information in `indexes`.
To implement `get(index)` we would look in the bit vector to determine which variant is at that location.
Then, to determine where to find it .. ah .. we could count the number of occurences of that bit that precedes the one we found.
```rust
impl<S, T> ResultColumns<S, T> {
/// Not quite `&Result<S, T>` but pretty close.
fn get(&self, index: usize) -> Result<&S, &T> {
let bit = self.indexes[index];
let pos = self.indexes
.iter()
.take(index)
.filter(|x| == bit)
.count();
match bit {
0 => Ok(&self.s_store[pos]),
1 => Err(&self.t_store[pos]),
}
}
}
```

And you are probably already screaming about how inefficient this is.

This implementation is quite snug with respect to space.
Beyond the `S` and `T` data itself, we store one additional bit for every record.
It's hard to imagine using any less, but it's also hard to imagine going any slower than scanning all of `self.indexes`.
The good news is that if we allow the memory to creep up just a .. bit .. we can recover random access.

In the field of [succinct data structures](https://en.wikipedia.org/wiki/Succinct_data_structure) folks study among other things the problem of succinct indexable dictionaries.
We will think of these as bit sequences, with additional support for a `rank` function that says for each position how many bits are set before it.
It turns out, this is exactly what we want: the `rank` function (combined with `index`) tells us exactly where to find our data!
If we make the arbitrary decision that `1` corresponds to `S`, then it looks like this:
```rust
impl<S, T> ResultColumns<S, T> {
/// Not quite `&Result<S, T>` but pretty close.
fn get(&self, index: usize) -> Result<&S, &T> {
let bit = self.indexes[index];
let rank = self.indexes.rank(index);
match bit {
0 => Ok(&self.s_store[rank]),
1 => Err(&self.t_store[index - rank]),
}
}
}
```

The cost of these succinct indexable dictionaries is "barely more than a bit".
Formally, it needs to be `1 + o(1)` for each element, so in the limit basically just a bit.

They are pretty complicated, so I implemented one that has exactly two bits for each element.
This is *not* "succinct" in the technical sense, but instead "compact", according to Wikipedia.
Whatever it is, you can draw the overhead down close to one bit, at some cost (Guy Jacobson's PhD thesis is ~$40 from ProQuest).

### Compacter data structures

I thought I'd talk through my implementation, which is not very smart but is very easy.
```rust
struct CompactBits {
counts: Vec<u64>, // counts ones in preceding `values`.
values: Vec<u64>, // contains bits packed together.
last_word: u64, // in-progress bits, not yet 64 of them.
last_bits: u8, // number of in-progress bits.
}
```
Without going in to great detail, we pack all the bits into a `u64`, and put those in a list (`values`).
At the same time, we also maintain the running sum of the number of ones in these blocks (`counts`).
There may be some bits that aren't cleanly packed into a `u64`, and we hold on to these separately.

To randomly access bits we look things up either in `values` or in `last_word`.
To determine the `rank`, we get the running sum from `counts`, and then count various bits in the word holding our bit.

As you can see, it's pretty simple and only two bits for each element because the counts pace the bits themselves.
In fact, we could drop down to 1.5 bits by just using `u32` for the running counts, because .. maybe we don't plan to hold more than 4B elements?
Let's not do that now, but perhaps by the end of the post you'll have a better idea.

### Adaptive Dictionary Compression

We are about to escalate things, but it's actually a very easy and pleasant path to take.
Don't get stressed by the intimidating section heading!

When we implemented our `get(index)` function we were able to find our record in either variant storage.
We did this by looking up its position, and then looking in the storage.
You might have noticed that we either used `rank` or `index - rank`, depending on which variant it was.
The one that corresponds to our variant .. was the answer we wanted! Tada!

What is at the other location, though?

For any `index` that results in `rank`, both `s_store[rank]` and `t_store[index - rank]` contain data.
One of the two of them is the data we indexed to find.
The other one of them is .. the most recent other variant in the list?
What could you possibly want to do with that?

Let's imagine we have a sequence of `S` that we thought might have repetitions in it.
There are probably some interesting ways to encode this, but here's a really easy one using `ResultColumns<S, ()>`:
1. If the item is not a repeat of the item before it, insert `Ok(item)`.
2. If the item is a repeat of the item before it, insert `Err(())`.

We inserted repeats using the almost meaningless `Err(())`, an error variant containing the empty tuple (which takes no space and stores no information).
We inserted non-repeats using `Ok(item)`, which will land `item` into `s_store`.
If we go and look things up the normal way, we'll find the non-repeats, and find nothing meaningful for the repeats.

Instead let's look things up a non-normal way.
1. If we find an `Ok(item)` we will produce `item`.
2. If we find an `Err(())` we will .. instead look up the most recent `Ok(item)` at the time of the index's insertion and return `item`.

When we find the `Err` variant we actually find the data we are looking for in the other store.
The *encoding* used the most recent value to lead us to choose the error variant, and we can decode with the same context.

The `ResultColumns<S, ()>` ends up storing two bits per element, and only the deduplicated `S` values (although only removing adjacent duplicates).
That's potentially pretty handy compression, and we didn't even have to invent anything to do it.
It just sort of happens.

### Adaptive Dictionary Compression, part 2

Deduplication isn't exactly dictionary compression, so let's fix that.

```rust
ResultColumns<S, u8>
```

Boom.
Ok.
Done here.

Ah, in more detail, then.
Rather than use `Err(())` to encode "direct repetition", we'll use `Err(offset)` to indicate "recent repetition".
The offset will tell us how far back to go in `s_store` to find our value.
An offset of zero indicates "the previous value" and an offset of 10 indicates "go back ten values".

```rust
/// Insert an item by first checking the previous 256.
fn push(&mut self, item: S) {
// Look backwards for a matching value.
let offset = self.inner
.s_store
.iter()
.take(256)
.position(|x| x == item);

if let Some(back) = offset {
self.inner.push(Err(back));
} else {
self.inner.push(Ok(item));
}
}
```

Although it may look like we are just looking at the current `s_store` for values, we'll be able to return to this exact point with `index` and `rank`.
```rust
/// Retrieve an item reference by index.
fn get(&self, index: usize) -> &S {
match self.inner.get(index) {
Ok(item) => item,
Err(back) => {
// Go backwards from `s_store` at time of insertion.
let pos = self.inner.indexes.rank(index) - 1;
self.inner.s_store.get(pos - (*back as usize))
},
}
}
```

We've encoded elements in a sequence of `S` with (ideally) a `u8` back-reference to a matching element, or if that fails then the element itself.
Ideally this is often a `u8` rather than whatever `S` is.
Informally, we are using the previous 256 distinct `S` values as a dictionary (yes distinct, because how could they repeat?).
As the sequence moves along, the dictionary we use adapts, admittedly in a primitive but not unhelpful way.

### Trying it out

I have a bunch of JSON that I downloaded from the internet, and am trying to get to work with `columnar`.
One thing you might know about JSON is that a JSON value can be an "object": a map from strings to other JSON values.
Commonly, these strings are field or attribute names, and they do not always demonstrate a rich heterogeneity.

Without wanting to overwhelm you with detail (JSON is structurally recursive, and columnarization is complicated), let's look at how we might store objects.
Our columnar JSON container has a member that initially looks like:

```rust
/// Columnar representation of `Vec<(String, Value)>`.
objects: VecColumns<(StringColumns, Vec<ValueIdx>)>,
```

There are a few details here, but the important ones are:
1. `StringColumns` stores as many strings as you like, by concatenating their bytes and recording offsets.
2. `VecColumns` stores as many list of things as you like, by concatenating the lists and recording offsets.
3. There is a sneaky `( , )` combinator in there that stores pairs of things in pairs of storage.

The tl;dr is that all of the `String`s across all of the objects will be packed in sequence in the `StringColumns`.

If you load up the 11,351 JSON records (26.1MB), you end up with `9,460,926` bytes in the `StringColumns`.
These are unsurprisingly largely repetitions of things like `"id"`, `"login"`, and `"gravatar_id"`.
Ideally these will compress up nicely.

We'll make a minor modification to the type, wrapping `StringColumns` with a `LookbackColumns` wrapper that employs the techniques above.
```rust
/// Columnar representation of `Vec<(String, Value)>`.
objects: VecColumns<(LookbackColumns<StringColumns>, Vec<ValueIdx>)>,
```
This ends up with `772,022` bytes data in the `LookbackColumns`.
Of those bytes, almost none are text.
They break down as:
* `indexes`: 153,840 bytes (two bits per element)
* `s_store`: 2,958 bytes text (all non-repeat elements)
* `t_store`: 615,224 bytes (one byte per repeat element)

The bytes in `t_store` are surprisingly varied.
They go as large as 168, for which there are 10,162 entries (out of at least 615,224 strings used as names in objects).
Clearly this will depend on your data, as will the efficacy of the approach generally.

### Rounding up

I basically stumbled on the approach above while poking around learning about succinct data structures (which the thing I implemented is not).
I was impressed by how little overhead there can be to recording variants of different sizes, given how accustomed to it I have become to the bloat in Rust, and all without having to give up random access.

But I was more flabbergasted by the ability to use the *other variants* as a way of compressing information.
Just recording offsets backwards results in a natural dictionary encoding that requires no auxiliary structures or code or what have you.
It just kind of works.

I have a few other things planned, to try and fit other techniques into the same framework.
For example, many of the string and vector offsets are (or are nearly) linear functions of `index`.
This happens when you are actually depositing fixed lengths that are not know *a priori*.
It seems entirely reasonable to encode the offsets as `Result<usize, u8>` indicating either a fixed position, or an edit to a linear interpolation from the previous fixed position.
Or (goodness) some other more complex interpolation from the previous several points.

0 comments on commit a6a0c89

Please sign in to comment.