succinct

frankmcsherry · Sep 11, 2024 · a6a0c89 · a6a0c89
1 parent 9c4d548
commit a6a0c89
Showing 1 changed file with 278 additions and 0 deletions.
diff --git a/posts/2024-09-10.md b/posts/2024-09-10.md
@@ -0,0 +1,278 @@
+## A Fascinating Diversion into Compression
+
+I've recently been working on columnar data representations in Rust.
+The intent is that if you have lists of some complex type `T`, say a `struct` or an `enum` or a list itself (`Vec`), you might have better options than storing the types themselves in a list.
+The [`columnar` repository](https://github.com/frankmcsherry/columnar) contains the work, and a [previous post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-25.md) goes in to detail about the process.
+
+What I want to talk about today is the curious case of `Result<S, T>`.
+
+We'll start with a pretty standard intro to how you might represent this in a columnar form, but by the end of the post we'll have developed a way to compress JSON objects.
+I was suprised and delighted by the connection, which I stumbled upon more than discovered, and would love to hear more about if you understand where this came from.
+
+### Re-orienting `Result` to columnar form
+
+The `Result<S, T>` type is a [sum type](https://en.wikipedia.org/wiki/Tagged_union), each instance of which can either be `Ok(S)` or `Err(T)`.
+That is, it is either one type or the other type, but not both. It is different from a pair `(S, T)` which always has both types.
+In addition to either `S` or `T`, it also has to communicate *which* of the two it is (they might be the same, or the same size).
+
+The size of a `Result` is determined by the sizes of `S` and `T`, but it clearly needs to be able to hold either of the two, and a spare bit to indicate which it is.
+For example, the size of a `Result<u64, i64>` is 16 bytes, but the size of a `Result<u64, u8>` is also 16 bytes.
+These might be larger than you expect because of [alignment](https://doc.rust-lang.org/reference/type-layout.html): types in Rust end up sized at least to an integral multiple of their alignment, which is generally at least the size of the largest member.
+If you have a sequence of `Result<u64, u8>` you'll need as 16 bytes times as many elements as you have, even if many of them are the `u8` variant.
+
+An alternate way to represent the same information as a sequence of results, is to [demultiplex](https://en.wikipedia.org/wiki/Multiplexer) the two variants into two sequences, leaving enough information to re-interleave them.
+```rust
+/// Replacement for `Vec<Result<S, T>>`.
+struct ResultColumns<S, T> {
+    /// Variant, and corresponding offset.
+    indexes: Vec<Result<usize, usize>>,
+    /// The `S` variants in order.
+    s_store: Vec<S>,
+    /// The `T` variants in order.
+    t_store: Vec<T>,
+}
+```
+In this example, the backing data are stored in `s_store` and `t_store`, and elements can be retrieved from the information in `indexes`:
+```rust
+impl<S, T> ResultColumns<S, T> {
+    /// Not quite `&Result<S, T>` but pretty close.
+    fn get(&self, index: usize) -> Result<&S, &T> {
+        match self.indexes[index] {
+            Ok(index) => &self.s_store[index],
+            Err(index) => &self.t_store[index],
+        }
+    }
+}
+```
+
+This pattern allows us to store `S` and `T` records separately, but with some overhead.
+Each element also requires a `Result<usize, usize>`, which .. is the same size as the hypothetical `Result<u64, u8>` we started with.
+So that's not great.
+
+### Succinter data structures
+
+It turns out that `indexes` stores way more information than we strictly need.
+We need to be able to see which variant, `Ok` or `Err`, an element is, and figure out where to find it in the corresponding store.
+
+We could alternately use a [bit vector](https://en.wikipedia.org/wiki/Bit_array), a bit like a `Vec<bool>`, to record the information in `indexes`.
+To implement `get(index)` we would look in the bit vector to determine which variant is at that location.
+Then, to determine where to find it .. ah .. we could count the number of occurences of that bit that precedes the one we found.
+```rust
+impl<S, T> ResultColumns<S, T> {
+    /// Not quite `&Result<S, T>` but pretty close.
+    fn get(&self, index: usize) -> Result<&S, &T> {
+        let bit = self.indexes[index];
+        let pos = self.indexes
+                      .iter()
+                      .take(index)
+                      .filter(|x| == bit)
+                      .count();
+        match bit {
+            0 => Ok(&self.s_store[pos]),
+            1 => Err(&self.t_store[pos]),
+        }
+    }
+}
+```
+
+And you are probably already screaming about how inefficient this is.
+
+This implementation is quite snug with respect to space.
+Beyond the `S` and `T` data itself, we store one additional bit for every record.
+It's hard to imagine using any less, but it's also hard to imagine going any slower than scanning all of `self.indexes`.
+The good news is that if we allow the memory to creep up just a .. bit .. we can recover random access.
+
+In the field of [succinct data structures](https://en.wikipedia.org/wiki/Succinct_data_structure) folks study among other things the problem of succinct indexable dictionaries.
+We will think of these as bit sequences, with additional support for a `rank` function that says for each position how many bits are set before it.
+It turns out, this is exactly what we want: the `rank` function (combined with `index`) tells us exactly where to find our data!
+If we make the arbitrary decision that `1` corresponds to `S`, then it looks like this:
+```rust
+impl<S, T> ResultColumns<S, T> {
+    /// Not quite `&Result<S, T>` but pretty close.
+    fn get(&self, index: usize) -> Result<&S, &T> {
+        let bit = self.indexes[index];
+        let rank = self.indexes.rank(index);
+        match bit {
+            0 => Ok(&self.s_store[rank]),
+            1 => Err(&self.t_store[index - rank]),
+        }
+    }
+}
+```
+
+The cost of these succinct indexable dictionaries is "barely more than a bit".
+Formally, it needs to be `1 + o(1)` for each element, so in the limit basically just a bit.
+
+They are pretty complicated, so I implemented one that has exactly two bits for each element.
+This is *not* "succinct" in the technical sense, but instead "compact", according to Wikipedia.
+Whatever it is, you can draw the overhead down close to one bit, at some cost (Guy Jacobson's PhD thesis is ~$40 from ProQuest).
+
+### Compacter data structures
+
+I thought I'd talk through my implementation, which is not very smart but is very easy.
+```rust
+struct CompactBits {
+    counts: Vec<u64>,   // counts ones in preceding `values`.
+    values: Vec<u64>,   // contains bits packed together.
+    last_word: u64,     // in-progress bits, not yet 64 of them.
+    last_bits: u8,      // number of in-progress bits.
+}
+```
+Without going in to great detail, we pack all the bits into a `u64`, and put those in a list (`values`).
+At the same time, we also maintain the running sum of the number of ones in these blocks (`counts`).
+There may be some bits that aren't cleanly packed into a `u64`, and we hold on to these separately.
+
+To randomly access bits we look things up either in `values` or in `last_word`.
+To determine the `rank`, we get the running sum from `counts`, and then count various bits in the word holding our bit.
+
+As you can see, it's pretty simple and only two bits for each element because the counts pace the bits themselves.
+In fact, we could drop down to 1.5 bits by just using `u32` for the running counts, because .. maybe we don't plan to hold more than 4B elements?
+Let's not do that now, but perhaps by the end of the post you'll have a better idea.
+
+### Adaptive Dictionary Compression
+
+We are about to escalate things, but it's actually a very easy and pleasant path to take.
+Don't get stressed by the intimidating section heading!
+
+When we implemented our `get(index)` function we were able to find our record in either variant storage.
+We did this by looking up its position, and then looking in the storage.
+You might have noticed that we either used `rank` or `index - rank`, depending on which variant it was.
+The one that corresponds to our variant .. was the answer we wanted! Tada!
+
+What is at the other location, though?
+
+For any `index` that results in `rank`, both `s_store[rank]` and `t_store[index - rank]` contain data.
+One of the two of them is the data we indexed to find.
+The other one of them is .. the most recent other variant in the list?
+What could you possibly want to do with that?
+
+Let's imagine we have a sequence of `S` that we thought might have repetitions in it.
+There are probably some interesting ways to encode this, but here's a really easy one using `ResultColumns<S, ()>`:
+1. If the item is not a repeat of the item before it, insert `Ok(item)`.
+2. If the item is a repeat of the item before it, insert `Err(())`.
+
+We inserted repeats using the almost meaningless `Err(())`, an error variant containing the empty tuple (which takes no space and stores no information).
+We inserted non-repeats using `Ok(item)`, which will land `item` into `s_store`.
+If we go and look things up the normal way, we'll find the non-repeats, and find nothing meaningful for the repeats.
+
+Instead let's look things up a non-normal way.
+1. If we find an `Ok(item)` we will produce `item`.
+2. If we find an `Err(())` we will .. instead look up the most recent `Ok(item)` at the time of the index's insertion and return `item`.
+
+When we find the `Err` variant we actually find the data we are looking for in the other store.
+The *encoding* used the most recent value to lead us to choose the error variant, and we can decode with the same context.
+
+The `ResultColumns<S, ()>` ends up storing two bits per element, and only the deduplicated `S` values (although only removing adjacent duplicates).
+That's potentially pretty handy compression, and we didn't even have to invent anything to do it.
+It just sort of happens.
+
+### Adaptive Dictionary Compression, part 2
+
+Deduplication isn't exactly dictionary compression, so let's fix that.
+
+```rust
+ResultColumns<S, u8>
+```
+
+Boom.
+Ok.
+Done here.
+
+Ah, in more detail, then.
+Rather than use `Err(())` to encode "direct repetition", we'll use `Err(offset)` to indicate "recent repetition".
+The offset will tell us how far back to go in `s_store` to find our value.
+An offset of zero indicates "the previous value" and an offset of 10 indicates "go back ten values".
+
+```rust
+/// Insert an item by first checking the previous 256.
+fn push(&mut self, item: S) {
+    // Look backwards for a matching value.
+    let offset = self.inner
+                     .s_store
+                     .iter()
+                     .take(256)
+                     .position(|x| x == item);
+
+    if let Some(back) = offset {
+        self.inner.push(Err(back));
+    } else {
+        self.inner.push(Ok(item));
+    }
+}
+```
+
+Although it may look like we are just looking at the current `s_store` for values, we'll be able to return to this exact point with `index` and `rank`.
+```rust
+/// Retrieve an item reference by index.
+fn get(&self, index: usize) -> &S { 
+    match self.inner.get(index) {
+        Ok(item) => item,
+        Err(back) => {
+            // Go backwards from `s_store` at time of insertion.
+            let pos = self.inner.indexes.rank(index) - 1;
+            self.inner.s_store.get(pos - (*back as usize))
+        },
+    }
+}
+```
+
+We've encoded elements in a sequence of `S` with (ideally) a `u8` back-reference to a matching element, or if that fails then the element itself.
+Ideally this is often a `u8` rather than whatever `S` is.
+Informally, we are using the previous 256 distinct `S` values as a dictionary (yes distinct, because how could they repeat?).
+As the sequence moves along, the dictionary we use adapts, admittedly in a primitive but not unhelpful way.
+
+### Trying it out
+
+I have a bunch of JSON that I downloaded from the internet, and am trying to get to work with `columnar`.
+One thing you might know about JSON is that a JSON value can be an "object": a map from strings to other JSON values.
+Commonly, these strings are field or attribute names, and they do not always demonstrate a rich heterogeneity.
+
+Without wanting to overwhelm you with detail (JSON is structurally recursive, and columnarization is complicated), let's look at how we might store objects.
+Our columnar JSON container has a member that initially looks like:
+
+```rust
+/// Columnar representation of `Vec<(String, Value)>`.
+objects: VecColumns<(StringColumns, Vec<ValueIdx>)>,
+```
+
+There are a few details here, but the important ones are:
+1. `StringColumns` stores as many strings as you like, by concatenating their bytes and recording offsets.
+2. `VecColumns` stores as many list of things as you like, by concatenating the lists and recording offsets.
+3. There is a sneaky `( , )` combinator in there that stores pairs of things in pairs of storage.
+
+The tl;dr is that all of the `String`s across all of the objects will be packed in sequence in the `StringColumns`.
+
+If you load up the 11,351 JSON records (26.1MB), you end up with `9,460,926` bytes in the `StringColumns`.
+These are unsurprisingly largely repetitions of things like `"id"`, `"login"`, and `"gravatar_id"`.
+Ideally these will compress up nicely.
+
+We'll make a minor modification to the type, wrapping `StringColumns` with a `LookbackColumns` wrapper that employs the techniques above.
+```rust
+/// Columnar representation of `Vec<(String, Value)>`.
+objects: VecColumns<(LookbackColumns<StringColumns>, Vec<ValueIdx>)>,
+```
+This ends up with `772,022` bytes data in the `LookbackColumns`.
+Of those bytes, almost none are text.
+They break down as:
+* `indexes`: 153,840 bytes (two bits per element)
+* `s_store`:   2,958 bytes text (all non-repeat elements)
+* `t_store`: 615,224 bytes (one byte per repeat element)
+
+The bytes in `t_store` are surprisingly varied.
+They go as large as 168, for which there are 10,162 entries (out of at least 615,224 strings used as names in objects).
+Clearly this will depend on your data, as will the efficacy of the approach generally.
+
+### Rounding up
+
+I basically stumbled on the approach above while poking around learning about succinct data structures (which the thing I implemented is not).
+I was impressed by how little overhead there can be to recording variants of different sizes, given how accustomed to it I have become to the bloat in Rust, and all without having to give up random access.
+
+But I was more flabbergasted by the ability to use the *other variants* as a way of compressing information.
+Just recording offsets backwards results in a natural dictionary encoding that requires no auxiliary structures or code or what have you.
+It just kind of works.
+
+I have a few other things planned, to try and fit other techniques into the same framework.
+For example, many of the string and vector offsets are (or are nearly) linear functions of `index`.
+This happens when you are actually depositing fixed lengths that are not know *a priori*.
+It seems entirely reasonable to encode the offsets as `Result<usize, u8>` indicating either a fixed position, or an edit to a linear interpolation from the previous fixed position.
+Or (goodness) some other more complex interpolation from the previous several points.