Suggest new encoding scheme #2686

antonlerner · 2021-08-17T15:47:08Z

Description

Currently, XDR codec scheme is implemented in a way that is very CPU and memory intensive, causing nodes to allocate a lot of memory and eventually crash.
We want to explore more efficient codecs .
Codecs must meet the following pre requisites:

non ambiguity: every implementation of the spec will give the same encoding
deterministic: same struct will be serialised the same way at all times
cross platform support: since we want to allow implementation in many languages, codec library must be supported by different languages
robust and "battle tested"

In addition to the "must haves" we want our chosen library to be CPU and memory efficient.

dshulyak · 2021-09-07T06:07:47Z

So far I have reviewed several options:

ssz

It is a relatively simple non-self describing binary encoding, designed to replace RLP in ethereum2.
Fixed length items are encoded in the order of fields, with integers as a little endian.
Dynamic length items are written at the offset after all fixed-length items.

Example:

type Example struct {
    Field1 uint32
    Field2 [20]byte
    Field3 []byte // dynamic byte array
    Field4 [32]byte
}

Encoding is a concantenation from: little_endian(field1) || field2 || little_endian(offset(4+20+4+32)) || field4 || field3.
Offset is computed from fixed-length fields, offsets, and length for dynamic fields that precede other dynamic fields.

This is an optimization to support resource-restricted environments so that they can decode any field without allocating memory for the whole struct. The primary target is an EVM, so we may not need it. The pros of this optimization are questionable for go-spacemesh, as SVM (which may have potentially have used this optimization) declares its own encoding.

Besides, basic types (fixed/dynamic byte slices, uint8-64, bool, structs) ssz supports:

bitlist, bitvector (packing multiple boolean variables as bits efficiently)
union type

Another part of ssz specifies a particular merkleization scheme together with an encoding, so that every object in the protocol can be hashed accordingly.

There are libraries in many languages (go, rust, java, nim, python, js) that are actively supported because they are used in eth2 clients. I have used this one https://github.com/ferranbt/fastssz , it is efficient and well tested but doesn't implement the streaming interface and misses some minor things mostly related to style that we will need.

deterministic cbor

It generally satisfies the properties that we are looking for. But as it is self-describing (includes tags in a manner similar to json) the length of the encoding is somewhat longer. So I decided not to pursue this option.

@tal-m suggested to take a look at https://borsh.io/

It shares a lot of similarities with ssz (excluding merkleization which we won't use).

The most significant difference is that the offset is not used to encode dynamic length fields. So if we consider my example above, encoding with borsch will look like: little_endian(field1) || field2 || little_endian(len(field3)) || field3 || field4.
The resulting length of the encoding is the same, but doesn't support lookup without decoding the whole structure.

Besides that, there are a few minor differences:

in borsh map types are supported, must be lexicographically sorted. we don't use them in existing data structures and probably won't need too
ssz doesn't support float64/32. we don't use them
better encoding for null values (Option type in borsh, Union[null, Struct] in ssz)

Libraries are implemented by near org: https://github.com/orgs/near/repositories?q=borsh&type=&language=&sort=
In particular golang library uses reflection and won't be efficient, but we can write an efficient one easily as the encoding is straightforward.

tal-m · 2021-09-09T00:53:39Z

Thanks for the summary! That's very helpful.

Ideally, I think we would be using the same serialization library for any kind of serializing we do in our code, so if there's a single library that can support both our crypto requirements and SVM use-cases (including potential future use-cases) that is an advantage. For example, support for maps might come in under that category...

Have you looked at Cap'n Proto? It seems that the latest versions define a Canonicalized encoding that would make it a viable option (unlike protobufs or flatbuffers, which explictly say they're not deterministic).

dshulyak · 2021-09-09T07:21:58Z

Ideally, I think we would be using the same serialization library for any kind of serializing we do in our code, so if there's a single library that can support both our crypto requirements and SVM use-cases (including potential future use-cases) that is an advantage. For example, support for maps might come in under that category...

I think that it would be good if we can use consistent encoding everywhere if we assume that multiple spacemesh clients may exist in the future. Based on available SMIPs SVM can use any of ssz or borsh, but according to @YaronWittenstein current SVM encoding is more compact. Maybe we can use it in go-spacemesh too.

Have you looked at Cap'n Proto? It seems that the latest versions define a Canonicalized encoding that would make it a viable option (unlike protobufs or flatbuffers, which explictly say they're not deterministic).

In the previous project we considered using it and decided not to use it:

it makes it much harder to work with types. instead of language based setters/getters programmer has to use API defined by the generated capnproto type. Pretty much all code needs to be refactored to work using new API
at least in golang capnproto doesn't seem to be zero-overhead based on this comparison. i don't know if it is an implementation thing or not. and in general zero overhead promise is negated if data is compressed for the wire.

So, it introduces additional complexity without obvious benefits.

For spacemesh use case, simple binary protocol seems to be a better choice (anything similar to ssz, borsh as they are both very straightforward) as it will be more error-proof, and needs to be concerned only with some security considerations from capnproto.

dshulyak · 2021-09-15T13:09:37Z

After some thoughts, i am leaning towards borsh:

i am not completely satisfied with fastssz library, and writing new library will be more complex in comparison with borsh
ssz was designed specifically for eth2 and may evolve in the future in some unexpected way (due to merkleization requirements for example), borsh on the other hand seems to be simpler and finished spec.

i made a lib for golang https://github.com/dshulyak/borsh , it provides a streaming interface and doesn't rely on reflection. for now it defines support for types that we currently use in the go-sm codebase.

dshulyak · 2021-10-11T07:56:43Z

One more option is to stick with xdr, and rewrite the library using the code generation approach. I haven't looked into xdr details before, but it is actually not that different from borsh (xdr uses big endian instead of little endian for integers, but in general encoding schema is the same).

UPD: actually xdr seems to have more overhead in comparison with borsh based on spacemeshos/SMIPS#23

nkryuchkov · 2021-12-27T11:17:15Z

The XDR library consumes an excessive amount of memory on parsing of some data: #3014

nkryuchkov · 2021-12-27T16:06:54Z

#2701 may be caused by a bug in the XDR library, I have no other ideas (needs to be confirmed)

antonlerner added the technical debt label Aug 17, 2021

lrettig added help wanted optimization labels Aug 26, 2021

moshababo assigned lrettig Oct 11, 2021

moshababo mentioned this issue Oct 11, 2021

Replace dynamic XDR codec with more efficient encoding #2545

Closed

lrettig assigned dshulyak and unassigned lrettig Jan 12, 2022

dshulyak removed their assignment Jan 13, 2022

dshulyak closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggest new encoding scheme #2686

Suggest new encoding scheme #2686

antonlerner commented Aug 17, 2021

dshulyak commented Sep 7, 2021 •

edited

Loading

tal-m commented Sep 9, 2021

dshulyak commented Sep 9, 2021

dshulyak commented Sep 15, 2021

dshulyak commented Oct 11, 2021 •

edited

Loading

nkryuchkov commented Dec 27, 2021 •

edited

Loading

nkryuchkov commented Dec 27, 2021

Suggest new encoding scheme #2686

Suggest new encoding scheme #2686

Comments

antonlerner commented Aug 17, 2021

Description

dshulyak commented Sep 7, 2021 • edited Loading

tal-m commented Sep 9, 2021

dshulyak commented Sep 9, 2021

dshulyak commented Sep 15, 2021

dshulyak commented Oct 11, 2021 • edited Loading

nkryuchkov commented Dec 27, 2021 • edited Loading

nkryuchkov commented Dec 27, 2021

dshulyak commented Sep 7, 2021 •

edited

Loading

dshulyak commented Oct 11, 2021 •

edited

Loading

nkryuchkov commented Dec 27, 2021 •

edited

Loading