Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR speculatively proposes a new way of capturing logs and querying them. It does it by introducing a
LogBuilder
that is responsible for capturing the logs as they appear and later build them into an immutable file. To make it efficient, each address/topic is hashes usingXXHash64
. This is a fast non-cryptographic hash based on 64 bit ulong. This, due to the birthday paradox, can suffer from collisions after breaching4 billion
of topics (log_2 (64)
).Entries
Each
LogEntry
is indexed by its address and all the topics. To distinguish the topic by position, different seeds are used for hashing. This impacts the collision probability slightly. The block number and a transaction number are encoded under ulong->uint mapping. Whileulong
represents the hash,uint
is used to encode tuple of(block, tx)
. This gives 12 bytes of storage for each topic/address of a log entry.Writing and deduplication
When a builder is flushed to a
IBufferWriter<byte>
, first it has its entries sorted by their hashes. To make the lookup faster, key/values are kept as separate arrays. After sorting, if a key (topic/address) appears multiple times, it's encoded in a different way. First all the corresponding(block, tx)
are sorted to later be encoded using diff encoding with a varint. They are written to the output buffer that is sealed by a special entry at the that points to the beginning of the sequence. The address of such entry is the thing that is mapped to the topic in a given file. This allows to encode frequently occurring topics in a very efficient way without sacrificing the case of a unique address/topic. The file is sealed by writing a singleint
that represents how many topics there are in it. The rest, can be derived.Example
Size considerations
mainnet
Merging and querying
As a single builder can encode up to a few hundred thousands blocks (probably not possible due to the memory), merging of files must be implemented. As they are ordered both by keys as well as entries, it can be thought of as merging two sorted enumerables (sorting).
Querying is not implemented yet. Binary search over billions of keys is impossible, so additional index is required. We could introduce simple skip-lists at the top, or split keys by prefixes. If additional index is introduced, it still can be written to the output buffer in the single pass. Queries that require AND with this design would issue separate searches that would result in enumerables of
(block, tx)
that would be intersected. Keeping the files small, within a limited range of blocks can limit the length of the enumerables and improve the speed.GC
If the builder builds files in block ranges, like 64k blocks at a time, files could be named using the starting block number. Then GC would remove the oldest files.
Issues
The diff encoding may be heavy to search through. If needed, when grouping 64k blocks is not enough, different encoding or a skip list can be added.
Changes
Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Optional. Remove if not applicable.
Documentation
Requires documentation update
If yes, link the PR to the docs update or the issue with the details labeled
docs
. Remove if not applicable.Requires explanation in Release Notes
If yes, fill in the details here. Remove if not applicable.
Remarks
Optional. Remove if not applicable.