Platform Independent Distribution of Data #2

hkalodner · 2017-09-06T18:57:03Z

Currently BlockSci data files consist of the direct serialization of BlockSci's data structures from memory onto disk in the exact layout which the compiler assigns. This allows data to be loaded with near 0 processing necessary, but also means that distribution of BlockSci's data files is difficult since it is dependent on the many factors that can effect memory layout of C++ classes.

Creating a platform independent intermediate data format would allow us to distribute our processed Blockchain data so that others could download it rather than requiring that people run BlockSci's parser themselves.

Further incremental data updates could be posted which would allow people to maintain fairly up to date copies of BlockSci blockchain data without running a full node.

mplattner · 2019-11-12T13:40:08Z

Solving this issue would significantly lower the entry-barrier of deploying and using BlockSci.

Thus, I tried to brainstorm what possible issues could be, specifically regarding @hkalodner's "many factors that can effect memory layout of C++ classes". However, I can't come up with many factors - any help is appreciated here.

Possible issues: struct alignment and padding, endianess, pointers, RocksDB's & Google DenseHashMap's data structure

Struct alignment and padding (of memory-mapped files): I think this might be resolved by using struct packing, eg. __attribute__((__packed__)). This causes a performance penalty on many platforms, which could be avoided by manually padding (eg. by inserting dummy variables) the structures to be ideal for common 64bit platforms. Thus, the data should be portable, but still is optimized (padded and aligned correctly) for common platforms.
Endianess: Most common architectures seem to use little-endian, so this shouldn't be a problem. A detection that warns the user about the "wrong" (big) endianess might be helpful for uncommon architectures.
Pointers: The memory-mapped files do not contain any (raw) pointers, so this should not be a problem.
Facebook's RocksDB: We need to check if the persistent files of RocksDB are platform-independent.
Google's DenseHashMap (parser only): We need to check if the serialized data using the built-in serializer is platform-independent.

If supporting only common platforms is enough, most of the above can maybe solved rather easily, see the suggestion in the first bullet.

Another possible (maybe more elegant, but also more time-consuming) solution is use a platform-neutral like Google's Protocol Buffers (protobuf) before distribution the files. Something like blocksci_parser config.json export <path> to export data to protobuf-format and blocksci_parser config.json import <path> to import distributed data.

This is just a first step to solving this issue and there are several open questions.
Eg., should parsing locally still work for distributed (pre-parsed) BlockSci data? (This determines if we need to ship the parser state).

hkalodner added the enhancement label Sep 6, 2017

Voelundr mentioned this issue Apr 19, 2018

NextQueueFinishedEarlyException #69

Closed

maltemoeser mentioned this issue Oct 1, 2018

Data structures #173

Closed

brockelmore mentioned this issue Dec 20, 2018

Avoiding Reorg/staying up-to-date Blocksci #231

Closed

This was referenced Nov 11, 2019

Offering mempool data to download #341

Open

Move mempool data out of parser directory #332

Open

jiagengliu mentioned this issue Dec 25, 2019

AMI: default disk size of 500GB is insufficient as of August 2019 #302

Open

maltemoeser mentioned this issue Jan 6, 2020

Very slow performance on local machine #356

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Platform Independent Distribution of Data #2

Platform Independent Distribution of Data #2

hkalodner commented Sep 6, 2017

mplattner commented Nov 12, 2019

Platform Independent Distribution of Data #2

Platform Independent Distribution of Data #2

Comments

hkalodner commented Sep 6, 2017

mplattner commented Nov 12, 2019